A non-exhaustive list of the alerts you may see in the #datawarehouse-infra Slack channel and their meanings.
Pingdom Alerts:
Lets you know the status of the Alli Data platform. Up is good, Down is not.
AWS Alarms
|
Alarm |
Image |
Meaning |
|---|---|---|
|
Long Running Autoscaling Instances |
|
The number of autoscaling groups hasn’t dropped below 3 for longer than 14 hours. |
|
Queue Backup |
|
There are more than 500 datasource jobs queued for longer than 2 hours. |
|
High Redshift Connections |
|
The number of Redshift connections is higher than 200 for longer than 2 minutes. |
|
No Redshift Automated Snapshots |
|
Fewer than 1 snapshot has been completed over a 12 hour period. |
|
Redshift Unhealthy |
|
The Redshift status is reported as unhealthy with checks every 5 minutes. This alert also happens when Redshift is automatically restarted during non-peak hours. |
|
Redshift Out of Space |
|
The Redshift disk space is over 90% used for longer than 30 minutes. |
|
Redshift Blocking Queries |
|
A Redshift query or queries have been blocking for longer than 90 minutes. |