Infrastructure Alerts
A non-exhaustive list of the alerts you may see in the #datawarehouse-infra Slack channel and their meanings.
Pingdom Alerts:
Lets you know the status of the Alli Data platform. Up is good, Down is not.
AWS Alarms
Alarm | Image | Meaning |
---|---|---|
Long Running Autoscaling Instances | The number of autoscaling groups hasn’t dropped below 3 for longer than 14 hours. | |
Queue Backup | There are more than 500 datasource jobs queued for longer than 2 hours. | |
High Redshift Connections | The number of Redshift connections is higher than 200 for longer than 2 minutes. | |
No Redshift Automated Snapshots | Fewer than 1 snapshot has been completed over a 12 hour period. | |
Redshift Unhealthy | The Redshift status is reported as unhealthy with checks every 5 minutes. This alert also happens when Redshift is automatically restarted during non-peak hours. | |
Redshift Out of Space | The Redshift disk space is over 90% used for longer than 30 minutes. | |
Redshift Blocking Queries | A Redshift query or queries have been blocking for longer than 90 minutes. |