Breadcrumbs

Infrastructure Alerts

A non-exhaustive list of the alerts you may see in the #datawarehouse-infra Slack channel and their meanings.

Pingdom Alerts:

Lets you know the status of the Alli Data platform. Up is good, Down is not.

AWS Alarms

Alarm

Image

Meaning

Long Running Autoscaling Instances

image-20200505-181757.png

The number of autoscaling groups hasn’t dropped below 3 for longer than 14 hours.

Queue Backup

image-20200505-181840.png

There are more than 500 datasource jobs queued for longer than 2 hours.

High Redshift Connections

image-20200505-181913.png

The number of Redshift connections is higher than 200 for longer than 2 minutes.

No Redshift Automated Snapshots

image-20200505-182008.png

Fewer than 1 snapshot has been completed over a 12 hour period.

Redshift Unhealthy

image-20200505-183238.png

The Redshift status is reported as unhealthy with checks every 5 minutes.

This alert also happens when Redshift is automatically restarted during non-peak hours.

Redshift Out of Space

image-20200505-191323.png

The Redshift disk space is over 90% used for longer than 30 minutes.

Redshift Blocking Queries

Screen Shot 2020-06-02 at 9.17.25 AM.png

A Redshift query or queries have been blocking for longer than 90 minutes.