Prometheus Alerting Rules and Metadata

In which we discuss the complexity of building a useful alerting system for our HL7 messaging engine, where different systems have different priorities and thresholds.

[ Insert summary description of Prometheus here ]

Like most hospitals, my employer has an HL7 interface engine. Ours is implemented on top of Red Hat (JBoss) Fuse, which combines technologies like the ActiveMQ message broker, and Camel integration patterns, into a robust engine that we can use for more than just HL7 processing.

We have written a custom exporter that collects information about queued messages from the message broker within our engine. Of interest are queues with too many messages, and queues with long delays.

Metrics #

The metrics collected look like this:

broker_queue_messages{direction="In",iface="ADT",instance="127.0.0.1:9000",job="esb_queues",org="UHN",queue="q-In-ADT-UHN-EPR-ADT-MSH-Soft",sys="EPR"}    1
broker_queue_messages{direction="In",iface="ADT",instance="127.0.0.1:9000",job="esb_queues",org="UHN",queue="q-In-ADT-UHN-EPR-ADTOMI-UHN-MedStreaming",sys="EPR"}    1
broker_queue_messages{direction="Out",iface="ADT",instance="127.0.0.1:9000",job="esb_queues",org="UHN",queue="q-Out-ADT-UHN-MPF",sys="MPF"}    2290
broker_queue_messages{direction="Out",iface="ADT",instance="127.0.0.1:9000",job="esb_queues",org="UHN",queue="q-Out-ADT-UHN-Muse",sys="Muse"}    2
broker_queue_messages{direction="Out",iface="ADT",instance="127.0.0.1:9000",job="esb_queues",org="UHN",queue="q-Out-ADT-UHN-NODR",sys="NODR"}    1
broker_queue_messages{direction="Out",iface="ADTSIU",instance="127.0.0.1:9000",job="glf_queues",org="SIMS",queue="q_Out_ADTSIU_SIMS_EPR",sys="EPR"}    1

broker_queue_oldest_message_seconds{direction="Out",iface="ADT",instance="127.0.0.1:9000",job="esb_queues",org="UHN",queue="q-Out-ADT-UHN-GRASP",sys="GRASP"}    1568041691
broker_queue_oldest_message_seconds{direction="Out",iface="ADT",instance="127.0.0.1:9000",job="esb_queues",org="UHN",queue="q-Out-ADT-UHN-HorizonCardiology",sys="HorizonCardiology"}    1568041748
broker_queue_oldest_message_seconds{direction="Out",iface="ADT",instance="127.0.0.1:9000",job="esb_queues",org="UHN",queue="q-Out-ADT-UHN-MPF",sys="MPF"}    1568039781
broker_queue_oldest_message_seconds{direction="Out",iface="Notification",instance="127.0.0.1:9000",job="glf_queues",org="SIMS",queue="q_Out_Notification_SIMS_Alerting",sys="Alerting"}    1568041542

A typical alert rule for these metrics might look something like:

- alert: MessagesNotFLowing
  expr: time() - broker_queue_oldest_message_seconds > 300
  labels:
    severity: warning
  annotations:
    summary: Messages on interface {{ $labels.queue }} have been delayed by {{ $value | humanizeDuration }}.
    description: messages are queueing on an interface. Please contact the downstream system owner to have them check their receiver.

Unfortunately, this is not terribly useful, because some queues are backed-up all the time (OLIS), while other queues are very sensitive to delays (Ultra Rapid Response). We need a way to handle both extremes.

Ignored Queues #

Some destinations we simply do not care about (OLIS); Some destinations are not yet active and should be ignored. To handle this, a new metric is created: queue_is_ignored. Then we can re-write the expression above to:

expr: time() - broker_queue_oldest_message_seconds > 300 UNLESS ON (queue) queue_is_ignored

This will suppress alerts for any queues where queue_is_ignored has a value.

Per Queue Thresholds #

This section is lifted mostly from a useful article at Robust Perception:

Using Time Series as Alert Thresholds

We define two new metrics: esb_queue_max_messages and esb_queue_max_delay_seconds with a label for the queue name, and the value being the required thresholds.

esb_queue_max_delay_seconds{queue="q-Out-ADT-UHN-EDNotification"} 900
esb_queue_max_messages{queue="q-Out_ORU_ON_OLIS"} 5000

We can then use these new metrics in an expression using GROUP_LEFT to handle any many-to-one matching. This expression also includes a default value for any queues that do not have a matching time series in the metric:

- alert: MessagesNotFlowing
  expr: |
    # Alert based on per-queue thresholds, with a default.
      time() - broker_queue_oldest_message_seconds > ON (queue) GROUP_LEFT()
      (
          esb_queue_max_delay_seconds
        OR ON (queue)
          COUNT BY (queue) (broker_queue_oldest_message_seconds) * 0 + 300
      )

Alerting Labels #

Some downstream systems are more important than others. Some downstream systems do not have on-call support staff. Some downstream systems would like to be notified when their queues are backing up. To handle these cases, we define a new metric:

esb_queue_labels{queue="q-Out-ADT-HSSO-eNotification",priority="medium",timeperiod="workhours"} 1
esb_queue_labels{queue="q-Out-ADT-MSH-Soft",priority="high",timeperiod="24x7"} 1
esb_queue_labels{queue="q-Out-ADT-MSH-Soft",priority="high",timeperiod="24x7",email_to="someone@uhn.ca"} 1

We can then fold the labels from this metric into our alerts with a GROUP_LEFT expression:

- expr:    broker_queue_oldest_message_seconds + ON (queue) GROUP_LEFT(priority,timeperiod,email_to) (0 * esb_queue_labels)
  record:  queue:broker_queue_oldest_message_seconds:labels

The ON (queue) expression tells Prometheus to match metrics on the left with metrics on the right based only on the queue label. The GROUP_LEFT() expression merges the listed labels into the resulting metrics. The magic of using addition in the main expression, but multiplying by 0 on the right-hand side, means the resulting metric has the same value as before, but with the labels merged in.

# Before
broker_queue_oldest_message_seconds{direction="Out",iface="ADT",instance="127.0.0.1:9000",job="esb_queues",org="UHN",queue="q-Out-ADT-UHN-GRASP",sys="GRASP"}

# After
queue:broker_queue_oldest_message_seconds:labels{direction="Out",iface="ADT",instance="127.0.0.1:9000",job="esb_queues",org="UHN",priority="low",queue="q-Out-ADT-UHN-GRASP",sys="GRASP",timeperiod="workhours"}

Default values #

Many operations in PromQL are effectively set operations. Unfortunately the expression above:

broker_queue_oldest_message_seconds + ON (queue) GROUP_LEFT(priority,timeperiod,email_to) (0 * esb_queue_labels)

is a set intersection operation; it removes any time series in broker_queue_oldest_message_seconds for which there is no matching queue in esb_queue_labels. This means we don’t get any alerts for queues if we have forgotten to add an entry to the label metric.

Changing the expression to include all time-series fixes this problem:

broker_queue_oldest_message_seconds + ON (queue) GROUP_LEFT(priority,timeperiod,email_to) (0 * esb_queue_labels OR ON (queue) COUNT BY (queue) (broker_queue_oldest_message_seconds))

Extra Notifications #

Using the label merging mechanism above, we can also add extra recipients for a given queue using the email_to label.

Textfile Exporter Complications #

There are a few mechanisms for creating metrics manually. Metrics can be added directly to Prometheus using recording rules:

- record: esb_queue_labels
  expr: 1
  labels:
    queue: q-Out-ADT-UHN-GRASP
    timeperiod: workhours

This is unwieldy for a large list of labels.

Metrics can also be exported using the node_exporter, by placing a file named something.prom containing metrics in the folder /var/lib/node_exporter/textfile_collector (the default location; your Prometheus may be configured differently). This is the mechanism we are using. A complication, however, is that our production cluster has two redundant servers for Prometheus, which creates two copies of each metric:

esb_queue_labels{instance="uhn-vi-esb-001p",job="node",priority="low",queue="q-Out-ADT-UHN-GRASP",timeperiod="workhours"}    1
esb_queue_labels{instance="uhn-vi-esb-002p",job="node",priority="low",queue="q-Out-ADT-UHN-GRASP",timeperiod="workhours"}    1

To work around this, we add “_instance” to each metric published using the textfile exporter, and then use recording rules in Prometheus to fold these into a single metric:

- expr:   avg(esb_queue_labels_instance) WITHOUT (instance)
  record: esb_queue_labels

Final Configuration #

groups:
- name: activemq
  rules:

    # we have a static file, loaded by the node_exporter's textfile support, containing definitions of
    # labels and thresholds for our queues. We need to combine all instances of these metrics into a single
    # metric that can be used in the alert rules below.
  - expr:   avg(queue_is_ignored_instance) WITHOUT (instance)
    record: queue_is_ignored
  - expr:   avg(esb_queue_labels_instance) WITHOUT (instance)
    record: esb_queue_labels
  - expr:   avg(esb_queue_max_delay_seconds_instance) WITHOUT (instance)
    record: esb_queue_max_delay_seconds
  - expr:   avg(esb_queue_max_messages_instance) WITHOUT (instance)
    record: esb_queue_max_messages


    # start with 1m averages, to ignore tiny spikes, or missed export cycles.
  - expr:   avg_over_time(broker_queue_messages{queue!~"(DLQ|JNL).*"}[1m])
    record: queue:broker_queue_messages:avg1m

    # mix in the alerting labels for the queue (including the ignore metric)
  - expr: |
            queue:broker_queue_messages:avg1m + ON (queue) GROUP_LEFT (priority, timeperiod, email_to)
            (0 * esb_queue_labels OR ON (qeueue) COUNT BY (queue) (queue:broker_queue_messages:avg1m))
            UNLESS ON(queue) queue_is_ignored
    record: queue:broker_queue_messages:avg1m:labels

    # alert based on thresholds with a default of 300 messages
    # See https://www.robustperception.io/using-time-series-as-alert-thresholds
  - alert: TooManyMessagesQueued
    expr: |
        queue:broker_queue_messages:avg1m:labels > ON (queue) GROUP_LEFT()
        (
            esb_queue_max_messages
        OR ON (queue)
            COUNT BY (queue) (queue:broker_queue_messages:avg1m) * 0 + 300
        )
    labels:
    # no labels because this alert switches on 'priority' pulled from esb_queue_labels
    annotations:
      summary: Queue {{ $labels.queue }} has {{ $value | printf "%.0f" }} messages.
      description: >
        Messages are queuing up on an interface. If this is an outbound queue,
        this could be a transient caused by a burst of messages; monitor the
        interface. If this is an inbound queue, there is probably a problem
        with the mapper that should be investigated immediately.


    # start with 1m averages, to ignore tiny spikes.
  - expr:    (time() - avg_over_time(broker_queue_oldest_message_seconds{queue!~"(DLQ|JNL).*"}[1m]))
    record:  queue:broker_queue_oldest_message_seconds:avg1m

    # mix in the alerting labels for the queue (including the ignore metric)
  - expr: |
             queue:broker_queue_oldest_message_seconds:avg1m + ON (queue) GROUP_LEFT(priority,timeperiod,email_to)
                 (0 * esb_queue_labels OR ON (queue) COUNT BY (queue) (queue:broker_queue_oldest_message_seconds:avg1m))
                 UNLESS ON(queue) queue_is_ignored
    record:  queue:broker_queue_oldest_message_seconds:avg1m:labels

    # alert based on thresholds with a default of 300 seconds
    # See https://www.robustperception.io/using-time-series-as-alert-thresholds
  - alert: MessagesNotFlowing
    expr: |
        queue:broker_queue_oldest_message_seconds:avg1m:labels > ON (queue) GROUP_LEFT()
        (
            esb_queue_max_delay_seconds
          OR ON (queue)
            COUNT BY (queue) (queue:broker_queue_oldest_message_seconds:avg1m) * 0 + 300
        )
    labels:
    # no labels because this alert switches on 'priority' pulled from esb_queue_labels
    annotations:
      summary: Messages on interface {{ $labels.queue }} have been delayed by {{ $value | humanizeDuration }}.
      description: messages are queueing on an interface. Please contact the downstream system owner to have them check their receiver.

Alertmanager Configuration #

FIXME

Troubleshooting #

Here is a PromQL query that will list all broker queues that are not ignored and do not have alerting labels defined:

queue:broker_queue_messages:avg1m{job="esb_queues",direction="Out"} unless ON (queue) esb_queue_labels UNLESS ON (queue) queue_is_ignored

The inverse query:

esb_queue_labels UNLESS ON (queue) queue:broker_queue_messages:avg1m{job="esb_queues",direction="Out"}

will list all entries in esb_queue_labels that do not have an active queue - useful to detect typos and decommissioned queues.

Thanks to cks https://utcc.utoronto.ca/~cks/space/blog/ for these two useful queries!