Alerts

Scalyr can notify you under conditions you specify. You can generate alerts based on any type of data (e.g. system metrics, access logs, or custom metrics), using simple or complex rules. Some examples:

  • CPU usage exceeds 70%
  • More than 50 error pages served in a 5-minute period
  • For /renderImage requests returning less than 5MB of data, 90th percentile latency over the last 10 minutes was more than twice the equivalent latency for /home requests

Creating and Managing Alerts

Alerts are specified in the /scalyr/alerts configuration file. You generally don't need to edit this file directly. The easiest way to create an alert is to bring up a graph of the value you want to alert on, and then click the Save Search button and select As Alert. However, this page will describe alerts in terms of their appearance in the configuration file. You will sometimes need to edit the configuration file directly to access advanced features, such as alert groups (sections). This file has the following structure:

{
  alertAddress: "email@example.com",

  alerts: [
    // Alert if there are more than 10 log messages containing the word "error"
    // in a single minute.
    {
      trigger: "count:1m(error) > 10",
      description: "Excessive error messages; "
                 + "see http://example.com/playbook/serverErrors"
    },
    // Alert if mean usermode CPU usage on server1 exceeds 0.5 cores over 10 minutes.
    {
      trigger: "mean:10m($source='tsdb' $serverHost='server1' metric='proc.stat.cpu_rate' type='user') > 50",
      description: "Server1 CPU high"
    }
  ]
}

alertAddress is the e-mail address to which alerts are sent. (If you do not specify an alertAddress, alerts will be sent to the e-mail address associated with your Scalyr account.) The alerts list specifies a series of conditions to monitor. To add, remove, or edit alerts, simply go to the /scalyr/alerts configuration file.

The general form of an alert specification is as follows:

  {
      trigger:                "...expr...",
      description:            "..."         // optional
      silenceUntil:           "...",        // optional
      silenceComment:         "...",        // optional
      gracePeriodMinutes:     nnn,          // optional
      renotifyPeriodMinutes:  nnn,          // optional
      resolutionDelayMinutes: nnn           // optional
  }

Trigger expressions, silences, grace periods, and renotification periods are discussed in later sections. The description can contain anything you like; Scalyr displays this the description when displaying the alert. You can include URLs (e.g. links to a playbook page) in the alert description.

Trigger Expressions

Alerts can be based on any data in Scalyr, using the same query language as the graph view. The heart of an alert trigger is an expression like this:

  FUNCTION:TIME([ATTR where] FILTER)

FUNCTION is one of the functions listed below. TIME specifies the time window to be considered, e.g. 5m for 5 minutes. (The unit can be 'seconds', 'minutes', 'hours', 'days', 'weeks', or various abbreviations.) ATTR is the name of a numeric event field, and FILTER is a filter expression. The function should then be compared with a threshold value, using standard operators such as < and >. Examples:

// Alert if more than 10 log messages in the last minute contain the word "error".
count:1m(error) > 10

// Alert if the average response size in requests for /home, over the last 5 minutes, is less than 100.
mean:5m(bytes where path == '/home') < 100

// Alert if 95th percentile latency of requests for /home over the last hour exceeds 1000 milliseconds.
p[95]:1h(latency where path == '/home') >= 1000

You can use any of the functions described in the Graph Functions reference, plus the following:

  • count — the number of matching events over the entire time period
  • countPerSecond — the number of matching events per second
  • mean — the average field value

Functions can be combined using standard operators such as +, -, *, /, <, >, <=, >=, &&, ||, and !. For instance:

// Alert if there at least 10% as many "server error" messages as "success" messages. Note that
// "server error" needs quotes, because it contains a space, but "success" does not.
count:1m('server error') > count:1m(success) * 0.1

// Alert if mean latency for /home requests is greater than 200 milliseconds, but only if there
// are at least 20 such requests to take an average from.
mean:1m(latency where path == '/home') > 200 && count:1m(path == '/home') >= 20

Special Functions

You can use the following special functions in a trigger expression:

Function Result
hourOfDay() The current hour of the day (0 - 23), in GMT.
hourOfDay(timeZone) The current hour of the day (0 - 23), in the specified time zone. For instance, "PST".
dayOfWeek() The current day of the week (0 for Sunday, 1 for Monday, 6 for Saturday), in GMT.
dayOfWeek(timeZone) The current day of the week (0 for Sunday, 1 for Monday, 6 for Saturday), in the specified time zone. For instance, "PST".
dayOfMonth() The current date of the month (1 for the first day of the month), in GMT.
dayOfMonth(timeZone) The current date of the month (1 for the first day of the month), in the specified time zone. For instance, "PST".

The hourOfDay(), dayOfWeek(), and dayOfMonth() functions can be used to write rules that only trigger during certain times of the day, week, or month. For example, the following rule will trigger if the message "success" occurs fewer than 5 times per second during business hours:

countPerSecond:10m(success) < 5
&& dayOfWeek("PST") >= 1 && dayOfWeek("PST") <= 5
&& hourOfDay("PST") >= 9 && hourOfDay("PST") <= 17

Grace Period

Sometimes you may want to not be notified of an alert unless it is triggered for several minutes in a row. For instance, it may be normal for a server to experience a brief spike of high latency, but you want to be notified if the high latency persists for more than a minute or two. In this case, add a "gracePeriodMinutes" field to your alert. For instance, the following alert will generate a notification if the average latency exceeds 500 milliseconds for three minutes in a row:

{
  trigger:            "mean:1m(latency where source == 'accessLog') > 500",
  description:        "Frontend latency is high"
  gracePeriodMinutes: 3
}

The gracePeriodMinutes field can also be specified at the top level of the /scalyr/alerts file, or in an alert section, in which case it provides a default value for all alerts in the file or section. (Alert sections are described below, under "Multiple E-Mail Addresses".)

Repeated Notifications

If an alert remains triggered for a long time, Scalyr will notify you again, as a reminder that the alert condition is still in effect. By default, a new message is sent once per hour. You can control this by adding a "renotifyPeriodMinutes" field to your alert. For instance, the following alert will generate a new notification once every four hours:

{
  trigger:               "mean:1m(latency where source == 'accessLog') > 500",
  description:           "Frontend latency is high"
  renotifyPeriodMinutes: 240
}

Set renotifyPeriodMinutes to 0 to disable repeated notifications.

You can change notification behavior for a group of alerts, or all of your alerts, by setting renotifyPeriodMinutes at a higher level in the alert configuration file. To disable repeated notifications for all alerts, add the following line near the top of /scalyr/alerts, just after the initial {:

renotifyPeriodMinutes: 0,

This sets the default for all alerts to "no repeated notifications". You can still set renotifyPeriodMinutes for individual alerts or alert groups if you want to use repeated notifications for some alerts.

renotifyPeriodMinutes applies only for an alert which remains continuously triggered for a long period of time. If an alert triggers, resolves, and then triggers again, a new notification will be sent regardless of renotifyPeriodMinutes.

If you wish to temporarily halt notifications for an alert, you can use the Silence button. This allows you to specify a time period during which all notifications for that alert are disabled.

When you specify PagerDuty or OpsGenie as your alert delivery channel, Scalyr does not send repeated notifications. PagerDuty and OpsGenie have their own systems for managing unresolved alerts.

Resolution Delay

When an alert is resolved, Scalyr waits a few minutes before notifying you. This is to avoid a flood of alternating "triggered" and "resolved" messages if the alert condition is flickering on and off. You can control this by adding a "resolutionDelayMinutes" field to your alert. For instance, the following alert will wait ten minutes before generating a resolved message:

{
  trigger:                "mean:1m(latency where source == 'accessLog') > 500",
  description:            "Frontend latency is high"
  resolutionDelayMinutes: 10
}

The default delay is 5 minutes. Set resolutionDelayMinutes to 0 to force resolved messages to be sent immediately. Set resolutionDelayMinutes to the special value 9999 to not send "resolved" messages.

You can change notification behavior for a group of alerts, or all of your alerts, by setting resolutionDelayMinutes at a higher level in the alert configuration file. To disable notification delays for all alerts, add the following line near the top of /scalyr/alerts, just after the initial "{":

resolutionDelayMinutes: 0,

This sets the default for all alerts to send "resolved" messages immediately. You can still set resolutionDelayMinutes for individual alerts or alert groups. Conversely, to turn off "resolved" messages by default for all alerts, use the special value 9999:

resolutionDelayMinutes: 9999,

Viewing Alerts

https://www.scalyr.com/alerts shows the current status and recent history of each alert. Status is defined as "red" (the alert condition has been met, so the alert is triggered) or "green". A series of red/green boxes show the alert status over the last hour, one box per minute.

You can click on any alert to see a detailed description of the alert, and (via a further link) view graphs of the data on which the alert is based. From the graph view, you can drill down and explore the data using all the usual tools of Scalyr.

Alert Messages

When an alert triggers (the alert condition becomes true), an e-mail message is sent to the address specified in /scalyr/alerts. The message identifies the alert, and contains a link to a graph of the data on which the alert is based.

When the alert condition ceases to be true, Scalyr waits a few minutes and then sends an "alert resolved" message. The delay is to avoid a flurry of messages if a borderline alert repeatedly flips between red and green. You can always see the latest alert status at https://www.scalyr.com/alerts.

To protect your inbox, Scalyr imposes a limit on the number of e-mail messages you are sent per hour. If the limit is exceeded, alert messages will be delayed for a few minutes and then sent in a batch (several alerts in one e-mail).

If you would prefer to receive a separate message for each alert, prefix "unbatched:" when specifying your e-mail address. For instance:

alertAddress: "unbatched:frontend-team@example.com"

Silencing Alerts

Sometimes you may want to temporarily disable, or "silence", an alert. You can do this from https://www.scalyr.com/alerts, by moving the mouse over the alert in question and clicking the "Silence" button. This causes a silence directive to be added to the /scalyr/alerts file. You can also edit the file directly:

{
  trigger: "count:1m(error) > 10",
  description: "Excessive error messages",
  silenceUntil: "Feb. 10 2013",
  silenceComment: "Spurious errors, will be fixed in next server push"
},

This causes the alert to be ignored until the specified time is reached. You can use any date/time format documented on the Time Range reference, except for relative formats (such as "4 hours").

The silenceComment field is optional; you can use it to record a reminder of why you silenced the alert.

Alert Templates

If you are running multiple servers, you will probably want to create alerts for each server. Alert templates make it easy to define your alerts once and apply them separately to each server. Here is an example, defining alert templates for "high CPU usage" and "root disk almost full", and applying them to three servers:

{
  alerts: [
    {
      templateParameters: [
        {host: "host1"},
        {host: "host2"},
        {host: "host3"}
      ],
      alerts: [
        {
          description: "#host#: high CPU usage",
          trigger: "mean:5m($source='tsdb' $serverHost='#host#' metric='proc.stat.cpu_rate' type='user') > 400.0"
        }, {
          description: "#host#: root disk almost full"
          trigger: "mean:10m($source='tsdb' metric='df.1kblocks.free' $serverHost == '#host#' mount=='/' ) < 500000"
        }
      ]
    }
  ]
}

Let's break this down. The first two lines are standard boilerplate for the /scalyr/alerts file, defining a list of alerts. The '{' on line three begins an entry in the alerts list. This entry has two properties, "templateParameters" and "alerts". The templateParameters list specifies that these alerts should be applied three times, to servers "host1", "host2", and "host3". The inner alerts list defines the alerts, using the placeholder #host#, which will be filled in with host1, host2, and host3. You can use placeholders to customize the description, trigger, and alertAddress properties of an alert.

You can use additional template parameters to customize your alerts. For instance, suppose your servers use different processors and you want to alert at a different CPU threshold on each server. You could add a cpuLimit parameter for this:

{
  alerts: [
    {
      templateParameters: [
        {host: "host1", cpuLimit: 240},
        {host: "host2", cpuLimit: 400},
        {host: "host3", cpuLimit: 300}
      ],
      alerts: [
        {
          description: "#host#: high CPU usage",
          trigger: "mean:5m($source='tsdb' $serverHost='#host#' metric='proc.stat.cpu_rate' type='user') > #cpuLimit#"
        }, {
          description: "#host#: root disk almost full"
          trigger: "mean:10m($source='tsdb' metric='df.1kblocks.free' $serverHost == '#host#' mount=='/' ) < 500000"
        }
      ]
    }
  ]
}

When the Scalyr Agent uploads data from each of your servers, it automatically sets $serverHost to the server hostname. So, when defining a templated alert to monitor multiple servers, it's usually easiest to use this field to identify the servers, as shown in this example. Replace "host1", "host2", etc. with the hostnames of your servers. You can attach additional fields to each server, e.g. to work with groups of servers; see the Manage Groups of Servers solution page.

Per-Server Alerts

Rather than explicitly listing parameter values for an alert template, you can configure a template to automatically apply to all your servers. You can also configure it to apply to selected servers, e.g. all frontends or all database servers.

To do this, instead of a templateParameters clause, specify a byHosts clause. Here is the first example from the previous section, rewritten using byHosts:

{
  alerts: [
    {
      byHosts: {
        filter: "",              // Blank means "all hosts"
        fields: ["serverHost"],  // Retrieve the "serverHost" (hostname) field for use in alert templates.
                                 // (You can specify "serverHost", "serverIP", and/or any server-level fields
                                 // defined in the Scalyr Agent configuration.)
        maxAgeHours: 4           // Ignore hosts which have not sent any data in the last 4 hours
      },
      alerts: [
        {
          description: "#serverHost#: high CPU usage",
          trigger: "mean:5m($source='tsdb' $serverHost='#serverHost#' metric='proc.stat.cpu_rate' type='user') > 400.0"
        }, {
          description: "#serverHost#: root disk almost full"
          trigger: "mean:10m($source='tsdb' metric='df.1kblocks.free' $serverHost == '#serverHost#' mount=='/' ) < 500000"
        }
      ]
    }
  ]
}

A pair of alerts will be created for each server that has generated logs or metric data in the last four hours. When you add or remove a server, e.g. in an EC2 autoscaling group, the alert list will automatically adjust. Note that alerts for a removed server will linger for several hours, according to maxAgeHours. (When a server stops sending data, Scalyr can't tell whether you've deliberately terminated it, or the server has crashed, lost power or network, or experienced some other problem. maxAgeHours allows the server's alerts to persist so that you can be warned of unplanned server termination.)

All three fields of the byHosts clause are optional. If you don't specify a filter, the alerts will apply to every server. If you don't specify a fields list, the serverHost (hostname) and serverIP (IP address) fields are provided. If you don't specify maxAgeHours, 4 hours is used. Alerts may linger for an hour or so beyond the period specified in maxAgeHours.

Here are some sample filters:

    // All servers whose hostname contains "frontend"
    filter: "serverHost contains 'frontend'"

    // All servers whose agent configuration includes a server-level field named
    // "scope", with value "staging".
    filter: "scope == 'staging'"

    // All servers having logs tagged with parser name "xxx".
    filter: "'parser:xxx'"

    // All servers where "xxx" appears anywhere in the file name or parser name of any log.
    filter: "'xxx'"

You can use the full Scalyr query language to select servers. Your filter expression can reference serverHost (the server's hostname), serverIP (the server's IP address), and any server-level fields defined in the Scalyr Agent configuration. In addition, you can select based on log files and log parsers, using the text search syntax. When you use a text search filter, each server is treated as having the following text:

[parser:xxx] [parser:yyy] [log:aaa] [log:bbb] ...

listing each log file for that server, and any parsers associated with those log files in the Scalyr Agent configuration.

Managing Templated Alerts

To create an alert template, you must directly edit the /scalyr/alerts configuration file. However, you can use the shortcuts on the https://www.scalyr.com/alerts page to manage templated alerts like any other alert. If you edit or delete a templated alert, all instances of the template are affected.

When you silence a templated alert, you can choose to silence a specific instance, or all instances of the template. For instance, suppose you click the Silence button for the "host2: high CPU usage" alert generated from the template above. The silence confirmation dialog will contain a field "Silence alerts matching:", with value host=="host2", meaning that only the host2 instance of this alert would be silenced. If you delete the host=="host2" clause, then all instances of the alert would be silenced.

Specifying Alert Recipients

Anywhere you specify an e-mail address for alerts, you can also give multiple addresses, separated by commas or semicolons. All of the listed addresses will receive the alert.

You can direct each alert to a different list of e-mail addresses, by adding an alertAddress field to the alert:

{
  alerts: [
    {
      trigger: "count:1m(error) > 10 $tier='frontend'",
      alertAddress: "frontend-team@example.com"
    },
    {
      trigger: "count:1m(error) > 10 $tier='backend'",
      alertAddress: "backend-team@example.com, ops-team@example.com"
    }
  ]
}

Any alert which does not specify an alertAddress, will default to the address specified at the top of the file or the login address for your account. You can also group alerts into sections, and specify an alertAddress on the section:

{
  alerts: [
    // Frontend alerts
    {
      alertAddress: "frontend-team@example.com",
      alerts: [
        {
          trigger: "count:1m(error) > 10 $tier='frontend'"
        },
        {
          ...more alerts...
        }
      ]
    },
    // Backend alerts
    {
      alertAddress: "backend-team@example.com",
      alerts: [
        {
          trigger: "count:1m(error) > 10 $tier='backend'"
        },
        {
          ...more alerts...
        }
      ]
    }
  ]
}

PagerDuty Integration

You can use PagerDuty to deliver Scalyr alert notices. Simply create a "Generic API" service in PagerDuty. In your Scalyr alert configuration, enter an alertAddress of the following form:

pagerduty:XXXXX

Where XXXXX is the "Service API Key" for your Generic API service.

For detailed instructions on setting up PagerDuty integration, see Alert Using PagerDuty.

OpsGenie Integration

You can use OpsGenie to deliver Scalyr alert notices. Simply create a Scalyr integration in at https://www.opsgenie.com/integration, and copy the resulting API key. Then in your Scalyr alert configuration, enter an alertAddress of the following form:

opsgenie:XXXXX

Where XXXXX is the API key generated on the OpsGenie integrations page.

For detailed instructions on setting up OpsGenie integration, see Alert Using OpsGenie.

HipChat Integration

You can use HipChat to deliver Scalyr alert notices. In your Scalyr alert configuration, enter an alertAddress of the following form:

hipchat:room=ROOM-ID&token=XXX

Where XXX is a HipChat API token, available at https://www.hipchat.com/admin/api.

Scalyr will send a message to HipChat when an alert triggers, and again when it resolves. Scalyr will instruct HipChat to add a notification to this message, only when the alert triggers. (The notification may take the form of a sound, pop-up window, or something else, depending on how you have configured HipChat.) You can customize this behavior by appending an option to the address:

Option Effect
hipchat:room=ROOM-ID&token=XXX&notify=none Don't add a notification to any alert messages
hipchat:room=ROOM-ID&token=XXX&notify=trigger Add a notification for trigger messages, but not resolution messages
hipchat:room=ROOM-ID&token=XXX&notify=all Add a notification for both trigger and resolution messages

Note that messages are always added to the HipChat room. The notify setting only specifies whether a sound or other notification should accompany the message.

For detailed instructions on setting up HipChat integration, see Alert Using HipChat.

Slack Integration

You can use Slack to deliver Scalyr alert notices. In your Scalyr alert configuration, enter an alertAddress of the following form:

webhook-trigger:POST https://hooks.slack.com/services/INTEGRATION_PATH[[{\"text\": \"#title# <#link#>\",\"username\": \"scalyr\", \"icon_emoji\": \":sos:\"}]]

Where INTEGRATION_PATH is obtained in the Slack integration settings. See Alert Using Slack for details.

Webhook Integration

You can deliver Scalyr alert notices to any third-party service that accepts notifications via a GET, PUT, or POST HTTP request. In your Scalyr alert configuration, enter an alertAddress in one of the following forms:

webhook-trigger:GET URL
webhook-resolve:GET URL

webhook-trigger:PUT URL[[BODY]]
webhook-resolve:PUT URL[[BODY]]

webhook-trigger:POST URL[[BODY]]
webhook-resolve:POST URL[[BODY]]

Where URL is a complete address beginning with "http://" or "https://", and BODY is the body of the HTTP request. For PUT and POST requests, you can optionally specify a Content-Type, as follows:

webhook-trigger:PUT https://api.example.com/operation[[{"foo": true}]]&content-type=application/json

A webhook-trigger recipient will be notified whenever an alert triggers (i.e. when the alert condition becomes true). A webhook-resolve recipient will be notified when an alert resolves (the alert condition becomes false). You will generally want to specify both webhook-trigger and a webhook-resolve recipient. A complete example:

alertAddress:
    "webhook-trigger:GET https://api.example.com/notify-alert?message=alert%20#title#%20triggering"
  + ","
  + "webhook-resolve:GET https://api.example.com/notify-resolved?message=alert%20#title#%20resolved"

Substitution tokens

You can embed tokens in a webhook URL or body, which will be replaced by information about the alert. The following tokens are supported:

Token Replaced by
#trigger# The trigger expression that determines when this alert fires.
#description# The description you've specified for this alert. If you didn't specify a description, the trigger expression is used.
#title# The first line of the description.
#link# A link to this alert.
#id# A short token identifying this alert.

Any other use of # is left alone. In particular, if your webhook contains a sequence like #foo#, it will be left unchanged.

Alert Logging

Scalyr generates three kinds of log records which you can use to review your alert history:

Alert state records are generated every 60 seconds, and indicate whether the alert condition is met (sample query). Each record has the following fields:

Field Value
tag Always alertState
state 2 if the alert condition is met (i.e. the alert is triggered), 1 otherwise
trigger The alert's trigger condition
description The alert's description

An alert notification record (sample query) is generated whenever Scalyr sends an "alert triggered" or "alert resolved" message:

Field Value
tag Always alertNotification
newState 2 if the alert condition is met (i.e. the alert is triggered), 1 otherwise
trigger The alert's trigger condition
description The alert's description
isRenotification True for a repeated notification of an alert that has been triggered for a long time, false otherwise

A state change record (sample query) is generated when an alert's state changes between "triggered" and "not triggered":

Field Value
tag Always alertStateChange
newState 2 if the alert condition is met (i.e. the alert is triggered), 1 otherwise
trigger The alert's trigger condition
description The alert's description

alertNotification and alertStateChange records are similar. An alertStateChange record is generated immediately when the alert's state changes. If the alert has a grace period, then the alertNotification record may be delayed or omitted. And each renotification generates an additional alertNotification record, but not an alertStateChange record.

If you're using alert templates, then the template parameters will be included as additional fields in each log record.