Service Performance Monitoring (SPM)

Service Performance Monitoring (SPM) allows you to aggregate logs into services and operations. After a mapping process, users can easily view RED (Rate, Error, Duration) metrics and graphs for each service, and for each operation under that service. A standardized overview of your architecture is available on a single page.

How SPM Works

The heart of SPM is a data model that aggregates logs into services and operations. The following terms are defined:

  • Service: a free-standing system component that provides a well-defined set of functionality. Services generally consist of an interchangeable set of containers, VMs, or a set of servers all running the same code; or a set of code running on a platform like an AWS Lambda Application.
  • Operation: a single function which is provided by a service. For example, a database service has operations such as "add record" and "get record". For AWS Lambda, this is likely a Lambda function.
  • Request: an individual invocation of a service. A specific occasion on which the service is asked to perform an operation.

Raw data always consists of logged requests. User-defined mappings tell Scalyr which requests are associated with which service and operation. The mappings also tell Scalyr which fields to use when calculating error and duration. Requests are aggregated to calculate rate and provide operation-level and service-level performance data.

For example, nginx access logs contain URI paths for "/app/home" and "/app/photo". Nginx is mapped as the service, and each URI path is mapped as an operation. The number of requests for each operation is used to calculate rate data. The HTTP status field in the data is mapped appropriately to calculate errors. And the request_time field is mapped to calculate duration.

This process generates a number of fields with the prefix "sca:":

  • sca:service: The name of the service. You can use this field to quickly inspect your data by service.
  • sca:operation: The name of the operation. You can use this field to quickly inspect your data by operation.
  • sca:outcome: The outcome for the request, standardized through user-defined mappings to either "error" or "success". Allows you to compare outcomes across all of your services and operations.
  • sca:latency: Duration for the request, standardized to milliseconds across all your services and operations. You can graph this field, break it down by service or operation, and use it to construct Alerts.

If you are importing trace data from OpenCensus or OpenTelemetry, three additional "sca:"-prefixed fields are generated:

  • sca:span: The span ID.
  • sca:parentSpan: The parent span ID.
  • sca:trace: The trace ID.

A future version will aggregate logged requests (spans) further through parentSpanIDs and traceIDs to provide full tracing capability. Rather than instrumenting code for a tracing framework, users can simply add a few fields in their logs and obtain the same functionality.

At present, the number of operations which are automatically tracked and visible in the UI is limited to 20.

SPM Overview

You can easily monitor the real-time performance of all your services and operations on a single page. Performance data is presented according to the RED method:

  • Rate: The average number of requests per second.
  • Error: The average number of errors per second.
  • Duration: The average amount of time it takes to fulfill requests, in milliseconds. Also known as latency.

These three metrics provide a standardized overview of your architecture. Furthermore, they are useful for assessing customer satisfaction.

(1) To monitor the performance of your services, select Logs from the main menu.

(2) Then select the Services tab.

(3) Services and operations are listed hierarchically, with operations under their parent service. They are color-coded for readability.

In this example we have two services, nginx2 and openCensus. Nginx2 has two operations, "addEvents" and "uploadLogs", while openCensus has one, "trace".

Note that If you are importing trace data from OpenCensus or OpenTelemetry, the data is treated as a specific operation (sca:operation == "trace").

(4) For each service, the average number of requests per second over the last four hours is displayed. For each operation, the graph displays the number of requests per second over the past four hours, and over the past four hours one week ago. The percentage change compared to the previous week is displayed, prefixed by an up or down arrow indicating an increase or decrease. The average number of requests per second for the operation, over the last four hours, is displayed in parentheses.

(5) For each service and operation rate data is displayed as the percent of requests for that service or operation that are errors, averaged over the past four hours.

(6) For each service, the average duration for fulfilling requests over the past four hours is displayed. For each operation, the graph plots the duration of requests over the past four hours, and over the past four hours one week ago. The percentage change compared to the previous week is displayed, prefixed by an up or down arrow indicating an increase or decrease. The average duration over the last four hours is displayed in parentheses.

Editing the Configuration File

Services, operations, and their associated RED metrics are specified by a configuration file in an augmented JSON format. This topic describes the configuration syntax. For more information on Scalyr configuration files, refer to Configuration Files.

The SPM configuration file is located at /scalyr/mapping. To create this file, click on your login at the top right of the page, select Config Files, and then select the Create New File button on the upper-right.

Here is an example of the configuration file syntax:

{
  mappings: [
    {
      filter: "tag = 'openCensusDogfood'",
      serviceName: {
        type: "fixed",
        source: "openCensus"
      },
      operationName: {
        type: "fixed",
        source: "trace"
      },
      latency: {
        type: "rewritefield",
        action: "timeDelta",
        startTime: "startTimeMs",
        endTime: "endTimeMs"
      },
      traceId: {
        type: "field",
        source: "traceId",
      },
      spanId: {
        type: "field",
        source: "spanId",
      },
      parentSpanId: {
        type: "field",
        source: "parentSpanId",
      },
      isError: "status > 399"
    }, {
      filter: "uri = '/addEvents' request_time",
      serviceName: {
        type: "fixed",
        source: "nginx2"
      },
      operationName: {
        type: "fixed",
        source: "addEvents"
      },
      latency: {
        type: "latencyfield",
        source: "request_time",
        units: "seconds"
      },
      isError: "status > 399"
    }, {
      filter: "uri contains 'uploadLogs' request_time",
      serviceName: {
        type: "fixed",
        source: "nginx2"
      },
      operationName: {
        type: "fixed",
        source: "uploadLogs"
      },
      latency: {
        type: "latencyfield",
        source: "request_time",
        units: "seconds"
      },
      isError: "status > 399"
    }

  ]
}

Note that each operation is a clause in the mappings field. In the example above we are defining three operations - "trace", "addEvents", and "uploadLogs". One of these ("trace") is for the "openCensus" service, while the other two are for the "nginx2" service.

At present, the number of operations which are automatically tracked and visible in the UI is limited to 20.

Each operation has the following subfields:

filter: Expressed in Scalyr Query Language, this filters for the log messages associated with an operation.

If you are using regex for mapping an operation, the matches from the regex should be a fixed number of operations. For example, if the data mapping rule for operation is { source: "url", match: ".*method=(.*)", replace: "$1"}", the user has to ensure the regex matches to a fixed number of method names. To prevent exploding of the regex matches we recommended you write a very tight filter when regex is used for operation mapping.

serviceName and operationName: These two fields specify the name of the respective service and operation. Each contains a clause with two subfields, type and source. With respect to service and operation names the value of type will be "fixed", and the source field defines the name of the service or operation. These names are visible on the Services tab of the Logs page.

For example, the above file specifies three operationNames (type: "fixed" for all 3; source: "trace", "addEvents", and "uploadLogs"), and two serviceNames (type: "fixed" for both; source: "openCensus" and "nginx2"). The Services tab of the Logs page will show the "openCensus" service with the "trace" operation under it, and the "nginx2" service with the "addEvents" and "uploadLogs" operations under it.

latency: This specifies which numeric field or fields to use as a measure of duration (latency). latency is a clause with subfields. When a single field in the (filtered) data can specify duration, the clause will contain three subfields: - type: This will will have a value of "latencyfield" - source: This refers to a numeric field in your (filtered) data containing the measure of latency for the respective operation. - units: This defines the units of the metric you have specified in source. Values can be "seconds", "milliseconds" and "nanoseconds", or "s", "ms" and "ns".

For example, in the configuration file above, the "addEvents" and "uploadLogs" operations both have a request_time field in the (filtered) data which measures latency, in seconds. Thus type: "latencyfield", source: "request_time", and units: "seconds".

In some cases latency must be calculated from two numeric fields in your (filtered) data. The latency clause will specify four subfields: - type will have a value of "rewritefield" - action will have a value of "timeDelta" - startTime specifies which field in your (filtered) data contains the start time. - endTime specifies which field in your (filtered) data contains the end time.

For example, in the configuration file above, the filtered data for the "trace" operation contains two fields - startTimeMs and endTimeMs - that can be used to calculate latency. Thus type: "rewritefield", action: "timeDelta", startTimeMs: "startTimeMs", and endTime: "endTimeMs".

isError: Expressed in Scalyr Query Language, isError defines "error" and "success" for each request associated with an operation. For user-facing operations this tends to be a mapping of the field in your (filtered) data which contains the HTTP status code. In the configuration file above, all three operations define isError: "status > 399".

For machine-facing operations, isError tends to refer to a field logging the outcome of a request. For example, isError: outcome == "failure".

Note that for each operation you are only defining Error and Duration from the RED (Rate, Error, Duration) trilogy. Scalyr calculates Rate from the number of requests per second associated with an operation.

Once mapped, it will take a few minutes for SPM graphs to load properly on the Services tab of the Logs page.