Telemetry metrics roll-ups

Our Telemetry aggregation system has been serving us well for quite some time. As Telemetry evolved though, maintaining the codebase and adding new features such as keyed histograms has proven to be challenging. With the introduction of unified FHR/Telemetry, we decided to rewrite the aggregation pipeline with an updated set of requirements in mind.

Metrics

A ping is the data payload that clients submit to our server. The payload contains, among other things, over thousand metrics of the following types:

  • numerical, like e.g. startup time
  • categorical, like e.g. operating system name
  • distributional, like e.g. garbage collection timings

Distributions are implemented with histograms that come in different shapes and sizes. For example, a keyed histogram represent a collection of labelled histograms. It’s not rare for keyed histograms to have thousands of possible labels. For instance, MISBEHAVING_ADDONS_JANK_LEVEL, which measures the longest blocking operation performed by an add-on, has potentially a label for each extensions.

The main objective of the aggregator is to create time or build-id based aggregates by a set of dimensions:

  • channel, e.g. nightly
  • build-id or submission date
  • metric name, e.g. GC_MS
  • label, for keyed histograms
  • application name, e.g. Fennec
  • application version, e.g. 41
  • CPU architecture, e.g. x86_64
  • operating system, e.g. Windows
  • operating system version, e.g. 6.1
  • e10s enabled
  • process type, e.g. content or parent

As scalar and categorical metrics are converted to histograms during the aggregation, ultimately we display only distributions in our dashboard.

Raw Storage

We receive millions of pings each day over all our channels. A raw uncompressed ping has a size of over 100KB. Pings are sent to our edge servers and end up being stored in an immutable chunk of up to 300MB on S3, partitioned by submission date, application name, update channel, application version, and build id.

As we are currently collecting v4 submissions only on pre-release channels, we store about 700 GB per day; this considering only saved_session pings as those are the ones being aggregated. Once we start receiving data on the release channel as well we are likely going to double that number.

As soon as an immutable chunk is stored on S3, an AWS lambda function adds a corresponding entry to a SimpleDB index. The index allows Spark jobs to query the set of available pings by different criteria without the need of performing an expensive scan over S3.

Spark Aggregator

A daily scheduled Spark job performs the aggregation, on the data received the day before, by the set of dimensions mentioned above. We are likely going to move from a batch job to a streaming one in the future to reduce the latency from the time a ping is stored on S3 to the time its data appears in the dashboard.

Two kinds of aggregates are produced by aggregator:

  • submission date based
  • build-id based

Aggregates by build-id computed for a given submission date have to be added to the historical ones. As long as there are submissions coming from an old build of Firefox, we will keep receiving and aggregating data for it. The aggregation of the historical aggregates with the daily computed ones (i.e. partial aggregates) happens within a PostgreSQL database.

Database

There is only one type of table within the database which is partitioned by channel, version and build-id (or submission date depending on the aggregation type).

As PostgreSQL supports natively json blobs and arrays, it came naturally to express each row just as a couple of fields, one being a json object containing a set of dimensions and the other being an array representing the histogram. Adding a new dimension in the future should be rather painless as dimensions are not represented with columns.

When a new partial aggregate is pushed to the database, PostreSQL finds the current historical entry for that combination of dimensions, if it exists, and updates the current histogram by summing to it the partially aggregated histogram. In reality a temporary table is pushed to the database that contains all partial aggregates which is then merged with the historical aggregates, but the underlying logic remains the same.

As the database is usually queried by submission date or build-id and as there are milions of partial aggregates per day deriving from the possible combinations of dimensions, the table is partitioned the way it is to allow the upsert operation to be performed as fast as possible.

API

An inverted index on the json blobs allows to efficiently retrieve, and aggregate, all histograms matching a given filtering criteria.

For example,


select aggregate_histograms(histograms)
from build_id_nightly_41_20150602
where dimensions @> '{"metric": "SIMPLE_MEASURES_UPTIME",
                      "application": "Firefox",
                      "os": "Windows"}'::jsonb;

retrieves a list of histograms, one for each combination of dimensions matching the where clause, and adds them together producing a final histogram that represents the distribution of the uptime measure for Firefox users on the nightly Windows build created on the 2nd of June 2015.

Aggregates are made available through a HTTP API. For example, to retrieve the aggregated histogram for the GC_MS metric on Windows for build-ids of the 2015/06/15 and 2015/06/16:

curl -X GET "http://aggregates.telemetry.mozilla.org/aggregates_by/build_id/channels/nightly/?version=41&dates=20150615&metric=GC_MS&os=Windows_NT"

 which returns

{"buckets":[0, ..., 10000],
 "data":[{"date":"20150615",
          "count":239459,
          "histogram":[309, ..., 5047],
          "label":""}],
 "kind":"exponential",
 "description":"Time spent running JS GC (ms)"}

Dashboard

Our intern, Anthony Zhang, did a phenomenal job creating a nifty dashboard to display the aggregates. Even though it’s still under active development, it’s already functional and thanks to it we were able to spot a serious bug in the v2 aggregation pipeline.

It comes with two views, the histogram view designed for viewing distributions of measures:

histogramand an evolution view for viewing the evolution of aggregate values for measures over time:

evolutionAs we started aggregating data at the beginning of June, the evolution plot looks rightfully wacky before that date.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s