How We Monitor High Scale Search at a Glance

One of our key missions on the search team at Shutterstock is to constantly improve the reliability and speed of our search system.  To do this well, we need to be able to measure many aspects of our system’s health.  In this post we’ll go into some of the key metrics that we use at Shutterstock to measure the overall health of our search system.

The image above shows our search team’s main health dashboard.  Anytime we get an alert, a single glance at this dashboard can usually point us toward which part of the system is failing.

On a high level, the health metrics for our search system focus on its ability to respond to search requests, and its ability to index new content.  Each of these capabilities is handled by several different systems working together, and requires a handful of core metrics to monitor its end-to-end functionality.

One of our key metrics is the rate of traffic that the search service is currently receiving.  Since our search service serves traffic from multiple sites, we also have other dashboards that break down those metrics further for each site.  In addition to the total number of requests we see, we also measure the rate of memcache hits and misses, the error rate, and the number of searches returning zero results.

One of the most critical metrics we focus on is our search service latency.  This varies greatly depending on the type of query, number of results, and type of sort order being used, so this metric is also broken down into more detail in other dashboards.  For the most part we aim to maintain response times of 300ms or less for 95% of our queries.  Our search service runs a number of different processes before running a query on our Solr pool– language identification, spellcheck, translation, etc, so this latency represents the sum total of all those processes.

In addition to search service latency, we also track latency on our Solr cluster itself.  Our Solr pool will only see queries that did not have a hit in memcache, so the queries that run there may be a little slower on average.

When something in the search service fails or times out, we also track the rate of each type of error that the search service may return.  At any time there’s always a steady stream of garbage traffic from bots generating queries that may error out, so there’s a small but consistent stream of failed queries.  If a search service node is restarted we may also see a blip in HTTP 502 errors, although that’s a problem we’re trying to address by improving our load balancer’s responsiveness in taking nodes out of the pool before they’re about to go down.

A big part of the overall health of our system also includes making sure that we’re serving up new content in a timely manner.  Another graph on our dashboard tracks the volume and burndown of items in our message queues which serves as our pipeline for ingesting new images, videos, and other assets into our Solr index.  This ensures that content is making it into our indexing pipeline, where all the data needed to make it searchable is processed.  If the indexing system stops being able to process data, then that will usually cause the burndown rate of each queue to come to a halt.

There’s other ways in which our indexing pipeline may fail too, so we also have another metric that measures the amount of content that is making it through our indexing system, getting into Solr, and showing up in the actual output of Solr queries.  Each document that goes into Solr receives a timestamp when it was indexed.  One of our monitoring scripts then polls Solr at regular intervals to see how many documents were added or modified in a recent window of time.  This helps us serve our contributors well by making sure that their new content is being made available to customers in a timely manner.

Behind the scenes we also have a whole host of other dashboards that break out the health and performance of each system covered in this dashboard, as well as metrics for other services in our search ecosystem.  When we’re deploying new features or troubleshooting issues, having metrics like these helps us very quickly determine what the impact is and guides us to quickly resolving it.

This article first appeared on Shutterbits.