Etsy - A Deep Dive into Monitoring with Skyline by Abe Stanway - Transcript

(Original post with video of talk here)

Abe Stanway: Okay. Hi, I’m Abe. I’m a data engineer at Etsy and today we’re going to talk about Skyline, of which I was the primary author. And so we’re going to talk about how we monitor, why we decided to build this, and how it advances the art of monitoring. So let’s start.

So Etsy is the world’s handmade vintage marketplace. We are based right here in Dumbo, so this wasn’t too much of a pain to get up here. We have a large stack. We’ve got a lot of stuff going on. Specifically, or actually not specifically at all, these are just some of our numbers of some of the servers that we’re dealing with – 41 shards, MySQL, 24 API servers, 72 web servers, 42 Gearman boxes, a 150 node Hadoop cluster, 15 memcached boxes, and around 60 search machines, and a lot more than that. Probably on a scale of a hundred to two hundred, for sure other various services come with a lot of things that we have.

And that’s not to mention the app itself, which is running on top of all these machines, and all the services that are actually running on these machines. In addition to that, we practice something called continuous deployment, which is kind of the new hotness we’ve developed with Devoxx, right. It’s kind of always deploying every single day, so we deploy around thirty to sixty times a day, every day, and we make this really really easy to do for all our engineers.

Deploying, of course, means a lot of different things to a lot of different people. In our case and probably everyone’s case, it means taking new code and putting it on your web servers to be serviced. And this can get tricky because people are always visiting your site all the time, and like in Etsy’s case we’re always making money every second of the day. Sometimes we’re making more money than other times but we’re always making money and so, you know, if we want to make our services better we have to figure out a way to deploy without taking the site down. And so we have this whole, continuous deployment infrastructure for us to do that efficiently and quickly so that when someone has an idea to make their feature – that they own – better, they can update, push a button and deploy it, and the whole process takes about ten minutes. So we’re very very proud of that.

Of course with this process, the whole point is democratizing deploys. It’s not like you take your coat changes and then go give it to the high priest of deployment, somewhere deep in the apps – the bowels of our apps department. Now everyone deploys. At Etsy we’ve got about 250 committers and not all of those guys are engineers. We have PMs who deploy, designers who deploy. In fact it’s so easy to deploy that on your first day, you’re expected to deploy new code to the web stack, like putting your face on the About page. And the way you do that is with a big green button, and literally just press ‘Deploy,’ and then it does it. And it’s terrifying.

And so, you’re probably thinking like, we’ve got hundreds of these boxes, these servers, they’re hosting constantly evolving code that’s changing up to 60 times a day. It must be a miracle and how do we even stay up at any time? How do we not burst into a pile of flames, given the fact that there are so many haphazard ways for things to go wrong. Well, we optimize not for avoiding failure; we optimize for recovering from failure.

The concept this is called is meantime to failure and meantime to recovery. And it goes in a spectrum. If you optimize for meantime to failure and you want to never fail again, that means your meantime to recovery is very very long. So for example, we really really heavily protect our credit card stuff; that’s what all PCI compliance is for, because the meantime to failure there – meaning meantime to recovery – is really really long and painful. And so, it’s much less painful to just optimize for increasing the time to failure, than reducing the recovery time, because once you lose credit card information you’re in a lot of trouble.

However, if you’re dealing with stuff in the web stack, that is easier to fix, like if you process dies you just bring the backup. If there’s a bug in your code you redeploy five minutes later – you fix it. This is optimizing for meantime for recovery. So we anticipate failure, we anticipate that things are going to go wrong. We anticipate boxes are going to have hard drive failures, and that human developers are not infallible and that they will introduce bugs in the code despite their best efforts. But we don’t penalize them for that. We kind of look at it from a systematic view, and we try to just make it as fast as possible to recover, once we have detected the failure. And the last part is very crucial – detecting a failure. How do you detect failures? How do you know when something’s going wrong, when you have all these different people doing all these different things, on all these different services and servers?

And so to the effect you have to realize that you can’t fix what you don’t measure, so we’ve taken a pretty hardline stance on this measuring every single thing in our infrastructure. To do that we’ve got a bunch of tools – some of them you may be familiar with, and some of then you won’t be. So, on the not homemade site we use Ganglia for a more deep operational level matrix, Graphite as a time series database to store our time series, Nagios for alerting. We use StatsD, which you might probably know about, which is an aggregating daemon. We’ll go into it in a bit. Supergrep – which I’ll explain – and then Skyline and Oculus, which is what this talk’s about.

So we’ll start with the most basic things. Supergrep. Supergrep is a really, really, really simple tool that allows us to detect errors when they occur. And all it is is a note app that’s tailing log files, literally. It’s tail F, access file log for your patch access logs, and everything else – logs that you want – pipe, grep, -v, minus the things you already know are going to happen, error. Like you’re searching for error or failure, whatever indications of failure that there are in the logs. You just load this up in your web page and then this just goes streaming by. And then when there’s an error there, since the whole site is being trafficked very heavily, like the entire page is flood red with the error, and you’ll know something’s wrong. And so the exit prods for that is Supergrep angry, or Supergrep bowling app. And so this is a really really really simple way to know if there are errors being dropped. And when there are errors being dropped you can go fix them. That’s how we do it.

So if you push you kind of go hang out with Supergrep, and you look to see if everything looks good. And then you have more confidence that your push didn’t break everything.

But that’s not really enough, because there are certainly many ways for a service to break and to be non-functioning, and to be failed without it throwing a specific error telling you that it’s failing. There are a lot more subtle ways for things to go wrong. And so for those, one of the ways you’ll know in a minute is with StatsD. What we do with StatsD – and this is the mythical monitoring desk that I referred to in the abstract for the talk – is also pretty simple. It’s a node, it’s a three-line node file. It’s just there and it listens for pings to it. These pings look like this in your code.

So you’re sitting there and you do something with in your logic, and then when you do that thing again you can call StatsD that increment, whatever your metric is, with the name space in that. We’ve a bunch of different ways to increment those gauges, those counters, like those bug kind of anthologies for speed metrics, but they all basically do the same thing. And the purpose is to graph metrics that otherwise would not be graphed by Supergrep or by Ganglia. So they serve as an application of stuff but not necessarily, like you can throw it anywhere there’s code running; you can throw a StatsD call. And then it’ll instantly get aggregated by the StatsD daemon running on one box, and then it’ll get sent and flushed to Graphite. And so, with one line of code, you’ve got an entire historical Graphite series from that point forward.

And then once it’s in Graphite, rollup is taken care of, aggregation is taken care of, you’ve all these Graphite functions, you can have a graph or the data, or you can take it out and have JSON and do all the script stuff. All that comes from the StatsD integration.

Like I said before, this logic comes from this philosophy of “It if moves, graph it.” If anything is moving in our infrastructure, we want to graph it. Even if you don’t think it’s important, even if you don’t think it’s a business level metric that could possibly ever impact failure, you want to graph it because you don’t know what’s going to happen. Even if it doesn’t move, you still want a graph of it. Put a StatsD call, because it’s so easy. There’s no friction to do it, and that’s by design, there’s zero friction. And so, this way we have these metrics infiltrating everywhere in our stack. Once you have those metrics, you put them in the dashboards. We do have an open source for our own use; nothing special. It’s just a wraparound Graphite and what we do with that is each team has the ability to make their own dashboards, where they kind of cherry pick the most important metrics that they think are relevant to them and put them in dashboards for easy access.

That was the flow before Skyline. Of course, when you just have dashboards, the problem with dashboards is that when you have an avalanche of metrics like we have – and the specific number is a quarter million StatsD calls and Ganglia, and a couple other metrics in the main spaces – that’s a quarter million total, which is an absurd amount. It’s way too much for dashboards. And we’ve got 250 committers. If we were to have every committer looking at graphs on the line, that’s a dashboard-to-engineer ratio of one to a thousand. You have to have to have all your engineers looking at a thousand graphs every day just to monitor everything that you have metrics for. So dashboards are really quite cut at that scale.

And so what you end up with is a situation where things can be going wrong and no one’s seeing it because: A, there’s no dashboard for it; B, if there is a dashboard no one’s looking at the dashboard. And so, if no one’s looking at the dashboard, or otherwise not aware of this metric that’s spiking or doing something it’s not supposed to be doing, no one can act on it and either decide it’s a failure and fix it, or decide that it’s normal and not fix it, but be aware that that is what it looks like.

To kind of counteract this, we do use Nagios; we’re very heavy users of Nagios, or nachos – if that’s how you pronounce it. But Nagios sucks. I never wanted to say how much Nagios sucks. There’s a lot of friction to develop a Nagios plugin; you don’t necessarily always have confidence that the Nagios plugin works, because you can write the plugin and then it’ll pass screen because you think it’s right, but that doesn’t… It’s just that it’s wrong, you have to take your service down. You can’t do so many of that, like a Dugg metric that you think is working. That’s dangerous. Nagios is stupid – there are thresholds that you manually set, and say you’ve got a disk space slowly filling on one box, and then give a threshold – 85% – alert me when this is happening, so I can clean the box up. And then Nagios decides to wake you up at three in the morning, because: “Look at this space, it’s at 85% now.” But that’s been slowly climbing for three years, like it did not have to wake you up at that particular moment.

So it doesn’t know. It really doesn’t know what normal is, and how it should structure its alerts, and you can’t just tell it to; that’s really hard. So Skyline was meant to kind of take this the next step forward. Of course, there obviously aren’t Nagios alerts for everything. We have this whole lot of metrics and there isn’t a hundred percent metric coverage for Nagios. There is an argument that says, Maybe you don’t have yet a hundred percent metric coverage, so it’s really only good to monitor things that you know, like new things, but Skyline – the design of Skyline is to take a long tail of everything, to just say everything could be valid. Err on the side of noise, and give you the metrics that are spiking. That’s all it does. And then it’s up to you, as the engineer, to act on those, if you choose to.

Kale is the thing that we came up with to do this, and it’s composed of both Skyline and Oculus. Skyline is the anomaly techie part of things and Oculus is the metrics correlation part of things. We’re not going to talk about Oculus today, but the way it works is, you can detect – you could find the metric and you can see what other metrics at that time look similar to it, by using… like literally how the graph looks, with the logic being: If something’s messing up, probably it’s not messing up in isolation, and probably there are other things that are also messing up. That’s to help you get a broader picture of the problem and diagnose it and then fix it. Before you can do that, you have to know that you have something wrong in the first place, and so that what Skyline comes in for.

So, we have the question. The question comes: How do you analyze a timeseries for anomalies in real time, given the structure for Skyline? And it’s actually pretty easy. You can just make a lot of HTTP requests to Graphite APIs. Graphites have a really, really awesome API, they can hit and get perfectly structured JSON data back, and you could take your JSON, and you can run algorithms on it, and you could just keep hitting it once very ten seconds or so for as many metrics you want, and that’s the way to do it. That was what the purpose of Skyline is.

But that doesn’t work when you hit our scale, when you hit a quarter million metrics, because Graphite’s going to disk to retrieve historical data, and it can’t handle hitting a quarter million requests for all these timeseries every ten seconds, if that’s the resolution you want. It’s not going to happen. So you need a dedicated service. That’s what Skyline can do.

So, to reiterate, Skyline is a real time anomaly detection system. Let’s dig into that. Real time means a lot of different things, in a lot of different contexts, and with this context, I’m kinda lying. It’s real time enough on two levels. The first level is that we need to do a StatsD resolution of ten seconds, so that means every ten seconds, no matter how many StatsD metrics you receive, it’ll flush us to Graphite every ten seconds. So you’re not going to get higher resolution StatsD data, unless you specifically set one second. But the convention of StatsD is to stick with ten seconds.

At Ganglia, we use one minute resolution, so you’re not going to get more than one minute resolution. You can trick this, but we don’t do it. That’s what we’re working on. Best case scenario, even if you could analyze, process and store instantly, it’s going to be ten seconds for the StatsD ones and one minute for the Ganglia ones. Of course, you’re not doing just that; you’re also processing them. You’re going to store them somehow, and then take them out and process them on your algorithms. And so, right now it takes about 70 seconds with our throughput. That’s not so bad, it’s really not bad. And it’s even better when you think about it that it’s still faster than you would have discovered otherwise, and that’s really where it shines. As long as it’s still faster than you would have discovered otherwise, which is – in those cases – never, this is good enough. And so I just want to think that we could optimize it, but it’s still… It’s an improvement.

So on to the architecture. When you’re talking about making all these requests and kind of playing with data at the speed that we would like, memory is a much better storage medium than going to disk, which is why Graphite wouldn’t work in the first place because its storage – its metrics – has whisper files, which is run on our own database on diskette stuff. And we don’t want to have to go to disk to get out certain timeseries that we want; we want it to be already in memory, easily accessible. So we chose Redis to do this. Redis has a great API. It’s rock solid, at least for what we’re doing with it. We don’t use the Redis Cluster yet, which I don’t think is mature at this point. But it’s got a lot of cool data structures, really great community, and it’s a really solid tool so we were excited to use it. And the question then becomes: How do you get a quarter million timeseries into Redis on time?

And the answer then, is to stream it – if you can find a stream. So we looked around for our stack for a stream of metrics – if we could get them – because we didn’t want to have to, like… How are you going to get all these different, these kind of – God – varied Ganglia metrics and StatsD metrics all in this one place? We didn’t know what to do, but then we looked at the Graphite, and we realized that they are already going with this, so we’re going with Graphite, and that’s cool. In addition to that, we also use this thing called Graphite Relay; it’s called carbon relay. What it’s built for is a redundancy thing. So you have your StatsD and your Ganglia – your StatsD is already one aggregation and it’s sending everything to Graphite, and Ganglia’s sending everything too. And then Graphite will take all that stuff and store it on the listener. If you do it that way that’s cool, but then if that box goes down one day, then you just lost all your metrics and you’re blind. So you need a backup.

So what we do is, we have a re-listener. So our stream – there’s our StatsD and our Ganglia –came to the relay, and then the relay will immediately, upon receiving the metrics, send them over TCP to a backup Graphite and to the local Graphite on the same box. So now you’ve got like a – making titabyes essentially – to two Graphites. That way, you can have redundancy. And then the way it does this is it sends pickles over the wire to the other backup Graphite. So that way if one box goes down, if you have a problem you have another one; and you can just bring it back up and keep going, keep deploying, and keep operating.

And so we sell this and we’re like, okay, we can just trick Graphite and say: Instead of relaying to a backup Graphite, we can just relay to Skyline. Skyline’s perfectly capable of listening on a port, which is what Graphite does, and consuming data, which is exactly what Graphite does. So we just simulate the Graphite protocol, and we pickle the center, the trigging pipe on, so we distribute the pickles, and we just have a listener, and the original Graphite doesn’t know that there is another Graphite listening. What it’s doing is it’s saying to three things now, the relay is into three things: sending to the original, the original carbon cast listener on local hosts. It’s saying to the original redundant Graphite down that box, and it’s also sending it to Skyline. So we’ve got three separate streams. The upside is that Skyline now has this awesome streaming of data, and that includes Ganglia. And Ganglia’s also an awesome thing because out of the box, it can write priority files. So we’ve got to strip the whole go in and import it from the priority files. Cool.

So we have our stream and it’s going through our box. We’ve solved the hard part, maybe. Now what do we do with the data? We have to keep it somewhere. Unless we’re going to talk about doing strictly online stream processing – which we’re not going to talk about – we have to store them somewhere, to then take them out and analyze the duration. So when you want to think about doing this: We’re keeping it in the memory, we’re keeping it in Redis, and you’ve got two things in mind when you want to do this. You have to minimize both I/O and your memory footprint. Those are for both fairly obvious reasons, I/O – you’re both putting things in, consuming them and taking them out, for the analysis stage. And Redis can only handle so much. Even though it’s really really good, it can only handle so much.

Similar with the memory, you just have a limited amount of memory. You just sort of limit the amount of metrics for a limited amount of duration, and you want to get that as small as possible so that you can pack it the most possible within your confines. So to do this, we chose to store the data and streams in Redis, and that might seem kind of intuitive – like Redis has a lot of cool structures that might seem better for storing timeseries, like lists for one. But the problem with lists is that you take them out and it’s log-in time to take them out, which is not ideal. Whereas the strings, it’s constant appends and constant withdrawals, which is awesome. Actually there’s kind of a lot of strings as they get longer and longer. Redis does not handle them well but it’s still like, orders of magnitude better than any other data structure in Redis, so we have to deal with it. And then we use the append – so, normally you would get ana-set set with append. It’s an atomic set and it’s guaranteed to be in order, so it’s super super awesome. And so we save ourselves, we cut our debts in half. You know what debt means, when you want to analyze them. And you know we set them when we want to update the timeseries.

So we have our strings in Redis, and now we have to think about how we’re going to structure this. You can’t just store data in the form of strings; you need storage. So you might think that JSON is the answer, and we did think that for a while. It’s probably true, at foreword, it’s easily appendable, so say your datapoint came in and this is our series. Another datapoint comes in, you just append it on top, and then another datapoint comes in you append it, and then you have almost got a JSON here. When you want to take it out you just riff through all that metric, append brackets on either end, then it’s valid JSON that can then be decoded in whatever language you’re dealing with.

This seemed all right, but we found that over half our CPU time was spent decoding JSON, which is… Imagine it as a silly, silly problem to have. We want to save CPU cycles for the hard things like analyzing the actual timeseries, things that are harder to be optimized. JSON decoding is just a kind of very low problem that doesn’t rub off of that. And we can kind of look into why that is. So here’s a JSON to JSON data – it’s an array with two things in it – if we break it down we’re realizing that things we care about are in the minority. We only care about the 1 and 2. We don’t care about the comma or the bracket, but the reason that stuff’s there is because the CPU needs it to decode the JSON and turn it into an actual object in your language. And the way it does this, is it does a stand, it looks at… Okay, this is a bracket; this is a 1; this is a comma, it’s a new element; this is now a 2. I still don’t know what’s going on. Oh, this is a closing bracket – okay. Is there anything else – no. Okay, good. I have an array. Bam! What’s done there is it scanned the entire string. And there’s no amount of data that tells us to skip this thing because it’s stupid to pre-allocate memory, because it doesn’t know. That’s a waste of CPU cycles.

So, we looked around for things that do have metadata in them and we found MessagePack, which is really, really awesome. This is a binary-based serialization protocol. It’s in the same class of things as Protobufs, and Avro and Thrift, which are all – they’re meant for RPC, for communication, they’re not really meant for storage – but they do solve the problem that we needed, which was speed of decoding and size. Because you want to send something over the wire, you want it to be as small as possible and you want it to be easily understood at both ends. So this actually solved our problem, and so we can look at what MessagePack looks like. It has metadata, and the metadata consists of headers. The header here that we care about is the size of the array, which can be either a 16 or 32 bit endian integer that basically tells the CPU: Okay, this is how big my array is, and you have to obey that, obviously, but you just append your two things on there and then that’s a concise thing. So now, the things that we care about are in the majority of the actual data we’re storing. We’re storing a minimum of metadata and MessagePack also supports appending things, so you just append it on. You’ve got MessagePack now, and you can just unpack that, right, and that’s really super cool. And so the results of doing this – of switching over from MessagePack to JSON – is that it recovers the queue time in half. And still, it’s still a big problem for us. Until something better comes along we’re stuck with this. But it’s significantly better and so I would recommend, in just about every situation where you think you’re using JSON, to use MessagePack instead, unless you’re dealing with the browser. Anything internally, there’s no reason for JSON, and it’s really a drop-in replacement. The APIs that people wrote about MessagePack, they wrote it with that in mind – to just replace your JSON with it.

So now, we’ve got things coming in and they’re being stored to timeseries on Redis and they’re being appended, and if we let it do that forever we’re going to run out of memory because we keep appending and appending. We’re not getting because we didn’t want to do that, because that’s more I/O. We’re appending and appending and appending, and that’s going to run out of memory. So we have a process called the Roomba, that goes through occasionally and it will clean the data. It will make it a get, but it runs less often, so it’s kind of negligible. It will run through, it will get everything; it’ll trim off the excess duration, and then it’ll put it back again. And so in this way, we store a continual rolling duration of 24 hours for each metric.

Unfortunately it can’t go above that because then your prompts really start to suffer. We wrote this in Python, as I mentioned before. It might seem like not a good idea to write a heavily current architecture in Python, but we chose it because it because it has really really good math options, like NumPy, SciPy, Stats Halos, Endos. These are really lively areas and they’re pretty good. And so even though it’s not great for parallelism and concurrency, we had more faith in our skills as engineers than statisticians, so we opted to slog through the engineering in favor of an easier time handling the metrics. That was why we chose that.

The rest of the architecture is pretty simple. That’s the entire consumption architecture, and on the other side we have an Analyzer – it’s based on a simple math produced design. It basically just takes thirty processes – how many cores you have is how many processes you should have – and then it will, at the beginning of each run, assign all the keys in Redis to a process and then that process is responsible for decoding that slice of keys and analyzing them and everything. At the end of the run, all the results get compiled into one JSON blog that the frontend will go and get every couple of seconds. So in this way it’s refreshing, and you’re seeing a constantly updated view of what is anomalous.

So here’s what that problem looks like. We’ll do a demo on the air so that you get a taste of it. All the metric mains are there, so you kind of go through them, scroll through them. The actual datapoint that triggered detection – if there is one – so that’s how you know… And that’s what the red line is so you kind of visualize what the computer is thinking. It will be easier. And then you get two views of your metric. The first view is the last hour, and the second view is the last day. That’s just for more context because oftentimes what looks anomalous within an hour does not look anomalous when you consider the broader context. And vice versa. So you get both views. And again, we only store about 24 hours, or exactly 24 hours. Cool.

Okay. So I talked to you about anomalies. We now define what it means to be anomalous. The way we handle the actual analysis part is through a consensus model. The way that works is it’s a vote, basically.  We have a bunch of algorithm methods, and then each timeseries is run through all those, and then they either return True or False – this is anomalous, this is not anomalous. What you get is the majority of them. You can set this primarily, like if six say Yes, it’s okay, or seven if you want to be stricter. We’ll vote that down the list, then we consider it an anomaly and then we service it.

The reason we do this is because it helps for mismatches within the model with respect to the timeseries – you know, the actual data – and what your algorithm is running on. Like you might… Like one idea that was posed was, why don’t we go the finance route and just make it a strictly forecasting problem, and you can do an ensemble of forecasts and then average them out, and then that’s your forecast. And if it’s significantly different than that, then that can be your anomaly. But the problem with that is that it’s opinionated. We found that we had so many different kinds of data that didn’t really work in all cases, and that really limits you into what you can do. And so you can still do forecasts in this, but what you can’t do is get a sense of how anomalous something is, because there are all different algorithms. How anomalous something is on one algorithm doesn’t really relate to how anomalous it seems on the other.

That’s one problem with doing it this way, but the upside is that you can implement anything that you get your hands on as long as it returns True or False. So we’ve got those three things on the upside. That’s how we opted to do this.

The basic idea behind the algorithm that we have implemented at this point is as follows: A metric is anomalous if its data point is over three standard deviations above its moving average. So you’ve got your timeseries, you’ve got standard deviations from the timeseries, which is like, that’s the normal variation to abnormal noise. And then you got one standard deviation up and down, two standard deviations, and if your latest datapoint that comes in is like, up here, or down here, if the absolute value is over three standard deviations above or below the moving average, you’ll be justified in saying that this is an anomaly and I want to know about it.

This is from Statistical Process Control, which was invented in 1920s, I think, by a guy named Will Shewhart. This was when the assembly lines were coming along, and yet these huge operations change, you don’t know what parts are working, so you have what further are called control charts, which are essentially exactly this. You have normal lines of variation, and then when you’re screw or whatever is outside – when the width of your hole that’s being screwed is outside that line – then you know you know you have an anomaly and you should fix something. And so to visualize this one more time, here’s your regular Bell Curve. This is the normal distribution and your Means right there in the middle. So these are values of data, and when your datapoint lies within the Three Sigma there, then that’s how you know you’ve got an anomaly, in the best scenario.

To implement this, we kind of riffed on this. We kind of achieved this in a bunch of different ways. First what we do is by Histogram binning. We get the data – here’s a real timeseries from Graphite that did get serviced by Skyline a couple of hours ago while I’m making slides – and we’ve taken our most recent datapoint, so that’s… That’s poor the way we did it; you can’t really see. So yeah. You got your datapoint there, and then follow that, so that’s 40, okay. And then you make a Histogram of the entire timeseries, so this kind of looks like the normal distribution. Histograms are distributions of the data. When there’s a smaller bar that means less values are the perceived value that you kind of put in for. So a really good 45 out here, 40 on here, because the bars are really really small. That’s in the Three Sigma area of this distribution here. And so to just check, you just check which bin that datapoint is – the 40 is – and you see it right there. It’s a really small bin size so that’s an anomaly.

Another way to do it is by a linear regression. Take the best fit line of your ordinary least squares, a very common linear regression algorithm. So you have your data again. You fit a regression line on it. There are problems with this, which I’ll go into in a bit, but bear with me for now. You fit your regression line and then you take the residuals, which are the errors of the mean of your regression line or just the line itself, because this is a datapoint too. You then take all those and then find the distribution of those and then you get your Three Sigma here, which is a three standard deviation. And that would be a big anomaly there. So that’s another way to do it.

A third way is through a Median absolute deviation where you take the median of your timeseries – the middle value that occurs the most often, which is going to be the highest spinning in the Histogram distribution – and then you calculate the residuals with respect to that. So you calculate how oft every other value is with respect to that, and then you find the median of that. And if that is outside Three Sigma or whatever test statistics you want to use, then it’s an anomaly again. So these are all kind of different ways to do it.

I think I have one more here. All right, so this is a broader technique that we want to use too. Instead of using averages, it’s generally recommended to use exponentially weighted moving averages, and that’s because data – like monitoring data and operational data – kind of gets less valuable as it gets older. When we’re capacity planning, we don’t really care what our data, or our page views per second previous performance time was in 2009. It’s not really relevant. Our whole stack has changed, and so we can apply this to this problem as well. Instead of just taking the entire timeseries and using all the values and treating them equally, we can add a decay factor to the timeseries. That way, the more recent values are treated with more importance. And that way, we can know what’s normal based on what’s happening right now and not really care about what’s happening in the past. And so this discounts the older values a little bit to give you a little more accurate and to make it a little less noisy. That’s the technique. You can apply this with the exponentially weighted technique basically anywhere, like just weight things, according to where they are in your data. So that’s one technique.

But there are a lot of problems with this. So Skyline works but there are lots of things it doesn’t handle, and I want to go into them right now. So we can think about how we can solve these problems together. So these are the four horsemen of the modelpocalypse. They are: Seasonality, Spike Influence, Normality, and Parameters. First one’s pretty straightforward. You might think that this is an anomaly; and you saw that on Skyline, you’re like: “Hey, thank you, Skyline. Thank you for showing me this awesome thing that doesn’t look right at all.” Until you look at the 24-hour view and you realize that it’s probably a cron-job; it’s highly seasonal, it should be happening. In fact, I want to know that it doesn’t happen. How do I get Skyline to tell me this? How do I know? Your regression lines are going to totally fail here because the average of this, particularly regressions, are there right in the middle and it’s totally worthless, if I can say anything. And the deviations of that are also worthless. This distribution curve is just not going to be a normal distribution. So our algorithms totally fall apart with Seasonality, and that’s one of the problems that we’re working on.

The thing is that once you’ve detect that something’s seasonal, it’s easier to bone aggregate but these are very expensive. We’ll get into that. So Spike Influence. When you have – say you detect an anomaly – and you got your spike and everything is working well. But then when you’re using this whole Three Sigma paradigm, your moving average – even if it’s exponentially weighted – is being artificially bumped up. So you have the same problem with having your average be much more than everything else, because it’s like the Tenth Effect. And so you’ll miss another anomaly that comes even if it had breached. Like say the first anomaly didn’t happen, it would have been detected if that anomaly hadn’t happen but since it did, the average moved up. The threshold for Three Sigma has moved up, and you’ve now got an anomaly that’s kind of slipped by you, so that’s a problem. To fix that, you have to remove the outliers or retroactively go back and change your data. We’re reluctant to do those things because we’re trying to detect the outliers in the first place, and we don’t’ necessarily want to change data. So that’s another problem.

The third problem is Normal Distribution, this is the real big big one. Real world data does not really follow normal distribution. Like all these algorithms, they’re developed in what’s equivalent to the pristine data environment, where everyone has got a normal Gaussian Curve, and life is dandy. But that’s not true for lots of real world data. The reason Statistical Process Control was able to get away with it and was able to be so influential is because they’re dealing with fundamentally different things than we’re dealing with. They’re dealing with car assembly lines and with screws that are supposed to be defined words, and they have specs and know exactly what they want. And it’s very easy, if you have a drill bit that drills holes all day long, you’re going to have a really really nice distribution curve because it’s doing the same thing and you know it would and you know what it wants to be doing. But we don’t have that. We don’t have that luxury. There are no specs for our systems. We build them because we have to serve users; we don’t spec it out.

You have no idea what normal is, really, besides you can log into Etsy.com and it serves you. Beyond that, it’s not apparent immediately what a normal flow of data through the system looks like because it’s a complex thing, and there’s different moving parts. Like this is the Disk I/O Wait on our database. It looks terrible. But our site’s up so, is it normal? Is it not normal? I don’t really know. I don’t know, I don’t have access to all the data of all the components that are making that particular graph behave the way it’s doing. So I don’t know if this is actually a good thing or not, and as a result I don’t get the luxury of knowing what is normal, and what is not normal.

To kind of reiterate, when you have simple systems, or just a drill bit and a piece of wood, you have very simple definitions of what anomalous is. But when you have complex systems, you have very complex definitions of anomalous. And that does not necessarily… you can’t really reduce that to a simple matter of this metric is bumping. It doesn’t translate to “Your system is failing,” and that’s something to watch out for. Of course in addition to complex systems being complex, even if you were to nail it down at any given time, they’re evolving because we’re deploying all the time. Our systems are changing; we’re bringing servers up, we’re taking them down. We’re changing the outcode, we’re getting hit with more traffic because Obama’s treated us.

These are the things that happen and we have to be able to adapt to it. Our monitoring has to be able to adapt to it too. And so, Skyline was built so that it can be adapted, like it’s more adapted than Nagios, which will just trigger the threshold. But it’s still not enough, and we’re still trying to think about how to get it to be more adaptive. One example of that is – all the time we take out shards to do rebalancing or something, and Skyline will detect that, which is awesome, but it’s false positive. It’s something that we know what’s happening. What’s a good way to kind of shut that off? And now we’re getting to the area of alert design, user interface – which is not the focus of this talk – but it’s another thing to think about.

And the fourth horseman of the modelpocalypse is Parameters, our parameters. To kind of back up on a broader… This is a graph of… We’re doing capacity finding right now so I had this graph handy. This is our Page Views since 2009 to right now, in the blue. I modeled it, I wanted to make a nice prediction, and so I made a prediction in the model. It’s a nice model; it looks pretty good, like that visually… It looks like that that is what our demand systems look like in terms of page views. The problem is, I spent a lot of time making this model, and kind of custom fitting the model to the data, making sure my assumptions were good, fitting that parameters. Parameters in general – that’s kind of a term you might see later on about – like ‘parametric’ versus ‘non-parametric.’ In general, when you say something is parametric, it’s like not machine running, because machine running – and I could be completely wrong about this because I don’t do much about machine running – but machine running is when you just say, like: “Machine, figure out what the best model is.” And “Figure out what parameters you need and what those values should be.” Whereas in parametric models you set the parameters. These are like different gauges of your model that you set to fit it the right way. So the parameters for this model I had was that it’s Seasonal, so I had to say: Treat every datapoint on a 365 day season, so only compare it to the datapoints 365 days ago. Give us a certain amount of weight to the overall trend, but also certain amount of weight to the seasonal trend, like these are different kinds of impacts. Give an exponentially weighted smoothing factor to kind of discount the older data and do all these things. And then, once you have all that, once you’ve defined your parameters, you want to fit it. We do do a bit of fishing line. This is like the most basic application of machine running where you train your data – and basically run it against all your different parameters until it finds the lowest proved mean squared error, until it fits the most, then there’s your model.

Those are two things that are really, really expensive to do, and they do not lend themselves easily to being automatically done on a massive scale across a quarter million moving metrics. That’s a really big problem because then, you’re dealing with models that don’t even fit your data and you’re trying to detect anomalies with those models. When it’s not feasible to hand fit all your models to data, what do you do? This is a problem that we’re dealing with. You can’t have good anomalies without good models. And so, as you might have guessed, a robust set of algorithms is the current focus of this project. We’ve got the picture pretty much set. It’s nice too, the idea that it runs on one box. I want to think of it that way, ideally, but yeah. So the hard part now – and this is going to take a while – but the hard part now is figuring out how to solve these problems with our models and algorithms, and with our data. So that’s all.