Redis at Pinterest - Transcript
What’s up guys? You’re listening to g33ktalk. Coming up, we’ve got a recording from the recent Meetup at Redis. Aren Sandersen is going to talk to us about Redis at Pinterest. And if you’re not there already, be sure to check out our website for more great start-up engineering content at g33ktalk.com.
Aren: Hi everybody. My name’s Aren Sandersen. I’m the head of technical operations at Pinterest right down the street here actually, at 7th and Brannan, just moved to the city not too long ago. I give, some of you who might have seen this at Redis Conf. I gave a similar overview of what we do with Redis at Pinterest at you know a fairly large scale. Things to keep in mind that we learned sort of the hard way. So as your sites grow, you don’t hit a lot of the same mistakes that we do.
All right, so a little bit about Pinterest, if you don’t know what it is. It’s an online pin board to organize and share what inspires you. So I’ll give you an example of, this is my Pinterest board. I’ve got my Pinterceptors and projects I’ve done, watch collection. Things I’m interested in and want to hang onto. So here’s an example of one of my pins. This is my air compressor powered Nerf gun that I built. Nobody in the office will play Nerf with me anymore. Can you guys see over here enough? Okay, I don’t want to block it. Yeah, needless to say, nobody wants to play Nerf gun with me anymore in the office.
So a little bit about the architecture built in AWS from pretty early on. And this actually turned out to be a pretty wise decision for those people before me who made this decision. I’m going to give you a graph, this, a while ago, but this give you an idea of Pinterest’s growth. Of course, we’re missing a lot of labels and axis and things like that. But you know, the time when it sort of does this is when there’s no way physical infrastructure. You know, buying machines, racking them, stacking them. We would never have been able to keep up with the growth that the site was experiencing so it was really critical that it was built in a cloud architecture.
So here’s an overview of what it looks like. Amazon’s Elastic Load Balancer on top, a bunch of front-end web servers, hundreds and hundreds of those. And then lots of back-end services. We’ve got our data layer which is a proxy to my SQL, some anti-spam, stuff that we’ve built. Pyres, which is Python’s version of Ruby’s Rescue. A lot of caching, it’s really important to have tons of caching. We get about 98% hit rate on our caches. So we’ve got a lot of those. And then a lot of thrift based and HTTP based services, feed service, a follower service, like several services.
They all, sort of, the front-ends confined a lot of these back-ends using Zookeeper based service discovery which is pretty awesome when you get to that point. Just being able to fire up a machine and when it comes up, it’s ready. It registers with Zookeeper and then it starts getting traffic automatically. So it’s a way to do mid-tier load balancing without actually needing to have a load balancer use EOB or put anything in the middle there.
All right, so I thought I’d show you sort of the areas, given this diagram, where we actually use Redis. So ResQ or the Pyres uses Redis for queuing so you drop a task in there, say this is the task. Class name, then another Pyres machine worker picks it up and runs with it. So we run a lot of those queues. In our feeds tons and tons of Redis usage in there. I’ll get to that in a little bit.
Our follower service, this is our graph of who’s following whom. And this is all in Redis and all in memory to keep it quick. In memory caching, of course, we’ll get to that, a ton of Redis usage there. And anti-spam, this team relies really heavily on Redis to keep signatures and things like that of known spammy items and users. I believe they’re also using it for some rate limiting to keep the bad people off the site.
All right, so diving a little deeper, if you’ve used Pinterest, you might have seen these weird looking IDs that we have. So, step aside here, you can see that Pin ID is kind of a monstrosity. And this is what it contains. It’s a 64 bit integer divided up into 3 sections. First, there’s a Shard ID. Then we’ve got the type of what kind of data this ID is. And then a local ID. Once you go to that Shard, what’s the local ID. And the local ID is sort of, it’s the position of the table in that database. So this structure allows up to 65k shards, 64k shards, excuse me. And we’re only using a small set of that. So we have lots of expansion in the future. So with those, we use about I think 8,192 virtual shards and that applies to our databases. So we run that many. Not that many physical machines, but that many virtual databases across dozens and dozens of machines. And that applies to our Redis installation as well.
All right, so I’ll talk a little bit about how Redis is ideal for us. All right, so a user’s home feed, what the user sees sort of like the activity stream. Sort of what you’d see on the front page of Facebook is things that, people that you’re following and what they’ve Pinned recently in chronological order. So that’s just a list of Pins. The follower graph, you have a set of followers and a set of followees. Your boards, every board you have is a set of Pins.
And when you consider what happens when I create a Pin. So I Pin something. Our back-end processes goes out, gets a list of followers I have. Takes that Pin ID and pushes it onto the head of every follower’s home feed. So you can see, that’s basically Pinterest in a nutshell right there, kind of giving it away. So take Redis add some HTML and a couple of million users, or a lot more than that actually, and there you go, Pinterest.
So the one drawback here is that we know that Redis is an in-memory database. Like this consumes a lot of RAM. So even for users that don’t come back to the site, we still keep their feeder on a memory. So that’s one of the major drawbacks with this. But we’re working on it.
All right, so one of the early uses of Redis was just as a much better in-memory cache. So we didn’t use any of the persistence of it. It was just, all the truth was stored in MySQL still. But we use it as a cache because it allowed paging of large sets and some atmosphere that we needed. So imagine you stored your home feed or every user’s home feed in a blob, a pretty large blob in Memcache. So what happens, there is an append feature in Memcache so that you’re saved there. But when you want to page into sort of the last half or you want to get page 10 of 20 or 50 or whatever of the user’s home feed, you would have to pull that whole blob down, page into it, take out some chunk, and then throw the rest away. And then that turns out to be extremely wasteful. So with Redis, of course with lists, you can just index into it and pull the data you want and show that to the user. So it turned out to be much more efficient.
Don’t, I mean we still use tons of Memcache for you know, really small objects that don’t need any sort of paging like that. And we use a lot of it. And as I said, we have hit rates around 90% for our Memcache and other caches as well. And also, we do run Redis as a persistent store now. We didn’t for a long time. We switched over to that for some of our stores. So a lot of our objects in Redis now are not backed up in the database. Redis is the only store.
Let’s talk a little bit about the operational take-aways we’ve had. Very important is to pre-shard if you can. Or if you haven’t pre-sharded, then when it’s time to do it, make a lot of shards. I think that will really help you. So as I said, our follower cluster has 8k shards. And these aren’t that many boxes. There are dozens and dozens of machine pairs, but certainly it’s not a one to one ratio. We have the ability to have 64k, but one thing you have to watch out for is the connection counts are Redis. You can’t take unlimited connections or more realistically, the OS can’t have that many connections. So if you can imagine, 50 processes per web machine, maybe you have a connection pool with 10 connections per pool, 100 machines, 16 shard factors, like 800,000 connections trying to go to one machine, like that’s not going to work. So just do the math on that. Make sure your architecture is going to keep the connection count of preferably below 100,000 connections per machine. We’ll talk in a bit about some of the things you can do to mitigate that.
All right, so we also learned you need to stagger your BG saves to improve performance. So what we do is we use append only everywhere and then we have a cruncher that goes through. And one by one does a BG save for our back-up process. If they all BG save at once, you can really crush your IO and then take your master down. You definitely don’t want to do that. If you can avoid it because it just improves performance. But you actually do need to worry about that case because the worst case is the slave drops offline. And say there’s a networking blip in UC2 or whatever hosting provider you’re using. And then the blip’s over. All the slaves are behind. They all reconnect and force a BG save of all 10 or 32 or however many processers are running on that box, if you don’t account for this and at least make sure it’s possible, then you risk the master going down again. So staggering BG saves is much better, but you need to support all of them is what we’ve learned.
And then I touched on it briefly but, using the right persistent strategy for your dataset. So give it some thought. Make sure it’s going to work for you. One thing that we’ve done, one of the few places we use it actually on Amazon is EBS. So we do use EBS on our Redis boxes, kind of reluctantly, but it’s the only way we got the IO that we needed to support all the slaves connecting to the master all at once.
All right, so a little bit about space saving. We’ve noticed that most follower lists, who’s following whom, are actually really short. Using a Redis set turned out to be really convenient but also very wasteful in terms of space. And since this is all in RAM that can be very expensive very quickly. So here’s a trick used by Instagram and some others. We store the entire follower graph in memory, as we said. So we’ve done this and you’ve probably seen this trick on Instagram’s blog as well.
Basically, just keep hash tables grouped like group users into groups of 1,000 and keep a hash table for each one. And then keep the follower list in a comma separated string here. So this gives an example of how you’d find the follower info for this particular user. So divide it by 1,000. Use that as the key in the hash table. Look it up and then mod by 1,000 and look in that sub-hash table for the follower info. Of course, if anyone noticed the problem here is like a pretty big race condition. If two people grab that to update it, pull the string out, add something new to the end, update Redis, one of those is going to get overwritten. So you could lose data this way.
So here’s our solution. Is that going in and out or is that just me? No, it’s good? Okay. So the way we solve that is Lua in 2.6. So you can actually write a Lua script to do this for you, which maintains atomicity and avoids that risk condition. So here’s the Lua script. I forget which one this is, but this gives an example of, you can have, Lua will lock this key at or remove or append or whatever you needed to do that key and save it back. So you don’t have the race condition that you would otherwise. And then at about 10,000, at the length of about 10,000 IDs this gets way too slow. So at that point, we just bite the bullet. We migrate it to a set and have kind of different mechanism to handle those really, really large follower sets.
So a little bit about what we’re working on or what’s important to us at Pinterest. Easy fail over, right now and this isn’t necessarily Redis related, this is just sort of the infrastructure around Redis. But it’s kind of a pain right now to switch form master to slave, especially if you lose the masters. Then you fail over to slave and then you’ve got to bring up a new one. So we’re working on automating this process. It probably won’t be automatic right away. Someone’s still going to get paged and have to get out of bed and actually do this process. But making it easy is really important to us. It takes like, right now, you actually have to do a deploy, so change what master you’re pointing to, redeploy the web or mid-tier. And so that’s just not very fun so we’re making that better. We’re putting our configs in Zookeeper and some other cool things like that. I’m happy to chat about that.
It’d be neat to do automatic failure detection. I think some of these goals might be obviated by the new stuff coming out of Redis that’s trying to make clustered Redis. We haven’t played with that yet, but we’re excited to try it out. And Twitter has a twin proxy or Nutcracker it’s called that’s great for Memcache proxying. I don’t know if anyone’s ever played with it, but it’ll basically run one of these proxies on each of your web servers, especially if you have multiple processes that talk to multiple Memcaches. And it makes sure that box only has one connection to each of your back-end Memcaches so this can really cut down the number of connections that each Memcache needs to support and really, really help scaling. So when you’re having too many connections per machine, take a look at this project. It’s open source . It’s really great.
Last little plug for us, right down the street we’re hiring. This was taken a while ago. We’re slightly bigger than that now. But we’d love to talk to you if you’re interested and happy to take questions. There’s some contact information if you want to get a hold of me.
Audience Member: When do you use Zookeeper and when do you use Puppet?
Aren: Yeah, yeah. A question about Zookeeper versus Puppet, so Puppet, for us, and I’m sure there’s a lot of overlap in some of these products, but Puppet, for us, is used when a machine first comes up. So we do a fair amount of autoscaling, adding more machines at peak and taking them away. And when a machine comes up, we have our base AMI image and then we use Puppet to make sure it’s up to date with all the packaging and everything. And also then, Puppet runs every couple hours to make sure the machines are completely updated if we’ve added things that day.
So those things are the things machines need every couple of hours. Zookeeper is sort of instantaneous information like what is the configuration right now. Like within 10 seconds, if that key in Zookeeper changes, then the machines will know and it’s a very different model. Zookeeper has three to five distributed nodes to keep a persistent connection to it. And if something changes, a small amount of data not a large amount of data, then you’ll know right away. So they both work in configuration management, but I think at very different levels if that makes sense. And Zookeeper also has the great feature of, you can use it for service discovery, where everyone who’s available to offer let’s say our database proxy layer. When they boot up, they register and then a watcher can come in and say what are all the database proxies available to me. What are their IP addresses and use that as sort of its working set. So similar, they’re great. They’re certainly great in conjunction.
Audience Member: What are your cache rates and what actually are you caching?
Aren: Yeah, great question. So did everyone hear that? So our cache rates are 90%. What actually are we caching? We don’t tend to cache html. I could actually tell you the main reason why you should not cache html unless you’re really careful is if you cache a bunch of html. It has assets in it. Those assets probably have http links when you’re testing. But if you offer https then all those cache fragments are not going to work. So unless you keep different http and https caches that tends to run into trouble. We cache little like JSON snippets of objects, so board information, Pin information, just really small objects in Memcache for the most part. We use Redis for most of the heavy lifting in terms of paging and things like that.
A question in the back?
Audience Member: At what point do you need to stagger the BG saves?
Aren: Yeah. Good question. So at what point do you need to stagger the BG saves? I mean I guess it’s kind of different for everyone. I mean keep an eye on your performance. When you notice that you’re wasting a lot of time with IO, waiting on IO, for wherever those BG saves are going, then you want to look into it.
Audience Member: Where is it for you?
Aren: Where is it for us? That was implemented soon after I joined. So I don’t actually know what the tipping point was. I do know that IO tuning with Redis was something that an employee spent a lot of time on to make sure that was done well.
Aren: More questions? Right here?
Audience Member: What Redis Clients are you using?
Aren: Good question. What Redis clients are we using? I don’t know. Python. Yeah, I’m not sure which one we’re using. I don’t know if there’s several or not. It’s actually one area I don’t know.
Audience Member: [inaudible]
Aren: Yeah. The process by which we switch from that list? Probably, I can look into it. I think it’s not more than a function. The question is the piece of code that we use to switch from this hash table space storing comma separated value to a Redis set and how we might accomplish that. I’ll look into it, but I don’t think it’s more than probably a function or so. There’s somebody that noticed that it was too big, it probably fires a task into one of our background processes and it probably just fixes it up behind the scenes. But more than happy to look into that and if it’s something we can share I will.
Right here? Yeah.
Audience Member: What don’t do you store in MySQL anymore?
Aren: What don’t we store in MySQL anymore. We do not store the follower graph in MySQL anymore. That is purely Redis. We take a lot of backups, but thus far no problems. No problems at all. That was kind of a big leap for us is to move away from everything as persisted in MySQL. This was before my time, but they tried a lot of back-end databases and kind of struggled with all of them. And MySQL just sort of worked and the idea was to put everything in there. And we’ve slowly been more willing to trust other products and so far no problems.
Audience Member: Do you let the append only file grow forever?
Aren: I think we save every hour from our Redis and we upload those to S3 and we time out the history slowly. So you might have every hour for the last day, you might have every day for the last week, and then and so on, backwards. So the question is do we let the append only file grow forever. We, I know we use the rewriting function. So we rewrite it periodically, I don’t know the frequency, but we need the data so I don’t—
Audience Member: [inaudible]
Aren: It’s always there. So does that answer your question? Okay.
Audience Member: What bad experiences have you had?
Aren: We have a lot of bad experiences. We’ve tried I think every combination of Redis persistence ideas. We’ve settled, for the most part. I can’t say for every single service, but I think for the most part, they’re all running appendantly which has worked out pretty well.
Audience Member: If you could request a feature in Redis what would it be?
Aren: If I could request a feature in Redis what would it be? I think it would be sort of the automatic clustering fail over in the stuff that’s already being worked on. I think that’s one of our biggest pain points is having to switch over and bring up, switch over to the slave, bring up a new master, bring up a new slave, switch that to master, bring up a new slave. Get a machine, getting machines synching on that. It takes a lot of sort of human time and we would appreciate it more like, we would love more automated tools around that and we’re writing some of our own, but I know the community’s doing it as well and so we look— they’re building some of that stuff now and I can’t wait to see it. I saw demo of it at Redis Conf and it looked amazing.
More questions? In the back.
Audience Member: Does using the comma separate set really buy you that much in terms of just using sets everywhere?
Aren: The question is, does using the comma separate set really buy you that much in terms of just using sets everywhere. I wasn’t the one who worked on that project so I can’t say for sure, but I know that we were able to cut down the number of servers needed to hold the entire data set by a pretty good margin. So from that specifically, so I think the data, sort of the memory overhead needed to maintain discrete objects in the set is substantially higher than what you get by just using a blob of data like a comma separated list was my impression.
Audience Member: [inaudible]
Aren: No it is a comma separated list, literally. It is a hash table of strings of comma separated IDs and then Lua. We use the Lua code to maintain that to get the adamancy that we’d need so yeah. And that’s, Instagram noticed it too though. It saved a lot of memory. Question?
Audience Member: How do you unit test that Lua code to make sure it’s doing the right thing?
Aren: Good question. The question is how do we unit test that Lua code to make sure it’s doing the right thing. Unit tests, what? No, Pinterest is actually, we’re much more mature now. We use unit test for our, we have actually really high unit test coverage. That’s a really good question and I don’t think we really have unit tested it. I think fortunately, the Lua code is only a couple chunks to manage that particular thing. Hopefully, by now, it’s pretty well tested. I don’t have a good answer to that. And if that, if we ever add a lot more Lua code I would look into how we can unit test it. I’m a big believer in unit tests so it’s a great point.
In the back.
Audience Member: What do you get most nervous about in terms of the Redis?
Aren: Yeah, what do I get most nervous about in terms of the Redis. So one thing that’s pretty nerve-wracking with Redis is when a machine goes down and comes back up. Like it takes a long, long time to load the data set into memory so like a master crashing for whatever reason, like sometimes we don’t watch the memory and it gets the OM killer decides it doesn’t live anymore. Those things take a long time to come back up. And so you have to make the decision like okay, do you fail everything over to the slave. Do you wait like an hour for this one shard to come up. Those scenarios really aren’t very fun. So it’s, it would be nice if it came up faster, but we should also just be a little better at making sure that they don’t get killed.
In the back.
Audience Member: What’s the user experience like?
Aren: Yeah, good question. So what’s the user experience like? So if, I mean you could see how many shards we have, thousands, I guess it depends on how many are in each process. Let’s say there are 16 or 32 in each process, so that’s only going to effect a subset of our users. We’re pretty good about returning error codes and handling them further down the stack. So in theory, they’re just going to get a message saying their graph’s not available or like check back later. So it’s not ideal, but we’re pretty good about making sure the whole site doesn’t show an error page when just one aspect isn’t available. Like they can still see their home feed if their follower feed is down and so on.
Audience Member: How do you size Amazon instances to the number of Redis?
Aren: Yeah, question about sizing Amazon instances to the number of Redis. One Redis per core I think, is ideal. I believe Redis is single threaded so more than one core isn’t going to buy you a whole lot. Some people say it’s sort of N minus one. Like give the OS CPU to work with. I mean Redis is going to probably have better answers or similar answers than we do. But that’s what we found in terms of sizing the Redis machines.
Audience Member: Do you have any tips for running Redis on Amazon Cloud?
Aren: Tips to running Redis on Amazon Cloud, I don’t know that it’s much different than anywhere else. I think sizing, sizing’s important. I can’t think of anything off the top of my head that I think wouldn’t apply to other sorts of infrastructure. I think IO on Amazon can be a bottleneck. So look into EBS if you need a lot more IO. Benchmark things if you’re using the ephemeral disk because IO is not great. So I think that’s kind of the one drawback and something to keep in mind.
Anybody else? All right, I’ll be around afterwards, I hope, for a while. Grab me if you have questions or want to talk about Pinterest. Thank you.