How to build an engineering team by Tapad Dag Liodden (Tapad) Interview Transcript
Vid 1 -
Pete: Hey, it’s Pete. I’m here in New York. And today we’re talking to Dag from Tapad. Dag’s the founder and VP of engineering. Thanks for being with us, Dag.
Dag: Thank you for coming.
Pete: So, first of all, tell us a little bit at a high level, what is Tapad? What do you guys do?
Dag: So, Tapad is a cross platform, or I’d say multi-platform advertising company. We’re what’s called a demand-side platform, so our customers are the actual advertisers, and they buy advertising through us. We’re in what’s called the real time bidded space. This means that we are constantly processing a lot of potential buy opportunities for ads, and we figure out in real time if you want to buy an ad and how much you’re willing to pay for it.
And then we figure out which ad, or which banner, or whatever you want to place there. And we do all this in real time. So it’s low latency high, throughput kind of business.
Pete: Okay. And tell me about the mobile aspect to your business.
Dag: So the mobile aspect is, there are, have been mobile ad networks for quite some time. Which are fairly, maybe not unsophisticated, but the mobile space has been lagging behind in terms of the progress that has been made on the online display space so, in the past few years. So, especially, on the side of a real time bidding.
So we’re trying to bring some of the more sophisticated use cases, or uses of advertising technology, to the mobile space.
And we also allow customers to do cross-platform targeting so you don’t have to go to one provider to buy mobile advertising, and to another one to get the online display. We’re more a one-stop shop.
We do interesting things based on what people are doing on their mobiles. We can use that for targeting them on their desktops and vice versus, which has started to resonate really well with a lot of advertisers.
Pete: Okay. So technically this means that because you guys are doing real time bidding, there’s a high volume, low latency requirement for your architecture. Is that correct?
Dag: Yeah. So basically how it works, is that whenever an app or mobile webpage or a traditional webpage is being served there can be multiple ad units on the page or in that app. As the page is being rendered or as the application loads pretty much, we have to make a decision or the whole network or the whole system has to make a decision, which includes quite a few jumps from, let’s say, a mobile website.
They call an ad exchange, the ad exchange calls us and a number of other vendors or demand-side platforms, and we all look at the data that we’re getting. We compare that to the campaigns we’re running and we return basically a bid, saying, OK, this impression is worth this much money to us.
And all this has to happen before the user actually notices because you don’t want to show a page with a lot of white blanks on it, or on some sort of spinner, loading thing. So, the requirements are pretty, pretty strong. You have to respond within typically 60 milliseconds and this includes network round trip time, so the total processing time is very, very short.
And I mean there are companies that have higher volumes than we do in the online display space, but we are currently processing about 25,000 requests per second on our servers.
Pete: And that’s one reason I heard that you guys actually made an interesting decision to move off the cloud. Can you talk to us a little bit about that?
Dag: Yeah, I mean, we started out as a lot of companies on Amazon Cloud, and so the EC2 Cloud is excellent to get started. And makes it very cheap to start running a service. Generally though, running on a public cloud where you have a lot of shared infrastructure makes it hard for you to really find the hot spots, find what’s costing jitter in your performance.
So when you’re running on virtualized hardware, you’re running on shared network infrastructure, you’re running on typically shared storage. And we’re also, we’re a JVM shop, so we also have garbage collectors which add to the whole mix. If you have too many moving parts it makes it really hard to pinpoint where are our main sources of performance issues.
Are they something we control or are they in the actual infrastructure?
And we realized that in many cases this was actually in the infrastructure. I’m sure you can cope with that in many ways. To us being able to respond on a timely basis, every single time is very important. So, to us the average response time doesn’t really matter much. What matters to us is how many requests or how many incoming requests are not handled within that very stringent time that we have to process it.
And if we are losing ten percent of all requests we’re having a very high latency on our ninetieth percentile. That it actually might be ten percent loss business for us. Because we cherry pick every single request and if we miss one out of ten, that’s ten dollars, basically.
The other factor is actually in terms of cost we see that we can, when you start scaling up a little bit, start buying more specialized hardware, more expensive hardware then very quickly you get a better performance ratio with your own cloud. And we also have some hardware that you don’t even get in the cloud at all.
Like we’re using solid state discs for our key value stores. And we also have some, on the analytic side, we have some very, very high memory type hardware and memory, RAM is something that’s often very expensive in the clouds.
Vid 2 -
Pete: So you mentioned that your stack, well you mentioned JVM. So you’re a Java shop?
Dag: We’re not a Java shop. We’re a Java Virtual Machine shop, which means that we like the whole whole Java ecosystem, but we’re not particularly fond of Java programming language itself.
Pete: That’s awesome. So tell me more.
Dag: So we are a Scala shop. Almost all our server, well actually all of our code is in Scala now. We still use some Java frameworks, but the coding we do is almost without any exceptions in Scala.
Pete: I’ve heard interesting things about Scala, not having used it myself, but it seems to be this weird mix of OO and functional programming. Is that odd to get used to?
Dag: Well, first of all, yeah you’re right. A lot of people say that it’s weird to mix functional programming with OOP the way Scala does it. I personally think that is why Scala will be successful. But yeah, it’s a functional/OOP hybrid which means you can easily interop with all your existing Java based frameworks.
You can still keep, well, you can still keep an architecture like a Java application, reuse the same patterns in many ways, but then you also have the power of functional programming.
Pete: And how would you explain the reasons that you made that decision based on?
Dag: Well, that was partly a personal choice. I had been looking, following Scala for a few years actually. I was really fond of the syntax. I wasn’t really that into the functional aspects of it. That kind of grew on me later. But when we started building this I already, I realized that the tools started to be mature enough, the community was getting vibrant enough so I, I think that I was pretty sure, sure that we were already past the tipping point, where, where Scala would continue to gain traction and we could actually find developers and that’s the flip side of it, of course.
Pete: So tell me also a little bit, you mentioned that you interoperate with some Java frameworks.
That’s kind of fascinating to me because it sounds like you’re able to capitalize on the enterprise maturity of certain aspects of Java but still use more cutting edge syntax.
Dag: Yeah, so the reason for this was it was a trade off when we started working with Scala, it was a fairly new language to me and also the functional aspects do require a slightly different mindset in many cases, and I didn’t want to throw everything that we have learned in the Java world over the past 10 years going from the monolithic EJB 1 and 2 versions over to a more, well much more lightweight type of system.
Also, I know, or knew at that point, that since not that many developers actually were familiar with Scala and probably wouldn’t be, I thought it would be a good idea to stick with architecting the applications somewhat similar to what you would do in a Java world and then just use language. And then, this has changed a little bit over the course of the past six to eight months.
We’re probably, probably more idiomatically inclined toward the Scala way of doing some things now, some patterns, but any Java developer will very quickly be familiar with how the application is architected. We use servlets instead of using some of the new, very cool, very interesting, but still fairly unproven web containers.
And we still use the Spring framework for stitching things together, and also a lot of the utility classes that are in the Spring framework are really, really good. And they have been proven for years. So, we prettied them up with Scala code to make them more well, make it nicer to work.
But, yeah, we have a, we have a pretty hybrid stack between some Scala specific things but most of it is still recognizable to a Java developer.
Pete: And what about the ability, the time it takes to program in Scala. I’ve heard that Scala code be two to three times more compressed than similar Java code, have you found in your experience that developing the Scala is faster?
Dag: Yeah, so, so one of the things that in initially appeal to a lot of developers I think is that, and this just initial phase, where you think that, okay the Scala syntax is much terser. It’s more dry and you generally have to write less boilerplate code. and this is one of the things I also find appealing.
But, to be honest, I don’t think that’s the actual, that’s not the major advantage of doing this in Scala. Because with modern IDEs like Eclipse and IntelliJ, or whatever your flavor is, generating that boilerplate is very easy, but also navigating it so you can see outlines. There’s a lot of tooling support that doesn’t really make that into a huge deal.
But when reasoning over code and solving harder problems, then the functional way of looking at them does make more sense in many cases. It makes the problem or the solution much more obvious when you look at the code, and it just makes it less error prone. I think just reasoning over a functional piece of code, it makes it easy to test in many cases, too, and also there are some very nice things that I didn’t know about when I first started with this, like we don’t have nulls in Scala.
And I think you can talk to pretty much any Java developer, and null pointer exceptions is a very, very large category of runtime exceptions you see in your area or even in your production systems and if you want to program defensively and handle nulls all over, then you actually end up writing a lot of code.
Whereas in Scala you use something called an option, which is it forces you to program defensively, but because you’re doing it in a more functional way, it actually it doesn’t cost you as much grief as it would if you were doing it imperatively. And that has actually removed a whole class of problems and bugs in the code.
Vid 3 -
Pete: So tell me about the next layer of your architecture. You’re doing some interesting things on the data side using NoSQL technologies and other things. Talk to me a little bit about that.
Dag: For every single bid request we get in we actually have to do some sort of storage lookup. And the way that is done, actually we have several, usually two look-ups per request. So, if we’re doing 25,000 requests per second, then we have to do about 50,000 look-ups.
And also, we do store information about them, so we have also fairly substantial write throughput. And this is like one of those obvious situations where traditional, very general, database technologies fall through. We could probably scale the reads pretty well up to that amount. But the writes would be very, very hard.
And also, because we’re so latency constrained, having a general database engine to add 2-3 extra milliseconds is just a waste of time.
Pete: So because of this, how many different types of data stores do you have?
Dag: So, we have three types basically. We have the key value storage, which is used for reads and writes during the time critical bidding process. Then we have traditional RDMS which is, in that case MySQL, that we use for the more business oriented, the standard things that are nice to put in a database.
So, we have our, well, ledgers and transaction logs for money flow. Not for the high frequency stuff, but for the more low frequency stuff. And then we also have a more big data type store, which we use for doing analytics and queries on. The really, really big data sets. We actually have a fourth.
We also have our unstructured, or they’re actually structured, but we have just raw logs that we can run more with MapReduce ad-hoc queries over if everything else fails.
Pete: And I think we spoke previously. You’re looking at some different solutions for NoSQL data sotres. Talk to me a little about the options there.
Dag: Yeah, there are a lot options for an NoSQL right now. And, so first of all I like the new explanation for NoSQL, which is “not only SQL”. And this is all about finding the right tools for the job and the different NoSQL solutions are usually very specialized. There are some general ones like Mongo and CouchDB that are fairly general NoSQL solutions and they also they have sweet spots but in our case we were looking at something very, very simple.
We just actually have key and values, that’s all we need. Something with extremely low latency and something that performed consistently and predictably. And we looked at a number of different projects including open source projects have actually started running on open source projects. I’m not going to bash them right now, but we ended up actually going for a commercial vendor.
Which I’m not going sell either right now.
Well, yeah. They’re called Citrusleaf. It’s a really good company, actually. They deserve a lot of credit. They took a lot of the jitter in our performance just right out the equation, out of the equation. So, very, very simple yet very capable NoSQL solution that performs insanely well. This is the stuff that we’re running on our SSD drives in a fairly small cluster, but can still take immense, the throughput is awesome.
Pete: So you mentioned that there’s faster technology out there than the typical MapReduce, Hadoop stack, that many engineers are becoming familiar with. Talk to me a little bit more about the options there, and what you guys are looking at.
Dag: First of all, MapReduce is just a principle on how you can process or split big data into smaller chunks and then process in parallel and then try to reduce it into something meaningful in the end. And there are a lot of generalized MapReduce frameworks, the most of well known one is called Hadoop, which is basically a rather large stack of software which is founded on the principal of MapReduce.
We started using Hadoop. It does have some strengths as in that, it’s open source and it’s fairly easy to get started with. We’ve had some issues running into bug It’s fairly difficult to get everything up and running there. But it worked all right. But we were seeing, for the things we do, we need shorter response times.
We needed to do queries over really large data sets, and we just found out that Hadoop didn’t perform well enough for us. We were using Amazon’s elastic MapReduce framework, or their cloud offering, which also works pretty well, but it just didn’t give us the performance we wanted. And the fundamental reason for this is probably or at least into large degree that when using Hadoop and you are using raw log files underneath.
You actually have to, if you have thirty terabytes of data, then you have to read thirty terabytes off a disc. Whereas, very often, you just want to pick, let’s say, one of the queries we might run over this data set, you want to see how, what’s our over the course of a month, how many ad impressions could we sell to iPhone users in North America, and all we’re interested in them is the platform. We detect it and maybe some IP ranges or DMA or something like that. But, just a couple of fields.
Pete: So you’re talking about selecting a value out of just one column, out of a data set that might have many columns in the row.
Dag: Yeah. So, that’s exactly it, right? And so Hadoop is brute forcing a lot of these things. What you really want to do is just, well, this is just like the poster-boy for column-oriented databases. There are a few open source ones there, too. So, HBase is one, but Cassandra is probably one of the more well-known there.
HBase then Cassandra. There is an interesting project going on I know that the Datastax guys are doing, with running Hadoop on top of Cassandra, which then, at least, takes away the storage and load, I/O, problem out of the equation. But it turns out that there are a lot of proprietary commercial vendors here, that offer solutions that still just blows anything else out of the water.
Some of them are actually hybrid open source as well like Infobright it’s based on MySQL and has a proprietary storage engine. And has a communication, but you actually have to. If you want it. We’re not using them, but if you want these solutions you actually have to pay for them.
At least for now. And then they use compression, column-oriented storage, and they just, it’s. I recommend anybody that’s looking for analytics, and especially real-time analytics, to take a look at these offerings. I know that if we were trying to do this with something like Hadoop, well, we would have made it work.
We could do rollups on a regular basis, all those kinds of traditional workarounds for it, but with these solutions you can actually just run your queries and SQL and they will respond in sub second even over hundreds and millions of rows.
Pete: And a couple of those players again are?
Dag: So the ones that we’ve been looking at are Vertica, Infobright, Aster Data, Greenplum. and another player called VectorWise, which is from Ingres.
So, they all perform incredibly well. They do start to change a bit when you start doing. So, we have two types of big data basically. We have the impression tracking, which is based per campaign. That’s still in the region of maybe a couple hundred million rows that we do analytics on.
And then we have this firehose, which is in the thirty billion row region. And when you start getting up to those levels they start to differ a bit due to the way they’re architected, some are easy to scale out where as others actually have to be scaled up. So they’re different, but they’re all worth a look.
Vid 4 -
Pete: Tell me a little bit about the team that you’re building here and why an engineer would want to come and work at Tapad.
Well, that’s an interesting question and of course, it’s not a hard one to answer actually. We’re looking to build a team with really, really skilled engineers, both on the technical side, but also on the team dynamics side. We’re looking for people that have already a proven track record and that can come in, get a problem, present it to them and then start working on it with us and solving them.
If you want to attract really good developers, then you typically have to give them something that is interesting. And Scala has no doubt about it caused a lot of interest in the Java community and also in other language communities and being slightly esoteric is a good thing, if you want to attract people.
Pete: Yeah, that’s an interesting strategy. And you have a unique perspective on repeatable jobs versus jobs that require a lot of brain power. Tell me about that.
Dag: We have a slightly, I’m not sure if egalitarian is the word here, but we don’t hire first rate and second rate programmers here, developers. We expect everybody to perform on the same level. And this also means that we’re not going to have people coming in that have to do all of the mundane tasks. We want to make sure that you have a team that everybody has to do simple things every now and then, but if the simple thing is something that’s repeatable then you should be automated instead of spending a time on that the next time.
So, we automate the simple and we spend our brain power on solving the harder and more interesting problems. There will always be a balance between the really fun stuff, and then the more. Well, most things are fun actually, but we can never guarantee that every single day will be super exciting.
Pete: That’s great.
Dag: But I think when you put a lot of engineers, really, really good ones into the same room, they will always come up with really, really good solutions, and they will be able to implement them in a fraction of the time that a huge team of less-skilled developers would. So, and that’s our, kind of founding principal that applies the whole organization not only to the development engineering team, this applies to the whole company.
So, you will work with peers in product management and in sales that are really awesome people.
Pete: So, anything else that’s awesome about Tapad that we missed that you want to talk about?
Dag: I think we’ve covered a lot. I’m sure we have missed a lot.
I think the key takeaway here is the team we’re building and the fun we’re having. Everybody’s pulling in the same direction. Nobody is slacking. Doesn’t mean we’re working people twenty-four/seven here. Actually, that’s one of my, one of my principles is actually that I want people to have a sustained pace all along.
We don’t burn the midnight oil every time. We expect people to pull through if we’re in a tough spot, but now, the team is also outside engineering. That’s important. Everybody here is pleasant to work with.
Pete: Yeah. Great. Awesome. Well. Thanks for chatting with us. This has been really great.
Dag: Thank you.
Pete: Yeah. I appreciate it.