Transcript: Building analytical applications on Hadoop
(Original post with audio and slides is here)
Josh Wills: All right. My name is Josh Wills. I am the Director of Data Science at Cloudera. I’ve been at Cloudera for, how long has it been now, actually? I’m a data scientist. I should know this. I think it’s been like 16 months or so, roughly speaking. That sounds good, 16 months. Before that I was at Google. I worked at Google for a little less than four years. I did kind of like a proper tour of duty at Google. My first job at Google was working on the ad auction. So you do a search on Google and the ads show up. That was me. You’re welcome. Thank you all who have heard that joke before for laughing politely. I appreciate that.
After that, I worked on a bunch of data infrastructure stuff. So I worked on things like logging, building dashboards, building experiment frameworks, doing friend recommendations, all this kind of stuff, all trying to analyze user behavior. A lot of it got used in Google+. A lot of it got used in Google News. A lot of it got used in Google Mobile Search. So I’ve done pretty much everything with respect to data that you possibly can do in about a four year span, roughly speaking.
So now I have this really cool job at Cloudera, where I am a data scientist at an enterprise software company that, as you probably suspect, doesn’t have a ton of data. So I mostly just get to kind of go around and work on problems that I think are interesting and then go talk about it. So it’s a pretty sweet gig. They pay me and everything. It’s really cool.
Anyway, so let’s get started. That was a lot of the about me, but there’s one more thing about me and it’s this tweet I did. If I was to die tomorrow, this would be like my major contribution to western civilization. And that is a proper, defensible definition of what is a data scientist. So my definition is someone who is better at statistics than any software engineer and better at software engineering than any statistician. And to the extent that being re-tweeted a lot on Twitter is a sign of credibility, this is a pretty popular definition. People typically kind of within the sort of the constraints of 140 characters, this is a pretty good definition of what it means to be a data scientist, all right?
So a lot of people, like my grandma, ask me what is data science, that kind of thing. Today we’re talking about analytical applications. And so I thought it would be useful if we started out by talking a little bit about what we mean by analytical applications. Or what exactly I mean when I say that. Since our field right now is kind of overrun by big data and data science and all these words that have a lot more meanings sort of poured into them than like any, sort of like their size can support, if that makes sense? Right, there’s like a lot of different ideas floating around about what this stuff means. So I want to tell you like what I mean by analytical application.
And what I primarily mean is sort of applications that allow users to work with and make decisions from data. And I realized as soon as I wrote that down that it was like way, way, way too broad of a definition, right? Like what application possibly doesn’t have data? So this might have to be more of a kind of like let me show you what I mean exercise than a like sort of define it precisely sort of exercise. But like let’s talk a little bit about this, right?
So the canonical example right has got to be dashboards. And everybody knows dashboards. Dashboards is like data science 101, writing a dashboard. I think this is something based off of Click View. If you guys have ever seen Click View, it’s a tool that like helps you build dashboards off of relational databases very quickly and easily. And there’s all kinds of dash boarding tools. There’s like you can use D3. You can use Click View. You can use Micro Strategy. You can use Tableau. There’s all kinds of stuff you can do.
And it’s sort of funny because I mean one thing that’s strange for me about these tools is they all support like pie charts. Do you know what I mean? Or like, even worse actually in some cases, 3D pie charts. And it seems like there should be some kind of like warning or like a pop-up box or like you guys remember Clippy right? Like Clippy for like Microsoft? It’s almost like, imagine like Clippy except it was like a little head shot of Edward Tufte. And it was like you’re about to use a pie chart. Are you sure? You know what I mean? Or like try to like try to talk you out of it or like recommend a like better like data visualization method, right? I think, I mean there’s like, I’m like I’m not sure about this. I’ve been doing some work on it, but I think even in the case of like if you’re trying to illustrate how much of a pie has been eaten, there’s a better choice of graphic than a pie chart.
None the less, right, dashboards. Dashboards are a great analytical application. There’s a lot of more fun stuff that we can talk about. Have you guys seen Cross Filter? Has anyone seen Cross Filter out of like the D3 guys? Like Mike Bostock and those guys? Cross Filter is fascinating. So if you go like you Google Cross Filters will come up. This example application they have of using Cross Filter is showing flight information. Flight arrival times, flight delays, all that kind of stuff. And it’s this really beautiful, fantastic, interactive application for analyzing large amounts of time-series oriented data. Like interactively, right? So there’s all this data. You get this fantastic user interface for like working with it and interacting with it, even though you, yourself don’t really know D3js. You don’t really know where this data came from or maybe even what an airplane is. It doesn’t matter. You can still use this application to do really kind of very interesting analytics.
Some topical examples of analytical applications. You guys, everyone enjoy the election the other day? That was pretty cool, right? Yeah, come one, yeah, all right. Well, this is San Francisco for God’s sake, right? We voted like 97% for Obama. We should be happy. Anyway.
Audience Member: 84.
Josh Wills: Eighty-four, is that what it was? Eighty-four? Okay. That’s like, that’s probably lower than 2008. That’s kind of surprising. The New York Times, again, this is basically, actually, maybe we should just put up a picture of Mike Bostock during this talk. Like this is God. Did this wonderful, wonderful, wonderful set of visualizations. I was like for you guys. I’m at work on a Tuesday and I’m sitting there and it’s like 5:00 pm. And election results are coming in from all over the country. And I’m on the New York Times website using their like visualizations. And I know I should go home and watch like TV with my friends and watch the election results come in, but I don’t want to because these visualizations are so good. And this is like such a better way to consume information about what’s happening in the election right now.
So this particular graphic I love. This was on the sort of the vote differential by county relative to 2008. So the red arrow that is sort of moving to the right there, which I suppose is appropriate, are counties that shifted to vote more for Romney than they did in 2008. And then the blue arrows, which are a bit more sparse and kind of tougher to see, are votes shifting toward Obama in different counties. So they had this big, really nice graphic of like all these different shifts. And you’ve got, basically you saw this in real time as the data was coming in. As like county results were being reported, this graphic was being updated. And I’m pretty sure it’s still up, so if you want to like, if you have a laptop open or a phone or whatever, or probably mostly a laptop and you want to go play with this, I highly encourage it. And it will not bother me at all. You won’t like hurt my feelings or whatever.
You could zoom in on individual states and then mouse over counties and like just literally see. In particular, in Florida you could see the ways in which Tampa was voting substantially similar to how they did in 2008. And if you knew anything about electoral politics you would know that’s a really good sign for Obama. That was a very positive thing for him. That was more likely than not, he was going to win because Tampa was voting the way Tampa did in 2008. Like that kind of thing. All right.
So when I’m talking about analytical applications, I want to make a distinction here, and it’ll be an important distinction in the rest of the talk, between an analytical application and a framework or a tool that helps you build analytical applications. So in these examples, this is all tool plus data, right? This sort of complete set formatted and packaged in a way that makes it available to a less technical user. Somebody who does not necessarily know nearly as much about data analysis, visualization, all that kind of stuff as you do. That is, in my mind, an analytical application. Things like AR, SAS, Click View, all these different tools, these things are things you can use to build analytical applications. They are not analytical applications in and of themselves as I am defining them for the purpose of this talk. Does that make sense to everybody?
Right, it’s like I could build something really cool with AR. I could build a cool website. Actually, did you guys see the AR studio guys today released a new framework called Shiny for doing like AR web aps? It looks really sweet. I’m like totally excited to play with it. But anyway, that’s just an important distinction. I want to throw that out there.
So I want to talk first a little bit about how we develop analytical applications. And this is in a general case. This is not in a Hadoop case quite yet. We’ll get there, but we’re not quite there yet. And so again, being topical, I want to talk a little bit about the sort of the process of predicting who was going to win the presidential election, right? If you guys don’t know who this is, this is Nate Silver. And he is the patron saint of math, at this point, for all intents and purposes. He’s a blogger, statistician who worked as a contractor for the New York Times. In a very popular web log website called 538.com. The New York Times picked it up. He does electoral predictions. He used to do baseball predictions. Now he does electoral predictions. He is a rock star. And he’s like really, he’s the guy right now.
So a lot of different people were working on models for predicting who was going to win the presidential election though. Nate was maybe the most famous one, but there were a few of them. And I want to talk about them right now. And talk about sort of the key differences between them. So first one, this is actually the oldest one, I think, that’s around right now, which is the Real Clear Politics model, okay? If you go, this is realclearpolitics.com. And they have kind of a right wing sort of slant as like in terms of their commentary and stuff like that, but they are honest and they post all these polls. And they give like great data. And they make it available to people.
And the way they predict which state is going to win is they take all the polls for a given state and they just compute the average of them. And the guy who’s going to win the average is the guy who’s going to win the state, roughly speaking. And there’s like different sort of strength levels here with respect to how likely they are to win the state or whether they’re like far, far away from like the zero point or is it pretty close, like that kind of stuff. So this is a really simple model, right? This is just here’s some data. Average it. Not rocket science. Not very advanced statistics, right? It’s very, very transparent. It’s totally obvious what they’re doing. You can see the raw data on their site. You can check their calculations. You can verify they’re true.
And as far as a data application goes, it provides actually like sort of many nice interactions. So I can like take one of these maps and I can change states. And I say oh, well, I’m going to assume that Colorado’s going to go to Obama. Like what are the implications of that? Right, so for like basic interactions, the little bit of like interactions you can do with this, both in terms of changing the inputs, like assuming a state is going to go differently and what are the implications of that, and then in terms of the outputs and like doing like drill downs and visualizations and stuff like that of the data. So really kind of like good, simple, straight forward, analytical application.
Now, 538.com, much more complicated model. If you guys, so Nate Silver like timed things very well and wrote a book called The Signal and the Noise that I’m reading right now. If you guys heard of sort of this the allegory of like the fox and the hedgehog? The hedgehog knows one thing really well. The fox knows lots and lots of different things. And it’s a way of describing different ways people see the world. Some people it’s like, they have like their thing and that’s it. And they just, this is the model. This is the preconceived notion. Foxes are much more like, let me try lots of different things and see what works. And see like which ones are consistent with each other. They’re sort of a, they don’t want to like put value judgments on these things kind of like living in San Francisco too long. The fox model is better, yeah. There we go. That was easy.
So Nate uses lots and lots of different things as inputs to his model. He uses the state polls. He uses economic data. He analyzes correlations between states. So if I know something about how the polls are in Wisconsin, that tells me something about how people are likely to vote in Minnesota because they are very demographically similar, all right? Lots and lots of different things, much more complicated than just say a simple averaging of the polls. Not transparent though, fairly opaque. Nate doesn’t share his model. It’s a proprietary thing. It’s kind of like locked down. It’s his pay check basically. He’s not inclined to like open source it to people. He also provides some simple actions in the same way Real Clear Politics does. I mean so you can basically mouse over the states and see the probabilities of the wins. And then down here, although it’s a little tough to see, there’s a histogram of like different electoral probabilities and how likely different electoral outcomes are. So same kind of basic, simple interactions, but with a much richer UI, all right?
So last but not least, the Princeton Election Consortium. And this is kind of like, this has become like the cool kids, you know. This was the place you go if you were like a really statistician kind of bad ass and you wanted to like keep track of what was going on in the election. Again, it’s a fairly simple, fairly open method. They take a recent sample of state polls. They grab the median. Pretty simple to do, right? And then they have this fairly complicated polynomial for taking that median, converting it into a probability and then generating a big electoral outcome based on those probabilities. So it’s a very, kind of like, it’s basically a kind of cool math trick that makes something very complicated, very simple to actually compute.
So for instance, the 538 model runs something like 25,000 simulations. And that takes a few hours to do apparently. This thing can be recomputed in like two seconds. It’s that kind of thing, right? So very transparent, relatively simple model again, like it’s just state polls, medians and this polynomial model. That’s it, right? Well, the thing I really loved about this was that it has a really nice, rich set of interactions. And rich in the sense that it actually lets you change the assumptions about the input data. You can assume that the population as a whole, like say you can assume that the polls are biased 2% in favor of Romney, or 2% in favor of Obama, and then recompute things in a Java Applet they have on the site that lets you see how those kinds of bias assumptions effect the downstream model. And you can see like shift all the probabilities in the states. So you really, you could see like the polls aren’t perfect, right? Although they were obviously did a fantastic job this time. So if they’re wrong by X percent, what are the implications of that? What would happen in that situation? So you get like a really much richer, it’s a much richer way for you as a user to interact with the inputs to the model and not just their outputs. It’s not just like a pretty way to consume it. You can actually change the assumptions the modeler made going in in the first place. Now that was pretty cool. I like that, all right?
So how did these guys do? How did everyone do? This is another useful website, isnatesilverawitch.com. This was actually, it’s really, it’s kind of like and the answer apparently right now is probably. I love this website. The great thing was like there was actually an API for the website as well. So if you like do isnatesilverawitch.com/update.JSON, you get back a little JSON object that has like a probability. Like it was like 61% I think the other day. Something like that. So listen, Nate did really well. Actually, to be honest with you, all three of these sites did really, really well.
I’m going to pop back to PEC just for a second here. Both PEC and Real Clear Politics correctly predicted 49 out of 50 states. People missed Florida. They did not get Florida. The average of the polls made it look like Florida was more likely to go to Romney. It’s not definite yet, but at this point it looks like probably it’s going to go for Obama. Nate managed to get all 50 states. He got everything. Sort of his Florida probability was something like 50.3% or something like that. It was like basically a toss-up, but it was stronger for Obama, so he gave it to Obama.
Even more so than that though, these guys did a really good job of predicting not just who was going to win, but the actual sort of percentage of votes in the different states. Like what fraction of the vote in Ohio was going to go for Obama versus Romney. CNET did an analysis of this across a bunch of different pollsters, a bunch of aggregation models, and Nate Silver did the best by far. He beat everybody. He even beat the PEC and beat Real Clear Politics. So bully for Nate.
Cool sort of like fun thing that came out of this, people started Nate Silver Facts as a hash tag. Let’s go through a few of them right now just because they’re fun. Results #[17:01:0] Silver if they’re significant. I kind of want something here like Nate Silver’s anecdotal experiences are statistically significant or something like that. Do you know what I mean? Nate Silver does not breathe air. He just periodically samples the atmosphere. That was a funnier one? Okay, that was good, all right. The next one’s mine. Let’s see how this one goes over. Nate Silver knows the mean of the Koshi Distribution. Nothing. Nothing? Not really? All right. The Koshi Distribution’s a probability distribution where the mean is undefined. That’s why it’s funny. Anyway, it’s good to know where I stand there. Anyway, I’ll keep working on that.
So here’s the rub though, somebody beat Nate. Nate Silverfax, right, somebody actually beat him, an expert. People know Markos Moulitsas, the Daily Kos guy? Have you ever heard of this guy, right? He actually did better than Nate in predicting the swing states. And he posted it on his blog. He like, I mean, he did it. He proved it. Like the day before he posted, these are my predictions. Then the day after he compared it to what actually happened. And he did better. He basically took all the polling data for the individual swing states, this is, I’m sorry. The photo was designed to be like kind of sad Nate Silver. I mean he might not be sad. He might just be like working really hard. It’s tough to tell. But Kos did better for a couple of reasons. One, he knew that in Colorado and Nevada, the polls tend to underestimate the Latino vote. And so he basically bumped Obama’s margin, his prediction of how much Obama would win by, in those states higher than the polls. And that turned out to be the case. And the other thing he did was, he knew that apparently in Iowa, the Des Moines Register there, whatever poll they do, has correctly predicted like within much tighter than anybody else for the last 20 years, so he didn’t use the average in Iowa. He just used that one poll. Those were basically like the two tricks he did to outperform Nate Silvers’ very complicated, very advance statistical model, right?
So you can think of this as like basically, he’s an expert. He knows a lot about politics. He like lives and breathes this stuff. So he’s capable of basically doing a better job, perhaps, of filtering data than Nate’s model can. Right? He can bring some expert judgment to things, some priors, the model probably wouldn’t feel comfortable making. He can do that. So I guess what I’m saying is Markos again also predicted every state correctly, including Florida, right? So very advanced, like sort of fox versus hedgehog kind of models are amazing and it’s like clearly Nate Silver is some kind of witch, but nonetheless, having an expert armed with really very simple data, just like literally here’s the polls, right? No fancy models, nothing like that. And an expert opinion can do just as well, if not better, than like a very, fairly like fancy advanced technique. I don’t know if this is like the right audience to be sending this message to. Maybe like the business people would feel better hearing this than this sort of people in this room. Anyway, one of those things.
So the analogy I draw here, the analogy I draw between Real Clear Politics with their very simple model, Nate with his very complicated model, and then like Kos with his fairly advanced knowledge, his expert knowledge of politics, with a respect for data. I guess it’s sort of the big caveat there, right? Like expert knowledge plus data, not expert knowledge by itself. Index funds, hedge funds, Warren Buffet.
So if you don’t know anything at all about politics, and you’re basically making a bet on the election, just doing a simple ARs of the polls is probably a really good idea. It doesn’t require that much work. You don’t really need to know that much statistics. It’s like 98% effective. Like this is really good, right? If you need a little extra on top of that, if getting the exact sort of like best prediction of popular vote is really important to you, then you need something more like a Nate Silver model. You need something like a hedge fund, right? Where you’re really looking for absolutely every single edge you can possibly get.
Optionally, maybe even better than that, be Warren Buffet, right? Be an actual expert who has a certain like perspective in a way of interpreting data that outperforms even like the fanciest hedge fund models, at least over the long term, right? All sort of valid approaches here depending on what your level of expertise and what your skill sets are. All are very, very effective, okay?
So if I may, right now, that was my kind of general analytical applications. This is what I think of analytical applications. I sort of assumed when I came here, this is a San Francisco audience. I assumed Hadoop expertise. I assume many of you know more about Hadoop than I do, to be honest with you. I don’t know if that’s fair or not though, based on some conversations I had so far. So how many of you are familiar with Hadoop? We’ll see the less embarrassing question. Okay. Most people are like reasonably familiar with Hadoop. But I see that not like every hand went up. So I’m going to go through, this is kind of Josh’s standard, this is Hadoop spiel. And hopefully, if you know what Hadoop is, I provide a little bit of color commentary that makes it perhaps, more interesting than the 97 other this is Hadoop spiels you’ve heard before. But I’m going to ask you to kind of bear with the people who haven’t done this, because if we don’t do this, the rest of the talk’s not going to make any sense to them. Okay?
So the way I normally do video. You know, I think I just realized something. I’m pretty sure my iPod is on and I’m like hearing the music right now. That’s what it was. I’m not crazy after all. That’s a relief. Thank God I’m recording this talk. So a brief introduction to Hadoop. The best way to understand a new technology is always in terms of technologies you already understand, all right? So let’s go back in time 11 years to 2001, back when I graduated from college. I’m sort of dating myself there. Usually when I say that, half the audience will laugh and be like, oh my God, he’s so young and the other half will be like oh my God, he’s so old. Anyway, data source in 2001. If I have a whole bunch of data and it’s 2001, what are my options? What is available to me? What technology can I use?
Option number one and it’s a very good option, is a database, right? Relational databases. So relational databases have structured schemas. I specify ahead of time, this is how my data will be structured. Database I promise that all the data I give you will match the structure for this particular table. Okay? Super cool thing about databases, you can do really very intensive processing on data where it is stored. You send a little, tiny, itty bitty SQL query to a database and it does like a massive amount of processing on its result of that query and it gives you back a relatively small answer. And that’s kind of a cool idea, right? That’s much better than having to say suck all the data off the database, do your very simple processing on it, and then like maybe write it back again. So this is a great thing about databases. In 2001, databases are like pretty reliable. I’d probably have like a backup somewhere just in case. And they tend to be a little bit expensive at scale. It’s really, instead of a little bit, a lot expensive at scale. All right, that’s option one.
Option two, a filer, a network attached storage, a storage area network, something like that. Cool thing about filers, there’s no schemas whatsoever. They’re just files. I can store anything I want. I can store images. I can store video. I can store audio. It doesn’t matter. I don’t need to cram it into some database schema ahead of time. This is very cool. Downside though, there’s no data processing capability. So if I have to do any processing on this data, I’m back in the situation where I have to kind of pull it off, do whatever processing I’m going to do, write it back again. Not very performant. Networking’s a big bottle neck. This is a problem, right? These systems are super reliable. In fact, that’s what you’re paying so much money for. They basically are promising you, in no uncertain terms, I will not lose your data. And again, we have the problem with they are a little big expensive, again, a lot expensive at scale.
All right, so that’s 11 years ago. Then this happens, right? We have this sort of ridiculous explosion in data. In fact, it exploded so much that the image went off the slide, apparently. So it was completely out of control, right? Ridiculous growth in data. Web application stuff, medical stuff, sensor stuff, like we have data coming out of our ears, right?
Data, we have data coming out of our ears, right? It's absolutely ridiculous right now. Okay. We also have what I think of as a shift in data economics, and this is important, I think to data science in particular. So this is classical data economics. Classical data economics is all about return on bit. How much does it cost me to store this amount of data? How much value can I extract for that amount of data? If the value is greater than the cost, I keep it around or I keep it online. If the value is less than the cost than I put it on tape, or I just throw it away. Basically the same difference, right? That kind of thing. That's sort of classical data economics. Last 10 years though, we had what I call big data economics. This is me obvious abusing a marketing term for my own nefarious needs. I hope you guys will forgive me for it. Big data economics is the case where no individual record is not that valuable, but having an enormous collection of records is invaluable. Everyone in this room has a webpage on their on their phone, or on their laptop, or whatever. Right? A little cache of 100 webpages. Not worth anything. Everybody has webpages. Not a big deal. But if I have eight billion web pages, I can build a search engine, and that's very valuable. Everyone in the room right now has a receipt probably in their pocket right now, or a couple of receipts. Not really worth much of anything. If I have 10 billion receipts, I can build the world's best product recommendation engine. Right? You can build data products by analyzing the relationships between very, very, very large number of things that have no value. Right? That's the kind of cool idea. So, value kind of going up there; it kind of Metcalfe's Law for data in some sense. Data that has strong relationships where I can analyze those relationships cost effectively, can be very, very valuable. All right?
So, we have a lot more data. The technology that's available to us in 2001 to store it are really expensive. What do we do? This is really what Hadoop is designed to solve essentially speaking. This is what Hadoop is about. The way to think of Hadoop if you know databases, and you know filers, is to combine aspects of both of these systems. So, Hadoop first and foremost at its core is just a file system. You can store anything you want on Hadoop just like you would use a filer or #[27:14] or something like that. Like a database though, we are going to send a code to the data. We're not going to pull data off, do processing on it, and then write it back. We are going send out very tiny amount of code to a very large amount of data stored in the Hadoop cluster. We are going to process all the data there. All right? So processed data is stored like a database. Store anything you want like a filer. Of course, there's no free lunch, right? We'll talk about that later, but that the real basic way to think about it.
So, core components of the system, Hadoop's distributed file system, right? This is based on Google's file system. Google wrote a paper about this. They published it, and the Hadoop guys use it as a basis for that. People, most people keep their file system right by like file systems are made of blocks. Well, each file is a whole bunch of blocks and the file system keeps track of which names map to which blocks. On your #[28:04] laptop file system a block seem to be like 4 kilobytes, like pretty tiny, could use about being big files, so typically more like 128 megabytes for a block, 256 megabytes for a block, their big blocks. Each block is replicated to multiple nodes in the cluster three times. This is basically done to ensure reliability. I have had people ask me, "Josh, Why, how did Google ever come up with this in the first place?" And the reason was, and people don't think of this now because Google is such an unbelievable behemoth, but back in 2000 when they were developing this stuff Google had no money! And, no there was no AdWords. There was nothing like that. There was still massages and good food! But there was really not money to pay for it! So, they had to buy very cheap, crappy hardware! And it was hardware that failed essentially like all the time. And after being burned a few times by having like the hardware that contained their data failing, they decided to start copying like manually coping data to multiple systems, multiple like nodes in their cluster, you know, multiple computers, right? So that if anyone failed that was okay. They had it stored someplace else. When they developed GFS, it was really just about how do we make this thing we are doing already manually, automatic and more efficient so we don't have to do it anymore. That was really the idea. Right? I'm all about, you know, innovation, drive. Sorry. Constraints, drive, innovation. It would be weird innovation, drive, constraints! Constraints, drive, innovation, that makes more sense! Constraints, drive, innovation, the reason like Google had to do this was because they didn't have any money. They couldn't go buy a really expensive filer. They couldn't buy an expensive database to keep this stuff around. It just wasn't an option for them. All right? So that's HGFS.
The processing framework map reduce. Map reduce has three stages to it. There is the map stage, which is embarrassingly parallel. We are going to take each input record from the file and we are going to process them independently with every other record in the file, and we're going to emit key value pairs, okay? The shuffle stage is kind of the brains of the system that's actually implemented by the framework itself. You don't have to do that yourself. What we're going to do is take all those key value pairs that map phase spit out, and we are going to sort them so that all the values for the same key are grouped together; so big block for key 1, big block for key 2 so on and so forth. And then in the reduced stage we're going to process all the values for the same key in one big step. Just integrate through them do whatever processing we need to do on those values. All right? Key idea again, kind of like database. We're processing data where it is stored. Map, we write a little map function in Java, or Python or whatever and we write a little reduced function. Same kind of thing. We send it over to the cluster, and the cluster has a way of executing our code against the data itself and then aggregating the results for us, doing all the shuffle. Map reduce, it seems kind of simple. It turns out there's an enormous variety of problems that can be mapped into one or more mapped reduced jobs. You'd be amazed and there's lots of books online if you're interested in understanding how that works. The genius of map reduced for me as a software developer when I first started using it, was the realization that if you write your map reduced in the same way, and you're not sort of intentionally evil about it, once you've written it you done. If the data volume goes up by two, you just throw more hardware at the problem. You don't have to rewrite the system. I don't know what it's like for you guys, but before I had map reduced I spent a lot of time rewriting code I'd written when the data volume had gone up by two or by five or by 10. That was like that was the way I spent a lot of my time, and I don't do that anymore. And it's awesome because my time is very expensive, and I have lots of stuff to do, and blah, blah, blah. Right? This is like sort of, this is one of the key, maybe really one of the obvious kinds of genius parts of Hadoop to me, and I'm going to talk about the other one later. All right? That's, yeah, good stuff. Okay.
All right on to the fun stuff. Developing analytical applications with Hadoop. Again, keep in mind when I am saying analytical applications I mean code plus data. I am not really talking about framework so much, at least I am trying not to. I am probably going to fail, but I'm going to give it a shot. Okay? That kind of thing. Code plus data, analytical applications. All right, rule number one in developing analytical application on Hadoop. It's really just kind of a general rule. "Novelty is the enemy of adoption". You take something and you give it to people, and it's like completely new, and completely different, it will freak them out, and they won't use it. So, the first analytical applications I developed at Hadoop right after I left Google, and started working on Hadoop stuff, we really were looking for things that people were doing now, on large amounts of data, but were doing badly or slowly or whatever. And it really was sort of easy and nice to translate into Hadoop, so they could basically be doing basically exactly same thing. Ideally like literally running the same commands and the same applications they were running now, except running it on Hadoop instead. All right? So the first application like this is a sort of, I wrote a project called#[33:04] seismic Hadoop. And for various like reasons that are not worth getting into right now, I got very interested seismic data processing when I starting working at Cloudera. Seismic data processing, this is basically how we find oil, essentially. We send like boats that have sonic waves, like air guns kind of attached to them, and like sonic receivers and we bounce sonic waves off the different sort of sub-layers of off the Earth, and we record the reflective waves as they come back to the surface. We analyze the data, and then we can see, it is basically doing an x-ray of the Earth. You can kind of, if you do it properly you can actually see the substructure of the Earth, and it lets you figure out where things like oil and natural gas are. And I love this field for a couple different reasons. The guys used for this are called geophysicists, and they've been doing like open source software on large datasets since like the 80s, and like I've sort of like discovered them. Like there's a tiny community of these guys that work at you know, Chevron and Exxon, and all these companies that one's ever heard of, and they have like these great open-sourced tools they've had around for over 30 years. And so they took one of these tools actually seismic units and I adapted it to run on Hadoop. So we can actually take, like essentially all you're doing in seismic processing is filtering things, sorting things, and adding things. That's fundamentally, that's the heart of seismic data processing. Hadoop, Google lots of things, #[34:25] is good at filtering things, sorting things, and adding things. These are all Hadoop's strong suits. So I literally could take their exact same commands they'd be running on a file stored at their desktop, and take that command and execute it on data that was stored on a Hadoop cluster. They didn't have to change anything.
Now I'm behind the scenes having to do some kind of magic to like parch a command to figure out which stages had to be turned into map reduces, and which stages had to be, or which stage was like the shuffle and what was the reduce, and all that kinds of stuff. But their experience was absolutely no different. Right? No change whatsoever. This is very good. This is very key. This is the novelty is the enemy of adoption. And the Hadoop community realizes this. And so in my opinion, really the best way to get started on Hadoop is a patchy Hive. The patchy Hive is a data warehousing metaphor on top of HGFS and map reduce. You have a sequel base query language, you know, a create table, insert things, select things, group by, where all that stuff sticks around. Okay? Like sort of basic you can look at the code and like roughly figure out what its doing if you know informational databases really well. Hive of course has some map reduce extensions. There's some things that like make sense in map reduce that don't make sense in sequel and vice versa, so there's like certain kind of map reduce operations. There's some like analytic functions like live lead operations that don't really make sense in Hive and so you can't do that in Hive, and like that kind of thing. But it's a great way to get started with a new, again novelty being the enemy of an option. Right? All the tools, all the different, microstrategy, #[36:00], business object. I shouldn't have left out either one of those, but you know, they connect to Hive and they understand Hive. They speak Hive. Hive kind of acts like Tables. One of the things though, we do this a lot we borrow these abstractions. We have developed, we in the sort of analytical, data analytics, processing community have developed an enormous number of really effective abstractions working with data. Spreadsheets. Right? Spreadsheets are an abstraction. We can use them for everything, and we do. It's one of those things where spreadsheets have an original meaning. Right? Spreadsheets was an actual paper thing that was used in accounting like 40 years ago, or whatever. Right? But it's becomes because it was so useful in its general purpose it's become an abstraction that we apply across all kinds of different data sets. Right? So we borrow these kinds of abstractions. Hive does this, #[36:52]Data Mere does this, Karmasphere does this. All these different tools built on top of Hadoop to provide you with star schemas and relational data, you know relational database things kind of semantics. Right? All the objective of making it really easy to get started with Hadoop for you, all right?
Now, there's a couple of problems with this. If you just see Hadoop through the lens of these abstractions that you're familiar with; if you just see Hadoop as like just a really big spreadsheet, or as a really big honking relational database, you don't really fully get the power of Hadoop. You don't really fully get what you can do with it. And I want to talk a little bit about that in the rest, in the rest of this talk. I guess my feeling is if you come to Hadoop and you're expecting, you know, spreadsheets and you're expecting very high performance data warehousing stuff, you probably are going to be disappointed. It's not going to be nearly as good as like the 40 years we spent working on spreadsheets or the #[37:50] eight years that we spend working on relational databases. There's a guy called Michael, Mike Stonebreaker. Anyone heard of Mike Stonebreaker? He is a database guru, #[37:59], like he's a long time database expert, knows all this stuff. And he loves railing on Hadoop, love ragging on Hadoop. He kind of, trolls is not the right word, but. Trolls might be the right word! And the only thing that I can figure is that his objection is that, when it boils down to it, as I see it, is that Hadoop is not a relational database. And I don't know who the hell that told him it was. And I am not sure, like I would like to find that person, and I would like to find that person and have like a stern talking to with them. Because if you're using Hadoop like a relational database, you're not really using it the right way, all right? That's, that's kind of my yeah, all right. That's good.
There's some things we can do to address this. There's some things we can do to make so the Hadoop experience be more like what you're used to with a relational database. A couple weeks ago, this is a good story; this is where the ad, advertising section of the talk kicks into high gear. We at Cloudera released a project called Impala. This is Cloudera Impala. This is our fast query engine on top of Hadoop. This is 100% open source Apache license#[39:03] link's up there. You can download it and play with it, and I highly encourage you to do so. And this is where we take ideas from big relational databases, from massively parallel processing in relational databases and we adapt it to data stored in Hadoop. So you get much, much, much faster, you know, more interactive depending on the query, query ability with Hadoop. You basically make Hadoop feel much more like the relational database abstractions that you're used to working with. Right? And, this is, this is a good thing. This is how, you know, we try to solve some of the, you know, the "leaky abstractions" isn't really the right term, but abstractions that aren't quite what you, you know, abstractions that obviously don't fulfill all the contracts of the abstraction, I think. They aren't exactly what they claim to be. They're sort of like that, but they're also different in some other ways, all right?
But more than that, and what I really, what I really want to talk about tonight, and it only took me about 40 minutes to get to it, is how we move beyond these abstractions. How do we actually kind of let go of the ways that we think of data, the spreadsheet orientation, the relational database orientation? All that kind of stuff. The ways that we currently kind of think about building analytical applications and working with data. And I have a few tricks that I want to, that I want to suggest since we're out there. And I think the first one I want to say is we want to, the first one is, we want to make the abstractions concrete, and this is important. We're not trying to build a relational database, a star schema. A star schema is a really useful abstraction. It is a pattern one can use across all kinds of different data analysis problems. So I can come up to a relational database, see that it has a star schema, and if I know nothing about that problem domain, I'll at least be able to understand roughly how to query it. Right? I'll understand that there's some fact tables, and snowflakes, and I can basically get it, right? But if we just try to take that model, and map out perfectly on to Hadoop. First, of all it's not going to be possible. Hadoop isn't a relational database. It's not going to be reasonable to ever do that perfectly, right? But my bigger concern is we're going to miss sort of the possibility to look at new ways of creating analytical applications that are appropriate for different domains. And so what I really want to move to when we are talking about building analytic applications is actual concrete applications. How do I make this patient data available to this doctor who doesn't even understand sequel? This is, this is the world that I want to speak to. Here's a, I don't mean to pick on the medical people, but here's a biologist, right? Who has to work through 17 terabytes of like genetic sequencing data, and like can't. Doesn't even understand the shell, right? Doesn't do anything of this stuff. Maybe knows a little bit of pearl at most, right. This is, this is a really new, fun challenge for us. We have a new set of users who have to work with larger data sets than they have ever had to work with before, and they have far fewer technical skills. So our job as data scientist is to fill that gap, is to fill that application gap. And my feeling is in the process of doing that, we will discover new abstractions and new patterns equivalent to spreadsheets and equivalent to the like spreadsheets relational databases, star schemas, all the stuff that has been so powerful, and that has served us so well for the last 30 years. I think we are going to be able to discover new ways of doing this stuff; much better on top of Hadoop.
And I have some feelings of how this is going to go, but before I do that quick shameless #[42:35]plug. Teaching a data science course next week. Cloudera about a week or so ago I read a blog post announcing our new data science course. It's going to be three days. Basically, essentially, the course is about building a data product. It's about building a recommender system in particular. Really the course is about how to munge data on Hadoop. Here's a bunch of web logs. Here's a bunch of #[42:59] data. Here's a bunch of database data. How do I combine all together, and prepare it to be input to a machine learning model? That's really kind of what the core of the course is about. I want people to see Hive and see Hadoop and see the way to work with it, the way that I work with it, and the way that I see it after having done it for five years, and done it for Google, and all that kind of stuff. I think it will be helpful. At least, you know, setting a very kind baseline, minimal level of understanding of what you can do differently when you're analyzing data on Hadoop. All right? #[43:53] shameless plug over.
Analytical applications I love. Things Hadoop based things that I really like, very much deeply enjoy. This one, this was actually kind of great. So Google had this, basically everybody had this. Twitter has this, Facebook had it; an experiments dashboard. So I, average developer, right? Just regular software developer can come up with some experiment, some #[43:59]EB test, some multivariate test that I would like to try. I have a system for deploying that experiment and the production, automatically gathering data on it, automatically writing that data into a Hadoop cluster, automatically running map reduced jobs that compute metrics over that data, compute #[44:15] centers over it, and then populate a database that then shows me the output of that. That is a full scale analytical application. Right? Soup to nuts. The UI for me is that I change the configuration file and I push it out to production, and then I wait 24 hours because obviously this stuff is a great batch, and I come back and I have a dashboard that I can now interact with and like get information about how my experiment is performing. This is one of the problems that I think has basically been solved over and over and over again, and that may be a theme of tonight actually like dashboards are a problem that have been solved over and over and again. I think this one is in the boat. Everyone does this themselves. There's not really any reason to, and so far as I can tell like the math is basically the same. I don't really think that there is like any secret sauce to developing these systems. Some company ideally founded by one of you, should start a company basically sells this as a product. Here's the Java PHP, okay, maybe not PHP, Python, you know, C++ libraries that write logs. Here's the language you use for calculating metric on those logs. Here the dashboard that computes intervals off of that data. That's and analytic application. Soup to nuts. Right? Powered by basically purely, not purely by Hadoop, but its center to do data processing. Right? Great, great stuff. Making people who don't know statics, who don't know the right way to compute account like #[45:39] intervals, whatever, make this sort of available to them. Give them a sophisticated interface for understanding how their, like how their ideas are performing. They just have to have the ideas. You have to have the expertise. It's all embedded in the application. All right.
Adverse drug events. This is, I think the second product I worked on at Cloudera. Adverse drug events people you're taking a whole bunch of drugs, and something bad happens to you, something unanticipated happens to you. There's potentially some kind of interaction between all these drugs you were taking, and it causes, you know, you get sick, you go to the hospital, you die, whatever, anything like that. It's nice to kind of trivialize death in that way. Sorry about that. A few million of these happen every year, and when they do they get reported to the FDA. The FDA maintains a database and they are actually kind enough #[46:30] such as myself to publish a non#[46:34] sample of this data. They make it available online. Who's whistling? Yeah. That wasn't like a "Your running over whistle." Right? It was just a, okay. Anyway. To make #[46:49] statement available online. Now, what we're looking for is drug interactions. Okay? So the person is taking a whole bunch of drugs. There's a whole bunch of reactions associated with it, right? So we want to analyze basically sort of kind of a #[47:00] explosion of all the possible drug interactions we have in this data set. All right? This is one of those things were you start out with like 400 megabytes of data when you start the analysis, like just very tiny, small amount of data and in the course of doing it you generate about 2 terabytes of data. Right? So like 4,000 times roughly more data than you had when you started, right? Because you had a big, huge #[47:23] commentarial explosion of all the different drug interactions. This is also an analytical application to me, and let me tell you why. When I am doing this analysis in the same way when Nate Silver is building his model, I am making an enormous amount of assumptions about which patients are similar to each other in some sense. I am assuming that a 52-year-old and a 53-year-old are similar. I am assuming that males and females are similar. I am basically creating a bucketing scheme. Let's say okay everyone in each of these buckets by gender, location, age, whatever is essentially the same for analytical purposes. So these decisions, these assumptions that I make at the start the modeling process have profound implications on what the eventual output is going to be. And I am not in any sense like an expert in this stuff. I am not an epidemiologist. I'm, you know, not a biologist. I'm nothing like that, right? So the assumptions I make when I conduct the analysis, May not nearly, certain may not be as good as the assumptions an expert would make, right? But running this thing, running this pipeline is about 20 map reduce jobs, especially a series of pig scripts, pig being another high level map reduce language that executes this entire thing. And so the thing to do to make this an analytical application would be to configure it so that a biologist and just kind of just click, right? Create a #[48:38]U wire so they can just construct a bucketing scheme and just press a button, having a good run over the data. Right? And that would be a much, much, this is kind of a cool, neat example, and the visualization here is the visualizing the output of like the strengths of the adverse event sort of the relationships of different drugs. But taking this whole pipeline and making it available to biologists, and epidemiologists, and people that actually know this stuff, would be like a far better use of my time. Arguably, as much as other than giving this talk right now actually, so maybe I should, after this talk I'll go do that, so that kind of thing. I meant for that to be funnier than that came out. Sorry about that! Yeah, anyway.
Favor one. I spend a lot of time talking to biologists and doctors, so this is why I want it. The genes sequencing example. You guys were at an age base #[49:30]. Age base is sort of a key value store built on top of Hadoop. The guys at Next Bio did a fantastic presentation about how to make the results of a gene alignment analysis available to the biologist, who actually understand the biology. It's essentially created a search engine for these things. When you do a sort of sequence alignment, if you guys know roughly how gene sequencing works, usually you start with some kind of reference genome, right? There's like a reference set of pairs, and when you take all the little tiny like sort of snips.
And when you take all the little tiny snips, the little short reads out of the machines, the machine spins around and spits out little tiny short reads, twelve characters, 100 characters, something like that, you go hunt through the reference set of characters, and you find matching substrings. It's like doing a jigsaw puzzle in some sense. What inevitably happens, is when you're doing this matching process, you end up with differences, like there's differences between the reference genome, and the genome you just sequenced, and some of those differences are caused by actual biology, and those are the ones we really care about. There was actually a genetic difference between this person and this person, and that's why this person has this disease. But a lot of differences are caused by transcription errors. The machines aren't perfect so they make mistakes, that kind of thing. So if I have like four billion or so little DNA letters in a given genome, there'll be somewhere between three or four million differences between the reference genome and the genome I've just sequenced. So you take three or four million instances per genome or per gene sequencing, you sequence a few thousand genomes, and pretty soon you're talking about a pretty serious amount of data. But the biologists don't really know Sequel that well, at least by and large. So storing this data in Sequel is difficult for a couple of reasons. One, it's not really nicely structured for Sequel. It asks them to be in fairly advanced Sequel, i.e. well beyond biologists Sequel, so with the next bio and with what other folks are doing now, is taking this data and indexing it using Hadoop into Solar, and then basically making a query API for biologists that's just a search engine. Type in name of gene. Boom. Pull back all the variants for a particular genome sequencing that feature that gene. Done. And then they have really – I highly encourage you to watch this – they have some really cool visualization tools, so you can see around that gene, what other genes were different, what correlated genes that you know about historically that are different, they just did a really great job with it. They created this really nice new powerful metaphor for doing analytics on this variant data, on this gene variant data, because they were constrained by the fact that the people who knew the biology well enough, the expert knowledge to be able to understand the data, didn't have the technical skills to really be able to do it. And it was never going to scale teaching the biologists all the SQL they needed to do, they have other stuff to do. I'm assuming, I don't know what biologists do.
Audience Member : What is it called?
Josh Wills: If you just Google HBaseCon and NextBio, it will pop right up. We have all the videos from HbaseCon up on Cloudera's website and you can watch them right there. That's great stuff. Along those lines, at Strata Hadoop World in New York a few weeks ago, my friend Ryan Brush over at Cerner just did this absolutely wonderful presentation on – Cerner if you don't know is an EMR (electronic medical record) company. There's a few different companies like this out there. They have a system where it's again the same kind of thing except they're providing much more patient data, making it available to researchers, using Hbase as a central key-value store, Storm to do real time processing, BatchMap produced jobs to fix errors and then Solar on the end here to again make patient data available to doctors and researchers. So find me all women who are younger than 37 who had leukemia and this particular drug – just type it as a query, have it come back and don't just return the row, return everything you know about that patient to me, return it like the patient has a document almost, updated in real time. It's really, really – actually I don't think that's online yet unfortunately, but this is fantastic. I thoroughly enjoyed the talk. So, a few things in here. One is that take advantage of the fact that Hadoop is just the file system, that you can structure data however you want, you don't have to squeeze everything into a row, you can build documents, you can use time series, you can use gene data, you can build complex structures that represent everything you know about your customer, everything you know about a supplier, whatever you want. You can structure the data in a way that makes sense for the problem. Interactive inputs, especially from me on that adverse drug event problem, not just interactive outputs. Let people actually interact with the model, let them change the assumptions that you're making. Let them go, take that, and then re-run everything, and see what the output is. Think of the guys from Princeton Election Consortium, allow me to change the underlying assumptions of the distribution of the popular vote and the amount of bias in the polls, and let me visualize the outcome of changing those assumptions. And finally, last but not least, simpler interfaces for users who don't have the technical sophistication – simpler interfaces that give you more sophisticated answers that you've been able to have before. If we move beyond the notion of the table spreadsheets dominated world that we're used to, and we think about how we make large quantities of data available and accessible to people who don't have our technical skills, I think we'll come up with some really amazing stuff. The dream for me really is Wolfram/Alpha. You guys know Wolfram/Alpha, right? So Wolfram/Alpha is the online search engine version. You can type in – what would be something appropriately useless? – how did the position of the moons of Jupiter correlate with the GDP of Europe in the 18th century? - something like that. Wolfram/Alpha can do that. Imagine if you could do that on useful data, data you actually cared about? How amazing that would be. And I think Wolfram/Alpha's working on this. I think they have a product that's designed to plug into your infra-structure, and do Wolfram/Alpha-style computable functions on business data, which is fantastic, which is amazing. My concern though, is that it's going to be expensive, and it's probably going to be pretty proprietary, probably fairly reasonable, Ed, you would say so, pretty expensive to write that kind of thing. Apparently it is very expensive, and I'm sure it's very good and they're worth every penny, but working towards this, towards Wolfram/Alpha for business data in an open-sourced fashion, on top of Hadoop, Hbase, Storm, all that good stuff, this is what I really want, this is what I actually want at the end of the day. This is my dream. A little bit – because I promised to talk about this in the talk, and I've got five minutes left, and I promise I'll answer questions for like 15-20 minutes, no problem, but I will talk a little bit about moving beyond MapReduce. This is a kind of a good news/bad news talk. About a year ago, at NIPS (Neural Information Processing Systems), I gave a talk about machine learning on Hadoop. So if you google “Hadoop machine learning” the talk comes up. I talked a lot about a framework called YARN at the core of Hadoop, basically a way of splitting up MapReduce's core job-scheduling and resource-allocation systems into separate applications, so that you could run stuff beyond MapReduce on Hadoop – sort of if Hadoop has a limitation right now, it's the really “everything has to be a MapReduce job” and there's lots of other kinds of processing we'd like to be able to do on a Hadoop cluster. So, the problem right now is we have way way way too many frameworks for doing resource allocation and job scheduling on Hadoop clusters, and they're all basically backed by different big powerful-ish companies. There's a framework called Apache Mesos that came out of Berkeley's AMPLab and that's deployed at Twitter. Facebook today made a big push to open-source a system called Corona, which is their take on the distributed Hadoop, like job schedule or resource allocation thing. We have YARN now. YARN's largely developed out of Yahoo for the most part. Actually I contributed a little bit to it. It's cool. There was a blog announcing YARN beta, YARN alpha, whatever, and they thanked people who had worked on it, and I was the only person who actually worked on it, who had never worked at Yahoo, which was kind of cool for me. Anyway, we have too many frameworks right now for solving the same problem, and I don't honestly know why exactly this happened, or what horrible thing went wrong, but this is not good, because people are going to start other frameworks, like frameworks on top of frameworks, it's basically frameworks all the way down insofar as I can tell. Eventually someone may build something useful, but for now it's just a lot of frameworks. So we'll have a YARN machine learning framework, and we'll have a Mesos machine learning framework, in fact we already have two. We'll have a Corona machine learning framework, and we're not all going to be working together on the same thing. This isn't good, this doesn't help anybody. Why Hadoop took off in the first place is because Yahoo – Doug Cutting was at Yahoo, and had the insight and genius to say “we're going to develop this thing for Yahoo's use cases, but we're going to develop it at the Apache software foundation, so it's going to be open-sourced and available to everyone”. Then Facebook and Twitter when they got big around 2007 and they really needed a system like this, it was there and available for them and they all started collaborating on it together. That's the key to Hadoop's success. Yahoo putting in the ASF in the first place, because Doug insisted on it. Guys like Jeff Hammerbacher, the folks at Twitter, insisting that this was the way to go, and they were going to work on it and collaborate, even if they were competitors or whatever. And we don't have that right now in frameworks, in the sort of resource scheduling, job allocation stuff in Hadoop right now. This is not good. This is sadness for me. We've had this keen brain explosion of frameworks. So I'm a little disappointed. It's not the end of the world, you know what I mean. My hope at this point is that Pinterest will pick a framework and that one will win. You hope that Pinterest will be the negotiator and Pinterest is the cool company right now, so they'll say “you know what, we're going to go with that one” and everybody follows Pinterest and the problem's solved. So if anybody here works at Pinterest, it'd be awesome if you could make a decision in the next couple of weeks, that'd be fantastic. If you need any help, let me know. All right, this is my concern, but let's leave that aside. There's a lot of stuff you can do at MapReduce, beyond Impala. One of the great ones is Spark. Spark is very much worth checking out. If you work at a cool kid's start-up right now, you're probably using Spark. All these frameworks I'm going to talk about right now are in-memory frameworks, and as such they are typically much, much, much faster than Hadoop. Hadoop is very disc-oriented. That's why it's nice and reliable. In-memory systems, that do analytics in memory, are going to be much, much, much faster. A lot of the question between in-memory stuff versus Hadoop is “How do I want to allocate my resources? Do I want to devote all memory on my cluster to loading and analyzing this one data set, or do I want to run like 100 jobs on MapReduce, each of which will take up a little sliver of memory?” It's basically just a resource trade-off thing, as it is with most stuff. Spark's written in Scala, so if you're a cool programming language kid, and I highly encourage becoming one of those people, it really is the language for you. It's got a nice rappel, you can interact with data superfast, and like all these systems, it supports reading and writing day to day HGFS. So no matter what happens, whether it's Mesos, or YARN, or Krone or whatever, everybody supports reading and writing day to day HGFS. HGFS is the standard, it's one, it's over- blah, blah, blah. It's frameworks all the way down to Kraftlab. Kraftlab was developed at Carnegie Mellon. It's basically a lower-level set of primitives that has complex notions of synchronizing data across processors, but still fairly high-level. It still takes care of a lot of the really tedious aspects of distributed systems for you. Instead of map and review steps, you have update and sort steps. Very, very fast, reinstated from HGFS, and the reason I recommend it is because they've written a whole bunch of machine learning libraries on top of it, like Apple Box. You don't have to go reinvent #[1:02: 19] or like fitting a collaborative filtering model or a matrix factorization, it's been written for you already, which is – I'm very lazy, I assume most of you are as well – this is a good thing as far as I'm concerned. A little bit on YARN. So Google has their own framework, a system called Borg. It's funny, a Borg runs Google. Borg is the system that maintains all of Google's computer clusters, and manages resource allocation across them. There's a language for configuring Borg. Say, I want to kick off this job, it's going to run this binary with this arguments, it's going to need this much memory, it's going to need this much disc, all that kind of stuff, it's called BorgConfig. When I started playing with YARN on Hadoop, I noticed it didn't have that, so I sat down and wrote a library that would basically be like Borg for YARN. I'm going to call it Kitten, because it's playing with YARN. That's so cute. Kittens are great, aren't they? So, if you want to get started just messing around with YARN, an example that uses Kitten, it's called BranchReduce. It's my graduate student – at grad school I was an operations research guy, so Branch-and-bound, integer optimization #[1:03:00] I love that stuff. So Branch-and-bound is a way of solving very large, very difficult optimization problems by basically exploring a tree where we make various assumptions about what the values of variables are, and we see how the assumptions lead to better objectives or lower objectives or whatever. Nonetheless, if you're interested in playing with Kitten, or trying out something like YARN, and you're looking for a non-trivial application that uses it, BranchReduce is there for you. Just to give you a feel for what it's like to use YARN. And that, ladies and gentlemen, is it, and I'm happy to answer questions for as long as you like, but I'm going to drink some more water first.
Maybe a little bit more water first.
Okay, questions. Really? I explained everything?
Audience Member: Hi. Can you tell me a little bit more about #[1:04:40]
Josh Wills: #[1:04:41] is actually online as well, so it's a bit like #[1.04.50], so go play, have fun.
Audience Member: What are the differences between Impala and Hive.
Josh Wills: So Hive is really like an abstraction layer on top of MapReduce. So when you write a Hive query, it maps it down to actual MapReduce jobs. Impala doesn't do that. Impala has little Impala agents that run on all your data nodes and your Hadoop cluster, and they run all the time continuously. They're written in C++, and there's no MapReduce. We don't ever do MapReduce jobs. It's all deoptimization and aggregation and filtering is all done in Impala, just as it would be as if you had a big MPP database and you were issuing a query to it. Okay?
Audience Member: So does it replace Hive or does it -
Josh Wills: No, they play very nicely together. Impala uses Hive's metastore to know the scheme of data, and to know where data lives and where to read it, so the two can basically co-exist perfectly happily. It's not designed to be a Hive – you're still like batch-processing jobs, like very large-scale – #[01.05.55] ELT queries as being primarily Hive queries, and then your interactive queries as being Impala queries.
Audience Member: Are there other tools that you can write and pop like Tableau?
Josh Wills: So since Impala uses the Hive metastore, and Impala gets plugged right into Tableau, MicroStrategy, easy-peasy.
Audience Member: Are you still doing badge type of analytic application, so if you really wanted to do these real-time analytics, what would you add to it?
Josh Wills: What would I add to my stock? It would depend on what I was doing . It'd probably be some combination of Storm and Hbase at this point. If we can find the Cerner talk, I'll try and dig it up and link it to you, because it's really good, and it's a really nice real-world example of someone doing this, not messing around.
Audience Member: I've been reading that Nate Silver book as well. #[1:07:05] and I've been following the other pollsters like Rasmussen and he was saying that the problem with his polling, which was one of the worst -
Josh Wills: It was awful. It was like #[1:07:27]
Audience Member: He said the collection problem was because he was phone-banking or phone-calling and that was probably the last polling that'd be done by phone.
Josh Wills: Let's hope so.
Audience Member: Right. Just wondering if #[1:07:41] analysis if #[1:07:45] figure out how technology plays into the polling, if you could see how thatgoes forward, if people are not using phones or if they are using smart phones, or they're using other types?
Josh Wills: How does that work? So I mean you're right. I'm not an expert on this stuff, but I'm going to try to sound like one for the next 45 seconds. The NBC, the #[1:08:03] CBS Times, those kinds of things, they actually would be willing to call cell phone people, and Rasmussen reports did not, and if you don't call the people that are cell phone only, if you're not going to be able to reach them, you're going to miss a ton of Obama voters, because there's so many people less than 30 that are cell phone only including me except I'm older than 30, but nonetheless right. So the problem with doing that is that it's very expensive and so Rasmussen and Gallup and people who missed it badly, can't afford to do that every day the way they can now with the robo-calls. With respect to data collection, we're going to need some cheap way of figuring out what people are going to do, cheaper than calling up all these different cell phones, that has the same sort of statistical validity and all the good stuff we get from our more expensive models. I think it's going to take a couple of election cycles before we figure out how to do that.
Audience Member: But Silver was just averaging all these different polls?
Josh Wills: He was being a little more cleverer than that. He actually assigned weights based on the lean of a particular pollster, so Rasmussen had a particularly conservative-lean. Public policy polling has a particular liberal lean, so he was actually doing smarter weighting than just a simple average.
Audience Member: So you don't see a way to weight #[1:09:21]
Josh Wills: I'm sure we'll always be using hybrids. I just mean that I agree that Rasmussen's probably going to go away as a source of data and my concern is the NBC/CBS's, they aren't done frequently enough, because they're so expensive to really fill the gap with respect to data. I could be wrong, what the hell do I know? I'm a data scientist. I guess that's why I should know, you're right. Okay. More questions.
Audience Member speaks
Josh Wills: I have never crossed paths with IBM's Watson team. I saw them once at a conference and they saw me coming, and kind of ran the other way. No, I'm kidding. I've never actually met them. If you know somebody, I'd love to talk to them. I think it would be fascinating. I'm a huge fan of their work. Actually I feel bad, because you've asked like two questions already. I'm sorry. I'll come back I promise.
Audience Member speaks
Josh Wills: There's a tool Cloudera wrote many years ago called Scoop. It's now part of the ASF, the Apache Software Foundation, as well. It's based on a whole bunch of MapReduce jobs, that are smart about sucking data out of relational database in parallel, and then Cloudera provides connectors for #[1:10:50] for doing fast data transfer using their proprietary API's.
Audience Member speaks.
Josh Wills: It actually makes a copy of the data, right. So if you want to do something clever involving copying write logs, that would actually be fun. We should talk about that.
Audience Member: What kind of ideas would you expect to see in ten years to help all these problems #[1:11:11]
Josh Wills: What would I like to see? Networking. Networking, networking, networking. I'd like buddies because I'm an ex-IBM not for nothing. A few do chip design stuff, and they're always asking me “Josh what do you need in terms of chips for your Hadoop jobs?”, and I'm “I need low power, your chips from four years ago were amazing for the kind of stuff I do. I'm IO bound, I just need low power.” But what I really need, and what I'd be willing to pay ridiculous amounts of money for, is superfast networking. The 68 gigabit per second, whatever that stuff, let's just keep going. Let's not stop any time soon. Networking, networking, networking.
Audience Member: So apparently people are still #[1:12:45]
Josh Wills: It seems like a classic interviewer's dilemma situation. Hadoop is mondo cheap. That's where things begin. It's really, really cheap, it definitely does not provide you all the really key fancy features and stuff like that, like ASIC compliance, that kind of stuff. But it seems to me the people who use it and adopt it don't really care for the most part. It's just not nearly as optimal. The cost-effectiveness and the processing is like straight through put power for the kind of things they have to do. It seems like there will certainly be things added over time that are security stuff, like very much a hard requirement for adoption in a lot of places. No doubt that will be picked up and addressed. But it's just always got to be done in a way that doesn't ever put Hadoop in a situation where you're not running a commodity hardware and where we ever threaten the scalability, like the pedabite, multi pedabite skill ability of Hadoop, so it will be done within those constraints. For people who care about those trade-offs, and my hope is with things like the data-science course, we'll be able to show analysts- if you look at the problem this way, and you analyze it this way, you can do a bunch of really crazy cool stuff that you can't do right now, or that you're doing fairly poorly right now using a relational database using a system like that. I love relational databases, they are fantastic at what they do. I'm a huge fan of relational databases. I've just been working with these big data sets, this technology, for a long time, and I love it too. I love them for different reasons.
Audience Member: Would you say that Impala's like Google's #[1.14.37]
Josh Wills: No, I would not. In the same way as I think of Radiohead's Kid A album as sort of Radiohead's interpretation of electronic music (amazing I'm saying this with a totally straight face, like clearly ridiculous right?), I think of Impala as a cross between Dremel and a system called F1, which is where the guy who developed Impala – Marcel Kornacker – worked on when he was at Google, which is more of a traditional sequel transactional style store, which was built on top of some Google infra-structure called Spanner. And so I think of it really in the same way, and like Radiohead gave their interpretation of electronic music and it wasn't electronic music, it was Radiohead doing electronic music. Impala is not Dremel. It's Marcel Kornacker, if Marcel Kornacker had done Dremel and Hadoop. Does that make sense? There's elements of it, there are patterns that are similar, but I wouldn't say the two are the same.
Audience Member: Do you think relational databases are travelling towards death?
Josh Wills: I think they're going to split up. I'm predicting, and I think this is a Nate Silver kind of super-safe prognostication. If you look at what oracles do with x data, x3, things are going to fork into Hadoop storing massive quantities of data and your analytical database moving to a purely in-memory system, for your very fast very interactive analytics. And this middle stage where we have big slow discs on the system, it's just not going to make sense any more. The economics aren't going to be there. The analytical RDB masses will run upmarket to the superfast memory stuff, and Hadoop will take the big data reservoir stuff. I think that's a pretty safe bet.
Audience Member speaks
Josh Wills: Something like that. Like as an open-source thing. Actually one of our interns Andrew Ferguson who's a CS PHD in Brown is working on that stuff. It's part of his grad programme. I don't know if he's doing it like - Andrew's a very good programmer, a very smart kid, I don't if anyone's doing it as an open-source project that's designed to run places. Does that make sense?