Transcript: Stop Hiring DevOps Experts and Start Growing Them
Jez Humble: Thanks for having me, it’s a pleasure to be here. I brought all my clothes to New York, and I’m wearing them all because it’s quite cold. Although, I hear it’s actually warmer than it was last week.
So, I don’t actually talk about anything really technical to do with deployment pipelines or anything like that. So, it’s kind of a bit odd, but this was the talk these guys decided they want to hear. We can talk about that stuff as well, but what I found, basically, is that the technical stuff with continuous delivery is actually the hard bit, the tools exist, the practice exists, the book’s been out a few years, people get the idea. The hard bit is working in an organization where people don’t understand the mindset change that you need to do it.
And so, you know, especially—who works in an organization of more than 1,000 people? Okay, a few of view. So, I go to these organizations a lot, they basically hire me to come and rant at them, and I always ask about continuous integration. That’s the first question I ask, is that are you doing continuous integration? And then a bunch of people foolishly put their hands up, and say yes were doing continuous integration. And then the first question I ask— let’s try this exercise right now, because you can use this at home as well, it’s great.
Who’s doing continuous integration? Put your hands up, keep your hands up. Keep your hands up, keep your hands up. Okay, put your hands down, unless all the engineers on your team are checking into trunk at least once a day. If that’s not true, if they’re checking into the feature branches instead of into trunk, put your hands down. Okay, which is what normally happens. And then, if every check-in doesn’t result in the build and testing run, put your hands down, otherwise put them up. Okay, good. And then, if you don’t get the build fixed within 10 minutes all the time, put your hands down, otherwise keep them up. Okay, so, there’s two of you doing continuous integration. Congratulations, that’s really cool. Sorry, four of you.
It’s really hard, my favorite paper on continuous integration is by James Shore, which is called Continuous Integration on a Dollar a Day. Using an old workstation a rubber chicken and a bell, and with no tools at all. And, it’s really important continuous integration is not about loopholes, lots of people that are saying they’re doing continuous integration, are in fact running Jenkins against their feature branches, nothing to do with continuous integration.
It’s a mindset you don’t need at all. You know, the rubber chicken and bell doesn’t scale, but it’s about the mindset, and this is true about most of the stuff in continuous delivery. It’s not about the tools, it’s not about the technical stuff, it’s about getting into the mindset of focusing on the right things. And, that’s also true of innovation, and I’ll come to that in a bit.
So, my way into this topic was getting very cross because somebody, a large company who you’ve all heard of, someone senior DM’d me on twitter and said, we want to hire the next John Allspaw do you know the next John Allspaw? And, my response was, no, I don’t know the next John Allspaw. And, secondly, the fact that you’re asking me to recommend someone and the fact that you’re trying to hire someone for this job, tells me there’s something very badly wrong with your company. Because, the next John Allspaw should be coming from within your company. And, the fact that that’s not happening says there’s something wrong with your company. And I kind of replied in a slightly less aggressive way that perhaps they should try and mentor someone from within, but that was very disturbing to me. And, unfortunately, this is a symptom of something which is just everywhere.
Who has DevOps on their LinkedIn profile? Okay, a few of you. Who gets spam on LinkedIn asking for DevOps people? That’s fucking everyone, right? And, it’s brutal. So, there’s this great blog post by a recruiting company which is nearly two years old now so there’s a few things we’ll laugh at as we read that because it’s so out of date. “DevOps is the latest big trend in hiring,” well, that’s still true. “Everyone is talking about DevOps, and everyone wants them. So, what the hell are DevOps, why do you need them, where can you find and hire them, and what should you be looking for? The dirty secret. Here’s the confusing part: There’s no such thing as a DevOp. You’ll have a damn hard time finding anyone whose business card just says ‘DevOp’,” no longer true. “What you’re actually looking for is Dev with some experience and knowledge as a system administrator, or possibly a system administrator with some experience and knowledge as a programmer.”
So, who is as comfortable writing BASH script as they are writing nice object oriented ruby or nice functional clojure. Okay, about half of you. That’s a pretty amazing statistic. It is very rare—I mean, five years ago, you asked that question to a general meet-up kind of audience and hardly anyone would have put their hand up. So, there’s more of us now than there were. And, that’s symptomatic of something which is just generally true. Which is, that the game has changed. Things are going faster, people are innovating faster, the rate at which we’re developing new products is faster.
There’s this word disruption which is kind of horrible and annoying, but does represent the reality that large companies are having traditional businesses disrupted all the time and this is something that Andrew Schaffer says. He has a talk called, There’s No Talent Shortage, which he gave in velocity New York, I think, recently, which is a great talk. And, I kind of stole this slide from him and then added my conference flowcon for a little subliminal advertising, but you know, all these various things like AWS, DevOps Days, all these kind of books and conferences and so forth, the reason that they’re popular and in demand is because people are struggling, companies are struggling with a business problem which is they can’t go fast enough, and they can’t innovate fast enough.
So, the game has changed. And, I think, obviously, most people in this room will be familiar with one of the key elements of that which is DevOps and this talk in particular which was the talk that Paul Hammond and John Allspaw did in Velocity Mountain View 2009, where they did the talk, Dev & Ops Collaboration Flickr Ten Deploys a Day, which was the start of the continuous deployment movement was this talk basically.
This for me is what captures DevOps. You’ve got two types of people, one of them is a little bit weird, sticks closer to the box, thinks too hard, that’s your Dev. One of them pulls lever and turns knobs, easily excited, yells a lot in emergencies, that’s your Ops. And the key thing about this, was not that we fused them into some kind of hybrid Frankenstein DevOps person, it’s that actually we acknowledge that they have different mindsets and skills, but they actually worked together. And that was what enabled them to move really fast at Flickr and achieve high levels of resilience and innovate and then ultimately get acquired by Yahoo! and then have all the senior leadership leave and join Etsy.
So, my favorite slide about DevOps and what DevOps means is from a guy called John Vincent. John Vincent is a sysadmin and I think after a particularly bad deployment, he was on a flight and he drank a little bit too much vodka and wrote a blog entry which looks a bit like this:
I’ll tell you exactly what DevOps means. DevOps means caring about your job enough to not pass the buck. DevOps means caring about your job enough to want to learn all the parts and not just your little world. Developers need to understand the infrastructure, operations people need to understand code, people need to actually work with each other and not just occupy space next to each other.
And that really is what it’s about. Have you seen the Netflix slides by Reed Hastings on Netflix culture? Yeah, so anyone who hasn’t should go and see them because they’re brilliant. But, there’s this whole thing about basically how the job is to make sure that everyone is aligned with the overall business goals of the organization and what’s in interest to the organization is also what’s in interest as a person working for the organization. The idea of being highly aligned but loosely coupled. And, this is a key part of it. We all need to understand a little bit about everything and we all need to care about it. And the caring about it is the important part. We have to care about what our customers want and in doing that, we also have to care about what everybody else wants, and we have to care about getting better all the time so we can better serve our customers and each other.
So, going back to these adverts for these DevOps people, what they really want is a culture change. They want to create a culture that looks like this. Where we care about each other and we care about our customers and that’s more important than anything else and in particular whatever our organizational structure and reporting structure happens to be and the problem is, you can’t hire in culture change. It doesn’t work.
What happens if you take a broken culture and you hire people into the broken culture who are great? What happens is it breaks the people. It doesn’t fix the culture, it breaks the people. If you put people into a system that’s broken, what happens is those people become frustrated because they can’t change things. I mean, culture is hard to change on purpose.
There’s a reason why culture is hard to change it’s because culture is the thing that made the company grow. It’s the fixed attitudes that worked enough to make the company get where it is today and then they’re hard to change, and hiring a bunch of people in—hiring can be part of cultural change, but it’s not enough. You also have to have leaders who—because I got called out for this. There’s a guy called Pedro Canahauti who’s head of Ops at Facebook and he saw this and basically said, no this is—it used to say, you can’t hire in cultural change. And, he said, well actually, we changed things at Facebook, and hiring was a big part of it.
Everyone thinks that— there’s this phrase about unicorns and horses. So, unicorns are the organizations like Facebook and twitter and Google, where there’s unicorns and everyone eats fabulous food and wonders around on clouds and everything’s fabulous. And, then there’s the organizations that all the rest of us work in, which are horrible and miserable and you can’t stand it, and those are the horses, right? So, unicorns and horses. The dirty secret is that there’s no unicorns, and that places like Facebook and Google have all kinds of horrible problems, just the same as everybody else.
So, when this guy Pedro Canahauti was hired into Facebook as the head of Ops, it took them six weeks to provision a new rack of servers into a data center. Which is not good, frankly. And, his job was to change that and fix it. One of the first things he did was understanding that the people who are working for him were just doing reactive stuff all the time. They were just panicking and fixing stuff and logging into stuff and running commands and just doing that all the time. And, what he realized is they had to not do that, but also that the people—he had to get new people so that they could change the culture with the new people and keep the old people doing what they were doing so otherwise the website would go down all the time and that would be bad.
And so, by hiring a bunch of people who are really into automation and saying, this is going to be what we’re going to do now and this is going to be the new culture at Facebook, that was a key part of changing the culture at Facebook. And, they got it down from six days to six hours—sorry, six weeks to six days, and now it takes six hours to provision new racks into the data center at Facebook.
What they do now at Facebook, apparently, is they wheel a bunch of racks into the data center, plug it in and then they don’t have to touch anything. There’s systems that automatically detect the new network devices, get those IDs, contact the hardware vendor, get the hardware profiles, automatically PXE boot them, do soak testing, provision everything else, and connect them up to the routers and have them serving transactions. They don’t even have to type a command, it just gets auto-detected.
So, complete change to the way the operations works. And, hiring was a part of it, but the key thing was they had support from the leadership that they were going to do this. And, the other thing that they did, which was really important, was making sure that Devs carried pages. Because, it can no longer be the case that operations was responsible for breaking—for fixing stuff. Not for breaking stuff, that would be crazy. Ops was no longer responsible— no longer crazy either— no longer responsible for fixing stuff. It was the job of the Devs to answer the pages and fix things.
And then, the other thing, the third thing he did which was really important, was that they made Dev and Ops both report up to the VP of engineering and they renamed operations into production engineering. And, the job of operations became to produce tools that were used by the developers’ products. They were used by the developers, rather than to sweep up behind the developers doing horrible crappy stuff.
So, hiring can be part of cultural change, but it’s not enough. And, if you just hire in a bunch of people, what’ll happen is, the organization will reject them, or they will reject the organization. When you see an ad to hire a DevOps person, what you actually see is what people really want to do is, they want to create a culture of innovation, in order to be able to better compete, and better adapt to changing customer needs, and be more resilient to external events. So, what does that look like?
So, number one, a lot of the time we think our job is to build stuff. But, that’s only one part of what we do. Our job is not just to build stuff, it’s also to create knowledge, to learn stuff. And, that’s at least as important as the stuff we build is, what we learn in the process of doing that and trying to capture and transmit that knowledge and grow it within the organization.
The other key element of innovation culture is trust. And, this is something you see at places like Netflix and Etsy that is really different. And, that’s the thing that’s different about unicorns is that, leaders trust the people who work for them, and the people trust the leadership. And, we basically tell people, this is roughly what we want to do, or we agree this is what we want to do. And, we don’t tell people how to do it, they just go and work out how to do it and we trust that they’ll do the right thing. So, trust is a key element of an innovation culture.
Third is experimentation and improvisation. The thing about innovation is, you cannot plan how you’re going to innovate. You can say we want something that looks like this, but you’re not going to know how to get there. Because, if you know how to get there, then by definition, it’s not innovation. Because, innovation is something that you don’t already know how to do, and this is why creating a detailed plan of something you’re going to create is a stupid idea, because if you could create a detailed plan, it wouldn’t be innovation.So, we have to have a culture where people experiment and improvise all the time, in order to sort out how to get from A to B, Bright? Because, by definition, we don’t know how to get from A to B, because we’re innovating.
And then, the third key element of this, I guess the fourth key element, is that we make it safe to fail. Failure is normally treated as a bad thing in organizations. But actually, the only way to learn is to fail or to experiment. People talk about failed experiments, a failed experiment is an experiment where we didn’t gather significant data that either proved or disapproved our hypothesis.
Disproving our hypothesis is not failure, disproving a hypothesis is information. Having an experiment that doesn’t produce statistically significant data that’s a failure. So, that’s not normally the case in many organizations. If you say something and it turns out not to be the case, that’s considered to be failure. But, that’s not actually true. We have is not failure, what we have is information.
And, then the other thing we need to think about is, who works at a company where you have an annual review process? Okay, almost everyone, right? So, part of that review process needs to be, how well did we do at cultivating knowledge, how well did we do at helping the other people on our team get better at what they do? And, that’s something—that, actually, for me, is the most important thing that I measure myself on at the end of the year is, how well did I do at helping other people learn. Not what did I do but, if I go and do something else, will I be completely replaceable? That’s really important.
Things like, how much did I blog, how much did I speak. And, this is something that they do at Etsy and Netflix is, encourage people to speak, they encouraged people to publish stuff on get hub, to open source things and to increase the amount of awesomeness and knowledge in the world. And, they measure people on that, and they report people on the. That’s really important, if that’s not part of how you’re rewarding people, based on, how did they grow the people around them, and how did they create knowledge, then you’re not rewarding people for doing most important thing.
In some creating and innovation culture, there’s a few things that are not very effective. So, number one, training. There’s a great book that was written in the 70s in Brazil that I’ve just forgotten both the title and the name of, that’s embarrassing. I’ll come back to it later, but it basically talks about the bank account model of knowledge, of learning. And, this is what we find in schools today, the way we teach people in schools is, we sit them in a classroom and we talked to them. And, we basically treat them as empty bank accounts full of knowledge. And, the sky in Brazil wrote this book about how this is actually terrible idea and the reason we teach in that way is basically to maintain existing power structures and to prevent people from getting ideas about how they can change what was around them and instead, just fill them. That would allow them to conform with the existing power structures. And, that’s the dominant model of teaching, even today.
And, companies are kind of the other side of that, we send people to school and fill them up with knowledge. And then, we hire them, and then we suck the knowledge out of them. And, people are basically hired by companies to be drained of knowledge and the service of the organization.
So, this bank account transactional model of learning and of productivity is perverse in every word. And, it’s terrible. And, I do training, which makes me sad, because I don’t think it’s actually very effective at helping people learn. People need to be involved in how they learn, because that’s how you learn to question things and that’s how you learn to change the system and that’s how you learn to change culture and create a culture of innovation and improvement, not by sitting passively and listening to what someone says. So, with that, by the way, feel free to ask questions or get with me, don’t just listen to me rant. I’ll rant all night quite happily, so interrupt me please.
Buying tools. This is a really horrible thing to do as well. Who has started a continuous delivery program with but spend lots of money on tools, anyone seen that, let’s buy some tools. Yes. So, you are all very lucky, I must say. That’s always the first thing that happens, we’re going to do continuous delivery, we’ve bought a tool. And, I’m like, oh shit, that’s a terrible idea, don’t do that. Some of the best things that I have done personally, helping organizations get better at this stuff is just by using BASH, or whatever they happen to have to hand. And then, the tools come afterwards.
I’m not saying tools are bad, tools are an important part of what you’re doing. But, unless you’re using a really terrible tool, it’s not that important. Anyone using ClearCase? Okay, good. If you are using ClearCase, for God’s sake, stop and buy a tool, right? But, if you’re using Subversion, you know it’s not git but it’s fine, Subversion is fine really. It’s okay. Question.
Yeah, how do you know that you should invest in tools? In general whether you should invest in tools. I’m writing a book and so, in order to write a book, I have to read a bunch of books. And, I read this book about Toyota where it basically says that they were very big on building their own machines.
So, they actually used to, they would have a process and they would say, we need a machine to do this. And, they would always try and build their own machines first, rather than buying something off the shelf. Because, they considered that a part of their capabilities to be adaptive and to differentiate themselves from other people, was they would build their own physical machines to do things on the production line. And, only if they were something that there is obvious that they should just buy because it was a utility thing, they would buy it. But, anytime they wanted to do something, they would first try and build it themselves.
And, I think if you look at companies like Netflix and Etsy, they’ve build their own tools. And, the reason they’ve done that is because you learn a bunch of stuff when you build the tools that’s actually important in your ability to get better at doing this stuff all the time. I mean, Etsy built their own AB testing tool and a AB testing framework. And, that helped them learn about how AB testing works and how to do experimental design, and that’s really important.
The biggest gap in product development. And, I used to be a product owner, and I was a terrible product owner because in Scrum, the Scrum teaching for product owner is, you’re in charge of prioritization, by coming up with requirements and prioritizing them. I was brilliant at that. I came up with loads of requirements, and I prioritized them. And, I did that all the time, and it was fabulous. But, that’s got nothing to do with creating a great product.
And, I tell people, learn how to do experimental design for product development. The way to do product development is to take a scientific approach to it. You have a hypothesis for an idea you think your customers will find valuable, and then you create an experiment minimum viable product, split test to test the hypothesis, then you get data. And, nobody knows how to do this. People are like, how do you design the experiment, how do you do something that’s not building out the whole feature? And, people don’t know how to do that. That’s the biggest gap in the knowledge of product owners today is, how to do experimental design for features and products.
And, the way that Etsy learned how to do that is, to build their own tools for doing it. So, by building the tools themselves, they learned about how to do experimental design and that’s the key competitive advantage for them. Because, they’re not just coming up with a product and they’re pulling a bunch of stuff out of their ass, which is what I did in saying, I think the customers will want this, go and build this. And then, they build it and people don’t want it and you’re all, oh no, now I’m all out of money.
Making the tools yourself is really important, unless it’s something like version control. I don’t recommend building version control because it’s a commodity, it’s a utility. But, anytime where you don’t quite know how to do something, a good way to start would be by building your own tool. The big enterprises who have got all their data in big CRM systems are really struggling right now with things like AB testing and data mining, because they have to connect it to these tools that are not designed for it. And, it would have been much better if they had just thrown it away and built their own tools.
So, that’s a long answer. The short answer is, if it’s utility, buy the tool. And, a good sign that something’s utility is that you won’t need to buy the tool, because there’ll be something open source. So, actually, the answer is never buy a tool. The answer is, if it’s open source, use the open source. If it’s not open source, there’s probably a good reason for that, which is that it’s something that is not utility, and you should build it yourself to gain the knowledge on how to do that.
Sorry it took me so long to get to that. So, anyone who’s seen my blog will know that I’m not big on DevOps teams. It’s a really horrible thing where we have a Dev team and we have an Ops team, and they don’t talk to each other. So, we’re now going to have a third team called the DevOps team to not talk to the other two teams. Somewhat ironic to solve a silo problem by creating an additional silo. And then, obviously hiring people doesn’t work, at least on its own. So, I had a beer, it’s there.
So, in terms of creating a learning organization, who’s read Nassim Taleb’s book Antifragile? Okay. It’s a really interesting book. So, Nassim Taleb is a fabulously irritating writer to read. He’s just wildly annoying to read, but also extremely brilliant. He’s the guy who came up with the book The Black Swan, and then he’s written this book Antifragile.
And, his point is, is actually a very simple concept, and he gives you lots of examples. We have things that are fragile, things that break easily like the iPhone. You drop it one centimeter onto a small, velvet blanket and this happens. And, people think that the opposite of fragile is resilient. So, resilient thing is when we apply stress to the resilient thing, it doesn’t change it, right? So, we apply stress or volatility to an object it breaks, it’s fragile. We apply stress or volatility to this thing, it’s not affected, that means it’s resilient.
And, Taleb’s point is the opposite of fragile is not resilient. The opposite of fragile is antifragile. And, an antifragile thing, when you apply stress or volatility to the antifragile thing, it gets stronger. I apologize for this. It’s kind of gross, but I’m from California, so I’ve learned to embrace my new homeland. The crucial thing is—I know this is a meeting of engineers—but if you go to the gym, the reason you go to the gym is to make yourself stronger, right? You apply stress to yourself and it makes you stronger. You’re not unaffected by it, at least if you’re doing it right, and it doesn’t break you. Again, if you’re doing it right it makes you stronger if you do it enough.
And so, this is a key characteristic of organizations that learn is that, when things change, they adapt to them. And, this is something else that Reed Hastings talks about in the Netflix talk is that, as organizations grow, they become more complex and the talent then tends to goes down, and you start relying on processes in order to scale your organization. But, the problem is, it’s really hard to change processes, at least at the whole valley stream systemic level. And, then what happens is, the environment changes and it’s really hard for you to change your organization. You’re steering the Titanic.
All you want to do is, and what Reed Hastings wanted to do is, create a large organization that was, nevertheless, able to adapt rapidly to changes in its environment. And, that’s what antifragile is. When we apply stress to the environment, we make it, actually, stronger. So, this is well known concept in literature, we acquire the strength, we have overcome, from life’s school of war, what doesn’t kill me makes me stronger, but this actually applies the systems and architectural level as well.
So, periodically, John Allspaw emails me PDFs from the master’s course he’s taking in system safety and then he’s like, read this immediately, rah. And, one of the things he sent me was this thing, which is a guy called Westrum he does some kind of sociology, he’s a sociology professor.
And, he was investigating health care and companies that do health care. And, he was looking at how different kinds of organizations deal with information and informational culture and how their culture deals with information. And, he categorized them according to these three different types, whether they’re pathological, bureaucratic or generative.
And so, this really hit home to me. Pathological organizations information is hidden, people guard knowledge, messengers are shot, people shirk their responsibilities, or they try to avoid having responsibility—that’s what’s called the cover your ass culture. By making sure that you check the boxes so you can’t be held responsible if something goes wrong, is more important than making sure things actually work properly, what I call risk-management theatre— bridging, in other words helping with other people’s stuff is discouraged, failure is covered up, new ideas are crushed.
And, then there’s bureaucratic cultures in which information is ignored, messages are tolerate, blah, blah, blah. And then, generative, where we actually actively seek out information, we train people to be able to exchange information, people share responsibilities, we’re rewarded for helping each other out, when we fail it causes inquiry, and new ideas are welcomed. And, this really, for me, distinguishes most of the companies that I go and see during my work, and the kind of companies that I see which are really good at adapting and becoming antifragile and innovating. So, I found that really great.
And, I think, what’s important about high trust culture is that, the key to creating an innovation culture is to understand that any time you have a large number of humans together and they’re building a system, what you have is a complex adaptive system. So, who’s heard of Cynefin this thing here. So, a few of you. So, this is the latest big trendy thing here. And, it’s by this guy called Dave Snowden and he’s Welsh, which is why you don’t call this [cy-nif-in], you call it [kah-na-vin]. He taught a special name for this, because it’s not a categorization thing, it’s a meta-thing for a—
Audience: A sense making system.
Jez: A sense making system, there you go. And so, what this is about is, how you deal with change in the system. And so, simple systems, simple systems are where, given a particular cause, you’ll always have a particular affect. If I say you do this, and you do that, then we can predict with 100% certainty that this other thing will come to pass as a result of that. So, simple systems, there’s a concept best practiced, which is if you tell me a problem, I tell you, go and do this, and that’s going to work. These sometimes exist if we’re really good at setting context, but not very often.
Complicated systems are ones in which there’s not a single cause or a single cause or train, but there’s multiple possible options, but we can still create causal networks. We can still say that if I do this, this’ll happen. Or, if I see this, it’s because this or this happened. So, you can still do root cause analysis and root cause analysis is still useful, but I can’t say if you do this, this will happen. I can say, if you do this, one of these things will happen. So, there’s this idea of good practice in these kinds of systems.
My piece keeps beeping at me and I don’t know why. My computer is a complex system. I’m going to try plugging it in.
And then, we have complex systems. And complex systems are ones in which you cannot predict the behavior of the system by analyzing the behavior of the components of that system. So, doing that is called reductionism. The idea we can take a system and decompose it into a bunch of components, and by understanding how the components work, we will then understand how the system works. That’s not true of complex systems, that’s what a complex system is, one in which reductionism doesn’t apply. And, in a complex system, given a particular thing, I have no idea what the fuck’s going to happen. That’s basically what a complex system is.
Complex systems, because we can’t predict what the effects or actions are, A) the idea of good practice or best practice doesn’t apply. If I take a practice that I’ve observed into one complex system and try and transplant it into another complex system, a different thing is going to happen. That’s why people who adopt methodologies, it doesn’t work a lot of the time. Methodologies are best understood like XP, Scrum or whatever, methodologies are best understood as a post-hoc rationalization of something that happened to work in a particular context.
So, if you take something that worked in a particular context, in a complex system and transplant it into another complex system, the one thing I can tell you is something different will happen as a result of that. So, people who adopt methodologies, it never works. Because, you’re in a complex system, complex systems are very sensitive to context and initial conditions, and when you try something out, something that you don’t expect will happen. There will be unexpected side effects, which might actually be the real effect, and you won’t know which is which. Yes?
So, the question is, in Ops there’s this concept of a playbook and this is basically, if this happens, here’s my list of things that I should do. So, if we’re in complex systems, then, presumably, a playbook should be of no value, but yet they are, so what does that say? So, I’m going to talk about that later on. My initial response to that is, within complex systems, you get contexts in which these other domains apply.
And, this is what we try and do when we build systems. We try and build systems which are embedded within complex systems, but behave in simple ways that are predictable. And, I think what a playbook does is, a playbook is a recognition that in this particular domain, much of the time when we follow this, these things will happen.
But, I think there’s two kinds of play books, there’s playbooks which are like, if I see this, it’s probably this. But there’s one thing we all know as Ops people, which is that there will be a proportionate time it is not that. And, a good playbook will allow you to fail fast and allow you to identify when, if you see this, do this. And, one of the first things is going to be, find out these other things to find out if it really is that, right? So, there’s this bit where we actually go out and look and see if that thing really is the case, before we then go and carry on the playbook. So, part of a playbook is seeking more information and failing fast if the thing we think is true is not true.
So, that’s kind of my high level answer, does that seem reasonable? We always try and simplify. We always try to create simple structures within complex systems, otherwise we can’t do anything. But, we have to recognize that, when you put human beings into the picture, you create complexity. Human beings plus nice simple thing, creates complex systems. And, part of that is, I follow the playbook, I do something wrong, something unexpected happens.
And, I didn’t know that I was doing something wrong, I just interpreted a particular instruction in a way that the person who wrote it didn’t mean it to be interpreted. So, I didn’t do something wrong, I interpreted it in a different way. Was that wrong? Again, this is an element of complexity is that everything is open to interpretation and we can’t completely specify exactly the meaning and now we’re into postmodernism.
So, one of the things that really struck me early on in my career, was this thing called the retrospective prime directive. Who knows the retrospective prime directive? Okay, for those of you who don’t, it’s a really important thing to learn. The retrospective prime directive is something that you’re actually—whenever you have a retrospective or a postmodern, you should actually read this out at the beginning. And, everyone should read out, a bit like sitting in an exit row in a play and then saying, yes.
Regardless of what we discover, we understand and truly believe that everyone did the best job that they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand. The retrospective prime directive is basically a statement of the fact that, we’re in a complex system and that we could not have predicted the effects of our actions, because we cannot know the complete initial state of where we were at the time we did things.
So, this is saying, we’re in a complex system, and we understand what that means. And, what that means is, we can’t be blaming people as a result of things going wrong. Because, in a complex system, complex systems drift into failure. Failure is a natural state of a complex system.
In a complex system, you don’t always know that you’re in a failure state. I mean, Air France 447 is an example of that. They didn’t know what was wrong, they just knew that things were beeping at them and making funny noises. And, they didn’t know what the fuck was going on. So, they actually didn’t know—they didn’t really understand why they were beeping at them, because as far they could see, everything looked fine. And, that’s what complex systems are like. Often, we don’t know that there’s trouble. If you find out your site’s down because of twitter, not because of your monitoring, that’s an example of being in a complex system.
So, there’s a really nice talk that Dan Milstein did at Velocity this year, where he talked about how to run a postmodern with humans, not robots. And, his key points in this talk is, if you have a postmodern, if something goes wrong at your company and the response to something going wrong is to find the person who’s responsible, then you’ve already lost.
I mean, what you’ve said is, I don’t understand that I’m in a complex system, I don’t understand that, actually, this is the case. And that, it’s not somebody’s fault, it’s interactions between the components of the system that we couldn’t have predicted, not somebody doing something wrong.
And, in particular, if when we find the person who’s fault it was and we say, don’t do that again, what we’re doing is, that’s called hindsight bias. That’s assuming that that person could have known. And, in particular, whenever you see that—and we’ve all seen that—whenever you see that, ask yourself, if I had been in that person’s position. Is it possible I could have done the same thing that they did? If that had been me, could I have done that? Often, if we’re truthful, the answer is, yes, that could have been me, and I could have done that.
And so, Dan Milstein has this great quote which I really like, “Let’s plan for a future where we’re all as stupid as we are today.” If your answer to something going wrong is, don’t be so stupid next time, I don’t know, man. I’m probably going to not be that much less stupid in a year’s time than I am today. This is what we tell our kid, for me. If I’m in a particularly foul mood and I haven’t had enough sleep and I’m like, don’t do that again, to my kid, and my kid's like, Dad, it's like I'm making this mistake, so being a father teaches me a lot of things that I've then used in consultancy. It's like, ah, I've been past a situation with my two-year-old child. I use my in-my-head voice when I say that, obviously. So, if the logical—I mean, what I'm trying to say here is failure is an opportunity for learning, and it's a natural state of complex systems, so we can't avoid failure, we can't blame people for when our system goes into a state where things are failing. So, if that gives us information, if that's a good thing, maybe we should do it more often? Maybe we should actually create failure in our systems? So, who's heard of Game Days, and is familiar with Game Days? Okay, many of you. So, for those of you who don't know, this is something that Amazon pioneered.
Jesse Robbins, who had the title of Master of Disaster at Amazon.com, he's a volunteer firefighter in his spare time, and, obviously, when you're a firefighter, one of the things you do is you go and help people run fire drills to practice what happens when a fire breaks out. And so, he was like, well, why don't we do this for our disaster recovery process, as well? It's like having a back-up process and not practicing the restore. It's useless. Having a disaster recovery process and not testing the disaster recovery process? Useless. So, he basically said, in about six months from now, I'm going to walk into a data centre and turn it off, and everybody has to maintain quality of service. And everybody shut him down, but, fortunately, he had the support of the senior leadership, and no one—because what happens is everyone's like, oh, that's a great idea, that's a terrible idea, and then they try and have them shut down. And that didn't happen, and he actually went into a data centre and turned it off, and they were able to maintain quality of service, and they do that on a regular basis at Amazon.
They have something called disaster recovery testing at Google. They actually have a team. There's some blog posts by Kripa Krishnan, who runs that team on ACM Queue. They have a team whose job it is to put together a scenario every year at Amazon, a disaster scenario. So, one year, it was aliens invading Mountain View, which some would argue is already happened. One year, it was there being an earthquake in Silicon Valley, and they actually turned off the connection between the Mountain View campus and the rest of Google as part of that disaster exercise. And what happened was a bunch of systems failed over to the desktops on the people's desks at Mountain View, except they didn't fail over because the connection was cut, and it didn't work.
So they found all these things out as a result of doing that, and, again, it was reversible, so, if it causes anything really bad to happen, they can turn it off. Another funny thing that happened was there's a procedure if you run out of diesel in a data centre, and you need to buy diesel because, otherwise, the data centre will stop working because the power's off, and you're running out of diesel. And one of the data centers was running out of diesel, and the power was off, and they didn't know about this rule, so somebody bought about $50,000 worth of diesel on their credit card, and then one of the things they did afterwards was to say, you know about this procedure? It allows you to authorize emergency purchases.
So, Game Days are basically creating failure in a controlled situation so that we can practice that and get used to it. And, again, the output of that is going to be playbooks. In this situation, here's some things that we've seen, this is what to look for, this is standard operating procedure in this situation. And, obviously, that's good because that gets us to practice it, and one of the outputs of Game Days is that people actually know who to call and know who to talk to and develop relationships with people in other departments so they can work together in a situation like that. And also, they start developing some level of muscle memory, so you don't instantly panic. Instead, you say, okay, well, now I know to do this, this, and this. And that's a really important part of creating culture is to turn conscious actions into unconscious actions that are routinized.
It's this thing about pricing something a number of times. It becomes subconscious. It just becomes part of your standard operating procedure of dealing with a particular set of initial conditions. Yes?
Audience: I guess—how often should you have Game Days? I mean, I understand that our needs are very different. I guess, how do you know whether to schedule them at the right frequency needed to…
Jez: So the question is how often do you do Game Days? Well, I think Google's still doing them once a year. I don't know how often Amazon do them. I think, in 99% of cases, the answer to that is at least once because most companies I know do not test their disaster recovery process. I mean, who works in a company where you've actually injected failure into production in order to test your disaster recovery process? Okay, one person in the room, apart from just as a result of screwing things up. So, yeah, for most organisations, the answer is do it once. Most people aren't doing it at all. Do it once, and then do it as much as you can tolerate it, but, if you take this to its logical conclusion, we end up with the idea that we should actually be injecting failure into our systems all the time. That should be part of normal operation, which is exactly what Netflix do with the Simian Army. They're always injecting failure into production, so who's familiar with the Simian Army? Okay, many of you.
So, Simian Army basically started with this tool called Chaos Monkey. Chaos Monkey would randomly shut down boxes in production, and you give it, basically, I think, an IP range, and it would just log into them and shut them down on a schedule randomly, and what it was testing is making sure that they could detect that and bring those systems back up again. And also, if you kill something and nothing happens, that means you can probably leave that thing dead. We've simplified our system.
And now, there's a whole range. This is all on GitHub. There's a whole range of different monkeys. There's Conformance Monkey that checks security policies for your EC2 boxes on Amazon. There's a whole bunch of different monkeys that inject various types of failure into the organization. But this is the logical conclusion of this. It's like, okay, we're not going to do it every now and again. We're going to do it all the time. We're constantly going to apply stress and volatility to our system, and our system is going to become—well, system in terms of computers plus human beings because there's a human element to this. We're applying volatility and stress into our systems, and our human-plus-computer organization is becoming stronger as a result of this. So this is how you create anti-fragility in systems engineering, in operations, is by continuously applying stress, and your system becomes stronger—and simpler, by the way, because simplicity, creating simplicity and reducing complexity is how you are able to reduce time to restore service, and that's an example of creating pockets of simplicity within a complex system, so that things like playbooks can work.
So, how do we get from A to B? I mean, we have organizations, they're not like this. organizational change is not like changing a light bulb. If only it were. It's the hardest thing you can possibly do within an organization is change the way that people behave.
So, one of my favorite books to come out in the last couple of years is this book, Toyota Kata. Anyone read Toyota Kata? It's a fabulous book. It's written by this guy, Mike Rother, and Mike Rother is a professor at, I think, University of Wisconsin, and he, for decades, has studied Toyota. His job is to go and study Toyota, write papers about what Toyota did, and then other companies would hire him to tell him what Toyota was doing so that they could copy it. And then, what would happen is—yeah?
Audience: They wouldn't have to read the book.
Audience: So they wouldn't have to read these papers.
Jez: Right. So they wouldn't have to read the book because no one wants to read things that's boring. Exactly. So, he noticed something very worrying, which was that he would go to a factory, see what they did, write it all up, someone would hire him, he'd say, go and do this, this is what Toyota's doing, and then he'd go back to the Toyota factory a few years later, and they weren't doing those things any more. And that was a problem because A) he told everyone that's the way of doing it, and that wasn't true any more. I mean, then he has to write more papers, which, in the context of the academy, is actually a good thing, right? Because then, you can get more points, and get tenure. But, actually, in real life, it's not a good thing because it means you've fucked it up.
And so, he was upset by this. He told these people to do these things, and Toyota wasn't doing it any more, and he came to the realization that the things he saw Toyota were doing were not best practices or good practices. What they were countermeasures to a particular situation that existed at a particular time. All the things that Toyota were doing, the practices and the boards and the this and that were countermeasures.
So, here's the thing we're going to do in response to this thing to see if we can fix it, and you got a snapshot of the countermeasures at a particular time, and he realized that what was important was not the particular countermeasures in place at that time in Toyota, it was the mentality of the people who were able to see a situation, experiment with countermeasures, measure if the countermeasures actually improved the thing they were tracking, and then keep on doing that forever, which is what continuous improvement is, which is what enables you to become adaptive to your changing environment.
So, what he realized was that's the thing that's important, the things that enable people to experiment and try countermeasures and get better at what they're doing, not the particular countermeasures they were doing. And so, he went to research how you learn to be a manager at Toyota, and what it means to be a manager at Toyota, and the results of his research are this book, Toyota Kata. And it turns out, to become a manager at Toyota, first of all, you have to have done the work on the shop floor for a sum length of time, a sum number of years before you become a manager.
And then, about half of the job of the manager is helping the people under them do continuous improvement. So, you don't—there's no training course that you go on to be a manager that teaches you this stuff. You learn to do it by having done it and understanding how to do it. And what he—I mean, he created a framework out of what he observed, and it's called the Improvement Kata. And the Improvement Kata is how you innovate in conditions of uncertainty. We are at A, we want to get to B, we don't know how to get to B, so we can't plan how we're going to get to B, so what do we do?
And this is the heuristic that they use. First of all, you understand the direction. So, this is something you want to do in about two years, where you want to be in about two years' time, something like that, that kind of horizon. And then, you grasp the current condition, and the way he suggested doing that is creating a value-stream map of your organization, and looking at the value streams. So there's probably… value stream from concept to cash, golf course to measurable customer outcome, so like say. And you see how that process looks, and then, what you do is is you have milestones of about a month, and, every month, you have a milestone, and you grasp the current condition, you establish the next target condition, how you want things to look about a month from now, and then, you don't plan how you're going to get from A to B.
Instead, what happens is, every day, people run experiments. PDCA is Plan, Do, Check, Act. It's called the demi-cycle. It's basically just the scientific method. You run the experiment to try and get towards the target condition, and design the experiment, and run the experiment. So, this is how people do process improvement at Toyota.
And this bit here, the bit that people do every day, there's five steps, and he actually has little cars that you can print out. What's the target condition? What's the challenge right now? Once the actual condition obtains where I am, what obstacles are preventing us from getting from A to B? Which one of those things are you addressing now? What's your next step? What's the experiment you're going to run to try and move towards the target condition? When can we go and see what we learned from taking that step? How soon can we measure and learn what we did as a result of taking that step, running that experiment? And that's how they do process improvement at Toyota.
Now, that sounds a bit abstract. One of the coolest things—the other really cool book I've read in the last few years is this book here, A Practical Approach to Large-Scale Agile Development. And this is the story of how the HP LaserJet Firmware Team, which is a 400-person team distributed across three countries, implemented continuous delivery. So, I'm going to talk about this at more length if people are interested later on, but, in the interest of stopping talking and giving people more time to ask questions, I'll just summarize it.
They basically, for this large distributed team, adopted continuous delivery in three years before the continuous delivery book came out, so they independently invented and adopted continuous delivery, and they independently invented and adopted the Improvement Kata, so this team at HP just invented both of these things themselves as a way of fixing their problem, which is they were going too slowly, so their, what is it, their direction, the thing they wanted to do in about two years was a ten-times productivity increase. They wanted an order-of-magnitude productivity in their ability to deliver, to their customer, new versions of their firmware.
Well, they didn't know how to do it, and they decided to rebuild the system, which I don't normally recommend, it's normally a terrible, but, in fact, in this particular case, they succeeded, and the way they succeeded is by making sure that, every month, they had something delivered—number one, after every month, they had something deliverable that was an end-to-end feature that demonstrated some particular architectural approach, or some approach to delivering some high-value feature. They would make sure, at the end of every month, they had some working software that, end to end, delivered a particular feature or set of features. And, number two, that they didn't have detailed plans for innovating.
They don't know how they're going to rebuild the system. They don't know how they're going to achieve the ten-times productivity increase. All they know is they're going to have to invent it and make up as they go along, and you can't plan it. So, instead of—I mean, what this is, basically, is these are sprint boundaries, so the sprint boundaries, they would grasp the current condition. Where are we now? They would establish the target condition. They wanted to be the case, the next sprint boundary, and they would not put together a plan as to how they would achieve that.
And, in terms of what that looks like, this was the program level. This is the program level sitting above four hundred engineers. What that looks like is this. So, they called their sprints mini-milestones because, again, they independently invented all this stuff. They didn't read any books. They didn't hire any consultants. They didn't buy any tools. They just worked it out housed it by themselves. So, at the end of every month, they would—sorry, at the beginning of every month, they set out mini-milestone objectives for the next month. So, their Rank Zero one was having a quality threshold where priority one issues were open less than a week while you fix your level two test failures in twenty-four hours. Then, their Rank One thing they wanted to achieve was a quarterly bit release, have all their final P1 change requests fixed, have their reliability overrated or released criteria.
So they had all these things, and these were agreed between the selection of engineers on the team and their leadership, the people in charge of the program, would agree on what these were, and, every month, they'd achieve about 80% of those things, but they wouldn't plan out how they were going to do it, and, instead, what they did is they just said to the teams, you go and work out how you're going to do this, and the teams would experiment with a bunch of things every day to try and get there.
And that's how, over two years, they ended up achieving a five-times productivity increase, which is not what they wanted, but still incredibly impressive, and they implemented this crazy deployment pipeline, which I can show you, which is totally nuts, that allowed them, with a ten million-line code base, four hundred developers, to get ten to fifteen good builds a day, a hundred check-ins to trunk a day, about 100,000 lines of code changed every day into this ten million-line code base, and get feedback within twenty-four hours if there was any regressions whatsoever.
So, I mean, it was amazing, but they did it by doing this. They had no plan. They just, every month, would set target objectives, and this is why I have a problem with a lot of the agile frameworks right now, particularly things like the scaled agile framework, is because they focus on what this guy, Gary Gruver, who wrote this book, called creating agile teams within the enterprise, rather than creating an agile enterprise. So, they focus on, let's make everybody scrum and agile on the teams, and then we'll build up from there and put some stuff on top of that. And he says that's totally the wrong thing to do.
What he says instead is the teams within the firmware group, he didn't care what methodology they used. There was no prescription. They could do waterfall, they could do XP, they could do whatever they wanted. He didn't really care what methodology they used. What he cared about was that all the teams together achieved the target objective at the end of the month, and, because they had no idea how to do it, they would have to work out amongst themselves how they were going to do that.
There was this horrible thing—so, who's worked on a software project with more than fifty people on it? So, you must have seen this horrible thing where, basically, there's the big meeting at the beginning of the sprint, and then, we get all the requirements, and we put them through a sausage grinder and turn them into lots of little stories, and then we hand out the little stories to the teams, and then, at the end of the month, everyone finishes their little stories, and we try and fit it all together, and then it doesn't work, and then everyone's like, I did my bit, must have been them. And everyone else is like, no, we did our bit, it must have been them. And everyone's right because everyone did their bit, but that's not the point because none of it actually works together because we didn't tell everyone, oh, and by the way, it has to work together. We told everyone, do your little bit.
So that, as a kind of process for doing large-scale bits of work, is fundamentally broken. We shouldn't break the work into little bits and hand it out. We fail at the point that we do that because, again, we're innovating, so we can't plan innovation because, if you can plan it, it's not innovation, so the idea that we can do this detailed planning and spit things up and put it together at the end, it'll magically work, it's just bogus. Instead, the idea is, well, we don't know how we're going to do it. You work out how to do it, and it's the job of the teams to make sure that they do this, and so either everyone succeeds or nobody succeeds.
And the other psychological element of this is, if I execute a plan that somebody else made up, and the plan doesn't work, that's not my problem. If the plan that I put together, that I made up, doesn't work, that's my problem. So there's this thing about buy-in. If I've made the plan, I'm much more bought in to making sure that it works. We're actually harnessing people's innate creativity and passion to actually go and try and solve the problem themselves instead of treating them like cogs within a machine that we hand these things to that they execute. And it all comes back to this whole thing about bank accounts and the bank account model, that we suck knowledge out of people. It's part of the same mindset, that we treat people as cogs in the machine.
Has anyone come across Theory X and Theory Y? Say, Theory X—it's this, like, fifty-year-old management guy who wrote this book, where Theory X is where managers assume that people are inherently lazy and not interested in their work, and hence we have to use carrot stick motivation to make them do things, whereas Theory Y, we assume that people are inherently motivated and actually care about what they’re doing, and so we’re going to let them come up with their own solutions and solve their own problems. And this whole model of agile or waterfall, in which you break things up into little bits and get people to do it, and that’s fundamentally Theory X. That’s fundamentally, we can’t trust you to come up with your own solutions to the problems so we’re going to tell you what to do. And that’s what creates the behavior in organizations where people are checked out. People are checked out not because that’s the natural state of people. Nobody’s born checked out. Anyone who’s played with little kids, they’re not born being like, oh, when can I go home, I just want my paycheck. No one’s born like that. We have it knocked out of us by an education system and organizations that treat people like little cogs in the machines to execute the things they are told to do. So, the improvement kata, and this process that they talk about here, is basically saying, we’re all going to agree that this is roughly in a month, we’ve got some idea that this is what we’re going to do, and we’re all going to work hours to do it ourselves, and we’re going to harness people’s native innovation and creativity and passion to solve those problems. So that is really the mindset that ultimately we need to create, and we do it in the way we treat people, in the way that we manage work, and in the way that we build our systems, and the way that we trust people, and the way that we deal with failure.
So, I just want to conclude with something by Jesse Robbins. Jesse Robbins, “Master of Disaster” at Amazon, co-founded OxCo, who makes Chef, now is doing a startup to do with monitoring that I can’t remember the name of. So, he has a great saying. He says, “Don’t fight stupid. Make more awesome.” And that’s the one thing I want you all to take away from this. People wonder, how am I going to implement DevOps, how am I going to implement continuous delivery, how am I going to deal with this stuff. You can spend a lot of time arguing with people about how you’re going to do things, and coming up with plans and fighting over the plans, and getting nowhere. The way you implement this stuff is by everyday coming in, and looking around you, and thinking, what one thing can I do today to make life slightly more awesome for the people around me. And if everyone does that everyday, that’s how we build an operative, anti-fragile, brilliant cultures that we actually like working at, and that are capable of building great things. So, yeah, that’s it. Questions?
Thank you. And I should say, ironically, Thoughtworks is hiring. I have an eBook coming out. That’s my advert. Questions?
Audience: I have a thought. I wasn’t going to say anything, but the more I listened to you speak, the more I was like, I should pipe up and say something and do a little shameless plug about where I work. I work in a coding education company. The first slide [01:03:07] you’re talking to the manager, departments need to talk to one another. We originally begin, I work at Codec, we begin at the client services firm, and we were providing technology solutions to AT&T, and Twitter, and all these great companies. It’s a very twenty percent business, and we began bringing on a lot of new people, teaching them coding, teaching them how to build these solutions. So then we became a coding education company. Now we run a three-month boot camp [01:03:33], we teach people iOS, Android, Google Glass, how to build apps, within three months, you can release your own monetizing app, and you begin to learn web and mobile code language. So, if anybody here is interested in talking to me about the course, you can give me your card before you go, you can sit in on the class a little bit, you can take a look, and it’s a really great place to expand upon your knowledge base. And, side note, look how many men are in the room, mostly a bunch of dudes, all you dudes in here. There are very few women here. We do have some special offers and scholarships to women because our enrollment is usually only about a quarter women, and we’re really big on diversity in tech and the creativity [01:04:15], people of all backgrounds. So, there you have a shameless plug.
Jez: Thanks for the shameless plug. I should say, by the way, that guy whose name I forgot, Paolo Freire. His book is called Pedagogy of the Oppressed. And actually he had to leave Brazil because his book was considered so revolutionary that the junta at the time weren’t very keen on these ideas. Other questions? Yes.
Audience: I’m kind of stuck on the never buy a tool place. So the Jesse Robbins guy, he created Chef right? So, I guess most people use that, and, I guess, presumably pay for support, and that’s how they make the money. I was just wondering, for monitoring, configuration management, you know, are you really saying don’t spend any money on tools, either create them in-house or use an open source?
Jez: So, a lot of the things I say, I say to get a rise, and that’s kind of one of them. But I say it for a reason as well, because a lot of the way we learn is by imposing artificial constraints. So, for example, I learned TDD and object-oriented programming basically by working with a bunch of people who are much smarter than me at Thoughtworks. And one of the ways that I learned about working [01:05:38] negatives integration, I worked with this guy called Ivan Moore. Now, Ivan Moore was this guy—most of us who grew up with a computer, our first language was basic, Ivan’s first language was Fortran, because he had this computer called a Jupiter Ace, which no one else in the world had. He was like the only person who ever bought that thing. Just, very clever. And he had this thing where basically at the end of the day, if you hadn’t checked in your code, he would just revert the working copy. So, you’re at the end of the day, you’re trying to work out what to do, you can’t back out, he’s like, right, let’s go home, it’s five-o-clock, I want my cup of tea, revert. And what would happen is, you would go to bed, you can’t wake up the next day, and you’ve slept on it, and then you’d get the whole thing done in about an hour, because you’d be like, oh, that’s how I do it.
So that’s an example of creating an artificial constraint as a way of innovating, and I think don’t buy tools is an example of that. It’s neither a necessary nor a sufficient condition. You can buy tools and do a great job. You can buy tools and do a shit job. You can not buy tools and do a great job. You can not buy tools and do a shit job. But by not buying tools, what you do is you force yourself to learn how to do the thing that the tool would have done for you. Because the problem is when you buy a tool, I’ll give you an example of what happens in the failure case. Say, I work for a company that sells tools, so I told you, don’t buy consultants, don’t buy tools, so I work for a company that sells consultants and tools. Whoops. So we make a tool called Mingle, which is for project management, and initially when we first made it, we sold it to big companies and what would happen is this: we made the tool very opinionated so it behaved in certain ways, and people would buy the tool, and they would say, nice tool, we want to do this. So, we want to have transitions that you can’t bypass, we want to measure utilization. And we would say, no, don’t do that. And they’d say, no no no, you don’t understand, we want to do that. And we’d be like, no no, you want to go agile, right, that’s why you bought the tool, because you want to go agile. If you do those things, those aren’t agile. And they would say, no no, but that’s what we do. And we’d say, no, but you become agile by not doing these things. And they would say, here’s our money, implement the fucking feature. And, you know, we’re kind of, oh, fucking owe you the money.
So that is what would happen. People buy the tools because they want to implement [01:07:55] so they buy the tool, or they want to implement agile so they buy the agile tool, and then they customize the agile tool so that they can keep doing waterfall with the agile tool. That’s what happens, like all he time. And I see Jenkins, people want to do CI, so they get Jenkins, or they buy whatever, and then they start running it on their feature branches. Right? It’s like, no, that’s not continuous integration. So that’s the failure mode: they buy the tool, they don’t change their behavior, because behavior is difficult to change, but I have a capsule budget, nothing changes. So that’s why I’m saying, as a way—impose that constraint to force yourself to change the way you behave. It’s kind of a, yeah. It’s a thought exercise.
Audience: Can you talk about running CI on feature branches?
Jez: Can I talk about, okay, can I talk about running CI on feature branches.
Audience: Or rather, do you define CI as not running it on feature branches? So what is wrong with running whatever you call it on feature branches?
Jez: So here’s the thing about continuous integration. It’s composed of these two words: continuous and integration. And integration means that you’re not on a branch. That’s what integration means. Integration means that the branches are integrated into trunk. And I’ve seen people try to weasel out of this by saying they’re integrating from trunk onto their branches, that’s not what it means. It means integrating into trunk. And continuous is a word that, as Mike Roberts will tell you, continuous means a lot more often than you think.
Audience: Is that my point?
Jez: Yeah, that’s yours, that’s yours. I use that all the time, it’s brilliant. So you’ve got Mike Roberts to thank for that, because it’s true. You know, people are like, oh, you’re going to do continuous improvement by doing retrospectives once a month. Once a month? That’s not very continuous. We’re going to do continuous integration by merging our feature branches once a week. Not continuous. So, I can give you the stupid answer which is, that’s the definition is that you do that. That doesn’t tell you why you have to do that. And I guess your question is why we should do continuous integration. So, that’s going to be—
Audience: Not why should we do continuous integration, but why is running things on feature branches?
Jez: Why is running integration on feature branches. The reason people make this mistake is because they think that what you’re trying to do is make sure the branches will merge into trunk, which is a linear problem, that’s an ON problem, will the individual branches all merge into trunk. That’s not the actual problem. If you want to release what everyone’s doing, you have to merge all the branches, and that means the branches all have to merge with each other, and that’s a combinatorial problem. So, I’ve tried to get around people working on GitHub by having a thing which runs all the feature branches, tries to merge trunk with their feature branch, and then run tests and build and test. That is often broken. Then, for shits and giggles, I have a little thing that tries to merge all the branches into trunk, and run tests, and I never actually implement the thing to run tests because the branches will never ever merge into trunk. I mean, that mode always fails, right? But actually, when you want to actually release the fucking thing, that’s what you have to do. You have to merge all the branches into trunk. So, people say, oh, well I can just merge my branch, because you’ve forgotten that there’s other people on their team and they have to merge their branches as well and release it. And actually what you’re doing is you’re just ignoring the fact that it’s a combinatorial problem. And this is why feature branching doesn’t scale. Because the more people you have, there’s a combinatorial problem of merging, which scales non-linearly, and also, when you’re in that situation what you want to do is you want to not merge your branch for as long as possible, because then you’re going to have to, everyone’s going to hate you, right. You’re going to merge your branch in, you’re refactoring some stuff, everyone else on your team pulls that, and they start coming over and shouting at you, right. That’s what happens. Everyone wants to get in first, but if you get in first, everyone else is going to be pissed off, and then no one can refactor. That’s what happens. Refactoring goes away because everyone’s scared of refactoring because it breaks everyone else. So that’s the ultimate problem. But it’s because it’s a combinatorial problem that gets exponentially worse with the number of branches and the length of time those branches remain unmerged, whereas continuous integration is something that scales. And I’m going to show you a slide, which I show everyone.
Audience: George Carlin’s bit.
Jez: Hey, George Carlin. And this always blows, people who don’t know about it, it kind of blows their minds, but it’s very important to bear in mind. Continuous integration does, in fact, scale. Google, all of Google, apart from Android and, I think, Chrome, is on one enormous Perforce repository on trunk. Everyone in Google works off trunk on Perforce. And all the Google properties are built off trunk. So in Google, if someone is making a change to a library and it breaks your thing that you’re working on, you can just revert their change from out of trunk. So everyone works off trunk. They have two hundred thousand test suites in their code base, they run ten million test suites per day, they have more than sixty million individual test cases per day and growing, and they have more than four thousand CI builds. So, Google is living proof that organizations like Amazon, Facebook, they’re all working off trunk. Because it’s the only thing that scales.
Audience: What do you mean by more than four thousand builds? Does that mean they run it more than four thousand times or they have four thousand different CI setups?
Jez: No, I think there’s four thousand runs of a thing per day. But what they have is they’ve got this—so because their repository is so big, they can’t check it all out every time, so they have this really simple thing based on makefiles which does impact analysis. So when I change a class, it looks at the main files and sees which things are affected by that, and just rebuilds those things. And then that’s the run of the CI, is, I see this change, here’s the things that are impacted, I’ll just rebuild those things.
Audience: There’s more than four thousand a day.
Jez: Is it?
Audience: I don’t think so.
Jez: Is it? This is a couple of years old. This is John Penix, he’s the guy who manages that.
Audience: So Paul had a couple [01:14:31] recent talks, both Facebook and [01:14:34].
Jez: Yeah, Jason Labor did a talk, I’ve got Jason Labor doing a talk at KeyCon this year, and I haven’t got his slide to hand, and it’s probably higher than that now. Yeah, this is a couple of years old.
Audience: A lot of companies that you call [01:14:47] as opposed to [01:14:48], people conflate the feature branches but what they often have is what they would call [01:14:53] branches, like we have a branch, one eight, and one nine, two zero, and so on and so forth. And I believe that [01:15:02] continuous integration of these branches is better than nothing. Especially for these branches are always released in isolation, so when you release one [01:15:18] don’t need to work to merge one nine to one eight, because one nine comes after that, which we would control, just merge one eight into one nine afterwards. So it’s not ideal, but it’s probably better. What do you think about that?
Jez: Yeah, so the question is, if you take branches for point releases, and you’re not merging stuff from the branches into trunk because you’re not actively working on those branches, is that okay. And I think that’s fine. I mean, the problem is not branching, the problem is merging. So there’s no problem taking branches; the problem comes if you don’t have to merge them. That’s where you get the pain. So you can take as many branches as you like, and there’s, [01:15:57] does a really nice presentation about this where he goes through the whole thing about branching and continuous integration, I’m just going to get you the, it’s down here somewhere, there we go. So his point is that taking branches for releases is okay, because all you’re merging back into the trunk is bug fixes. And those are small, and there aren’t many of them, unless you’re doing a really bad job, so that’s okay. You’re just taking little things and pushing them into trunk, that’s fine. And again, if you’re doing an architectural spike, and you create a branch and do some stuff, and then throw it away and do the relimitation, that’s fine too. The problem comes when you have to merge back into trunk, big gobs of stuff at the end of it. So if you’re taking, if you’re doing work on, as long as the developers are doing new work on trunk, taking a release branch and pushing to production from the release branch and then making bug fixes on the release branch and putting them into trunk, I have no problem with. I mean, what happens is as you start—
Audience: [01:17:04 – 01:17:06] not really trunk, they are developing on their own branch, and merging back individually into trunk.
Jez: Well, as long as—where you get problems is when people are working on two of those release branches simultaneously. That’s when you get the problems. So I think that’s—because what happens is, you ultimately have to merge, and that integration is always a nightmare. It always takes weeks, and you always have to redo a bunch of work. If you’re doing this right, if you’re doing increments of development on trunk, you get rid of the whole integration and stabilization phase. That just goes away. And what you’re doing is taking the pain—and developers don’t like it. Developers don’t like continuous integration because what developers want to do, is developers want to sit in their cube with their headphones on and write code forever. Right? And continuous integration makes you check into trunk at least once a day. And when you do that, the job of version control is it’s a communication mechanism. That’s what version control is, it’s a communication mechanism, and so you’re making developers check in a lot more often so that other people know what they’re doing. And then what happens is I check something in, other people look at it, they’re like, why are you doing that, and they come and talk to me, and suddenly we’re talking to each other a lot more, and developers don’t like that. I mean, the reason people become developers is so that they don’t have to talk to other people. And now we’re making them talk to each other and they don’t like that. And so developers don’t like it. But the problem is developers then, they like to work on their branch, but then integrating it and making it work is somebody else’s problem. And those people get screwed, but we don’t care because we’ve got our headphones in and we’re writing code.
Audience: It is often very just [01:18:50] branches, because they say for instance, I want to develop cheats for the January rooms and I want to develop [01:19:00] for the February rooms.
Jez: Right, because they don’t want to have to worry what other people are doing. My point is, ultimately you are going to have to worry about what other people are doing, I mean we’re never just going to release the January release and not release the February release, right. You’re going to have to release the February release, ideally in February. So at that point you’re actually going to have to merge the stuff and then it’s going to be somebody’s problem. And as long as it’s not me, that’s fine. And so, what you’re talking about is, the developer is optimizing for getting their work done, by which they mean, works for my machine, or maybe works in January. What you’re not optimizing for is how quickly can we release features into production. So you’re optimizing for the wrong thing. The developers are optimizing for the wrong thing. They’re not optimizing for how quickly can I get features to users, they’re optimizing for how can I make sure that we don’t have to worry about what the February release is doing till March. Or maybe April, if the merge goes really badly.
Audience: And there are… right, like what software needs certification before it hits the user?
Jez: What if software needs certification before it hits the user? Why does the software need certification before it hits the user?
Audience: There are federal regulations…
Jez: That’s right. And what are the federal regulations checking?
Jez: Not what methods are they using, what are they checking? They’re making sure that it works, right? So, how does filling in a form make sure that it works?
Audience: It’s not filling in a form. It’s peer reviews.
Jez: Peer reviews are fine, I have no problem with peer reviews. My problem with certification, using code review as certification, is the code reviews are being done often enough. If you do the code reviews prior to every release, what happens is you’ve got to big f-ing load of— that dif is enormous, right? And you’re going to tell me that somebody’s going to read that dif and they’re going to understand it, and they’re going to understand what the implications of those code changes are in the behavior of the system? That’s only true for a trivially simple system. Full methods only work on trivially simple systems, that’s what people don’t use them anymore. I mean, you cannot predict the effects of a change, of a dif of ten pages on the kbase. And a lot of the time, you can’t predict the effects of one page on a kbase. I don’t believe you, and I know it doesn’t work. It’s an example of what I call risk management theatre. When we do code reviews, we should be doing code reviews all the time, continuously. And the way we do code reviews is by pair programming with people. And that way, we are continuously reviewing the codes that people are doing, we’re continuously checking on it and we’re asking questions. And then when we check in to trunk, instantly, everybody else gets that code. And then the dev lead on the team is reviewing the atom feed of what’s going into trunk, and we’re doing code review at the second level. Then, we’re running automated tests based off trunk which actually validate the functional and non-functional performance of the system and making sure that it behaves— that it’s fit for purposes and there’s no known regressions in it. And then we’re doing performance testing, and then we’re doing user usability and exploratory testing. And every stage, we’re feeding back to the stage before and making sure that any problems we catch later, we put tests in place to catch them earlier. That’s an effective way to manage risk. Having a massive dif just before the release and inspecting it to see if that will have an effect is all nonsense.
Audience: Do we know of any examples in aeronautics or other places where they’re doing continuous integration, or not continuous integration or continuous delivery rather?
Jez: Yep, SpaceX. There’s a pamphlet that, I can’t remember, some consultancy did a continuous delivery issue of their innovation magazine about continuous delivery. There’s an article from the guy who’s head of engineering at SpaceX talking about how they do TDD and continuous delivery at SpaceX, which is why SpaceX is the first commercial company- one of the reasons why SpaceX is the first commercial company that’s been able to actually launch rockets into space and carry satellites, and dock with the international space station. You can actually download it for free. I’ll see if I can—
So, this is Technology Forecast by PWC. And they’ve got— “Aerospace industry moving towards DevOps testing and development environment” This is the guy who’s head of IT at SpaceX, so… development at SpaceX. Yes?
Audience: So, we’ve talked a lot about continuous integration and continuous delivery, and continuous improvement- a lot of things that are continuous. We still haven’t learned how to grow the next John Allspaw.
Jez: Alright. I want one too, man. He’s awesome. So, I think— I mean, if I could give this talk in one sentence, I would say that growing John Allspaws is how you validate that your company is doing the right things. The problem is, you can go and look at what Flickr did, which enabled them to do what they did and then Etsy too and what they did, and the problem is again, it’s a complex system. We have this cognitive bias called the narrative fallacy, which again, is something that Nastin Talib writes about, where as human beings we take a sequence of events and you try to reconstruct a story out of them, because that’s how we’ve evolved. And that’s really dangerous. I mean, it’s dangerous for safety analysis, because again, we try and create a story of how the failure happened. And again, complex systems, that’s really a dangerous thing to do. Try and understand the story of one company and try to synthesize that across multiple companies, I mean, I would be lying if I could tell you how that happens. What I’m trying to draw out is king of high level heuristics and meta methods around how to do that. And so, for example, improvement cutter for me is a meta method. It’s a way of trying to— the point of the improvement cutter is that the people doing the work learn how to get better at doing their work. We don’t teach them how to do their job and we don’t teach them how to get better at their job, we create an environment where their job is to get better at their job, and that’s how you create John Allspaws. So the improvement cutter is a meta method which tells people how to go about doing their work in such a way that producing knowledge and growing capability is as important as the end product of what you come out with at the end of it. And that’s what’s key to creating an innovation culture, in my opinion.
Audience: Yes, absolutely. However, you have to start from somebody who’s wanting to get better at their job. Where we have the case of developers who sits with their headphones on they clearly don’t want to have anything to do with what everybody else is doing. So where do we find, or how do we encourage that in people who are already there? From scratch, starting new—
Jez: So the problem is you’ve got a bunch of people that don’t behave this way and you’ve got to change their behavior. The key insight, in my opinion, is that people’s behavior is not just a function of themselves, it’s a function of the system in which they’re working as well. So if people behave a certain way, that’s not just innately the way they behave, that’s also— how they behave is a function of the system they work within. There’s a couple of things we can do, and they’re both complimentary. You can’t just change the system and expect everyone to adapt. And the reason you can’t do that is because, again, if you plan the TB state and you just change everything, you just create chaos. We’re in a complex system, you can’t predict how people will change in response to changing events, so this is why the improvement cutter works in these small horizons. It’s like, we don’t know how to get from A to B, we’re going to try to change this thing and we’re going to look at these measurable outcomes.
An example of this is metrics. What they would measure on the HP Laser Jet team would change every month. They didn’t have a single metric they would measure forever. They actually worked out what they were going to measure and changed it all the time. So one month, they were like, let’s measure the number of defects, or let’s measure the number of builds we’re getting out of trunk. And next month, let’s get this many builds out of trunk instead of this many builds. And they would set that target for next month and they would see how the behavior of people would change in response to it. And they would see if things got better or if people were finding ways to game the system. And in response to that they would say okay, either that was a good thing to measure and we’ll keep on measuring it, and we’ll set a new target, or that was a bad thing to measure and we shouldn’t have measured that, and we’re going to try and measure something else instead and see how that changes behavior. So again, complex system, you don’t know what the effects of those changes are going to be, so you mitigate— it’s optionality, is this idea that we limit the downside of a particular option that we take. And we limit the downside by having a time horizon, by saying we’re going to do this for a month and see what happens. And if it’s the wrong option, we can see and we can cut our losses and change it. So, it’s that idea of limiting downside by having a time horizon. And the mistake that people make is they’re like, what should I measure? I don’t know what you should measure. What you should measure is going to depend on how things are now and what you’re trying to achieve. And based on that, you’ll measure something and that will then change the state of the system, which will mean that A, you’re going to find out if you got towards that thing. If you got towards that thing, you’re then going to find out if that was a thing that you actually wanted to get towards in the first place, which it might not have been. And then in the process of doing that, you’re also going to change existing situation. So then, we need to start again and look at— So that, again, is kind of the meta thing that you need to do, which is why asking what should I measure, what process should I put in place- you can’t do that. You’re training people’s behavior. We have a name for people who change behavior unpredictably. We call them psychopaths. So we don’t want people to wildly and unpredictably change their behavior. We build up trust by knowing how people behave. And so we want to change thing incrementally and iteratively in order to actually see the effects of what we want to create. Does that answer your question? I kind of went off on a bit of tangent.
Audience: Good enough.
Jez: Thanks. Good enough, or shut up? Alright. More questions?
Jez: No, I’m not saying don’t work on branches. What I’m really saying is keep the inventory on the branch small. A branch should never be further away from trunk, so far away from trunk that you can’t read the dif on a page and understand what it means. That’s what I’m saying. Because think about open source projects. You have an open source project, say you’re the maintainor of an open source project. And say you don’t know who I am and I email you a ten page long patch. What are you going to do? And what are you going to say to me? And what are you going to tell me to do? Not f- off, but what productive, helpful, nurturing advice are you going to give me? Send smaller pages, right? You’re going to tell me to break it up into smaller things, so that when I get the smaller thing I can read it and understand and know whether it’s the right thing to do.
What we’re doing is we’re managing work in process. Every branch is work in process, and what we want to do is limit work in process, and it’s really about making sure that they don’t diverge too far. That’s what I’m saying, don’t let your branches diverge too far from trunk.
Audience: What do you think of pull requests as a code reading mechanism?
Jez: What do I think of pull requests as a code reading mechanism? I think they’re a fabulous code reading mechanism.
Jez: Yeah, I mean, the thing is, right, a branch is one of these terms that has multiple different meanings. Your working copy is a branch, right? Something on GitHub, your copy of the repo that you give me a pull request from is a branch.
Jez: I think it’s fine. The one thing I will say is that I don’t believe that developers should be prevented from checking into trunk. That’s part of creating a high trust culture. If you don’t let developers check into trunk, what you’re saying is we don’t trust you. And we’re going to optimize the situation where we assume you’re going to be wrong, and that we’re going to keep you on a branch rather than letting people check in assuming that developers are sensible and know how to do the right thing. And then catching the exceptional condition where they do something wrong with the test suite.
So, you’re saying something about the people you’re working with by not letting them check into trunk. You’re saying I don’t trust you to do the right thing, and most of the time you’re not, and I’m going to optimize to the case that you screw it up.
Jez: I went to Etsy a couple days ago—
Jez: I was at Etsy a couple of days ago. John also has this great story where basically he’s walking past a developer, and the developer says to John, “I’ve got this change. Do you think I should release it?” And John turns around to her and says, “I don’t know, should you release it?” And that kind of sums up in a nutshell the culture, which is that you as the developer are responsible for what you do. And developers feel a bit nervous about that. And that’s a good thing. It’s good for you to feel a bit nervous. It’s good for you to be a bit worried about it, because that means you’re taking responsibility for your actions. All of these things that we put in place to prevent developers from doing bad things, that obstructs people away from the consequence of their actions. That’s why I don’t like this thing about at the end of the release, because what’s happening is you get this feedback all this time afterwards. We don’t know what the thing I did was that was wrong, because the feedback loop’s too long. We obstruct people away from the consequence of their actions. People only learn to get better if when they do something wrong, they get the slap in the face straightaway. At Etsy, they do this thing where they let developers push code out without any checks. And in fact, the CI build starts at the same time that they push to production. It’s not that they do the CI build and then there’s a bunch of steps and then it goes to production. I push to production, the CI build starts at the same time, in parallel with the push to production, because we trust that the developers have run that change against the CI system and against all the rash of tests before they push.
Jez: Well, I think it should be— again, there’s this great quote. There’s a guy called … who’s CFO at … who implements this whole beyond budgeting thing where they don’t do budgeting at … And he has this great quote, he says, “You can’t change command and control through command and control.” You can’t create a high trust environment by getting everyone in a room and saying, “Right, now we’re going to trust each other. Go!” Right? That doesn’t work. You can’t say, “Okay, now it’s all your problem.” And a bunch of people who’ve work in a command and control system won’t self-organize into this fabulous high trust thing. That doesn’t happen, right? It’s an ongoing co-creative process of culture change that has to happen. So, pair programming— when we help people adopt these practices, we always say find volunteers. Start with the people who want to do this. Say we’re going to try doing this thing, who would like to try it? And what you’ll find, unless you’re really screwed, is that some people will stick up their hands and say yes, we want to try it. And you start with those people. And then what happens is they do some cool stuff, and then other people are like, that looks cool, maybe I should try that. And then they say, can I try that? Yes, of course you can try it! Don’t hide it away. That’s why I don’t like … because … is hidden away. Do it right in the middle of everything, accept that things are going to go wrong, give people options to try it and then more people want to do it. And then what happens is once you get over 50% of the organization, the people who didn’t want to do it before, they’re to people who want to be the normal people who are doing what everybody else does. And now, they’re not the normal people anymore, they’re the outliers. Those people don’t like being outliers, so then they’re like, oh sh-, and either they quit or they’re like okay, how do I get to do this?
So there is a kind of natural dynamic to it. There’s an adoption curve to it, same as anything else, and you have to play to that. You can’t force people do things. I mean, you can force people to do things. What happens is when you remove the pressure, those people don’t go back to what they were doing before. It doesn’t cause permanent behavior change. What causes permanent behavior change is people working out for themselves that they want to do it. It’s like having an addiction. You can’t tell people stop having the addiction. People have to first acknowledge that they want to change. You can’t force people to change.
I think it’s beer o’clock, but I’m here for a bit longer. So, let’s chat. Thanks again!
The video for this this talk can be found here