Previously, we created a Digital Ocean account and an SSH key to have a secure connection. If you have not done so already, go back to the first part and do so.Continue
Sean Kenny, Android developer at Spotify, talks about how they've reduced crash rates by building a veritable fortress of tests. Sean shares their testing for mobile methodology/process, covering their testing levels (unit [Robolectric], test automation, manual) and crash reporting, as well as how they've improved over time to create a more stable and robust application.
This video was recorded at the New York Android Developers meetup at Spotify in NYC.Continue
Companies like Netflix, Google and Amazon are leading the way with Continuous Delivery. But most of us don’t work in a company like those. We don’t have a simian army. We don’t have a canary testing process. Working out where to start can feel overwhelming.Continue
Docker Software Engineer Victor Vieux gives an overview of the new features in the Docker Engine and Docker Hub. New features for the Engine include the ability to pause and unpause a container, various networking strategies, .dockerignore, and much more. For the Hub, there are many new features including organizations and groups and official repositories. Victor also goes over what’s coming in the future for the Engine.Continue
One of our key missions on the search team at Shutterstock is to constantly improve the reliability and speed of our search system. To do this well, we need to be able to measure many aspects of our system’s health. In this post we’ll go into some of the key metrics that we use at Shutterstock to measure the overall health of our search system.
The image above shows our search team’s main health dashboard. Anytime we get an alert, a single glance at this dashboard can usually point us toward which part of the system is failing.
On a high level, the health metrics for our search system focus on its ability to respond to search requests, and its ability to index new content. Each of these capabilities is handled by several different systems working together, and requires a handful of core metrics to monitor its end-to-end functionality.
One of our key metrics is the rate of traffic that the search service is currently receiving. Since our search service serves traffic from multiple sites, we also have other dashboards that break down those metrics further for each site. In addition to the total number of requests we see, we also measure the rate of memcache hits and misses, the error rate, and the number of searches returning zero results.
One of the most critical metrics we focus on is our search service latency. This varies greatly depending on the type of query, number of results, and type of sort order being used, so this metric is also broken down into more detail in other dashboards. For the most part we aim to maintain response times of 300ms or less for 95% of our queries. Our search service runs a number of different processes before running a query on our Solr pool– language identification, spellcheck, translation, etc, so this latency represents the sum total of all those processes.
In addition to search service latency, we also track latency on our Solr cluster itself. Our Solr pool will only see queries that did not have a hit in memcache, so the queries that run there may be a little slower on average.
When something in the search service fails or times out, we also track the rate of each type of error that the search service may return. At any time there’s always a steady stream of garbage traffic from bots generating queries that may error out, so there’s a small but consistent stream of failed queries. If a search service node is restarted we may also see a blip in HTTP 502 errors, although that’s a problem we’re trying to address by improving our load balancer’s responsiveness in taking nodes out of the pool before they’re about to go down.
A big part of the overall health of our system also includes making sure that we’re serving up new content in a timely manner. Another graph on our dashboard tracks the volume and burndown of items in our message queues which serves as our pipeline for ingesting new images, videos, and other assets into our Solr index. This ensures that content is making it into our indexing pipeline, where all the data needed to make it searchable is processed. If the indexing system stops being able to process data, then that will usually cause the burndown rate of each queue to come to a halt.
There’s other ways in which our indexing pipeline may fail too, so we also have another metric that measures the amount of content that is making it through our indexing system, getting into Solr, and showing up in the actual output of Solr queries. Each document that goes into Solr receives a timestamp when it was indexed. One of our monitoring scripts then polls Solr at regular intervals to see how many documents were added or modified in a recent window of time. This helps us serve our contributors well by making sure that their new content is being made available to customers in a timely manner.
Behind the scenes we also have a whole host of other dashboards that break out the health and performance of each system covered in this dashboard, as well as metrics for other services in our search ecosystem. When we’re deploying new features or troubleshooting issues, having metrics like these helps us very quickly determine what the impact is and guides us to quickly resolving it.
This article first appeared on Shutterbits.Continue
Etsy engineers deploy 40+ times per day to Etsy.com. How does a team of 175+ committers maintain uptime for 60+ million unique monthly visitors?Continue
An increasing number of applications are built with APIs for internal services communicating with each other as well as an external API. These critical APIs must be monitored. In this talk, Garland Kan (Engineer, Algorithms.io), explains how engineers can use Frisby.JS as an API testing and monitoring tool.Continue
Over the years, bitly has built a number of large scale systems to handle and analyze billions of clicks each month. Distributed systems can often be challenging to build and operate, but they can offer significant benefits in terms of availability, cost effectiveness, and capacity.Continue
In this talk, Sean O'Connor (Lead Application Developer, Bitly) covers the DevOps best practices, processes, approaches, and tools his team uses at Bitly to do over 20-30 deploys each day to over 400 servers without causing any major disruptions to service.Continue
Hi all, we’ve got another awesome DevOps best practices interview, this time it’s with the folks at Shutterstock. Lead architect Nathan Milford & Cloud Solutions architect Sebastian Weigand will give you their insight into the best practices that they employ.Continue
Selenium and the Page Object Model serve as building blocks for our testing suite at Axial. But we’re missing the glue that connects it all together: pytest. Pytest is a popular alternative to python’s built in unittest framework offering many highly useful features unavailable in unittest. Our favorite features include:Continue
In this talk, Pascal-Louis Perez of Square NYC will share lessons learned at Square and provide a detailed layering of the best practices that they employ. This talk will cover non-controversial topics such as Test Driven Development, but from new angles. We'll also cover emerging practices like continuous deployment, and softer areas such as engineering management practices geared towards safety. This talk was recorded at the Continuous Delivery NYC meetup at Etsy Labs.Continue
Jez Humble: Thanks for having me, it’s a pleasure to be here. I brought all my clothes to New York, and I’m wearing them all because it’s quite cold. Although, I hear it’s actually warmer than it was last week.Continue
In this talk, Chris Angove, Chapter Lead at Spotify, talks about deployment tools and techniques at Spotify, specifically issues surrounding the use of virtualenv within Debian packages: (http://labs.spotify.com/2013/10/10/packaging-in-your-packaging-dh-virtualenv/)Continue
Monday, Jan 27th
Released 12 PM, Jan 27
Everyone is putting “DevOps” on their LinkedIn profile, and everyone is trying to hire them. In this post, Jez Humble from ThoughtWorks will argue that this is not a recruitment problem but an organizational failure. If you want to learn some of the best DevOps practices that your organization can employ, this is the talk for you.
Released 2 PM, Jan 27
At Axial they recognized the need to make their testing efforts more reusable and they accomplished this by doing what their corpus does best: build. They built Axium, a test automation suite that is easily executed, understood, maintained and configured. This article is part one of an insightful three part series.
Released 5 PM, Jan 27
Great Q & A in this article as Daniel Doubrovkine, Head of Engineering at Artsy talks about some best DevOps practices at Artsy. Continue
As the amount of unstructured data has greatly exceeded a single computer's ability to process it, data has become increasingly isolated from the compute elements. The resulting haul from stores of record (e.g., SAN, NAS, S3) to transient compute (e.g., Hadoop, EC2) creates needless mechanical work and human labor. Is there a better way?Continue
The past few years have seen a phenomenon of software organizations abandoning traditional release cycles in favor of daily or even hourly deployments. The emergence of a new, rapid software development workflow has raised questions regarding the role of test and QA in a product’s life cycle. When there is no QA phase, can there still be QA? In this talk, Noah Sussman answers that question and more as he delivers an Intro to Selenium and continuous deployment.Continue
In Part 1 we learned how to use Selenium IDE to automate the testing of common actions within the browser. In Part 2 we’ll take a programmatic approach to browser automation using the Selenium Webdriver API and the Page Object Model.
Integrating Selenium Webdriver API calls into your unit tests is easy. If we want to use Webdriver to automate a user searching “Axial Network” on Google, all we have to do is…
from selenium import webdriver
driver = webdriver.Firefox() # Open a Firefox browser
driver.get('http://www.google.com') # Go to page
driver.find_element_by_id('gbqfq').send_keys('Axial Network') # Type search term
driver.find_element_by_id('gbqfb').click() # Click search button
While this example uses Python to perform actions in Firefox, its also possible to use Java, C# or Ruby to perform actions in Chrome, Safari or IE. Selenium IDE test cases can be converted to Webdriver code by going to File > Export Test Case As… > Python 2 (WebDriver). The example here shows the result of converting our test case from Part 1, which tests weather.com’s ability to get the weather in New York.
Introducing the POM
How can we piece together our test cases in a way that produces a flexible and maintainable testing suite? What do we mean by “flexible and maintainable”?
1. One that requires minimal changes when application code changes
2. One that is reusable, in that testing logic can be shared across multiple test cases
3. One that can be run on multiple browsers
This is where the Page Object Model (POM) comes in. The POM is an approach that focuses on modeling all the attributes and actions that comprise each page, then writing test cases that perform those actions and verify results. This approach makes a lot of sense given we are trying to simulate user actions, which can be described in the form of “go to that page and do this, go to this page and do that, and so on”.
As an example lets model a login page…Continue
The challenge with testing is no different than most: allocating resources. Its hard to sacrifice resources for testing at the cost of new features and still feel a sense of accomplishment – especially when such a sacrifice needs to be made every time you release. At Axial we release every day and have come to realize that such sacrifices need to be made consistently in order to maintain a top quality product. We recognized the need to make our testing efforts more reusable. We accomplished this by doing what the corpus does best: build. From this we built Axium, a test automation suite that is easily executed, understood, maintained and configured. This is the first of a series of posts that shows how we built Axium and contains the following independent parts:
1. Selenium – For test automation Continue
Everyone is putting “DevOps” on their LinkedIn profile, and everyone is trying to hire them. In this talk, Jez Humble from ThoughtWorks will argue that this is not a recruitment problem but an organizational failure. If you want to learn some of the best DevOps practices that your organization can employ, this is the talk for you.Continue
In this talk, John Allspaw, Senior Vice President of Technical Operations at Etsy talks about the development and deployment process at Etsy. He traces it back from its roots at Flickr to the present day, and also goes over the rationale behind many of its guiding principles. This talk was recorded at the Continuous Delivery NYC meetup at Etsy.Continue
In this talk, Bethany Erskine, Devops Engineer from Paperless Post, presents an in-depth look into why Paperless Post chose Sensu, and how they monitor their services and collect system metrics to send to Graphite. Subtopics will include how we planned for and executed the migration, mistakes they made along the way, how they knew when to scale and how they did scale. Bethany also covers how Paperless Post is making their Sensu setup redundant and highly available, and how they're monitoring and collecting metrics about Sensu and integrating their internal tools with it. This talk was recorded at the NYC Devops meetup at Meetup HQ.Continue
This is an Apache Zookeeper introduction - In this talk, Camille Fournier, from Rent The Runway, gives an introduction to ZooKeeper. She talks on why it's useful and how you should use it once you have it running. Camille goes over the high-level purpose of ZooKeeper and covers some of the basic use cases and operational concerns. One of the requirements for running Storm or a Hadoop cluster is to have a reliable Zookeeper setup. When you’re running a service distributed across a large cluster of machines, even tasks like reading configuration information, which are simple on single-machine systems, can be hard to implement reliably. This talk was recorded at the NYC Storm User Group meetup at WebMD Health.Continue
In this talk, "Scaling Deployment," Daniel Schauenberg from Etsy talks on the development and deployment infrastructure that they utilize at Etsy. This talk was recorded at the Continuous Delivery NYC meetup at Etsy Labs. At Etsy they have over 100 engineers deploying more than 60 times a day. This culture of continuously deploying small change sets enables them to build and release robust features all while serving over a billion page views per month. In order to make sure they can keep up this pace, they have development and deployment infrastructure in place that makes it comfortable and simple to make changes. So simple that as an engineer at Etsy you deploy the site on your first day - even if you're a dog.Continue
In this talk, we'll hear from Sam Helman, Software Engineer at MongoDB (formerly 10gen), on how MongoDB is integrating Go into their new and existing cloud tools. Some of the tools leveraging Go include the backup capabilities in MongoDB Management Service and a continuous integration tool. They see using Go as an opportunity to experiment with new technologies and create a better product for end users. This talk was recorded at the MongoDB User Group meetup at MongoDB.Continue
Organized by SF Bay Area Large-Scale Production Engineering, this is the first talk from their Dynamic Scaling meetup last week at Yahoo! URL's cafe. The talk is from Coburn Watson, Manager of Cloud Performance at Netflix. It's called "Dynamically Scaling Netflix in the Cloud."