Bare Metal Machine Management
Self Service Bare Metal Machine Management
Since Spotify’s inception, the way we provision and manage machines has evolved dramatically. Today we have a platform that is capable of provisioning bare metal machines on-demand as well as managing their complete lifecycle. However this was not always the case. In this post I am going to tell you how we decoupled our job broker from the Configuration Management Database (CMDB), created a unified interface for bare metal machine management, and some lessons learned along the way.
From days to minutes
Over the years Spotify has reduced the amount of time it takes to provision machines for engineers from weeks to days, and now from hours to minutes. To get to this point we have transitioned from a completely manual process, to a tool-aided workflow, and now many RESTful web services to interact with the underlying hardware.
When the latest generation of our provisioning platform was started there was an immediate need for engineers to be able to reinstall existing servers due to a planned OS migration. However there was one caveat, reinstalls had to be entirely self service so the various engineering teams could autonomously upgrade their machines without the assistance of folks in Site Reliability Engineering (SRE). Due to these requirements, the project was split into two phases.
The first part was to remove the barrier between engineers and SRE when needing to interact with machines they own and the second was to rewrite how we perform provisioning to make it on demand.
Since the team first and foremost wanted to give the ability for engineers to do in-place reinstalls, we had to rethink how we interact with servers in the datacenter. Central to any bare metal management platform is the Intelligent Platform Management Interface (IPMI) and the Preboot eXecution Environment (PXE). IPMI provides an out of band mechanism for us to interact with the underlying hardware and PXE provides us with a way to boot via the network. However, we do not just use vanilla PXE, and instead chainload iPXE onto the Network Interface Controller (NIC) via PXE. This provides us with the benefit of being able to boot from HTTP and dynamically control if the machine should boot from the network or local disk based on the current CMDB state of a machine.
These tasks were previously initiated from within our CMDB which in turn distributed various IPMI commands to a message queue. However these tasks were only accessible while provisioning a new machine and took a rather naive approach that assumed the hardware was always reliable. Although we could have added additional features to the CMDB to perform new tasks, it didn’t align well with the CMDB’s core objective of being Spotify’s “source of truth” production database. This led us to break out the job queue into its own service that is now independent of the CMDB, giving us the benefit of having more loosely coupled services.
In parallel with breaking out the job queue into its own service, we began writing a RESTful frontend service that can be used as the basis to provide users with a web interface. Its purpose is to provide an aggregation layer for the discrete job queues and CMDB, while providing a central place to perform authentication and authorization.
With the foundation of these new services operational, we were able to provide engineers with more than just the ability to reinstall their machines but also trigger out of band restarts and recycles. This enables the engineering teams at Spotify to fully manage the life cycle of a bare metal machine without blocking on a human in SRE to look at their ticket. Since teams already have full autonomy and end-to-end control of their services, this system is a natural progression towards extending that autonomy one step further.
Over the course of many years at Spotify, little scripts have grown up into services with APIs, making tasks that were once manual, automated. With these two new services, Spotify now has a single interface for managing its bare metal fleet of servers, empowering engineers to create, destroy, or reinstall a server without the assistance of SRE.
During the implementation of these services, we were able fix some issues that were plaguing the old one, by adding smarter retry logic and better job tracking, giving us clearer indications of whether a job was successful or not.
Now that we have a service that is exposed directly to engineers, we have been working on adding more reliability into the system. We have discovered numerous bugs with flakey iDRAC firmware, IPMI libraries, and BIOS settings that has driven us to be even better at monitoring so we have visibility into when and why failures occur. We have also found that monitoring the availability of the iDRACs on the network plays an important role in increasing the reliability, because when you have thousands of servers, cables will become loose. We also automatically open a JIRA ticket anytime a failure is detected from a user initiated request. This helps keep the engineers in the loop and provides us with reportability into what is breaking.
All of this work thus far has proved to be successful for Spotify. Usage of these services have blown past our initial expectations and engineers continue to express how happy they are to be able to manage their own machines.
 Spotify’s CMDB provides information about hardware and network addressing.
 State is used to track the lifecycle of a machine. i.e. available, installing, or in use.
 “Recycle” means the machine is returned to the pool of available hardware, to one day be repurposed with a new role.