Andy Dirnberger Andy Dirnberger on

iHeartRadio ingests hundreds of thousands of products each month. Historically, as a new product delivery was received, a user would manually initiate the ingestion process by entering its file path into a form on a web page, triggering the ingestion application to parse the delivery and update the database. Downstream systems would constantly poll the database, run at regularly scheduled intervals, or be triggered manually. This process, roughly visualized below, was reasonable, with new content arriving in the catalog within a few days of its receipt.

Linear ingestion flow

Ingestion v2


As new content providers were added, new distribution formats needed to be accommodated. More and more code was added to the application. Eventually we developed a new version of the application, this time introducing an XSLT stylesheet for each provider. These stylesheets transformed the providers’ formats into a single, canonical format. This simplified the application as the ingestion application only then needed to know how to parse one XML format.


Over time, though, provider-specific logic found its way into the application to handle cases that couldn’t be handled by XSLT. Also, one provider delivered their content in a format that couldn’t easily be handled by XSLT at all. This meant that both versions of the application were used to ingest new content. Changes targeted at all providers needed to be made to two different applications. This also meant two applications needed to be tested against changes.

Ingestion v3


We made another pass at creating a provider-agnostic version of the ingestion application. This time, however, the goal was to include other types of content. The first two iterations focused solely on music. Data models and their business logic were pushed out of the application and into configuration files. This would allow for new instances of the application to be spun up with configuration files that described different content types.


To help with the additional workload, the application was designed to distribute its work. As this version was written in Python (the previous two were both written in Java), Celery, backed by RabbitMQ, was used to distribute tasks across multiple workers.


Unfortunately the simplicity of the application came at the cost of code that was very difficult to debug. It was also difficult to efficiently debug a data model when the model was defined in configuration rather than in code. Problems were hard to diagnose, and it quickly became clear that adding additional content types would only make this worse.


In addition to debugging problems, there was one issue this version failed to address, something that had been plaguing us from the first version on: when a delivery failed, it needed to be placed through the entire process again.

Ingestion Pipeline


We set out to build the fourth version of the ingestion application. This time, however, we decided to split it up into smaller, single-purpose applications. Each application would be connected to the next through a message queue.


With our new approach, applications can be run in succession or in parallel. Applications can be added or removed at any time without affecting the entire system. The flow of products through the ingestion system takes on a very different shape.

Distributed ingestion flow

By logging each outgoing message we gain visibility into the state of the system and can better monitor its health and performance. We send all of our logs — both the messages between systems and all application-level logs — to Logstash. This enables us to easily get messages into Elasticsearch, either in their original format or with keys remapped. This, coupled withKibana for dashboards, allows us to gain insight into how the system is performing and the stakeholders to gain insight into products being ingested.


We can also recover from errors much easier. For errors from which we know we can recover, an application can place its incoming message back into the incoming queue — this could also be done by not acknowledging the message and allowing it to return to the queue, but we want to know about the error and be able to fail if it’s been retried too many times — to allow it to be processed again later.

Brought to you by the letter H

Henson

To accomplish this, we built a framework known as Henson. Henson allows us to hook up a consumer (usually an instance of our Henson-AMQP plugin) to a callback function. Any message received from the consumer will be passed to the callback function. To help simplify the callback function associated with each application, Henson also supports processing the message before giving it to the callback through a series of callbacks (e.g., message schema validation, setting timestamps) and processing the results received from the callback (e.g., more timestamps, sending the message through a producer).


Henson allows us to reduce each application to just the amount of code required to implement its core functionality. The rest can be handled through code contained in shared libraries, registered as the appropriate processor.


The boilerplate for one of our services can be as simple as:

from henson import Application

from henson_amqp import AMQP
from ingestion import send_message, validate_schema

from .callback import callback

app = Application('iheartradio', callback=callback, consumer=AMQP(app))

app.message_preprocess(validate_schema)
app.result_postprocessor(send_message)


All we need to do when creating a new service is to implement callback.

async def callback(application, message):

"""Return a list of results after printing the message.

    Args:

application (henson.base.Application): The application
instance that received the message.
message (dict): The incoming message.

    Returns:

List[dict]: The results of processing the message.
"""
print('Message received:', message)
return [message]

Once this is done, we can then run the application with

$ henson run service_name

We decided to use asyncio’s coroutines and event loop to allow multiple messages to be processed while awaiting for traditionally blocking actions to complete. This is especially important in our content enrichment jobs, many of which poll APIs from third parties.


In addition to Henson itself, we’re also developing several plugins covering message queues, databases, and logging.

Continue
Unknown author on

Nick Elprin, founder of Domino Data Lab, talks about how to deploy predictive models into production, specifically in the context of a corporate enterprise use case. Nick demonstrates an easy way to “operationalize” your predictive models by exposing them as low-latency web services that can be consumed by production applications. In the context of a real-world use case this translates into more subtle requirements for hosting predictive models, including zero-downtime upgrades and retraining/redeploying against new data. Nick also focuses on the best practices for writing code that will make your predictive models easier to deploy.

Continue
aothman aothman on

Scores are a way for domain experts to communicate the quality of a complex, multi-faceted object to a broader audience. Scores are ubiquitous; everything from NFL Quarterbacks to the security threat risk of software has a score. Scoring also has commercial potential: beyond obvious applications (e.g., credit scoring) in the past twelve months both Klout (social media reputation scoring) and Walkscore (neighborhood walkability assessment) have been acquired.

HiScore is a python package that provides a new way for domain experts to quickly create and improve scoring functions: by using reference sets, a set of representative objects that are assigned scores. HiScore is currently used by a major environmental non-profit as well as IES, a startup that assesses the safety and sustainability of fracking wells.

HiScore relies on being able to interpolate through the reference set in an understandable and justifiable way. In technical terms, HiScore needs a good solution to the multivariate monotone scattered data interpolation problem. Monotone scattered data interpolation turns out to be trivial in one dimension and devilishly hard in many others. We discuss several failed approaches and false starts before finally arriving at the quasi-Kriging algorithmic foundation of HiScore. We conclude with applications, including the intuitive creation of complex scores with dozens of attributes.

The theoretical basis of HiScore is joint work with Ken Judd (Stanford).

31:36

GitHub repo here.

This video was recorded at the SF Bayarea Machine Learning meetup at Thumbtack in SF.

Continue
Eric Schles Eric Schles on

In this video Eric Schles, Dev evangelist at Syncano, gives a compare and contrast talk on Django and Flask frameworks. Eric walks through some strengths of Flask: low overhead, easy to set up, good for prototyping etc., and some weaknesses: poor built-in debugging tools, not built to scale well, not strongly object-oriented. On the Django side of things, Eric points to strengths: object oriented, production level code under a deadline, and the benefits of the admin panel. However, Eric cautions viewers about the high overhead of set up and warns beginners about challenges.

Continue
Victor Levy Victor Levy on

Victor Levy (Senior Consultant, Princeton Consultants) talks about COSMA (CSX Onboard Systems Management Agent). COSMA is a Python application hosted on locomotives, designed to monitor, check out, and upgrade safety-related locomotive systems. In this illuminating talk, Victor discusses the challenges of Python programming for train systems and how COSMA works.

Continue
Robert Clarke Robert Clarke on

In this lightning talk, Robert Clarke (CEO, Kind Robotics) explains how he uses Python scripts to power drones. The scripts can maneuver drones, track motion sensors, and execute complex missions.

Continue
Silas Ray Silas Ray on

Silas Ray and his team at The New York Times have been developing automation testing tools. Major problems they've had to solve include poor result visibility and unresponsive results. The team built the open source tools Sneeze and Pocket Change to solve these problems. In this talk, Silas explains how the New York Times has integrated Sneeze and Pocket Change into their testing suite and demos code to help you get started.

Continue
Daniel Krasner Daniel Krasner on

In this talk, Daniel Krasner covers rapid development of high performance scalable text processing solutions for tasks such as classification, semantic analysis, topic modeling and general machine learning. He demonstrates how Python modules, in particular the Rosetta Python library, can be used to process, clean, tokenize, extract features, and build statistical models with large volumes of text data. The Rosetta library focuses on creating small and simple modules (each with command line interfaces) that use very little memory and are parallelized with the multiprocessing package. Daniel also touches on LDA topic modeling and different implementations thereof (Vowpal Wabbit and Gensim). The talk is part presentation, and part “real life” example tutorial. This talk was recorded at the NYC Machine Learning meetup at Pivotal Labs.

Continue
Matt Story Matt Story on

no-slurping


A few weeks ago, a well-intentioned Python programmer asked a straight-forward question to a LinkedIn group for professional Python programmers:
What’s the best way to read file in Python?

Invariably a few programmers jumped in and told our well-intentioned programmer to just read the whole thing into memory:
f = open('/path/to/file', 'r+')

contents = f.read()

Just to mix things up, someone followed-up to demonstrate the exact same technique using ‘with’ (a great improvement as it ensures the file is properly closed in all cases):
with open('/path/to/file', 'r+') as f:

contents = f.read()
# do more stuff


Either implementation boils down to the use of a technique we call “slurping“, and it’s by far the most common way you’ll encounter files being read in the wild. It also happens to nearly always be the wrong way to read a file for 2 reasons:

  1. It’s quite memory inefficient

  2. It’s slower than processing data as it is read, because it defers any processing done on read data until after all data has been read into memory, rather than processing as data is read.



A Better Way: Filter


UNIX filter is a program that reads from stdin and writes to stdout. Filters are usually written in such a way that you can either read from stdin, or read from 1 or more files passed on the command line. There are many examples of filters: grep, sed, awk, cut, cat, wc and sh, just to name a few of the most commonly used ones.

Continue
Laurent Gautier Laurent Gautier on

We were lucky to attend the Bay Area R users group last week where we recorded Laurent Gautier's talk on the RPy2 bridge which allows one to use Python as the glue language to develop applications while using R for the statistics and data analysis engine. He also demonstrated how a web application could be developed around an existing R script.

Continue