Small m3kmuz40k40trprtqohq 400x400 Ramesh Johari on

A/B testing is a hallmark of Internet services: from e-commerce sites to social networks to marketplaces, nearly all online services use randomized experiments as a mechanism to make better business decisions. Such tests are generally analyzed using classical frequentist statistical measures: p-values and confidence intervals.

Despite their ubiquity, these reported values are computed under the assumption that the experimenter will not continuously monitor their test---in other words, there should be no repeated “peeking” at the results that affects the decision of whether to continue the test. On the other hand, one of the greatest benefits of advances in information technology, computational power, and visualization is precisely the fact that experimenters can watch experiments in progress, with greater granularity and insight over time than ever before.

What You Will Learn:
Based on some of Ramesh's work at Optimizely, you'll learn how their optimization platform addresses continuous monitoring of experiments.

Prerequisites:
Basic statistics would be helpful.

Where To Learn More:
- Nontechnical blog post @ Optimizely.com
- Technical post (PDF) from Optimizely
- Full paper on arxiv via Arxiv.org

These slides are from a talk given at the SF Data Engineering meetup.

Continue
Placeholder Silviu Calinoiu on

This talk shows how to build an ETL pipeline using Google Cloud Dataflow/Apache Beam that ingests textual data into a BigQuery table. Google engineer Silviu Calinoiu gives a live coding demo and discusses concepts as he codes. You don't need any previous background with big data frameworks, although people familiar with Spark or Flink will see some similar concepts. Because of the way the framework operates the same code can be used to scale from GB files to TB files easily.

This talk was given as a joint event from SF Data Engineering and SF Data Science.

Continue
Small 1545224 10152105765716192 1764874921 n Pete Soderling on

We just posted the final schedule for DataEngConf San Francisco, April 7-8, 2016 and added even more great talks & workshops.

We're lucky to have two talks on Google's TensorFlow platform, including one by a TensorFlow committer! You can also participate in 2 free workshops we'll be running throughout the event - one on SparkSQL & one on Scikit-Learn.

Talks from these top companies:

DataEngConf logos


DataEngConf is the first engineering conference that tackles real-world issues with data processing architectures and covers essential concepts of data science from an engineer's perspective.

Hear real world war-stories from data engineering & data science heroes from companies like Google, Airbnb, Slack, Stripe, Netflix, Clover Health, Segment, Lyft and many more.

Full info & tickets: http://www.dataengconf.com

Continue
Placeholder Peter Bakas on

Peter Bakas from Netflix discusses Keystone, their new data pipeline. Hear in detail how they deploy, operate, and scale Kafka, Samza, Docker, and Apache Mesos in AWS to manage 8 million events & 17 GB per second during peak!

This talk was given as a joint event from SF Data Engineering and SF Data Science, and Peter Bakas - Director of Engineering, Real-Time Data Infrastructure, Netflix is the speaker.
Dataenconfnyc2016 logos4

Continue
Placeholder Asim Jalis on

Big Data applications need to ingest streaming data and analyze it. HBase is great at ingesting streaming data but not so good at analytics. On the other hand, HDFS is great at analytics but not at ingesting streaming data. Frequently applications ingest data into HBase and then move it to HDFS for analytics.

What if you could use a single system for both use cases? This could dramatically simplify your data pipeline architecture.

Enter Apache Kudu. Kudu is a storage system that lives between HDFS and HBase. It is good at both ingesting streaming data and good at analyzing it using Spark, MapReduce, and SQL.

This talk was given as a joint event from SF Data Engineering and SF Data Science, and Asim Jalis (Lead Instructor, Data Engineering Immersive, Galvanize SF) is the speaker.
Dataenconfnyc2016 logos4

Continue
Small 1545224 10152105765716192 1764874921 n Pete Soderling on

After the success of our last event in NYC, we decided to bring DataEngConf to San Francisco, April 7-8, 2016!

logos

DataEngConf is the first engineering conference that tackles real-world issues with data processing architectures and covers essential concepts of data science from an engineer's perspective.

Hear real world war-stories from data engineering & data science heroes from companies like Google, Airbnb, Slack, Stripe, Netflix, Clover Health, Yammer, Lyft and many more.

Use code "site20" for 20% off regularly priced tickets through 3/24.

Full info & tickets: http://www.dataengconf.com

Continue
Small 168109 Andy Dirnberger on

iHeartRadio ingests hundreds of thousands of products each month. Historically, as a new product delivery was received, a user would manually initiate the ingestion process by entering its file path into a form on a web page, triggering the ingestion application to parse the delivery and update the database. Downstream systems would constantly poll the database, run at regularly scheduled intervals, or be triggered manually. This process, roughly visualized below, was reasonable, with new content arriving in the catalog within a few days of its receipt.

Linear ingestion flow

Continue

Small 128247 Adam Denenberg on

Here at iHeartRadio we have made a significant investment in choosing Scala and Akka for our MicroService backend. We have also recently made an investment in moving a lot of our infrastructure over to AWS to give us a lot more freedom and flexibility into how we manage our infrastructure and deployments.


One of the really exciting technologies coming out of AWS is Lambda. Lambda allows you to listen to various “events” in AWS, such as file creation in S3, stream events from Kinesis, messages from SQS and then invoke your custom code to react to those events. Additionally the applications you deploy that react to these events, require no infrastructure and are completely auto-scaled by Amazon (on what seems like a cluster of containers). Currently Lambda supports writing these applications in Python, Node and Java.

Continue

Placeholder Chris Wiggins on


Interested in learning more about data engineering and data science? Don't miss our 2 day DataEngConf with top engineers in San Francisco, April 2016.

For our inaugural DataEngConf 2015 we were excited to have Chris Wiggins talk about the importance of data science in a modern organization.

Chris covered the importance of bridging data science and data engineering in a company, and spoke about the interactions of their data team at The New York Times.

So what is data science? Chris explained data science as an intersection of machine learning with the decades old academic fields of statistics and computer science, and then applying these combined concepts to some particular domain of expertise.

Another way of explaining data science is a knowledge of machine learning that enables one to find the right tool for the right job, an ability to listen to people and work to figure out how to reframe their problems as machine learning tasks, and the translation of what you've learned in a way that's actionable.

Continue

Unknown author on


The data science team at iHeartRadio has been developing collaborative filtering models to drive a wide variety of features, including recommendations and radio personalization. Collaborative filtering is a popular approach in recommendation systems that makes predictive suggestions to users based on the behavior of other users in a service. For example, it’s often used to recommend new artists to users based on what they and similar users are listening to. In this blog post we discuss our initial experiences applying these models to increase user engagement through targeted marketing campaigns.


One of the most successful approaches to collaborative filtering in recent years has been matrix factorization, which decomposes all the activity on a service into compact representations of its users and what they interact with, e.g. artists. At iHeartRadio we use matrix factorization models to represent our users and artists with ~100 numbers each (called their latent vectors) instead of processing the trillions of possible interactions between them. This simplifies our user activity by many orders of magnitude, and allows us to quickly match artists to users by applying a simple scoring function to the user’s and artist’s latent vectors.

Continue

Join Us