Small screen shot 2016 06 07 at 3.57.11 pm Amir Najmi on

Scalable web technology has greatly reduced the marginal cost of serving users. Thus, an individual business today may support a very large user base. With so much data, one might imagine that it is easy to obtain statistical significance in live experiments. However, this is always not the case. Often, the very business models enabled by the web require answers for which our data is information poor.

In this talk, Amir Najmi from Google will use a simple mathematical framework to discuss how experiment sizing interacts with the business model of some large-scale online services.

Amir Najmi is Principal Quantitative Analyst at Google. He received a PhD in Electrical Engineering from Stanford University under Robert Gray and Richard Olshen. Amir works on statistical modeling and prediction methodology for large-scale high-dimensional data. He is interested in a critical understanding of mathematical models, and the role of human insight in machine learning.

This talk was given at the SF Data Engineering meetup in May 2016.

Small geg dingle Greg Dingle on

Tech businesses know how they're doing by numbers on a screen. The weakest link in the process of analysis is usually the part in front of the keyboard. People are not designed to think about abstract quantities. Scientists in the field of decision science have described for decades now exactly how people go wrong. You can overcome your biases only by being aware of them. Greg Dingle will walk you through some common biases, examples, and corrective measures.

Greg Dingle's Bio - “My first love was science. I was happily ensconced in a PhD program in evolutionary psychology when Y-Combinator came calling. I moved to SF, I lived the startup life for two years, then Facebook bought my two-person company. I rode that rocketship for 7 years. I wrote lots of code. I ended up specializing in building tools for data analysis--query tools, visualization tools and workflow tools. This past March, I quit Facebook and joined a young startup as co-founder, ParseHub. We make web scraping easy.”

This talk was given at the SF Data Science Meetup at Galvanize in May, 2016.

Small jeroenjanssens Jeroen Janssens on

Data scientists love to create exciting data visualizations and insightful models. However, before they get to that point, usually much effort goes into obtaining, scrubbing, and exploring the required data.

In this talk, Jeroen Janssens, from YPlan, talks about the *nix command line. Although it was invented decades ago, it remains a powerful environment for many data science tasks. It provides a read-eval-print loop (REPL) that is often much more convenient for exploratory data analysis than the edit-compile-run-debug cycle associated with scripts or even programs. Even if you're already comfortable processing data with, for example, R or Python, being able to also leverage the power of the command line can make any data scientist more efficient.

This talk was recorded at the NY Open Statistical Programming meetup at Knewton.



Small 1545224 10152105765716192 1764874921 n Pete Soderling on

In response to a recent post from MongoHQ entitled “You don’t have big data," I would generally agree with many of the author’s points.

However, regardless of whether you call it big data, small data, hot data or cold data - we are all in a position to admit that *more* data is here to stay - and that’s due to many different factors.

Perhaps primarily, as the article mentions, this is due to the decreasing cost of storage over time. Other factors include access to open APIs, the sheer volume of ever-increasing consumer activity online, as well as a plethora of other incentives that are developing (mostly) behind the scenes as companies “share” data with each other. (You know they do this, right?)

But one of the most important things I’ve learned over the past couple of years is that it’s crucial for forward thinking companies to start to design more robust data pipelines in order to collect, aggregate and process their ever-increasing volumes of data. The main reason for this is to be able to tee up the data in a consistent way for the seemingly-magical quant-like operations that infer relationships between the data that would have otherwise surely gone unnoticed - ingeniously described in the referenced article as correctly “determining the nature of needles from a needle-stack.”

But this raises the question - what are the characteristics of a well-designed data pipeline? Can’t you just throw all your data in Hadoop and call it a day?

As many engineers are discovering - the answer is a resounding "no!" We've rounded up four examples from smart engineers at Stripe, Tapad, Etsy & Square that show aspects of some real-world data pipelines you'll actually see in the wild.


Small passbild2nd Rosaria Silipo on

Open source tools usually delegate their support service to community forums. How reliable is this strategy? In this talk, Rosaria Silipo answers that question and this one, "who says that Open Source Software does not have support?"  She measures the efficiency of the community forum from 2007 to 2012 of KNIME, an open source data analytics platform. Commonly used techniques in social media analysis, such as web crawling, web analytics, text mining, and network analytics, are used to investigate the forum characteristics. Each part is described in detail during this presentation. This talk was recorded at the SF Data Mining meetup at inPowered.



Small russell jurney Russell Jurney on

In this talk, Russell Jurney (author of Agile Data) presents about rapidly prototyping analytics applications using the Hadoop stack to return to agility in light of the ever deepening analytics stack. This presentation uses Hadoop, Pig, NoSQL stores and lightweight web frameworks to rapidly connect end-users to real insights. This talk was recorded at the SF Data Mining meetup at Trulia.



Small alexis le quoc over Alexis Lê-Quôc on

Imagine you are tasked with building a platform to monitor the performance of 500,000 servers in real-time. How would you design it? What tools would you choose? (Cassandra? Storm? Spark? HBase? ...) What technical challenges would you expect? As a monitoring company, Datadog receives tens of billions of telemetry data points every day and is working to change the way operations teams understand and troubleshoot their infrastructure and applications. In this talk, Alexis Lê-Quôc from Datadog talks about how they built their (Python-based) low-latency, real-time analytics pipeline. This talk was recorded at the NYC Data Engineering meetup at The Huffington Post.



Small jmw bigger John Myles White on

In this talk, "Streaming Data Analysis and Online Learning," John Myles White of Facebook surveys some basic methods for analyzing data in a streaming manner. He focuses on using stochastic gradient descent (SGD) to fit models to data sets that arrive in small chunks, discussing some basic implementation issues and demonstrating the effectiveness of SGD for problems like linear and logistic regression as well as matrix factorization. He also describes how these methods allow ML systems to adapt to user data in real-time. This talk was recorded at the New York Open Statistical Programming meetup at Knewton.



Small facebook small Adam Ilardi on

(Original post with video of talk here)

Adam Illardi: Hi, I’m Adam Ilardi.  I work here at eBay.  I’m an applied researcher.  Why do I choose eBay?  It’s a pretty cool company.  They sell the craziest stuff you’ll ever believe.  There’s denim jean jackets with Nick Cage on the back, and this kind of stuff is all over the place.  So it’s definitely cool.

The New York office is brand new.  It’s less than a year.  What does the New York office do?  Well, we own the homepage of eBay, so the brand-new feed is developed right over there.  You might even see one of the guys.  He’s hiding.  Okay.  And also, all the merchandising for eBay is going to be run out of the New York office.  So that’s billions of dollars worth of eBay business run right out of here.  It’s a major investment eBay has made in New York, which is really cool.

So why you’re here is to find out how we use Scala and Hadoop, and given all the data we have, the two pair very nicely together, as you will see.  All right, so let’s get started.  Okay, these are some things we’ll cover—polymorphic function values, higher kinded types, Cokleislis Star Operator, some use of macros.


Small dag final headshot 10 10 12bw Dag Liodden on

This talk is by Dag Liodden, VP of Engineering at Tapad,  recorded at the Aerospike facilites.

Dag Liodden, Co-Founder and CTO of Tapad, joined the Wikibon community to discuss the tools and technology that Tapad uses to make real-time decisions for ad placement.



Join Us