Placeholder Chris Wiggins on


Interested in learning more about data engineering and data science? Don't miss our 2 day DataEngConf with top engineers in San Francisco, April 2016.

For our inaugural DataEngConf 2015 we were excited to have Chris Wiggins talk about the importance of data science in a modern organization.

Chris covered the importance of bridging data science and data engineering in a company, and spoke about the interactions of their data team at The New York Times.

So what is data science? Chris explained data science as an intersection of machine learning with the decades old academic fields of statistics and computer science, and then applying these combined concepts to some particular domain of expertise.

Another way of explaining data science is a knowledge of machine learning that enables one to find the right tool for the right job, an ability to listen to people and work to figure out how to reframe their problems as machine learning tasks, and the translation of what you've learned in a way that's actionable.

Continue

Placeholder Wes McKinney on


Interested in learning more about data engineering and data science? Don't miss our 2 day DataEngConf with top engineers in San Francisco, April 2016.

In this talk we're excited to have Wes McKinney give a demo and discuss the roadmap of Ibis, a new data analytics framework.

While Python is a de-facto language for modern data engineering and data science, Python development has been confined to local data processing—thereby limiting its users to smaller data sets. Historically, to address bigger data workloads, Python developers have had to extract samples or aggregates, forcing compromises in data fidelity, adding ETL costs, and ultimately leading to a loss of productivity and addressable use cases.

Ibis, a new open source data analytics framework for Python developers, has the goal of enabling the Python data ecosystem (NumPy, pandas, etc.) to operate efficiently at Hadoop scale. To enable high performance Python at scale without the age-old JVM interoperability problems, Ibis takes advantage of unique synergies between Python and Impala, the leading open source MPP analytical query engine. In this talk, Ibis creator Wes McKinney, who was also the creator of pandas, will demo the current capabilities of Ibis as well as explain its roadmap.

Continue

Small 7481924 Noel Cody on


Interested in learning more about data engineering and data science? Don't miss our 2 day DataEngConf with top engineers in San Francisco, April 2016.

Spotify currently runs over 100 production-level Cassandra clusters. We use Cassandra across user-facing features, in our internal monitoring and analytics stack, paired with Storm for real-time processing, you name it. With scale come questions. “If I change my consistency level from ONE to QUORUM, how much performance am I sacrificing? What about a change to my data model or I/O pattern? I want to write data in really wide columns, is that ok?”

Rules of thumb lead to basic answers, but we can do better. These questions are testable, and the best answers come from pre-launch load-testing and capacity planning. Any system with a strict SLA can and should simulate production traffic in staging before launch.

Continue

Unknown author on

Anomaly detection in healthcare data is an enabling technology for the detection of overpayment and fraud. In this talk, we demonstrate how to use PageRank with Hadoop and SociaLite (a distributed query language for large-scale graph analysis) to identify anomalies in healthcare payment information. We demonstrate a variant of PageRank applied to graph data generated from the Medicare-B dataset for anomaly detection, and show real anomalies discovered in the dataset.

Dataenconfnyc2016 logos4

Ofer Mendelevitch is Director of Data science at Hortonworks, where he is responsible for professional services involving data science with Hadoop. Prior to joining Hortonworks, Ofer served as Entrepreneur in Residence at XSeed Capital where he developed an investment strategy around big data. Before XSeed, Ofer served as VP of Engineering at Nor1, and before that he was Director of engineering at Yahoo! where he led multiple engineering and data science teams responsible for R&D of large scale computational advertising projects including CTR prediction (with Hadoop), a new front-end ad-serving system and sales tools.

Continue
Unknown author on

In this talk Adam Gibson from skymind.io presents the ND4J framework with an iScala notebook.
Combined with Spark's dataframes, this is making real data science viable in Scala. ND4J is "Numpy for Java." It works with multiple architectures (or backends) that allow for run-time-neutral scientific computing as well as chip-specific optimizations -- all while writing the same code. Algorithm developers and scientific engineers can write code for a Spark, Hadoop, or Flink cluster while keeping underlying computations that are platform-agnostic. A modern runtime for the JVM with the capability to work with GPUs lets engineers leverage the best parts of the production ecosystem without having to pick which scientific library to use.

This video was recorded at the SF Bay Area Machine Learning meetup.

Continue
Small joe doliner Joe Doliner on

As companies continue to become more data-driven, data pipelines have gotten much more complicated and we need new tools and workflows for managing them. In this talk, Joe Doliner, co-founder of Pachyderm, looks at some of the current data pipelining challenges and how he envisions them being solved in the future.

Dataenconfnyc2016 logos4

This talk was recorded at the SF Data Engineering Meetup at New Relic in San Francisco.

Continue
Unknown author on

Xiangrui Meng, a committer on Apache Spark, talks about how to make machine learning easy and scalable with Spark MLlib. Xiangrui has been actively involved in the development of Spark MLlib and the new DataFrame API. MLlib is an Apache Spark component that focuses on large-scale machine learning (ML). With 50+ organizations and 110+ individuals contributing, MLlib is one of the most active open-source projects on ML. In this talk, Xiangrui shares his experience in developing MLlib. The talk covers both higher-level APIs, ML pipelines, that make MLlib easy to use, as well as lower-level optimizations that make MLlib scale to massive datasets.

ML workflows often involve a sequence of processing and learning stages. Realistic workflows are often even more complex, including cross-validation to choose parameters and combining multiple data sources. Inspired by scikit-learn, we proposed simple APIs to help users quickly assemble and tune ML pipelines. Under the hood, it seamlessly integrates with Spark SQL’s DataFrames and utilizes its data sources, flexible column operations, rich data types, as well as execution plan optimization to create efficient and scalable implementations.

Dataenconfnyc2016 logos4

There are many factors affecting a parallel implementation of an ML algorithm, e.g., optimization algorithm, platform limitation, communication pattern, data locality, numerical stability and performance, and fault-tolerance. Different implementations of the same ML algorithm can perform dramatically different. Xiangrui shares lessons learned from optimizing the alternating least squares (ALS) implementation in MLlib.

This talk was recorded at the NYC ML Meetup at Pivotal Labs in NYC.

Continue
Small 39b7a68b6cbc43ec7683ad0bcc4c9570 Paul Dix on

Over the past 4 months, Paul Dix and his team completely rewrote InfluxDB: from Go to Go. In this talk, he gives a quick overview of InfluxDB and shows how it's useful for metrics, analytics, and sensor data.

Paul also dives into the history of the project and why they chose to rewrite their previous Go implementation into the implementation they have now. He shows pain points with their legacy codebase and gives examples of how rewriting the code from scratch gave them the ability to do things they couldn't have done otherwise.

Paul closes out with some comparisons on usability, readability, and performance of the previous version against the new rewritten version.

31:34

This video was recorded at the GoSF meetup at Chain in SF.

Continue
Join Us