Obtaining, Scrubbing, and Exploring Data at the Command Line

Data scientists love to create exciting data visualizations and insightful models. However, before they get to that point, usually much effort goes into obtaining, scrubbing, and exploring the required data.

In this talk, Jeroen Janssens, from YPlan, talks about the *nix command line. Although it was invented decades ago, it remains a powerful environment for many data science tasks. It provides a read-eval-print loop (REPL) that is often much more convenient for exploratory data analysis than the edit-compile-run-debug cycle associated with scripts or even programs. Even if you're already comfortable processing data with, for example, R or Python, being able to also leverage the power of the command line can make any data scientist more efficient.

This talk was recorded at the NY Open Statistical Programming meetup at Knewton.

00:00

In this presentation Jeroen looks at the following subjects:

1. Essential concepts of the *nix command line.

2. Setting up an efficient environment.

3. Filters such as cut, grep, sed, and awk.

4. Scraping websites using curl, scrape, xml2json, and jq.

5. Managing your data science workflow using drake.

6. Parallelizing and distributing data-intensive pipelines and turning one-liners and existing code into reusable command-line tools.

The main goal of this presentation is to give you have an understanding of why, when, and how you can use the command line for your next data science project.