Data Engineering

Data Engineering

You’ll study in detail all the stages of data processing and disassemble the necessary tools to work with them, learn how to build ETL-systems and design data warehouses.

Start your journey in Data Engineering now

Take the DE course and you’ll be able to:

  • Store and process vast amounts of data.
  • Master the tools of Hadoop, Apache Airflow, Apache Spark, SparkSQL, HDFS and MapReduce.
  • Build your own Data Platforms that can scale.
  • Master a profession that is relevant in 5-10-15 years.
  • Increase your skills and income level.

Course Program

Introduction to Data Engineering

Learn everything you wanted to know about the Data Engineer profession: goals, directions, tasks, responsibilities and team function. Compare Data Engineer vs Big Data Engineer. Become familiar with the technologies you’ll be working with during the course. Understand what tasks a particular Big Data technology solves.

Python for Data Engineering

Learn to work with different data structures: string, list, tuple, set, dictionary. Start loading data from external sources using Python. Learn the specifics of working with Python modules: import and relative import modules.

SQL for Data Engineering

Learn what SQL is used for in Big Data. Learn how to combine datasets using SQL: JOIN, UNION, EXCEPT. Start using SQL for analytical queries: analytical functions, data grouping, window functions. Understand how to write fast-executing SQL.

Analytical Databases

Identify the differences between OLTP and OLAP systems. Understand the technical implementation of a database management system designed for analytics. Learn how to describe a database structure using an ER model for future database design (Crow’s foot notation).

Designing Data Warehouses

Learn the purpose of data warehouses and approaches to designing them. Learn how to design (build) data warehouses. Master the skill of “data mapping”. Distinguish examples of existing data warehouses.

Transferring data between systems

Design an ETL solution. Understand how to transfer data between systems. Learn how to extract data from external sources, transform and clean up.
Learn to create, run, and monitor ETL using Apache Airflow. Start describing ETL processes using the Directed Acyclic Graph. Write your Airflow statement to access the API. Connect to external data sources using Apache Airflow.

Distributed Computing

Understand the concept of distributed systems and computing. Learn what tasks they solve and what ready-made solutions already exist. Distinguish distributed systems from conventional ones, and examine their advantages and disadvantages. Understand what the properties of distributed systems and limitations of distributed systems in ATS-theorem mean for your work. Find out what you should pay attention to when building distributed systems and what you can sacrifice for your particular task.

Distributed File System (HDFS)

Learn how to work with the Hadoop distributed file system. Become familiar with the range of tasks that can be performed. Learn about the internal architecture of HDFS and how it is implemented. Learn how to manage files, upload and download data, and administer clusters with HDFS.

BigData architectures

Master MapReduce technology for parallel computing over large data sets in computer clusters. Explore the problems that are solved with MapReduce. Learn how to analyze large amounts of data using MapReduce

Distributed in-memory computing (Apache Spark)

Begin an overview of the Apache Spark technology, identifying how it differs from MapReduce. Understand why Apache Spark is the flagship technology in the BigData world. Learn what tasks Apache Spark solves. Use Apache Spark technology to organize big data.

Working with structured data with SparkSQL

Begin your introduction to SparkSQL, one of the Apache Spark syntaxes. Learn how to load data into Spark. Explore how Spark works with external data sources. Perform transformations on structured data using SparkSQL.
Start offloading data from Spark. Learn how to perform analytics on structured data in Spark.

Optimize task execution in Apache Spark

Understand how to write efficient code and accelerate big data processing in Apache Spark. Learn how to identify major Spark performance issues, fix them. Organize the data in an Apache Spark cluster.

Data streams in Apache Spark.

Understand the differences between processing streaming data and static data. Learn how to process streams of data with Spark Streaming. Break down an example program for analyzing streaming data.

Summarizing.

Combine all the knowledge you have gained. Create a data platform. Review the full cycle of project preparation and implementation. Begin preparation for the course project.