LSDE 2016 - Large Scale Data Engineering

LSDE2015 · LSDE2015-2016 · LSDE2016 · VU BlackBoard

LSDE: Large Scale Data Engineering 2016

Lecture Block 3: the Hadoop Ecosystem: YARN, HDFS and many tools

In this lecture we start discussing the evolution of Hadoop as an all-in-one MapReduce system towards a whole ecosystem with many tools. This was embodied by the second major version of Hadoop that separated MapReduce from job scheduling, i.e. YARN. In this ecosystem we distinguish (1) basic services (HDFS, YARN, HCATALOG), (2) generic data processing frameworks (MapReduce, Tez, Spark), (3) data querying services (Pig, Hive, Impala, Drill, PrestoDB and Spark SQL), (4) Graph Processing tools (Giraph, Spark GraphX), and (5) Machine Learning tools (Mahout, MLlib and Okapi).

The main part of the lecture, in preparation for the practicum, is dedicated to Apache Pig. The name is chosen because pigs are animals that eat any kind of food (and the Pig system eats many different kind of input data formats). Pig is a high-level front-end that allows to write scripts in Pig Latin from which the system then automatically generates and runs multiple MapReduce jobs that execute the work. The lecture discusses the main Pig operators for loading and storing data (LOAD, STORE), operators for working with data (FILTER, FOREACH, GROUP, JOIN, ORDER BY, LIMIT) as well as operators handy for debugging (DUMP, DESCRIBE, EXPLAIN, ILLUSTRATE).

The final part of the lecture shifts focus to another tool that is part of the Hadoop ecosystem, namely Apache Giraph. This is a framework for iterative graph analytics algorithms on Hadoop - it avoids the HDFS materialization that MapReduce would cause between the iterations, by keeping the graph state in the memory of the cluster nodes while the computation runs. Its computational model is based on the concept of "Think like a Vertex": the user specifies a Java method that describes how one vertex reacts on messages that come in for it, uses these and its own state to compute its new state and possibly send output messages (to its neighbours over edges). Giraph operates in "supersteps": the vertexes receive all messages sent to them in the previous superstep, do their computation and sending and then wait until all other vertexes have done so. Then giraph moves to the next superstep, and this keeps iterating until a fixed number of iterations or until no vertex sends any message anymore in a superstep. This relatively simple computational model is illustrated to be quite powerful, Google's PageRank can be implemented in just a few lines of code. Similar to MapReduce, Giraph makes sure that this small user function is embedded in a framework and run on a cluster on potentially huge datasets.

Technical Literature

Please read these two papers:

General Background Material

I wrote up a textual description of the Hadoop Ecosystem, since there seems to be no good summary on the internet.
an article on Pig from IBM DeveloperWorks.

Introduction To Apache Pig at WHUG from Adam Kawa

Peter Boncz · Hannes Mühleisen