LSDE 2021 - Large Scale Data Engineering

2015 · 2015-2016 · 2016 · 2017 · 2018 · 2019 · 2020 · 2021 · VU Canvas

LSDE: Large Scale Data Engineering 2021

Lecture: The Spark Framework

In this lecture we discuss first the Hadoop framework and then the Spark Framework. Hadoop started the big data era at Google under the name MapReduce. Spark is by now the most popular system for doing data science computations on clusters of machines.

In this lecture we first motivate MapReduce by looking at the difficulties previously experienced (in Super Computing) in developing parallel programs, in terms of programming expertise and debugging effort. Cluster Computing ("the cluster is the computer") broke through thanks to the introduction of high level programming frameworks such as MapReduce, that take away many of the parallel programming difficulties from developers (like synchronization, error handling and termination).

We first take a look at cluster hardware architectures (single node, rack of e.g. 80 nodes, cluster of e.g. 40 racks) and analyze the basic costs to access data from (i) local CPU cache, (ii) local memory, (iii) another computer in the rack, (iv) another computer in the cluster, (v) a computer in a different cluster. (vi) a computer in a different datacenter. This shows that disk, and network communication latencies can be very high (and do not improve over time), and though bandwidth does improve with newer hardware, it can still be precious. This motivates cluster programs to communicate less, and if at all, in burst, and preferably read data from the local node and if not then at least sequentially and at a large granularity (idea: minimize bandwidth usage, and amortize latency using large transfers).

We then describe MapReduce, specifically its open source implementation Hadoop. The framework automatically spreads computation over a cluster, following the above rules: it tries to move computation to the data (minimize bandwidth usage) and when communicating it does so in large transfers. Hadoop relies on a distributed file system named HDFS (Hadoop Distributed File System) that stores files in large blocks of 64MB and replicates these blocks on three machines (by default). To keep replication simple, HDFS files cannot be modified, just appended to. A special master machine runs the namenode which keeps track of all files, their blocks, and on which machines (called datanodes) these are stored. HDFS in principle stores all blocks of a HDFS file on the same three datanodes together, and the datanode which originally wrote the data always is one of them. Though, if a datanodes goes down, the namenode will detect this and replicate the block on another datanode to compensate. The datanodes (tend to) run Linux, and store all HDFS data together as Linux files. But in Linux one cannot see the HDFS filesystem directly, one has to use a special HDFS utility program to browse through it (cd, ls/list, copy files) and see data. So, Hadoop (and HDFS) is a cluster software layer on top of Linux.

The user is expected to write a Map() and Reduce() function, both of which map (key,value) pairs into other (key,value) pairs. Both Map() and Reduce() functions can emit zero or more result pairs for each input pair. The reducer receives as input (key,value*): the second parameter is the full list of all values that were emitted by the mappers for that key - so each key is handled by exactly one Reduce() call. Optionally, a Combine() function may be placed in between Map() and Reduce(), which can reduce the amount of communication between mappers and reducers. Its input and output parameters are the same format as the input of the Reduce() function.

The MapReduce framework optimized data movement by asking the namenode for the location(s) of the HDFS input files and assigning files (or parts of files, called splits) to mappers on machines where this data is on the local disk. Because the reducers get data from all mappers, as determined by the key, MapReduce must perform network communication for the reduce phase. If provided, it will use Combine() functions to reduce the data volume that needs to be sent.

Spark by now replaced Hadoop as the most popular system for data science on clusters. Spark can work on a single machine or laptop, but can also make use of a Hadoop cluster of machines, where it will fire off jobs that the Hadoop cluster manager (YARN) will schedule. In the Amazon cloud, we can use Spark via EMR (Elastic MapReduce), which is Amazon's Hadoop with pre-installed Spark. The original makers of Spark founded a company (Databricks) that offers Spark as a cloud service, in which case it runs natively in the cloud; without Hadoop. Whereas Hadoop clusters always form a distributed filesystem (HDFS), this form of Spark in-the-cloud depends on cloud storage services such as S3 (on AWS) or the Azure block storage.

The central concept in Spark is the RDD: Resilient Distributed Dataset. This concept is Spark's answer to addressing fault tolerance in distributed computation. As Spark wants to avoid writing to disk, the question is how fault-tolerance is achieved if a node dies during a computation, and the input was not written to a replicated disk. RDDs are the answer: they are either persistent datasets or declarative recipes of how to create a dataset from other RDDs. This recipe allows Spark to recompute lost (parts of) RDDs of failed jobs on-the-fly.

Spark is a programming framework that offers Map and Reduce as two operators (similar to the previously popular MapReduce on Hadoop), but it offers 20+ more computational operators, e.g. also joins between datasets. As such, it is a generalization of MapReduce, that turns it more into a distributed database system. Also notable is its focus on functional programming, in particular Scala. Functional programming languages have no side-effects and therefore are much easier to parallelize than traditional programming languages. This is what Spark does: automatically optimize functional programs over a cluster infrastructure. One feature of functional programming that is often used in Spark are lambdas: these are implicit function definitions. The functions you would write in MapReduce for Map and Reduce in Java, you can provide shortened, inline, as lambdas in Scala. Typically, such a lambda is a parameter to the high-level operators of Spark, e.g. a join operator would expect a boolean function that defines when two items from the two RDDs being joined, match.

Newer versions of Spark introduced DataFrames. A DataFrame is a conceptual layer on top of RDDs. Essentially, a DataFrame is a table, with column names and types. Thanks to this extra information, Spark can execute queries on DataFrames more efficiently than queries on RDDs because it can optimize these queries. There is a query optimizer in Spark, called Catalyst. Both Spark SQL and all scripts on DataFrames use this optimizer. Spark was further made to perform better by just-in-time (JIT) compilation of Spark queries into java code, and also supports columnar-storage better.

Spark is a distributed Big Data processing environment using Scala, Java or Python programs, but it also has multiple higher-level components, notably Spark SQL (distributed columnar database), GraphX (vertex-centric graph processing), and MLLIB (machine learning library).

MapReduce but also Spark have been criticized for their low absolute performance (the former more so than the latter). One of the early critics was Michael Stonebraker; Turing Award winner for his work on relational databases. In their evolution, big data infrastructures have adopted many database techniques. For instance, Parquet files adopt columnar compressed storage (popular in analytical databases), DataFrames reinforce the notion of database schemas. Spark SQL has a query optimizer, and e.g. the recent Delta Lake project adds transactions to cloud storage. Another criticism has been the low absolute performance of distributed frameworks. Concretely, the COST paper by Frank McSherry (referenced below) argues that when presenting a new parallel system or algorithm, one should always compare with an optimized sequential (non-parallel) baseline and compute how many cores/computers one would need to equal its performance (this is the metric COST proposes). While Spark is an improvement over MapReduce, and it has been improving in terms of performance over the years, there is still a gap with what could be achieved. In terms of our practicum assignment1, this is evidenced by the relatively slow performance even optimized assignment 1c solutions achieve over the native implementations of the marketing query in C/C++.

“You can have a second computer once you’ve shown you know how to use the first one.” –Paul Barham

Technical Literature

For technical background material, here are a couple of papers,

These papers walk through the evolution of Big Data infrastructures, where we start with Spark's predecessor: MapReduce (a Google proprietary system whose open-source clone Hadoop became popular and started the Big Data thing). Then there is the first Spark paper which talks about the importance of caching in memory and RDDs; followed by a more recent Spark paper that introduces DataFrames, Spark SQL with its Catalyst optimizer and JIT-compiling Tungsten query engine (see: next lecture on SQL). The latest paper introduces Delta Lake: a storage layer on top of data lakes (which is basically a repository of files), which organize these files into a database schema stored in a more efficient data format and which allows transactional updates. Finally, there is Frank McSherry's critique on often lacking efficiency of distributed computation frameworks such as MapReduce and Spark.