PDF slides: Spark + MLlib (based on slides by Matei Zaharia & Xiangrui Meng)
In this lecture we focus on the Spark system, which is by far the hottest technology entering the Hadoop ecosystem (though we should mention there is a very similar framework from Europe, called Apache Flink, which is very much alike Spark, and arguably better on some fronts, in particular streaming data). Where MapReduce writes all data that Mappers send to Reducers to file first, Spark attempts to avoid doing so, keeping as much data in RAM as possible. But, since Spark does not provide a distributed filesystem, or a cluster resource manager, it effectively depends on HDFS and YARN, making Spark a member of the Hadoop ecosystem. In the Amazon cloud, we can use Spark via EMR (Elastic MapReduce), which is Amazon's Hadoop with pre-installed Spark. However, it is also possible to use Spark without Hadoop, though one then somehow needs another shared filesystem and job scheduler instead of HDFS and Spark. The makers of Spark (the company Databricks) operate such a Hadoop-less Spark in the Amazon cloud as a service, where the shared storage is provided by Amazon S3 (a similar service in the Microsoft Azure cloud now also exists).
The central concept in Spark is the RDD: Resilient Distributed Dataset. This concept is Spark's answer to addressing fault tolerance in distributed computation. As Spark wants to avoid writing to disk, the question is how fault-tolerance is achieved if a node dies during a computation, and the input was not written to a replicated disk. RDDs are the answer: they are either persistent datasets (on HDFS) or declarative recipes of how to create a dataset from other RDDs. This recipe allows Spark to recompute lost (parts of) RDDs of failed jobs on-the-fly.
Spark is a programming framework that offers Map and Reduce, but it offers 20+ more communication patterns, e.g. also joins between datasets. As such, it is a generalization of MapReduce. Also notable is its focus on functional programming, in particular Scala. Functional programming languages have no side-effects and therefore are much easier to parallelize than traditional programming languages. This is what Spark does: automatically optimize functional programs over a cluster infrastructure. One feature of functional programming that is often used in Spark are lambdas: these are implicit function definitions. The functions you would write in MapReduce for Map and Reduce in Java, you can provide shortened, inline, as lambdas in Scala. Typically, such a lambda is a parameter to the high-level operators of Spark, e.g. a join operator would expect a boolean function that defines when two items from the two RDDs being joined, match.
Newer versions of Spark introduced DataFrames. A DataFrame is a conceptual layer on top of RDDs. Essentially, a DataFrame is a table, with column names and types. Thanks to this extra information, Spark can execute queries on DataFrames more efficiently than queries on RDDs because it can optimize these queries. There is a query optimizer in Spark, called Catalyst. Both Spark SQL and all scripts on DataFrames use this optimizer. Spark was further made to perform better by low-level compilation of Spark queries into java code, and also support columnar-storage better.
Spark is a distributed Big Data processing environment using Scala, Java or Python programs, but it also has multiple higher-level components, notably Spark SQL (distributed columnar database, will be discussed later), GraphX (vertex-centric graph processing), and MLLIB (machine learning library). The latter two are summarized as follows.
GraphX is a Spark API for iterative graph analytics algorithms on Hadoop. Avoiding the materialization on disk (something that MapReduce would do) is a strength of Spark and it pays off especially in iterative algorithms. The computational model of GraphX is based on the concept of "Think like a Vertex" originally proposed in the Pregel system by Google (not open-source). Beyond Spark GraphX, other popular such think-like-a-vertex systems are Apache Giraph and Flink Gelly. In all these systems, the user writes a method that describes how one vertex reacts on messages that come in for it, uses these and its own state to compute its new state and possibly send output messages (to its neighbours over edges). This kind of computation operates in "supersteps": the vertexes receive all messages sent to them in the previous superstep, do their computation and sending and then wait until all other vertexes have done so. Then the system moves to the next superstep, and this keeps iterating until a fixed number of iterations or until no vertex sends any message anymore in a superstep. This relatively simple computational model is illustrated to be quite powerful, Google's PageRank can be implemented in just a few lines of code.
MapReduce and Spark have been criticized for their low absolute performance (the former more so than the latter). Concretely, the COST paper (referenced below) argues that when presenting a new parallel system or algorithm, one should always compare with an optimized sequential (non-parallel) baseline and compute how many cores/computers one would need to equal its performance (this is the metric COST proposes). While Spark is an improvement over MapReduce, and its use of JIT compiled query plans (see: SQL on Big Data lecture) have improved its absolute performance further, there is still a gap with what could be achieved in terms of performance. In terms of our practicum1, this is evidenced by the relatively slow performance even optimized assignment 1c solutions achieve over the native implementations of the marketing query in C/C++.
“You can have a second computer once you’ve shown you know how to use the first one.” –Paul Barham
Spark offers its MLLIB library that provides a set of ready-to-run machine learning algorithms that work on RDDs (and now also on DataFrames), and thus scale on a cluster. We focus on its Alternating Least Squares (ALS) implementation for collaborative filtering. In collaborative filtering, typically we have a very partial set of datapoints that describe users and their interests (e.g. songs). The purpose of collaborative filtering is to derive what all users think of all possible interests, specifically to recommend the most interesting items (songs) to users. ALS models this as a matrix problem where for each user and for each song we keep a vector of numbers: each element in this vector corresponds to a latent factor (characteristic). If we would know these factors, then multiplying them would give the matrix of all user/song interests. The ALS method iteratively approaches optimal vector values, using a (possibly small) set of known interests as training set.
MLLIB contains many different machine learning algorithms that can be easily connected to data through the Spark concept of RDDs. The output of the learning (the model) is also an RDD and can subsequently be used to score new users (i.e. make recommendations). Besides tight integration with Spark processing pipelines (and e.g. Spark SQL) another advantage of MLLIB is that the algorithms in it are implemented in a scalable, distributed fashion. Therefore, one can address large problems with MLLIB using clusters, which would not fit on a single machine.