LSDE2015 · LSDE2015-2016 · LSDE2016 · LSDE2017 · LSDE2018 · VU Canvas
LSDE: Large Scale Data Engineering 2018
Data Analysis Projects

In this assignment the goal is to get experience with a real Big Data challenge by attempting to solve this on a large Hadoop cluster. Computation for all projects will be performed on the SurfSara Hadoop cluster. We will make all datasets mentioned below available there.

Access to the cluster is via ssh login.hathi.surfsara.nl.

I have prepared a hathi-client github repo, with all software needed to get you going on your laptop, and being able to run and monitor Spark jobs on the cluster. It comes with a MUST-READ README and an example scala project that is built using the sbt tool. Windows users have no other option than to work from the surfsara Linux VM that we prepared with all this software (password: osboxes.org).

As we saw in class, the VU eduroam (and maybe other wifi's as well) block some of the Kerberos ports, in which case kinit lsdeXX will hang and nothing works. Use a vpn, a different wifi or phone tethering to get through that.

File Formats

Datasets typically come in some non-obvious format. Data is typically dirty, and typically you need domain expert input to understand what is in it, e.g. how to interpret normal but especially outlier values. This is a fact of life for the data scientist. One of the first tasks when starting a project is therefore to run some initial scripts on this data, and visualize/summarize these to get a feel what it is. This implies that often you need to delve immediately in finding out how the data is represented, and what it looks like (size, structure, value distributions). To even contemplate doing this, you must start writing some scripts and tools that read the data. This should be done quickly and we expect you to have such visualizations and summaries by time of the planning presentation (assignment 1a).

Think first, before deploying big iron

Rather than diving head first with Big Data tools, stop and think first.

Try to understand as much of the properties of the data. Take samples and visualize these (on your laptop).

Ask yourself all kinds of questions on the veracity of the data. Is it what you think it is?

Start small and write pieces of code that you can test locally (on spark on your laptop).

Do back-of-the-envelope calculations, extrapolating from your small-scale experiments. How much time will it take to calculate on all your data? Are there smarter ways? What is the best representation the data (maybe change it/ reduce it)? What tools to use for which step?

Keep Calm

A word of warning.

Do not expect anything to work out-of-the-box.

This is normal. The project will be a sequence of attempts to look at data and run tools, typically only responded with by the next error message.

We feel for you, but this is normal and is the current state of Big Data (still).

Also, do not wait too long in seeking help on the slack channel (but google your question first).

Spark and TensorFlow

It is generally recommended to use Spark. It comes with a number of interesting built-in modules that you may also use, notably SparkSQL but also MLlib, Spark Streaming or GraphX.

For those interested in deep learning on the cluster, there is also a tensorflow library that can run on the datanodes in
hdfs:///user/pboncz/Python.zip
The cluster only has CPUs (and not that fast ones). For heavy deep learning, GPUs would be better rented in AWS. Groups may approach me, and if it fits the project well, we can find a solution to work in the cloud.

Below you find some indications on how tensorflow can run on the cluster CPUs, using the MNIST example it ships with (that example data is in hdfs:///user/pboncz/mnist). See also: https://github.com/yahoo/TensorFlowOnSpark/wiki/GetStarted_YARN.

# set environment variables (if not already done)
export PYSPARK_PYTHON=./Python/bin/python
export LIB_NATIVE=/usr/hdp/2.3.4.0-3485/lib/native:/usr/lib/jvm/jre/lib/amd64/server

hdfs dfs -rm -skipTrash -r -f mnist/tfr

# save images and labels as TFRecords
spark-submit \
--master yarn \
--deploy-mode cluster \
--conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_NATIVE \
--conf spark.yarn.appMasterEnv.LD_LIBRARY_PATH=$LIB_NATIVE \
--conf spark.executorEnv.LD_PRELOAD=/lib64/librt.so.1 \
--conf spark.yarn.appMasterEnv.LD_PRELOAD=/lib64/librt.so.1 \
--num-executors 4 \
--executor-memory 4G \
--archives hdfs:///user/pboncz/Python.zip#Python,hdfs:///user/pboncz/mnist/mnist.zip#mnist \
--jars hdfs:///user/pboncz/tensorflow-hadoop-1.0-SNAPSHOT.jar \
hdfs:///user/pboncz/TensorFlowOnSpark/examples/mnist/mnist_data_setup.py \
--output mnist/tfr \
--format tfr

hdfs dfs -rm -skipTrash -r -f mnist_model

# run distributed MNIST training (using QueueRunners)

spark-submit \
--master yarn \
--deploy-mode cluster \
--conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_NATIVE \
--conf spark.yarn.appMasterEnv.LD_LIBRARY_PATH=$LIB_NATIVE \
--conf spark.executorEnv.LD_PRELOAD=/lib64/librt.so.1 \
--conf spark.yarn.appMasterEnv.LD_PRELOAD=/lib64/librt.so.1 \
--conf spark.executorEnv.HADOOP_HDFS_HOME=/usr/hdp/2.3.4.0-3485/hadoop/hdfs \
--conf spark.yarn.appMasterEnv.HADOOP_HDFS_HOME=/usr/hdp/2.3.4.0-3485/hadoop/hdfs \
--num-executors 4 \
--executor-memory 7G \
--py-files hdfs:///user/pboncz/TensorFlowOnSpark/examples/mnist/tf/mnist_dist.py \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.yarn.maxAppAttempts=1 \
--archives hdfs:///user/pboncz/Python.zip#Python \
hdfs:///user/pboncz/TensorFlowOnSpark/examples/mnist/tf/mnist_spark.py \
--images mnist/tfr/train \
--format tfr \
--mode train \
--model mnist_model

hdfs dfs -rm -skipTrash -r -f predictions

# run distributed MNIST inference (using QueueRunners)
spark-submit \
--master yarn \
--deploy-mode cluster \
--conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_NATIVE \
--conf spark.yarn.appMasterEnv.LD_LIBRARY_PATH=$LIB_NATIVE \
--conf spark.executorEnv.LD_PRELOAD=/lib64/librt.so.1 \
--conf spark.yarn.appMasterEnv.LD_PRELOAD=/lib64/librt.so.1 \
--conf spark.executorEnv.HADOOP_HDFS_HOME=/usr/hdp/2.3.4.0-3485/hadoop/hdfs \
--conf spark.yarn.appMasterEnv.HADOOP_HDFS_HOME=/usr/hdp/2.3.4.0-3485/hadoop/hdfs \
--num-executors 4 \
--executor-memory 7G \
--py-files hdfs:///user/pboncz/TensorFlowOnSpark/examples/mnist/tf/mnist_dist.py \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.yarn.maxAppAttempts=1 \
--archives hdfs:///user/pboncz/Python.zip#Python \
hdfs:///user/pboncz/TensorFlowOnSpark/examples/mnist/tf/mnist_spark.py \
--images mnist/tfr/test \
--mode inference \
--model mnist_model \
--output predictions

upon request, I also added the osmnx geospatial library that includes geopandas in Python.zip. For osmnx to work you also need to send along in spark-submit using the (--files option the libraries. Using no spaces after the comma's that would behdfs:///user/pboncz/libexpat.so.1, hdfs:///user/pboncz/libgeospatial.so.4, hdfs:///user/pboncz/libgeospatial_c.so You also need to set the environment variable SPATIALINDEX_C_LIBRARY="libspatialindex_c.so" in both the executor and driver with the respective spark-submit options.

spark-submit \
--master yarn \
--deploy-mode cluster \
--files hdfs:///user/pboncz/libexpat.so.1,hdfs:///user/pboncz/libspatialindex.so.4,hdfs:///user/pboncz/libspatialindex_c.so \
--conf spark.executorEnv.LD_LIBRARY_PATH='.' \
--conf spark.yarn.appMasterEnv.LD_LIBRARY_PATH='.' \
--conf spark.executorEnv.SPATIALINDEX_C_LIBRARY=libspatialindex_c.so \
--conf spark.yarn.appMasterEnv.SPATIALINDEX_C_LIBRARY=libspatialindex_c.so \
--archives hdfs:///user/pboncz/Python.zip#Python \
YOUR_VERY_OWN_PYTHON_SCRIPT_THAT_IMPORTS_osmnx.py 

Interestingly, on the login node of hathi, a different Linux version installed, and if you would test there without spark locally with python (mkdir mypython; cd mypython; unzip /home/boncz/Python.zip; cd ..), you will find that the login node does not have tcl/tk libraries (the datanodes *do* have this). Therefore, I have put these libraries in /home/pboncz for convenience, so then the following should work: export SPATIALINDEX_C_LIBRARY=/home/pboncz/libspatialindex_c.so; export LD_LIBRARY_PATH=/home/pboncz; mypython/bin/python -c "import osmnx"

The installed python packages (pip list) in the self-contained Python2.7 archive Python.zip are now:

backports.functools-lru-cache (1.4)
backports.weakref (1.0.post1)
bleach (1.5.0)
certifi (2017.7.27.1)
chardet (3.0.4)
click (6.7)
click-plugins (1.0.3)
cligj (0.4.0)
cycler (0.10.0)
decorator (4.1.2)
descartes (1.1.0)
enum34 (1.1.6)
Fiona (1.7.9.post1)
funcsigs (1.0.2)
geopandas (0.3.0)
h5py (2.7.1)
html5lib (0.9999999)
idna (2.6)
Keras (2.0.8)
Markdown (2.6.9)
matplotlib (2.1.0)
mock (2.0.0)
munch (2.2.0)
networkx (2.0)
numpy (1.13.3)
osmnx (0.6)
pandas (0.20.3)
pbr (3.1.1)
pip (9.0.1)
protobuf (3.4.0)
pyparsing (2.2.0)
pyproj (1.9.5.1)
python-dateutil (2.6.1)
pytz (2017.2)
PyYAML (3.12)
requests (2.18.4)
Rtree (0.8.3)
scipy (0.19.1)
setuptools (36.5.0)
Shapely (1.6.1)
six (1.11.0)
subprocess32 (3.2.7)
tensorflow (1.3.0)
tensorflow-tensorboard (0.1.8)
tensorflowonspark (1.0.4)
urllib3 (1.22)
Werkzeug (0.12.2)
wheel (0.30.0)