LSDE2015 · LSDE2015-2016 · LSDE2016 · LSDE2017 · VU Canvas
LSDE: Large Scale Data Engineering 2017
Data Analysis Projects

In this assignment the goal is to get experience with a real Big Data challenge by attempting to solve this on a large Hadoop cluster. Computation for all projects will be performed on the SurfSara Hadoop cluster. We will make all datasets mentioned below available there.

Access to the cluster is via ssh

File Formats

Datasets typically come in some non-obvious format. This is a fact of life for the data scientist. One of the first tasks when starting a project is to run some initial scripts on this data, to get a feel what it is. This implies that you need to delve immediately in finding out how the data is represented, and what it looks like (size, structure, value distributions). To even contemplate doing this, you must start writing some scripts and tools that read the data.


The Hadoop ecosystem equips you with a toolkit of many systems. Some of these tools need to be installed by system administrators, and in case of SurfSARA this almost certainly implies that if it is not installed yet, it will take too long to install. However, there are also plenty of tools that can be installed in user space (i.e. by you). Consequently, at least the following tools should be available:

Hence, a second decision to make in your tentative project is which Hadoop tools to use, and why.

Think first, before deploying big iron

Rather than diving head first with Big Data tools, stop and think first.

Try to understand as much of the properties of the data. Take samples and visualize these (on your laptop).

Ask yourself all kinds of questions on the veracity of the data. Is it what you think it is?

Start small and write pieces of code that you can test locally.

Do back-of-the-envelope calculations, extrapolating from your small-scale experiments. How much time will it take to calculate on all your data? Are there smarter ways? What is the best representation the data (maybe change it/ reduce it)? What tools to use for which step?

Keep Calm

A word of warning.

Do not expect anything to work out-of-the-box.

This is normal. The project will be a sequence of attempts to look at data and run tools, typically only responded with by the next error message.

We feel for you, but this is normal and is the current state of Big Data (still).

Also, do not wait too long in seeking help (but google your question first). We are there by email and Skype on