LSDE2015 · LSDE2015-2016 · LSDE2016 · VU BlackBoard
LSDE: Large Scale Data Engineering 2016
Data Analysis Projects

In this assignment the goal is to get experience with a real Big Data challenge by attempting to solve this on a large Hadoop cluster. Computation for all projects will be performed on the SurfSara Hadoop cluster. We will make all datasets mentioned below available there.

Access to the cluster is via ssh login.hathi.surfsara.nl.

File Formats

Datasets typically come in some non-obvious format. This is a fact of life for the data scientist. One of the first tasks when starting a project is to run some initial scripts on this data, to get a feel what it is. This implies that you need to delve immediately in finding out how the data is represented, and what it looks like (size, structure, value distributions). To even contemplate doing this, you must start writing some scripts and tools that read the data.

Tools

The Hadoop ecosystem equips you with a toolkit of many systems. Some of these tools need to be installed by system administrators, and in case of SurfSARA this almost certainly implies that if it is not installed yet, it will take too long to install. However, there are also plenty of tools that can be installed in user space (i.e. by you). Consequently, at least the following tools should be available:

Hence, a second decision to make in your tentative project is which Hadoop tools to use, and why.

Keep Calm

A word of warning.

Do not expect anything to work out-of-the-box.

This is normal. The project will be a sequence of attempts to look at data and run tools, typically only responded with by the next error message.

We feel for you, but this is normal and is the state of Big Data in 2016 (still).

Also, do not wait too long in seeking help. We are there by email and Skype on lsde_course@outlook.com