Large Scale Data Engineering is a new MSc course for VU University taught by Peter Boncz and Hannes Mühleisen from the Database Architectures research group of CWI, specifically for the Amsterdam Data Science initiative.
Goals & Scope
The goal of the course is to gain insight into and experience with algorithms and infrastructures for managing big data.
This course confronts the students with some data management tasks, where the challenge is that the mere size
of this data causes naive solutions, and/or solutions that work only on a single machine, to stop being practical.
Solving such tasks requires the computer scientist to have insight in the main factors that underlie algorithm
performance (access pattern, hardware latency/bandwidth), as well as possess certain skills and experience in
managing large-scale computing infrastructure.
The course has now completed succesfully. Here is a small (biased ;-) sample of the results of the second assignment, where students had to analyze datasets using cluster technology:
- World Map of Faces, which used MapReduce to crawl a 100 million picture archive from Flickr, run face-detection on the images, and aggregate these into map overlays at various zoom levels. Created by Till Dohmen and Florian Gomelo.
- Secret New York City Nightlife, which analyzed a 27GB New York taxitrips dataset for nightly patterns. The result is a New York map showing those locations which are much more often taxi drive destinations at night than during the day, and which do not map to known nightclubs. Created by David Mueller and Alessandro Zonta.
There are two lectures per week, on Monday 11:00-12:45 and Thursday 09:15-10:45 in room P647.
More information about the lectures is in the schedule.
This course may take a significant amount of time and effort, and requires significant practical work.
The practicals are done outside lecture hours, at the discretion of the students.
In the first assignment the students can work either on
their own laptops via a prepared VM, or in the cloud using
an Amazon EC2 Micro Instance.
The second assignment, using a Hadoop Cluster, will be done on the
SurfSARA Hadoop cluster (90 machines, 720 cores, 1.2PB storage).
The students must work in groups consisting of two students; the VU BlackBoard is used for registering student groups and for reporting grades.
For this course, each group must deliver the following:
- a program which runs your assignment1 data loader (reorg) and the benchmark queries (with correct results) within the deadline on our test hardware (a VM running on a dedicated machine, single-core CPU, 1GB RAM). There is an online competition between the student groups for maximum speed. The evaluation set in this competition consists of 50 queries, and we take the median query time as the score. You are asked to use git as your code repository, since the competition software automatically checks out git commits, benchmarks your software and adjusts the leaderboard.
- a presentation (of ~20 minutes) during one of the lectures summarizing two research papers on the topic of Assignment2. This topic must be chosen from the list of pre-defined topics, and no two groups can choose the same topics. The students can choose the topic first-come-first-served, based on the final ranking in the Assignment1 speed competition. A draft of the presentation should be emailed to email@example.com 24 hours before presentation for early feedback and tips.
- a written report (PDF) for Assignment2, which contains a summary of the presentation (as a "related work" section) and a description of the Hadoop project and its results. Plus all code created for that project in a tar file. This should be submitted via blackboard.
The final grade (1-10) is weighted (30%,20%,40%) between these respective tasks with the final 10% based on your individual observed participation during the lectures.
You can email questions to firstname.lastname@example.org or via chat or audio
or video connection to that account via skype.
The lecture slides for LSDE2015 are based on those used in the
Extreme Computing course, and were graciously
provided by dr. Stratis Viglas, of
University of Edinburgh.