LSDE2015 · LSDE2015-2016 · LSDE2016 · VU BlackBoard
LSDE: Large Scale Data Engineering 2016

Large Scale Data Engineering is a MSc course by VU University professor in the Large-Scale Analytical Data Management special chair Peter Boncz, and Hannes Mühleisen from the Database Architectures research group of CWI, developed specifically for the Amsterdam Data Science initiative.

Goals & Scope

The goal of the course is to gain insight into and experience with algorithms and infrastructures for managing big data.

This course confronts the students with some data management tasks, where the challenge is that the mere size of this data causes naive solutions, and/or solutions that work only on a single machine, to stop being practical. Solving such tasks requires the computer scientist to have insight in the main factors that underlie algorithm performance (access pattern, hardware latency/bandwidth), as well as possess certain skills and experience in managing large-scale computing infrastructure. This is the focus of the first assignment.

The course further gives an overview of the infrastructures currently at the disposal of a broad public to address large scale data analysis. In particular, cloud computing infrastructures, as well as the Hadoop software ecosystem to manage data on large clusters. In the second assignment, students perform a Big Data analysis project on a large cluster, to gain experience with these technologies. They also present related literature to their project in a talk, as well as write a final report.

Course Structure

There are two lectures per week, on Monday 11:00-12:45 in room C147 (from Oct on: C669) and Wednesday 09:15-10:45 in room M639. More information about the lectures is in the schedule. This course may take a significant amount of time and effort, and requires significant practical work. The practicals are done outside lecture hours, at the discretion of the students.

In the first assignment the students work on their own laptops via a prepared VM.

The second assignment, using a Hadoop Cluster, is done on the SurfSARA Hadoop cluster (90 machines, 720 cores, 1.2PB storage).

Books

The below books give background information on the hardware, resp. software aspects of Big Data Infrastructures and Technologies:

Tasks and Grading

The students must work in groups consisting of two students; the VU BlackBoard is used for registering student groups and for reporting grades.

For this course, each group must deliver the following:
  • a program which runs your assignment1 data loader (reorg) and the benchmark queries (with correct results) within the deadline on our test hardware (a VM running on a dedicated machine, single-core CPU, 1GB RAM). There is an online competition between the student groups for maximum speed. The evaluation set in this competition consists of 10 queries, and we take the median query time as the score. You are asked to use git as your code repository, since the competition software automatically checks out git commits, benchmarks your software and adjusts the leaderboard.
  • a presentation (of ~15 minutes) during one of the lectures summarizing two research papers on the topic of Assignment2. This topic must be chosen from the list of pre-defined topics, and no two groups can choose the same topics. The students can choose the topic first-come-first-served, based on the final ranking in the Assignment1 speed competition. A draft of the presentation should be emailed to lsde_course@outlook.com 24 hours before presentation for early feedback and tips.
  • a written report (PDF) for Assignment2, which contains a summary of the presentation (as a "related work" section) and a description of the Hadoop project and its results. Plus all code created for that project in a tar file. This should be submitted via blackboard.

The final grade (1-10) is weighted (30%,20%,40%) between these respective tasks with the final 10% based on your individual observed participation during the lectures.

Getting Help

You can email questions to lsde_course@outlook.com or via chat or audio or video connection to that account via skype.

Acknowledgements

The lecture slides for LSDE2016 are based on those used in the Extreme Computing course, and were graciously provided by dr. Stratis Viglas, of University of Edinburgh.