Large Scale Data Engineering is a MSc course by VU University professor in the Large-Scale Analytical Data Management special chair Peter Boncz, and Hannes Mühleisen from the Database Architectures research group of CWI, developed specifically for the Amsterdam Data Science initiative.
The goal of the course is to gain insight into and experience with algorithms and infrastructures for managing big data.
This course confronts the students with some data management tasks, where the challenge is that the mere size of this data causes naive solutions, and/or solutions that work only on a single machine, to stop being practical. Solving such tasks requires the computer scientist to have insight in the main factors that underlie algorithm performance (access pattern, hardware latency/bandwidth), as well as possess certain skills and experience in managing large-scale computing infrastructure. This is the focus of the first assignment.
The course further gives an overview of the infrastructures currently at the disposal of a broad public to address large scale data analysis. In particular, cloud computing infrastructures, as well as the Hadoop software ecosystem to manage data on large clusters. In the second assignment, students perform a Big Data analysis project on a large cluster, to gain experience with these technologies. They also present related literature to their project in a talk, as well as write a final report.
There are two lectures per week, on Monday 11:00-12:45 in room C147 (from Oct on: C669) and Wednesday 09:15-10:45 in room M639. More information about the lectures is in the schedule. This course may take a significant amount of time and effort, and requires significant practical work. The practicals are done outside lecture hours, at the discretion of the students.
In the first assignment the students work on their own laptops via a prepared VM.
The second assignment, using a Hadoop Cluster, is done on the SurfSARA Hadoop cluster (90 machines, 720 cores, 1.2PB storage).
The below books give background information on the hardware, resp. software aspects of Big Data Infrastructures and Technologies:
The students must work in groups consisting of two students; the VU BlackBoard is used for registering student groups and for reporting grades.
For this course, each group must deliver the following:
The final grade (1-10) is weighted (30%,20%,40%) between these respective tasks with the final 10% based on your individual observed participation during the lectures.
You can email questions to lsde_course@outlook.com or via chat or audio or video connection to that account via skype.
The lecture slides for LSDE2016 are based on those used in the Extreme Computing course, and were graciously provided by dr. Stratis Viglas, of University of Edinburgh.