LSDE2015 · LSDE2015-2016 · LSDE2016 · LSDE2017 · LSDE2018 · VU Canvas
LSDE: Large Scale Data Engineering 2018

Large Scale Data Engineering is a MSc course by VU University professor in the Large-Scale Analytical Data Management special chair Peter Boncz, and Hannes Mühleisen from the Database Architectures research group of CWI, developed specifically for the Amsterdam Data Science initiative.

Goals & Scope

The goal of the course is to gain insight into and experience with algorithms and infrastructures for managing big data.

This course confronts the students with some data management tasks, where the challenge is that the mere size of this data causes naive solutions, and/or solutions that work only on a single machine, to stop being practical. Solving such tasks requires the computer scientist to have insight in the main factors that underlie algorithm performance (access pattern, hardware latency/bandwidth), as well as possess certain skills and experience in managing large-scale computing infrastructure. This is the focus of the first assignment.

The course further gives an overview of the infrastructures currently at the disposal of a broad public to address large scale data analysis. In particular, cloud computing infrastructures, as well as the Hadoop software ecosystem to manage data on large clusters. In the second assignment, students perform a Big Data analysis project on a large cluster, to gain experience with these technologies. This project is also a training in practical problem solving as well as critical thinking. They also present related literature to their project in a talk, as well as write a final report.

Course Structure

There are two lectures per week, on Wednesday 09:15-10:45 and Friday 09:15-10:45 in room WN-M623. More information about the lectures is in the schedule. This course may take a significant amount of time and effort, and requires significant practical work. It requires programming skills. In its current shape, LSDE students use at least C or Java and scala, plus whatever tools, programming or scripting languages they chose in Assignment 2. The practicals are done outside lecture hours, at the discretion of the students.

In the first assignment the students can work either on their own laptops via a prepared VM, or in the cloud using an Amazon EC2 Micro Instance in their free tier (free as in $0). It consists of three parts (1a, 1b, 1c). The first two parts (1a, 1b) are done individually. Then, practicum groups consisting of exactly three students are formed. In these groups, the assignment 1c is done as well as the Big Data project (a.k.a. assignment 2). For the Big Data heavy lifting in the second assignment, the SurfSARA Hadoop cluster (90 machines, 720 cores, 1.2PB storage) is available and each group will get an account there, for the duration of the cours, for the duration of the coursee.

From Assignment 1c on, the students will work in groups consisting of three students; the VU Canvas is used for registering student groups and for reporting grades.

Books

The below books give background information on the hardware, resp. software aspects of Big Data Infrastructures and Technologies:

Tasks and Grading

For this course, each student (or, from Assignment 1c on, each group) must deliver the following:

  • assignment 1a: a program which runs the assignment1 benchmark queries (with correct results) within the deadline on our test hardware (a VM running on a dedicated machine, single-core CPU, 1GB RAM). There is an online competition between the students for maximum speed. The evaluation set in this competition consists of 10 queries, and we take the median query time as the score. For this you need to have a github account, so you can get access to a private code repository there. The competition software automatically checks out versions you pushed to git, benchmarks your software and adjusts the leaderboard.
  • assignment 1b: an improved query program, as well as possibly a "reorg" program that it invokes before running the benchmark queries. The reorg program must complete within the benchmark timeout period (20 minutes, it also holds for assignment1), but the reorg running time is not counted in the score. Its purpose is to reorganize the data in a way such that the queries can be computed more quickly.
  • assignment 1c: performs the same task as 1b, but now the reorg and queries are implemented using Spark. The test hardware for this is more powerful (a VM running on a dedicated machine, quad-core CPU, 4GB RAM). The online competition is still in place, though now the competition is between the practicum groups. That is, assignment 1c is not individual anymore, but performed in a group of three students.
  • assignment 2a: a short but cool planning presentation on the Big Data project. The goal of this short presentation is to (1) visualize findings of the quick-scan of the datasets done by each group, (2) to define the research question, (3) present related scientific literature and (4) to seek feedback to the intended project approach (both content-wise and technologically). All group members must be present and participate in the presentation (unexplained absence will lead to individual point deduction).
  • assignment 2b: a short but cool result presentation on the Big Data project. The goal of this short presentation is to share with all students the results of your project. The projects are likely not fully finished by that time, so these may still be preliminary. The focus should now be on presenting the data hurdles and technical hurdles encountered (and their solutions..), to present the findings, as well as the (plans for) the visualization website. All group members must be present and participate in the presentation (unexplained absence will lead to individual point deduction).
  • assignment 2c: all code created for the project (in a tar file), a final report in article style (PDF) and a web visualization (in static HTML, javascript allowed) for Assignment2. The report should cover problem motivation, definition, research questions, methodology en technology chosen, and a detailed description of the approach taken, and its results, all in scientific writing style. At the end of the report, a table must be included that lists which group member worked on which part of the project (report, code parts, visualization) and to what extent. The report and code should be submitted via Canvas; the visualization can be put as a tar file in a readable directory on the HDFS filesystem of the cluster (as it tends to include data and thus tends to be big).
The final grade (1-10) is weighted (10%,20%,0%,10%,10%,50%) between these respective tasks. We are using software for plagiarism detection in assigment 1 (detected plagiarism will result in 1 points for all involved for that assignment). As mentioned, at the discretion of the teacher, a 0.5 bonus point may be awarded to students who actively (and constructively and usefully) participate in class and slack channel (though the final grade is limited to 10). The incentive for 1c is not grade points but freedom to choose a Big Data project: topics are chosen in the first lecture after the deadline using FCFS (First Come First Served) in 1c leaderboard order.

Getting Help

The primary communication channel is slack. Consider asking your question in the #general slack channel. This means your fellow students will see it and may also answer the question before I can. Active students in class and those who correctly and timely answer questions on slack may get a 0.5 bonus point on their overall grade. You can also message Peter Boncz directly on slack. of course. Please also get skype and install the skype app in slack, as we can talk over audio or video connection to lsde@outlook.com (which is the official LSDE email address).

Acknowledgements

The lecture slides for LSDE are based on those used in the Extreme Computing course, and were graciously provided by dr. Stratis Viglas, of University of Edinburgh.