2015 · 2015-2016 · 2016 · 2017 · 2018 · 2019 · 2020 · 2021 · 2022 · Canvas
LSDE: Large Scale Data Engineering 2022

Large Scale Data Engineering is a MSc course by VU University professor in the Large-Scale Analytical Data Management special chair Peter Boncz from the Database Architectures research group of CWI, developed specifically for the Amsterdam Data Science initiative.

Goals & Scope

The goal of the course is to gain insight into and experience with algorithms and infrastructures for managing big data.

This course confronts the students with some data management tasks, where the challenge is that the mere size of this data causes naive solutions, and/or solutions that work only on a single machine, to stop being practical. Solving such tasks requires the computer scientist to have insight in the main factors that underlie algorithm performance (access pattern, hardware latency/bandwidth), as well as possess certain skills and experience in managing large-scale computing infrastructure. This is the focus of the first assignment.

The course further gives an overview of the infrastructures currently at the disposal of a broad public to address large scale data analysis. In particular, cloud computing infrastructures, as well as large-scale data management frameworks for compute clusters, such as Spark. In the second assignment, students perform a Big Data analysis project on a large cluster, to gain experience with these technologies. This project is also a training in practical problem solving as well as critical thinking. They also present related literature to their project in a talk, as well as write a final report.

Course Structure

There are two lectures per week, on Tuesday 11:00-12:45 and Friday 11:00-12:45. More information about the lectures is in the schedule. This course may take a significant amount of time and effort, and requires significant practical work. It requires programming skills. In its current shape, LSDE students use at least C and Python, plus whatever tools, programming or scripting languages they choose in Assignment 2. The practicals are done outside lecture hours, at the discretion of the students.

In the first assignment the students can work either on their own laptops via the prepared VMs or in the cloud using an Amazon EC2 t2.micro instance in their free tier (free as in $0). The assignment consists of three parts (1a, 1b, 1c). The first two parts (1a, 1b) are done individually. Then, practicum groups consisting of exactly three students are formed. In these groups, the assignment 1c is done as well as the Big Data project (a.k.a. assignment 2). For the Big Data heavy lifting in the second assignment, LSDE has received a $9000 grant from Amazon Web Services (AWS).

From Assignment 1c on, the students will work in groups consisting of three students; the VU Canvas is used for registering student groups and for reporting grades.

Tasks and Grading

For this course, each student (or, from Assignment 1c on, each group) must deliver the following:

  • assignment 1a: a program which runs the assignment1 benchmark queries (with correct results) within the deadline on our test hardware (a VM running on a dedicated machine, single-core CPU, 1GB RAM). There is an online competition between the students for maximum speed. For this you need to have a github account, so you can get access to a private code repository there. The competition software automatically checks out versions you pushed to git, benchmarks your software and adjusts the Leaderboard (see the menu on the right).
  • assignment 1b: an improved query program, as well as possibly a "reorg" program that it invokes before running the benchmark queries. The reorg program must complete within the benchmark timeout period (the same timeout as used for assignment1a), but the reorg running time is not counted in the score. Its purpose is to reorganize the data in a way such that the queries can be computed more quickly.
  • assignment 1c: performs the same task as 1b, but now the reorg and queries are implemented using Spark. The test hardware for this is more powerful (a VM running on a dedicated machine, quad-core CPU, 4GB RAM). The online competition is still in place, though now the competition is between the practicum groups. That is, assignment 1c is not individual anymore, but performed in a group of three students.
  • assignment 2a: a project plan for the Big Data project. Every group writes a short project plan (PDF, max 5 pages A4, 11 point font), describing the general scope and approach of the big data project. This includes the envisioned data products, and some ideas about its visualization. This includes a description of how the data product is to be created. It starts with a description of the input data sets and their relevant properties and statistics. Existing techniques, tools and algorithms need to be identified, that can be deployed on the raw input data. A pipeline of tools that will process the input data to create the data product needs to be described. For all of this, we also need a time-plan as well as a detailed and motivated budget plan on AWS. This project plan is peer-reviewed among the student groups.
  • assignment 2b: a short but cool result presentation on the Big Data project. The goal of this short presentation is to share with all students the final results of your project. All group members must be present and participate in the presentation (unexplained absence will lead to individual point deduction).
  • assignment 2c: all code created for the project (in a tar file), a final report in article style (PDF) and a web visualization (in static HTML, JavaScript allowed) for Assignment2. The report should cover problem motivation, definition, research questions, methodology and technology chosen, and a detailed description of the approach taken, and its results, all in scientific writing style. At the end of the report, a table must be included that lists which group member worked on which part of the project (report, code parts, visualization) and to what extent. The report and code should be submitted via Canvas; the visualization can be put as a tar file in a readable directory on the HDFS filesystem of the cluster, or on S3 if we work in the cloud (as it tends to include data and thus tends to be big).
The final grade (1-10) is weighted (10%, 20%, 0%, 10%, 10%, 50%) between these respective tasks. We are using software for plagiarism detection in assignment 1 (detected plagiarism will result in 1 points for all involved for that assignment). As mentioned, at the discretion of the teacher, a 0.5 bonus point may be awarded to students who actively (and constructively and usefully) participate in class and slack channel (though the final grade is limited to 10). The incentive for 1c is not grade points but freedom to choose a Big Data project: topics are chosen in the first lecture after the deadline using FCFS (First Come First Served) in 1c leaderboard order.

Getting Help

The primary communication channel is Slack. Consider asking your question in the #general Slack channel. This means your fellow students will see it and may also answer the question before I can. Active students in class and those who correctly and timely answer questions on Slack may get 0.5 bonus points on their overall grade. You can also message Peter Boncz or the TAs directly on Slack, of course.

When posting questions, please (1) state what the problem is, (2) describe the system you are running on (e.g. operating system, virtual machine type), (3) show the full error message that you get (when possible, copy the error message instead of creating a screenshot), (4) explain what steps you tried so far to troubleshoot the problem.

Acknowledgements

The lecture slides for LSDE are based on those used in the Extreme Computing course, and were graciously provided by dr. Stratis Viglas, of University of Edinburgh.

During the first five iterations of this course, Hannes Mühleisen played an important role in bringing to life this course; contributing valuable aspects to e.g. the practical assignments.