LSDE 2022 - Large Scale Data Engineering

2015 · 2015-2016 · 2016 · 2017 · 2018 · 2019 · 2020 · 2021 · 2022 · Canvas

LSDE: Large Scale Data Engineering 2022

Practical Information

Getting a Virtual Machine

We provide ready-made Linux virtual machines (VMs). On x86-based systems, use the VirtualBox system and the ldbc-vm.ova image (18 GB). Once downloaded, this machine can be imported into VirtualBox and started. On systems with the Apple M1 chip, use the UTM virtualization app and the image in lsde-m1.zip (12 GB, grows to 20+ GB when uncompressed).
The settings with 1GB RAM and 1 core are for assignments 1a and 1b. For assignment 1c, please modify the memory size to 4GB and the amount of cores to 2+ (in VirtualBox, go to Settings/Instellingen => System/Systeem => set 4096MB there, and 2+ CPUs on the Processor tab; in UTM, edit the machine's settings). The VM contains the sample datasets in the folder /opt/lsde and the required dependencies (gcc, Spark, etc.). To login, use the username "ubuntu" and the password "ubuntu".
In order to connect to the VM on Linux and MacOS, run ssh -p 2224 ubuntu@A.B.C.D (for the VirtualBox/x86 VM) ssh -p 22 ubuntu@A.B.C.D (for the UTM/M1 VM). The equivalent of ssh in Windows is using the putty application and saving the session.

The virtual machines contain the data sets in binary format but they are also available separately: SF100, SF3000.

Those using Amazon in assignments 1a and 1b should register for the free tier (credit card info required, but all this should cost $0), then in the management console go to EC2, and launch a "Community AMI" instance. Searching for "lsde" in the Frankfurt data-center (click top-right menu bar to switch data-center) you will find the lsde2022 image, which can be launched on a "micro instance" for free. Doing so, you will need to create an ssh key pair, and launch the instance. When it is running, you note down the IP address and SSH into it using username "ubuntu" and passing your key : e.g. ssh -i mykey.pem ubuntu@A.B.C.D in a Linux or Mac OS terminal, or a Windows putty client (in that case you need to first use the puttygen utility to convert the PEM key from amazon into a Putty Private Key file. That key file you should include under the 'ssh' tab and its 'Auth' subtab in the putty connection setup window).

Warning: the free-tier t* instances in EC2 have burstable CPUs so the CPU can only run at full speed for a few hours per day.

Working with the Example Code

Once you pass us your github account name, you will be invited into a private github repository (owned by us) that you can use and contains a naive solution. The naive solution is a copy of the lsde2022-assignment1 repository, which you can clone in the VM. Once you have access to your private github repo, please clone that one and work from there.

Here are the instructions to clone and build the code (assuming you have an SSH connection set up):

git clone git@github.com:lsde-course/lsde2022-yourname.git
cd lsde2022-yourname
make

Compiling the provided code results in the cruncher executable, that can be ran with the data-directory, test-queries and some output-file as parameters, e.g.:

./cruncher /opt/lsde/dataset-sf100 queries-test.csv out.csv

The example solution goes through the persons table sequentially, and for each person P accesses the interests table, counting how many of the artists A2,A3,A4 are liked as the score. For all P who do not like A1 and have score>0, it visits all friends F that P knows (this is a sequential access in the knows table). For each F, it checks on equal person location and whether F already likes A1, and whether F also knows P (the mutuality check). All these checks involve random access, but especially the last check is expensive, as it involves a random lookup of all F's friends (to see if P is among them) in the large knows table. If all this succeeds (score,P,F) is added to a result table.

This cruncher program reads a binary representation of the CSV files (.bin files) and a query set and produces the required output in a rather straightforward (read: slow) manner. Afterwards, you may inspect the out.csv file with the results.

The queries-test.csv file contains several test query sets, with the structure Query.id|A1|A2|A3|A4|D1|D2 as explained above:

1|2811|1990|5183|1749|2015-04-09|2015-05-09
2|11254|143|2807|2799|2015-04-30|2015-05-30
3|7517|568|2820|6945|2015-07-30|2015-08-29
4|782|2008|1022|8|2015-08-27|2015-09-26
5|206|2837|808|7509|2015-04-23|2015-05-23
...

The cruncher program will create the out.csv file, with the structure Query.id|Score|P|F:

1|2|23749085|23166928
1|2|50823520|7696632439459
1|2|3298594899508|16492734279256
1|2|5497600779230|12094670330460
1|2|6597123132851|17592239440354
...

You can verify that the program created the correct output by comparing it with the file queries-test-output-sf100.csv and queries-test-output-sf3000.csv depending on which dataset you used:

diff out.csv queries-test-output-sf100.csv

If this outputs nothing, all is well.

You must run your cruncher program inside the VM in order to mimick the resource-constrained conditions we will evaluate your solution in. You can either develop inside the VM (using a basic editor like vim, but you can sudo apt-get install XXX other editors XXX). An alternative way of working is to clone your repository on your laptop, develop there, push to a git branch on your laptop, and pull that branch on the VM, and make cruncher there and run it. Here we suggest to use a branch, as the leaderboard starts testing as soon as you push to the master branch. So if you are developing and testing, please push to some development branch. Once you want it evaluated on the leaderboard, checkout master on the VM, merge your development branch and push master.

Visual Studio Code: Remote SSH

The most comfortable way of working is probably to install Visual Studio Code (a.k.a. VS Code) and install the Remote SSH extension pack. You then contact your VM on the same laptop via SSH. When joining to the VirtualBox/x86 machine, use the ubuntu@localhost:2224 target. For the UTM/M1 virtual machine, use the ubuntu@localhost target.

Working with Binary Data Files (.bin)

Lesson: textual data such as CSV, XML or JSON is very inefficient to store (and also to query). Note that the CSV files are not needed for running the programs, as they use the binary (.bin) files, described here.

We used a loader program to produce a binary representation of the CSV data. The loader throws away the unused columns and adds to "person" offset columns that point into the "likes" and "knows" arrays, so the graph becomes easier to navigate. The "knows" data in this binary form also identifies persons by their offset rather than by their person-ID - for the same reason. This data format is considerably more compact than the original CSV files, therefore we chose to distribute the sf3000 dataset inside the VM in this binary form and not as CSV files. In essence, you have to use these binary data files, but can take a look at the sf100 CSV to get an idea of what it looks like. The smaller dataset (LDBC scale factor 100) is intended for fast experiments and correctness testing, the larger one (sf3000) will be used to measure your program's performance.

The binary representation of the datasets consists of three files, person.bin, interest.bin and knows.bin. These are direct representations of the in-memory layout of C arrays of the respective types, hence they can be memory-mapped directly (see cruncher.c for details).

person.bin: Array of struct Person (see utils.h for its exact specification)
- person_id: Person.id from CSV
- birthday: Birthday in short representation (MMDD)
- location: Location id
- knows_first: Array offset into knows.bin
- knows_n: Number of entries in knows.bin for this person
- interests_first: Array offset into interest.bin
- interest_n: Number of entries in interest.bin for this person
interest.bin : Array of unsigned short containing interest ids
knows.bin : Array of unsigned int, contains array offset into person.bin

Your cruncher program works on this binary data format, rather than the CSV files.

Your reorg program (assignment 1b only) should also work on this binary data, and produce some new binary data files, of your own design, in the same data directory (do not overwrite the existing binary files). You should then also modify cruncher to use these new files. To read binary files, the easiest is to use the mmapr() memory mapping call that you already find in cruncher.c: it just gives you the binary file as a an array without any further I/O hassle. In order to write new files, do not use memory-mapped I/O: please use C standard I/O (fopen/fwrite/fclose) or the C++ equivalent. The current codebase is C, but also compiles as C++.

Debugging your programs

Coding in C/C++ can be frustrating (here is a crash course) certainly when you have a bug. It is very likely that when this happens, some offset is wrong (e.g. you needed knows_map[offset] rather than offset) and the program crashes with a SEGV. We recommend to ssh/putty into the VM box and use gdb to debug. In the Makefile, you would need to replace the -O3 flag by -g and recompile clean. Do not forget to revert the Makefile when you push to your master branch to trigger leaderboard testing (because -O3 produces much faster code).

You can run a program in the debugger with: gdb —args ./cruncher and the rest of the arguments (and, similar for ./reorg). Here is a gdb walkthrough. In gdb the most important commands are:

r = run program,
b linenumer = break at line number,
c = continue,
n = execute one statement,
p variable = print variable

So you could just do: p person_map[offset] and you would be effectively looking at the contents of the person.bin file (but in a much more readable form than using some hexadecimal dump tool)

Competing on the Leaderboard

There is an automatic testing infrastructure (see Leaderboard in the menu on the right) that automatically takes your code from your git URL, compiles it, runs it (first the reorganiser - if present), then three benchmark queries. For the benchmark queries, the median query time is recorded. There is a timeout: if everything takes longer than 10 minutes, the run is a failure. Further, the query results need to be correct. We provide you with 10 query parameters and correct answers that are representative of (but not equal to) the secret parameter sets that we use for performance evaluation and correctness testing in the Leaderboard.

The benchmarking process works as follows. You will be assigned a private github repository to work in. The testing infrastructure is polling that repository, and once a new version is pushed, it will:

pull your code from git
Run make
If no binary named cruncher is produced, abort
If a binary named reorg is produced, it is run with a single parameter, the folder name where the binary data of the sf3000 dataset is stored. This is your possibility to reorganize the binary data.
Run the test queries using the binary named "cruncher" with three parameters as described above
Compare the output with the reference output. If there are discrepancies, abort
Run the benchmark queries and record the time
Record the median runtime and note on the leaderboard

Please note that we run these tests in a virtual machine, which is almost identical to the Quickstart VM detailed above. So 1 GB of main memory and a single CPU is all that's available there. Beware that the test environment runs on a relatively slow disk, comparable to a magnetic (spinning) disk. So if you would run your VM on a laptop with SSD drive, your machine could be much faster than the benchmark environment.

For asssignment 1b please uncomment reorg in Makefile as a default make target, such that the leaderboard machine actually compiles and runs it.

Assignment 1c: doing this in Spark

An example implementation is available in the lsde2022-assignment1c repository for you to study. However, you should work in the private repository (lsde2021_groupXX) that we created for your group.

Reminder: edit the settings of the virtual machine in VirtualBox/UTM so that the VM has 4GB RAM and 2 cores. This is the configuration leaderboard will be testing with in assignment 1c. In assignment 1c we work on a laptop, rather than AWS, because instances with 4GB and 2 cores cannot be had in the free tier.

In this last part of assignment 1 (1c), rather than low-level C/C++ programming, we raise the abstraction level to DataFrames in Spark. This part of the assignment is intended to get you started with Spark, which is useful for assignment 2, generally. But it is also a reality check, because as you will see, the raw performance of Spark on CSV files is much less than a hand-optimized C program. Spark resource usage is also much higher than the hand-coded program: the dataset at scale-factor 100 (30x smaller than the one tested in 1a/1b) already makes the VM sweat, even though it has 4x the RAM.

There are plenty of resources on the web to get some basic overview of working with DataFrames in Spark. One specific pointer is the Databricks page that explains by example the Python API for DataFrames.

We provide a Jupyter notebook for experimenting. In this script you find two scala functions: reorg() and cruncher(). These functions have the obvious parameters, and for the moment reorg() is empty. It is your task to make reorg reorganize the data. The minimal things to do are:

make reorg() eliminate non-required knows
make reorg() write back any files that you use in Parquet format
adapt cruncher() to make use of the files (i.e. DataFrames) you create in reorg()

On the virtual machines, just run jupyter notebook.

Peter Boncz - Large Scale Data Engineering