LSDE 2021 - Large Scale Data Engineering

2015 · 2015-2016 · 2016 · 2017 · 2018 · 2019 · 2020 · 2021 · VU Canvas

LSDE: Large Scale Data Engineering 2021

Practical Information

Getting a Virtual Machine

We provide ready-made Linux virtual machines (VMs). On x86-based systems, use the VirtualBox system and the ldbc-vm.ova image (link removed). Once downloaded, this machine can be imported into VirtualBox and started. On systems with the Apple M1 chip, use the UTM virtualization app and the image in lsde-m1.zip (link removed).
The settings with 1GB RAM and 1 core are for assignments 1a and 1b. For assignment 1c, please modify the memory size to 3.5GB and the amount of cores to 2+ (in VirtualBox, go to Settings/Instellingen => System/Systeem => set 3584MB there, and 2+ CPUs on the Processor tab; in UTM, edit the machine's settings). The VM contains the sample datasets in the folder /opt/lsde and the required dependencies (gcc, Spark, etc.). To login, use the username "ubuntu" and the password "ubuntu".
In order to connect to the VM, either get the IP address of the machine and run ssh ubuntu@A.B.C.D or set up a port forwarding rule from the internal IP address of the machine and the SSH port (e.g. 10.0.2.15:22) to the host machine (localhost:2224) and run ssh -p 2224 ubuntu@localhost. The equivalent of ssh in Windows is using putty and saving the session.

The virtual machines contain the data sets in binary format but they are also available separately: SF100, SF3000.

Those using Amazon in assignments 1a and 1b should register for the free tier (credit card info required, but all this should cost $0), then in the management console go to EC2, and launch a "Community AMI" instance. Searching for "lsde" in the Ohio data-center (click top-right menu bar to switch data-center) you will find the lsde2021 image, which can be launched on a "micro instance" for free. Doing so, you will need to create an ssh key pair, and launch the instance. When it is running, you note down the IP address and SSH into it using username "ubuntu" and passing your key : e.g. ssh -i mykey.pem ubuntu@1.2.3.4 in a Linux or Mac OS X terminal, or a Windows putty client (in that case you need to first use the puttygen utility to convert the PEM key from amazon into a Putty Private Key file. That key file you should include under the 'ssh' tab and its 'Auth' subtab in the putty connection setup window).

Warning: the free-tier t* instances in EC2 have burstable CPUs so the CPU can only run at full speed for a few hours per day.

Working with the Example Code

Once you pass us your github account name, you will be invited into a private github repository (owned by us) that you can use and contains a naive solution. The naive solution is a copy of the lsde2021-assignment1 repository, which you can clone in the VM. Once you have access to your private github repo, please clone that one and work from there.

Inside the repository you can do make to compile, and subsequently you can run the cruncher program with the data-directory, test-queries and some output-file as parameters, e.g.:

./cruncher /opt/lsde/dataset-sf100-bidirectional queries-test.csv out.csv

This cruncher program reads a binary representation of the CSV files (.bin files) and a query set and produces the required output in a rather straightforward (read: slow) manner. Afterwards, you may inspect the out.csv file with the results.

The queries-test.csv file contains several test query sets, with the structure Query.id|A1|A2|A3|A4|D1|D2 as explained above:

1|1989|1990|5183|1749|2015-04-09|2015-05-09
2|2788|568|2820|6945|2015-07-30|2015-08-29
3|775|2008|1022|8|2015-08-27|2015-09-26
4|2788|1989|1023|7380|2015-09-24|2015-10-24
5|139|2837|808|7509|2015-04-23|2015-05-23
...

The cruncher program will create the out.csv file, with the following structure

Query.id|TotalScore|P1|P2|P3

and content:

1|4|3298555005280|1099531721546|8796113134938
2|4|9895651605584|1099558292668|5497605077977
3|4|13194158914249|13194158885086|15393182153412
4|4|4398066818896|2199043561421|3298555180256
5|4|12094690422868|6597132305763|10995179134501
5|4|12094690422868|6597132305763|13194202231225
...

You can verify that the program created the correct output by comparing it with the file queries-test-output-sf100.csv and queries-test-output-sf3000.csv depending on which dataset you used:

diff out.csv queries-test-output-sf100.csv

If this outputs nothing, all is well.

You must run your cruncher program inside the VM in order to mimick the resource-constrained conditions we will evaluate your solution in. You can either develop inside the VM (using a basic editor like vim, but you can sudo apt-get install XXX other editors XXX. An alternative way of working is to clone your repository on your laptop, develop there, push to a git branch on your laptop, and pull that branch on the VM, and make cruncher there and run it. Here we suggest to use a branch, as the leaderboard starts testing as soon as you push to the master branch. So if you are developing and testing, please push to some development branch. Once you want it evaluated on the leaderboard, checkout master on the VM, merge your development branch and push master.

If you don't want to bother with branches, and want to develop outside the VM, then you can also simply scp copy your source file(s) after changing them to the VM.

The example solution goes through the persons P1 sequentially, filtering them by their birthday and their interest for A1. Then, it uses nested loops for finding person P2 and person P3 who are both required to have a score of at least 2 (based on interest in artists A2, A3, A4), required to live in the same city as P1, and the friendships of P1-P2-P3 need to form a triangle. This is checked by navigating from P1 to P2, from P2 to P3, and from P3 to P4 (with a final check on P4 = P1). While navigating from P1 to P2 is still sequential access in knows.bin; the other two navigational steps will cause random access in person.bin and knows.bin Ultimately, if all checks pass for a given triangle, the tuple (TotalScore, P1, P2, P3) is added to a result table, where TotalScode is P2.score + P3.score.

Working with Binary Data Files (.bin)

Lesson: textual data such as CSV, XML or JSON is very inefficient to store (and also to query). Note that the CSV files are not needed for running the programs, as they use the binary (.bin) files, described here.

We used a loader program to produce a binary representation of the CSV data. The loader throws away the unused columns and adds to "person" offset columns that point into the "likes" and "knows" arrays, so the graph becomes easier to navigate. The "knows" data in this binary form also identifies persons by their offset rather than by their person-ID - for the same reason. This data format is considerably more compact than the original CSV files, therefore we chose to distribute the sf3000 dataset inside the VM in this binary form and not as CSV files. In essence, you have to use these binary data files, but can take a look at the sf100 CSV to get an idea of what it looks like. The smaller dataset (LDBC scale factor 100) is intended for fast experiments and correctness testing, the larger one (sf3000) will be used to measure your program's performance.

The binary representation of the datasets consists of three files, person.bin, interest.bin and knows.bin. These are direct representations of the in-memory layout of C arrays of the respective types, hence they can be memory-mapped directly (see cruncher.c for details).

person.bin: Array of struct Person (see utils.h for its exact specification)
- person_id: Person.id from CSV
- birthday: Birthday in short representation (MMDD)
- location: Location id
- knows_first: Array offset into knows.bin
- knows_n: Number of entries in knows.bin for this person
- interests_first: Array offset into interest.bin
- interest_n: Number of entries in interest.bin for this person
interest.bin : Array of unsigned short containing interest ids
knows.bin : Array of unsigned int, contains array offset into person.bin

Your cruncher program works on this binary data format, rather than the CSV files.

Your reorg program (assignment 1b only) should also work on this binary data, and produce some new binary data files, of your own design, in the same data directory (do not overwrite the existing binary files). You should then also modify cruncher to use these new files. To read binary files, the easiest is to use the mmapr() memory mapping call that you already find in cruncher.c: it just gives you the binary file as a an array without any further I/O hassle. In order to write new files, do not use memory-mapped I/O: please use C standard I/O (fopen/fwrite/fclose) or the C++ equivalent. The current codebase is C, but also compiles as C++.

Debugging your programs

Coding in C/C++ can be frustrating (here is a crash course) certainly when you have a bug. It is very likely that when this happens, some offset is wrong (e.g. you needed knows_map[offset] rather than offset) and the program crashes with a SEGV. We recommend to ssh/putty into the VM box and use gdb to debug. In the Makefile, you would need to replace the -O3 flag by -g and recompile clean. Do not forget to revert the Makefile when you push to your master branch to trigger leaderboard testing (because -O3 produces much faster code).

You can run a program in the debugger with: gdb —args ./cruncher and the rest of the arguments (and, similar for ./reorg). Here is a gdb walkthrough. In gdb the most important commands are:

r = run program,
b linenumer = break at line number,
c = continue,
n = execute one statement,
p variable = print variable

So you could just do: p person_map[offset] and you would be effectively looking at the contents of the person.bin file (but in a much more readable form than using some hexadecimal dump tool)

Competing on the Leaderboard

There is an automatic testing infrastructure (see Leaderboard in the menu on the right) that automatically takes your code from your git URL, compiles it, runs it (first the reorganiser - if present), then three benchmark queries. For the benchmark queries, the median query time is recorded. There is a timeout: if everything takes longer than 10 minutes, the run is a failure. Further, the query results need to be correct. We provide you with 10 query parameters and correct answers that are representative of (but not equal to) the secret parameter sets that we use for performance evaluation and correctness testing in the Leaderboard.

The benchmarking process works as follows. You will be assigned a private github repository to work in. The testing infrastructure is polling that repository, and once a new version is pushed, it will:

pull your code from git
Run make
If no binary named cruncher is produced, abort
If a binary named reorg is produced, it is run with a single parameter, the folder name where the binary data of the sf3000 dataset is stored. This is your possibility to reorganize the binary data.
Run the ten test queries using the binary named "cruncher" with three parameters as described above
Compare the output with the reference output. If there are discrepancies, abort
Run the benchmark queries and record the time
Record the median runtime and note on the leaderboard

Please note that we run these tests in a virtual machine, which is almost identical to the Quickstart VM detailed above. So 1 GB of main memory and a single CPU is all that's available there. Beware that the test environment runs on a magnetic (spinning) disk. So if you would run your VM on a laptop with SSD drive, your machine could be much faster than the benchmark environment. The Amazon EC2 micro instances use a network filesystem (EBS), where data might get cached in a RAM cache of the Amazon storage layer, and could therefore potentially also be faster than the benchmarking environment.

For asssignment 1b please uncomment reorg in Makefile as a default make target, such that the leaderboard machine actually compiles and runs it.

Assignment 1c: doing this in Spark

An example implementation is available in the lsde2021-assignment1c repository for you to study. However, you should work in the private repository (lsde2021_groupXX) that we created for your group.

Reminder: edit the settings of the virtual machine in VirtualBox/UTM so that the VM has 4GB RAM and 2 cores. This is the configuration leaderboard will be testing with in assignment 1c. In assignment 1c we work on a laptop, rather than AWS, because instances with 4GB and 2 cores cannot be had in the free tier.

In this last part of assignment 1 (1c), rather than low-level C/C++ programming, we raise the abstraction level to DataFrames in Spark. This part of the assignment is intended to get you started with Spark, which is useful for assignment 2, generally. But it is also a reality check, because as you will see, the raw performance of Spark on CSV files is much less than a hand-optimized C program. Spark resource usage is also much higher than the hand-coded program: the dataset at scale-factor 100 (30x smaller than the one tested in 1a/1b) already makes the VM sweat, even though it has 4x the RAM.

There are plenty of resources on the web to get some basic overview of working with DataFrames in Spark. One specific pointer is the Databricks page that explains by example the Python API for DataFrames.

We provide a Jupyter notebook for experimenting. In this script you find two scala functions: reorg() and cruncher(). These functions have the obvious parameters, and for the moment reorg() is empty. It is your task to make reorg reorganize the data. The minimal things to do are:

make reorg() eliminate non-required knows
make reorg() write back any files that you use in Parquet format
adapt cruncher() to make use of the files (i.e. DataFrames) you create in reorg()

On the virtual machines, run pip3 install pyspark, then run jupyter notebook.

Peter Boncz - Large Scale Data Engineering