We provide ready-made Linux virtual machines (VMs).
On x86-based systems, use the VirtualBox system and the ldbc-vm.ova image (18 GB).
Once downloaded, this machine can be imported into VirtualBox and started.
On systems with the Apple M1 chip, use the UTM virtualization app and the image in lsde-m1.zip (12 GB, grows to 20+ GB when uncompressed).
The settings with 1GB RAM and 1 core are for assignments 1a and 1b.
For assignment 1c, please modify the memory size to 4GB and the amount of cores to 2+ (in VirtualBox, go to Settings/Instellingen => System/Systeem => set 4096MB there, and 2+ CPUs on the Processor tab; in UTM, edit the machine's settings).
The VM contains the sample datasets in the folder /opt/lsde
and the required dependencies (gcc, Spark, etc.).
To login, use the username "ubuntu" and the password "ubuntu".
In order to connect to the VM on Linux and MacOS, run
ssh -p 2224 ubuntu@A.B.C.D (for the VirtualBox/x86 VM)
ssh -p 22 ubuntu@A.B.C.D (for the UTM/M1 VM).
The equivalent of ssh in Windows is using the putty application and saving the session.
The virtual machines contain the data sets in binary format but they are also available separately: SF100, SF3000.
Those using Amazon in assignments 1a and 1b should register for the free tier (credit card info required, but all this should cost $0), then in the management console go to EC2, and launch a "Community AMI" instance. Searching for "lsde" in the Frankfurt data-center (click top-right menu bar to switch data-center) you will find the lsde2022 image, which can be launched on a "micro instance" for free.
Doing so, you will need to create an ssh key pair, and launch the instance. When it is running, you note down the IP address and SSH into it using username "ubuntu" and passing your key : e.g. ssh -i mykey.pem ubuntu@A.B.C.D
in a Linux or Mac OS terminal, or a Windows putty client (in that case you need to first use the puttygen utility to convert the PEM key from amazon into a Putty Private Key file. That key file you should include under the 'ssh' tab and its 'Auth' subtab in the putty connection setup window).
Warning: the free-tier t* instances in EC2 have burstable CPUs so the CPU can only run at full speed for a few hours per day.
Once you pass us your github account name, you will be invited into a private github repository (owned by us) that you can use and contains a naive solution. The naive solution is a copy of the lsde2022-assignment1 repository, which you can clone in the VM. Once you have access to your private github repo, please clone that one and work from there.
Here are the instructions to clone and build the code (assuming you have an SSH connection set up):
git clone git@github.com:lsde-course/lsde2022-yourname.git cd lsde2022-yourname makeCompiling the provided code results in the
cruncher
executable, that can be ran with the data-directory, test-queries and some output-file as parameters, e.g.:
./cruncher /opt/lsde/dataset-sf100 queries-test.csv out.csv
The example solution goes through the persons table sequentially, and for each person P accesses the interests table, counting how many of the artists A2,A3,A4 are liked as the score. For all P who do not like A1 and have score>0, it visits all friends F that P knows (this is a sequential access in the knows table). For each F, it checks on equal person location and whether F already likes A1, and whether F also knows P (the mutuality check). All these checks involve random access, but especially the last check is expensive, as it involves a random lookup of all F's friends (to see if P is among them) in the large knows table. If all this succeeds (score,P,F) is added to a result table.
This cruncher program reads a binary representation of the CSV files (.bin files) and a query set and produces the required output in a rather straightforward (read: slow) manner.
Afterwards, you may inspect the out.csv
file with the results.
The queries-test.csv
file contains several test query sets, with the structure Query.id|A1|A2|A3|A4|D1|D2
as explained above:
1|2811|1990|5183|1749|2015-04-09|2015-05-09 2|11254|143|2807|2799|2015-04-30|2015-05-30 3|7517|568|2820|6945|2015-07-30|2015-08-29 4|782|2008|1022|8|2015-08-27|2015-09-26 5|206|2837|808|7509|2015-04-23|2015-05-23 ...
The cruncher program will create the out.csv file, with the structure Query.id|Score|P|F
:
1|2|23749085|23166928 1|2|50823520|7696632439459 1|2|3298594899508|16492734279256 1|2|5497600779230|12094670330460 1|2|6597123132851|17592239440354 ...
You can verify that the program created the correct output by comparing it with the file queries-test-output-sf100.csv and queries-test-output-sf3000.csv depending on which dataset you used:
diff out.csv queries-test-output-sf100.csv
If this outputs nothing, all is well.
You must run your cruncher
program inside the VM in order to mimick the resource-constrained conditions we will evaluate your solution in. You can either develop inside the VM (using a basic editor like vim, but you can sudo apt-get install XXX
other editors XXX). An alternative way of working is to clone your repository on your laptop, develop there, push to a git branch on your laptop, and pull that branch on the VM, and make cruncher there and run it. Here we suggest to use a branch, as the leaderboard starts testing as soon as you push to the master branch. So if you are developing and testing, please push to some development branch. Once you want it evaluated on the leaderboard, checkout master on the VM, merge your development branch and push master.
The most comfortable way of working is probably to install Visual Studio Code (a.k.a. VS Code) and install the Remote SSH extension pack. You then contact your VM on the same laptop via SSH. When joining to the VirtualBox/x86 machine, use the ubuntu@localhost:2224 target. For the UTM/M1 virtual machine, use the ubuntu@localhost target.
Lesson: textual data such as CSV, XML or JSON is very inefficient to store (and also to query). Note that the CSV files are not needed for running the programs, as they use the binary (.bin) files, described here.
We used a loader
program to produce a binary representation of the CSV data. The loader throws away the unused columns and adds to "person" offset columns that point into the "likes" and "knows" arrays, so the graph becomes easier to navigate. The "knows" data in this binary form also identifies persons by their offset rather than by their person-ID - for the same reason. This data format is considerably more compact than the original CSV files, therefore we chose to distribute the sf3000 dataset inside the VM in this binary form and not as CSV files. In essence, you have to use these binary data files, but can take a look at the sf100 CSV to get an idea of what it looks like. The smaller dataset (LDBC scale factor 100) is intended for fast experiments and correctness testing, the larger one (sf3000) will be used to measure your program's performance.
The binary representation of the datasets consists of three files, person.bin
, interest.bin
and knows.bin
. These are direct representations of the in-memory layout of C arrays of the respective types, hence they can be memory-mapped directly (see cruncher.c
for details).
struct Person
(see utils.h
for its exact specification)
person_id
: Person.id from CSVbirthday
: Birthday in short representation (MMDD)location
: Location idknows_first
: Array offset into knows.bin
knows_n
: Number of entries in knows.bin
for this personinterests_first
: Array offset into interest.bin
interest_n
: Number of entries in interest.bin
for this personunsigned short
containing interest idsunsigned int
, contains array offset into person.bin
Your cruncher
program works on this binary data format, rather than the CSV files.
Your reorg
program (assignment 1b only) should also work on this binary data, and produce some new binary data files, of your own design, in the same data directory (do not overwrite the existing binary files). You should then also modify cruncher
to use these new files. To read binary files, the easiest is to use the mmapr() memory mapping call that you already find in cruncher.c: it just gives you the binary file as a an array without any further I/O hassle. In order to write new files, do not use memory-mapped I/O: please use C standard I/O (fopen/fwrite/fclose) or the C++ equivalent. The current codebase is C, but also compiles as C++.
Coding in C/C++ can be frustrating (here is a crash course) certainly when you have a bug. It is very likely that when this happens, some offset is wrong (e.g. you needed knows_map[offset] rather than offset) and the program crashes with a SEGV. We recommend to ssh/putty into the VM box and use gdb to debug. In the Makefile, you would need to replace the -O3 flag by -g and recompile clean. Do not forget to revert the Makefile when you push to your master branch to trigger leaderboard testing (because -O3 produces much faster code).
You can run a program in the debugger with: gdb —args ./cruncher and the rest of the arguments (and, similar for ./reorg). Here is a gdb walkthrough. In gdb the most important commands are:There is an automatic testing infrastructure (see Leaderboard in the menu on the right) that automatically takes your code from your git URL, compiles it, runs it (first the reorganiser - if present), then three benchmark queries. For the benchmark queries, the median query time is recorded. There is a timeout: if everything takes longer than 10 minutes, the run is a failure. Further, the query results need to be correct. We provide you with 10 query parameters and correct answers that are representative of (but not equal to) the secret parameter sets that we use for performance evaluation and correctness testing in the Leaderboard.
The benchmarking process works as follows. You will be assigned a private github repository to work in. The testing infrastructure is polling that repository, and once a new version is pushed, it will:
make
cruncher
is produced, abortreorg
is produced, it is run with a single parameter, the folder name where the binary data of the sf3000 dataset is stored. This is your possibility to reorganize the binary data. Please note that we run these tests in a virtual machine, which is almost identical to the Quickstart VM detailed above. So 1 GB of main memory and a single CPU is all that's available there. Beware that the test environment runs on a relatively slow disk, comparable to a magnetic (spinning) disk. So if you would run your VM on a laptop with SSD drive, your machine could be much faster than the benchmark environment.
For asssignment 1b please uncomment reorg in Makefile as a default make target, such that the leaderboard machine actually compiles and runs it.
An example implementation is available in the lsde2022-assignment1c repository for you to study. However, you should work in the private repository (lsde2021_groupXX) that we created for your group.
Reminder: edit the settings of the virtual machine in VirtualBox/UTM so that the VM has 4GB RAM and 2 cores. This is the configuration leaderboard will be testing with in assignment 1c. In assignment 1c we work on a laptop, rather than AWS, because instances with 4GB and 2 cores cannot be had in the free tier.
In this last part of assignment 1 (1c), rather than low-level C/C++ programming, we raise the abstraction level to DataFrames in Spark. This part of the assignment is intended to get you started with Spark, which is useful for assignment 2, generally. But it is also a reality check, because as you will see, the raw performance of Spark on CSV files is much less than a hand-optimized C program. Spark resource usage is also much higher than the hand-coded program: the dataset at scale-factor 100 (30x smaller than the one tested in 1a/1b) already makes the VM sweat, even though it has 4x the RAM.
There are plenty of resources on the web to get some basic overview of working with DataFrames in Spark. One specific pointer is the Databricks page that explains by example the Python API for DataFrames.
We provide a Jupyter notebook for experimenting. In this script you find two scala functions: reorg() and cruncher(). These functions have the obvious parameters, and for the moment reorg() is empty. It is your task to make reorg reorganize the data. The minimal things to do are:On the virtual machines, just run jupyter notebook.