LSDE 2018 - Large Scale Data Engineering

LSDE2015 · LSDE2015-2016 · LSDE2016 · LSDE2017 · LSDE2018 · VU Canvas

LSDE: Large Scale Data Engineering 2018

Practical Information

Quick Start

We provide a ready-made Linux virtual machine for the VirtualBox system for download (9 GB). Once downloaded, this machine can be imported into VirtualBox and started. The VM contains sample code and datasets (in the folder /opt/lsde) and sample code at /home/admin/leaderboard-demo. To login, use the username "admin" and the password "randomaccess". The virtual machine is also configured to open up SSH access at port 2224.

Those using Amazon should register for the free tier (credit card info required, but all this should cost $0), then in the management console go to EC2, and launch a "Community AMI" instance. Searching for "lsde" in the Ohio data-center (click top-right menu bar to switch data-center) you will find the lsde-2018 image, which can be launched on a "micro instance" for free. Doing so, you will need to create an ssh key pair, and launch the instance. When it is running, you note down the IP address and SSH into it using username "fedora" and passing your key : e.g. ssh -i mykey.pem fedora@1.2.3.4 in a Linux or Mac OS X terminal, or a Windows putty client (in that case you need to first use the puttygen utility to convert the PEM key from amazon into a Putty Private Key file. That key file you should include under the 'ssh' tab and its 'Auth' subtab in the putty connection setup window).

You can run the cruncher programs with the data-directory, test-queries and some output-file as parameters, e.g.:

./cruncher /opt/lsde/dataset-sf100 queries-test.csv out.csv

Afterwards, you may inspect the out.csv file with the results.

Code and Data

There is a GitHub repository that contains example code to get you started. The cruncher program reads a binary representation of the CSV files and a query set and produces the required output in a rather straightforward (read: slow) manner. Each group will get a private github repository that starts from this content (you must email the instructors your github user names and mention your group number, to get access). Please work on this private repository once you get access.

The example solution goes through the persons P sequentially, counting how many of the artists A2,A3,A4 are liked as the score and for those with score>0, visits all persons F known to P. For each F (note that touching F has a randomly access pattern), it checks on equal location and whether F already likes A1, and whether F also knows P (because friends are people who mutually know each other, in this social network). If all this succeeds (score,P,F) is added to a result table.

There is also a loader program, which converts the CSV files into the binary representation. There are two versions of the dataset available online, "sf100" (about 220 MB) and "sf3000" (5.3 GB). We have already ran a loader program on both of these, producing .bin files. The binary files are much smaller, and are already included in the VM images we provide for practical1. The smaller dataset (LDBC scale factor 100) is for fast experiments, but the larger one (sf3000) will be used to measure your program's performance.

The queries-test.csv file contains several test query sets, with the structure Query.id|A1|A2|A3|A4|D1|D2 as explained above:

1|2811|1990|5183|1749|2015-04-09|2015-05-09
2|11254|143|2807|2799|2015-04-30|2015-05-30
3|7517|568|2820|6945|2015-07-30|2015-08-29
4|782|2008|1022|8|2015-08-27|2015-09-26
5|206|2837|808|7509|2015-04-23|2015-05-23
...

The cruncher program will create the out.csv file, with the structure Query.id|Score|P|F:

1|2|23749085|23166928
1|2|50823520|7696632439459
1|2|3298594899508|16492734279256
1|2|5497600779230|12094670330460
1|2|6597123132851|17592239440354
...

You can verify that the program created the correct output by comparing it with the file queries-test-output-sf100.csv and queries-test-output-sf3000.csv depending on which dataset you used:

diff out.csv queries-test-output-sf100.csv

If this outputs nothing, all is well.

Benchmarking

There is an automatic testing infrastructure that automatically takes your code from your git URL, compiles it, runs it (first the reorganiser (if present), then the ten test queries, then 10 benchmark queries). For the benchmark queries, the median query time is recorded. There is a timeout, if everything takes longer than one hour, the run is a failure. Further, the query results need to be correct. We provide you with 10 query parameters and correct answers that are representative of (but not equal to) the secret 10 query parameter sets that we use for performance evaluation and correctness testing.

The benchmarking process works as follows. You will be assigned a private github repository to work in. The testing infrastructure is polling that repository, and once a new version is pushed, it will:

pull your code from git
Run make
If no binary named cruncher is produced, abort
If a binary named reorg is produced, it is run with a single parameter, the folder name where the binary data of the sf3000 dataset is stored. This is your possibility to reorganize the binary data. This step is skipped in assignment 1a!
Run the ten test queries using the binary named "cruncher" with three parameters as described above
Compare the output with the reference output. If there are discrepancies, abort
Run the 10 benchmark queries and record the time
Record the median runtime and note on the leaderboard

Please note that we run these tests in a virtual machine, which is almost identical to the Quickstart VM detailed above. So 1 GB of main memory and a single CPU is all that's available there. Beware that the test environment runs on a magnetic (spinning) disk. So if you would run your VM on a laptop with SSD drive, your machine could be much faster than the benchmark environment. The Amazon EC2 micro instances use a network filesystem, where data might get cached in a RAM cache of the EBS Amazon storage layer, and could therefore potentially also be faster than the benchmarking environment.

Binary Dataset structure

The example loader program produces a binary representation of the CSV data, that throws away the unused columns and adds to "person" offset columns into the "likes" and "knows" arrays, so the graph becomes easier to navigate. The "knows" data in this binary form also identifies persons by their offset rather than by their person-ID - for the same reason. This dataset is considerably smaller than the original CSV files, therefore we chose to distribute the sf3000 dataset inside the VM in this binary form and not as CSV files.

Your reorg program should work on this binary data format, rather than the CSV files. It is in fact much easier to deal with and we do not even provide the CSV files for sf3000 data size (they would be very large and make the VM quite big). Therefore, we explain its structure here.

The binary representation of the datasets consists of three files, person.bin, interest.bin and knows.bin. These are direct representations of the in-memory layout of C arrays of the respective types, hence they can be memory-mapped directly (see cruncher.c for details).

person.bin : Array of struct Person (see utils.h for its exact specification)
- person_id: Person.id from CSV
- birthday: Birthday in short representation (MMDD)
- location: Location id
- knows_first: Array offset into knows.bin
- knows_n: Number of entries in knows.bin for this person
- interests_first: Array offset into interest.bin
- interest_n: Number of entries in interest.bin for this person
interest.bin : Array of unsigned short containing interest ids
knows.bin : Array of unsigned int, contains array offset into person.bin

Peter Boncz · Hannes Mühleisen