We provide a ready-made Linux virtual machine for the VirtualBox system for download (9 GB). Once downloaded, this machine can be imported into VirtualBox and started. The VM contains sample code and datasets (in the folder /opt/lsde
) and sample code at /home/admin/leaderboard-demo. To login, use the username "admin" and the password "randomaccess". The virtual machine is also configured to open up SSH access at port 2224.
You can run the cruncher
programs with the data-directory, test-queries and some output-file as parameters, e.g.:
./cruncher /opt/lsde/dataset-sf100 queries-test.csv out.csv
Afterwards, you may inspect the out.csv file with the results.
There is a GitHub repository that contains example code to get you started. The cruncher program reads a binary representation of the CSV files and a query set and produces the required output in a rather straightforward (read: slow) manner.
The example solution goes through the persons P sequentially, counting how many of the artists A2,A3,A4 are liked as the score and for those with score>0, visits all persons F known to P. For each F (note that touching F has a randomly access pattern), it checks on equal location and whether F already likes A1, and whether F also knows P (because friends are people who mutually know each other, in this social network). If all this succeeds (score,P,F) is added to a result table.
There is also a loader program, which converts the CSV files into the binary representation. There are two versions of the dataset available online, "sf100" (about 220 MB) and "sf3000" (5.3 GB). We have already ran the loader program on both of these, producing .bin files. The smaller dataset is for fast experiments, but the larger one will be used to measure your program's performance.
The queries-test.csv file contains several test query sets, with the structure Query.id|A1|A2|A3|A4|D1|D2 as explained above:
1|2811|1990|5183|1749|2015-04-09|2015-05-09 2|11254|143|2807|2799|2015-04-30|2015-05-30 3|7517|568|2820|6945|2015-07-30|2015-08-29 4|782|2008|1022|8|2015-08-27|2015-09-26 5|206|2837|808|7509|2015-04-23|2015-05-23 ...
The cruncher program will create the out.csv file, with the structure Query.id|Score|P|F:
1|2|23749085|23166928 1|2|50823520|7696632439459 1|2|3298594899508|16492734279256 1|2|5497600779230|12094670330460 1|2|6597123132851|17592239440354 ...
You can verify that the program created the correct output by comparing it with the file queries-test-output-sf100.csv and queries-test-output-sf3000.csv depending on which dataset you used:
diff out.csv queries-test-output-sf100.csv
If this outputs nothing, all is well.
There is an automatic testing infrastructure that automatically takes your code from your git URL, compiles it, runs it (first the reorganiser (if present), then the ten test queries, then 10 benchmark queries). For the benchmark queries, the median query time is recorded. There is a timeout, if everything takes longer than one hour, the run is a failure. Further, the query results need to be correct. We provide you with 10 query parameters and correct answers that are representative of (but not equal to) the secret 10 query parameter sets that we use for performance evaluation and correctness testing.
The benchmarking process works as follows. You register your public GIT repository URL at the leaderboard. It will then start polling that repository. Once something is checked in, it will:
make
cruncher
is produced, abortreorg
is produced, it is run with a single parameter, the folder name where the binary data of the sf3000 dataset is stored. This is your possibility to reorganize the binary data.Please note that we run these tests in a virtual machine, which is almost identical to the Quickstart VM detailed above. So 1 GB of main memory and a single CPU is all that's available there. Beware that the test environment runs on a magnetic (spinning) disk. So if you would run your VM on a laptop with SSD drive, your machine could be much faster than the benchmark environment. The Amazon EC2 micro instances use an emulated disk, which might get cached in a RAM cache of the EBS Amazon storage layer, and could therefore potentially also be faster than the benchmarking environment.
The example loader
program produces a binary representation of the CSV data, that throws away the unused columns and adds to "person" offset columns into the "likes" and "knows" arrays, so the graph becomes easier to navigate. The "knows" data in this binary form also identifies persons by their offset rather than by their person-ID - for the same reason. This dataset is considerably smaller than the original CSV files, therefore we chose to distribute the sf3000 dataset inside the VM in this binary form and not as CSV files.
Your reorg
program could work on this binary data format, rather than the CSV files. Therefore, we explain its structure here.
The binary representation of the datasets consists of three files, person.bin
, interest.bin
and knows.bin
. These are direct representations of the in-memory layout of C arrays of the respective types, hence they can be memory-mapped directly (see cruncher.c
for details).
struct Person
(see utils.h
for its exact specification)
person_id
: Person.id from CSVbirthday
: Birthday in short representation (MMDD)location
: Location idknows_first
: Array offset into knows.bin
knows_n
: Number of entries in knows.bin
for this personinterests_first
: Array offset into interest.bin
interest_n
: Number of entries in interest.bin
for this personunsigned short
containing interest idsunsigned int
, contains array offset into person.bin