LSDE 2018 - Large Scale Data Engineering

LSDE2015 · LSDE2015-2016 · LSDE2016 · LSDE2017 · LSDE2018 · VU Canvas

LSDE: Large Scale Data Engineering 2018

Your Task

Scenario: Birthday Cross-Selling

The marketeers of a social network have been data mining the musical preferences of their users. They have built statistical models which predict given an interest in say artists A2 and A3, that the person would also like A1 (i.e. rules of the form: A2 and A3 => A1). Now, they are commercially exploiting this knowledge by selling targeted ads to the management of artists who, in turn, want to sell concert tickets to the public but in the process also want to expand their artists' fan-base.

The ad is a suggestion for people who already are interested in A1 to buy concert tickets of artist A1 (with a discount!) as a birthday present for a friend ("who we know will love it" - the social network says) who lives in the same city, who is not yet interested in A1 yet, but is interested in other artists A2, A3 and A4 that the data mining model predicts to be correlated with A1.

Dataset: LDBC Social Network

The dataset of this assignment is a slightly modified subgraph generated by the LDBC social network benchmark, which models a social network much like Facebook. This is a synthetically generated dataset, but it contains complex patterns and correlations that are also found in real life. For instance, people from a certain country often have a name that is typical for that country. Also, people who grew up in the same location or have similar interests are much more likely to know each other than random persons. The original dataset contains a lot of information on where the persons studied or work, as well as their wall posts, pictures, and likes etc. However, for this task, we only operate on users, their location, their "knows" graph and their interests.

Dataset

Every user account in the social network is represented as a line in the person.csv file:

Person.id|firstName|lastName|gender|birthday|creationDate|locationIP|browserUsed|locatedIn
420|K.|Reddy|female|1981-09-08|2010-02-01T04:10:19.506+0000|14.102.233.188|Safari|152
7489|Shweta|Rao|male|1985-11-07|2010-03-27T18:13:25.167+0000|61.11.117.174|Internet Explorer|177
9037|Meera|Khan|female|1984-03-17|2010-03-14T07:06:54.725+0000|59.176.223.143|Chrome|178
12860|Aditya|Sharma|female|1982-08-20|2010-02-19T12:51:10.863+0000|103.1.115.239|Firefox|271
[...]

The last column (locatedIn) denotes cities; these are encoded as numbers here. Note that the actual data files do not contain the field names in the first line.

The social graph is formed by persons being vertices and edges entries in the "knows.csv" file, where each edge is represented by two Person IDs (knows.csv):

Person.id|Person.id|
420|12860
420|8150883
420|1099518059227
420|2199027829087
...

From the first entry together with the information from person.csv, we see that person "K. Reddy" knows "Aditya Sharma". Please note that the connection is uni-directional (Twitter "Follower" semantics) Thus, if P knows F, P may or may not know F.

Finally, the social network knows for each person a number of favorite artists (musical interests). The interests are encoded in a separate "interest.csv" file as combinations of (personID,tagID), where tagID is a number and represents an artist:

Person.id|Tag.id|
420|1
7489|3
7489|138
7489|142
...

There are 16080 different artists. As in real life, the popularity of the artists is quite skewed.

Please be aware that the connection structure of this graph is highly complex and correlated. The graph is one huge connected component, and everyone knows everyone within 4 steps. While in a large network as Facebook (1.3 billion users) the average amount of friends is above 400, in smaller networks like here it is lower. The dataset you operate just has 8.9M persons (LDBC "scale factor" sf3000), who know on average 149 persons (and on average have 23 interests).

Query Program

You must implement, hence, a program that gets as parameters:

a date range [D1..D2]
target artist A1, and
related artists A2, A3 and A4.

For all persons P who have their birthday on or in between D1..D2 (e.g. next month), who do not like A1 yet, we give a score of 1 for liking any of the artists A2, A3 and A4 and 0 if not (the final score, the sum, hence is a number between 0 and 3). Further, we look for friends F of P (people who know each other mutually) who live in the same city and already like A1. The answer of the query is a table (score, P, F) containing only scores > 0 and ordered descending on score, and for equal scores ascending on P (a person ID) and if again equal ascending on F (also a personID).

You can implement this query program in any way you like, but it better be efficient. We will test the query multiple times with different birthdate and artist parameters. After all, this is repetitive business for the social network marketeers, as time passes by and other people will have their birthday, and as they sell the idea of such an ad campaign to the managers of different artists.

In assignment 1a the competition is about writing a version of the query program that actually finishes withing the benchmarking timeout on the large dataset. You will see that this isnot so easy. The way to achieve this is to restrict the random access pattern that the naive implementation performs, seperating the accesses in stages, such that in each stage only one part of the data structures is accessed randomly. Specifically, the check for bi-directional friendship, which randomly accesses the largest array (knows.bin) should be the last stage, because in the last stage there are only few candidate persons/friends in play left.

Reorg Program

Rather than running on the textual CSV, you can prepare some better data representation of your choice. For this purpose, before the queries are run, a "reorg" program that you can provide, is executed. In assignment 1b, the reorg program should reduce the size of the data, where the most important hint is that only friendships between people from the same location are relevant. There are potentially more things that you might do in reorg that will help the queries getting faster (which you may also try). In assignment 1c the data structures are now CSV files, which as you will see make the data size much bigger. Therefore, you should in assignment 1c convert these datastes to a better format (Parquet) and generally try to make Spark do the optimizations that you did manually before in assignment 1b.

Peter Boncz · Hannes Mühleisen