LSDE 2016 - Large Scale Data Engineering

LSDE2015 · LSDE2015-2016 · LSDE2016 · VU BlackBoard

LSDE: Large Scale Data Engineering 2016

Your Task

Scenario: Birthday Cross-Selling

The marketeers of a social network have been data mining the musical preferences of their users. They have built statistical models which predict given an interest in say artists A2 and A3, that the person would also like A1 (i.e. rules of the form: A2 and A3 => A1). Now, they are commercially exploiting this knowledge by selling targeted ads to the management of artists who, in turn, want to sell concert tickets to the public but in the process also want to expand their artists' fan-base.

The ad is a suggestion for people who already are interested in A1 to buy concert tickets of artist A1 (with a discount!) as a birthday present for a friend ("who we know will love it" - the social network says) who lives in the same city, who is not yet interested in A1 yet, but is interested in other artists A2, A3 and A4 that the data mining model predicts to be correlated with A1.

Dataset: LDBC Social Network

The dataset of this assignment is a slightly modified subgraph generated by the LDBC social network benchmark, which models a social network much like Facebook. This is a synthetically generated dataset, but it contains complex patterns and correlations that are also found in real life. For instance, people from a certain country often have a name that is typical for that country. Also, people who grew up in the same location or have similar interests are much more likely to know each other than random persons. The original dataset contains a lot of information on where the persons studied or work, as well as their wall posts, pictures, and likes etc. However, for this task, we only operate on users, their location, their "knows" graph and their interests.

Dataset

Every user account in the social network is represented as a line in the person.csv file:

Person.id|firstName|lastName|gender|birthday|creationDate|locationIP|browserUsed|locatedIn
420|K.|Reddy|female|1981-09-08|2010-02-01T04:10:19.506+0000|14.102.233.188|Safari|152
7489|Shweta|Rao|male|1985-11-07|2010-03-27T18:13:25.167+0000|61.11.117.174|Internet Explorer|177
9037|Meera|Khan|female|1984-03-17|2010-03-14T07:06:54.725+0000|59.176.223.143|Chrome|178
12860|Aditya|Sharma|female|1982-08-20|2010-02-19T12:51:10.863+0000|103.1.115.239|Firefox|271
[...]

The last column (locatedIn) denotes cities; these are encoded as numbers here. Note that the actual data files do not contain the field names in the first line.

The social graph is formed by persons being vertices and edges entries in the "knows.csv" file, where each edge is represented by two Person IDs (knows.csv):

Person.id|Person.id|
420|12860
420|8150883
420|1099518059227
420|2199027829087
...

From the first entry together with the information from person.csv, we see that person "K. Reddy" knows "Aditya Sharma". Please note that the connection is uni-directional (Twitter "Follower" semantics) Thus, if P knows F, P may or may not know F.

Finally, the social network knows for each person a number of favorite artists (musical interests). The interests are encoded in a separate "interest.csv" file as combinations of (personID,tagID), where tagID is a number and represents an artist:

Person.id|Tag.id|
420|1
7489|3
7489|138
7489|142
...

There are 16080 different artists. As in real life, the popularity of the artists is quite skewed.

Please be aware that the connection structure of this graph is highly complex and correlated. The graph is one huge connected component, and everyone knows everyone within 4 steps. While in a large network as Facebook (1.3 billion users) the average amount of friends is above 400, in smaller networks like here it is lower. The dataset you operate just has 8.9M persons (LDBC "scale factor" sf3000), who know on average 149 persons (and on average have 23 interests).

Query Program

You must implement, hence, a program that gets as parameters:

a date range [D1..D2]
target artist A1, and
related artists A2, A3 and A4.

For all persons P who have their birthday on or in between D1..D2 (e.g. next month), who do not like A1 yet, we give a score of 1 for liking any of the artists A2, A3 and A4 and 0 if not (the final score, the sum, hence is a number between 0 and 3). Further, we look for friends F of P (people who know each other mutually) who live in the same city and already like A1. The answer of the query is a table (score, P, F) containing only scores > 0 and ordered descending on score, and for equal scores ascending on P (a person ID) and if again equal ascending on F (also a personID).

You can implement this query program in any way you like (see below for details), but it better be efficient. We will test the query multiple times with different birthdate and artist parameters. After all, this is repetitive business for the social network marketeers, as time passes by and other people will have their birthday, and as they sell the idea of such an ad campaign to the managers of different artists.

Reorg Program

Rather than running on the textual CSV, you can prepare some better data representation of your choice. For this purpose, before the queries are run, a "reorg" program that you can provide, is executed. In the example code we provide, this data reorganization translates the CSV files into binary data. You should think about what data you need (and don't need, maybe some parts can even be omitted) and the shape in which you would best store the data, to allow the queries to run fast.

Peter Boncz · Hannes Mühleisen