The marketeers of a social network have been data mining the musical preferences of their users. They have built statistical models which predict given an interest in say artists A2 and A3, that the person would also like A1 (i.e. rules of the form: A2 and A3 => A1). Now, they are commercially exploiting this knowledge by selling targeted ads to the management of artists who, in turn, want to sell concert tickets to the public but in the process also want to expand their artists' fan-base.
The ad is a suggestion for people who already are interested in A1 to buy concert tickets of artist A1 (with a discount!) as a birthday present for a friend ("who we know will love it" - the social network says) who lives in the same city, who is not yet interested in A1 yet, but is interested in other artists A2, A3 and A4 that the data mining model predicts to be correlated with A1.
The dataset of this assignment is a slightly modified subgraph generated by the LDBC social network benchmark, which models a social network much like Facebook. This is a synthetically generated dataset, but it contains complex patterns and correlations that are also found in real life. For instance, people from a certain country often have a name that is typical for that country. Also, people who grew up in the same location or have similar interests are much more likely to know each other than random persons. The original dataset contains a lot of information on where the persons studied or work, as well as their wall posts, pictures, and likes etc. However, for this task, we only operate on users, their location, their "knows" graph and their interests.
Every user account in the social network is represented as a line in the person.csv file:
Person.id|firstName|lastName|gender|birthday|creationDate|locationIP|browserUsed|locatedIn 420|K.|Reddy|female|1981-09-08|2010-02-01T04:10:19.506+0000|14.102.233.188|Safari|152 7489|Shweta|Rao|male|1985-11-07|2010-03-27T18:13:25.167+0000|61.11.117.174|Internet Explorer|177 9037|Meera|Khan|female|1984-03-17|2010-03-14T07:06:54.725+0000|59.176.223.143|Chrome|178 12860|Aditya|Sharma|female|1982-08-20|2010-02-19T12:51:10.863+0000|103.1.115.239|Firefox|271 [...]
The last column (locatedIn) denotes cities; these are encoded as numbers here. Note that the actual data files do not contain the field names in the first line.
The social graph is formed by persons being vertices and edges entries in the "knows.csv" file, where each edge is represented by two Person IDs (knows.csv):
Person.id|Person.id| 420|12860 420|8150883 420|1099518059227 420|2199027829087 ...
From the first entry together with the information from person.csv, we see that person "K. Reddy" knows "Aditya Sharma". Please note that the connection is uni-directional (Twitter "Follower" semantics) Thus, if P knows F, P may or may not know F.
Finally, the social network knows for each person a number of favorite artists (musical interests). The interests are encoded in a separate "interest.csv" file as combinations of (personID,tagID), where tagID is a number and represents an artist:
Person.id|Tag.id| 420|1 7489|3 7489|138 7489|142 ...
There are 16080 different artists. As in real life, the popularity of the artists is quite skewed.
Please be aware that the connection structure of this graph is highly complex and correlated. The graph is one huge connected component, and everyone knows everyone within 4 steps. While in a large network as Facebook (1.3 billion users) the average amount of friends is above 400, in smaller networks like here it is lower. The dataset you operate just has 8.9M persons (LDBC "scale factor" sf3000), who know on average 149 persons (and on average have 23 interests).
You must implement, hence, a program that gets as parameters:
You can implement this query program in any way you like (see below for details), but it better be efficient. We will test the query multiple times with different birthdate and artist parameters. After all, this is repetitive business for the social network marketeers, as time passes by and other people will have their birthday, and as they sell the idea of such an ad campaign to the managers of different artists.
Rather than running on the textual CSV, you can prepare some better data representation of your choice. For this purpose, before the queries are run, a "reorg" program that you can provide, is executed. In the example code we provide, this data reorganization translates the CSV files into binary data. You should think about what data you need (and don't need, maybe some parts can even be omitted) and the shape in which you would best store the data, to allow the queries to run fast.