LSDE 2016 - Large Scale Data Engineering

LSDE2015 · LSDE2015-2016 · LSDE2016 · VU BlackBoard

LSDE: Large Scale Data Engineering 2016

Dataset. Wikipedia publishes page view statistics for their projects. We have collected a ~820GB dataset of this dataset from 2014 (will be updated to 2016). Tip: The page names mentioned in those files is before redirects etc. are performed. It might be a good idea to use the Wikipedia database dumps to resolve those first. Also, normalizing accesses by the sum of clicks on the observed pages might help to reduce skew.

W1: Trending News Topics. Map Wikipedia pages to mentions in news articles. Find "trending" pages. Who is following who, are things first trending in Wiki and then in newspapers or the other way around? Make interactive visualization that allows to explore trending topics and show associated articles in generated time lines.

Summary. The project derived trending news topics from the 2014 Volkskrant archive on a single computer. Then, it used the cluster to process the Wikipedia data with Spark to create Top-100 viewed pages (in the process disambiguating redirects), for different time granularities (week, month, quarter, year) and distill from that in a separate step trending pages.

Data curiosity: **
Related work: *
Technical difficulties: mastered **
Visualization coolness: *