LSDE 2016 - Large Scale Data Engineering

LSDE2015 · LSDE2015-2016 · LSDE2016 · VU BlackBoard

LSDE: Large Scale Data Engineering 2016

Data Analysis Projects

In this assignment the goal is to get experience with a real Big Data challenge by attempting to solve this on a large Hadoop cluster. Computation for all projects will be performed on the SurfSara Hadoop cluster. We will make all datasets mentioned below available there.

Access to the cluster is via ssh login.hathi.surfsara.nl.

Datasets and Topics

In the following, we describe the datasets and project topics teams can work on. Some datasets are used by several projects.

ADS-B Plane Tracking

Commercial airplanes periodically send out radio messages containing their position details (plane identifier, flight number, latitude, longitude, height, speed, ...) . These ADS-B messages are picked up by enthusiasts and collected in systems such as the OpenSky network or Flightradar24. We have obtained ~200 GB of compressed ADS-B messages from September 2015 in a compressed format.

P2: Flight Routes. Determine standard flight routes between pairs of airports. Detect and visualize flights (and airlines) that diverge from them. Also determine and visualize how much CO2 could be saved by flying directly?
P3: Flight Visualization. Generate an interactive flight path animation (GIF?) of all flights based on their accurate location data. Speed up time. Reduce amount of flights if necessary through stratified sampling of diverse flight routes. Non-interactive Example.
P4: Rescue Choppers. Investigate the use and coverage of rescue and medical helicopters. Where are such helicopters located? Which areas are effectively covered by these helicopters, and which are not? Can you relate these areas to population density and maybe suggest shifts in coverage? How long does the average flight take and how much does this cost?
P5: Unscheduled Private Jets. Classify and visualize private jet flight patterns. Where are they flying to, what are unusual airports, etc? Would you be able to detect "rendition flights"?
P6: Cause for Emergency. Detect flights that made an emergency stop. Use access to the historical twitter archive to determine the cause for the emergency.

Dataset HDFS storage location: /user/hannesm/lsde/opensky

AIS Ship Tracking

Commercial ships periodically send out messages containing their position details (ship identifier, latitude, longitude, speed, ...) . These AIS messages are collected and visualized in systems such as MarineTraffic. We have ~26 GB of compressed AIS messages (TXT/AIVDM format) over a period of two weeks.

S1: Shipping Safety. Find almost-collisions in English channel, visualize on interactive map. Do so by reconstructing ship paths and distance computation in latitude longitude and time dimensions. Take ship information into account, e.g. by getting information by IMO number, and the maneuverability that results from the ship-type.
S2: Running for Oil. Identify oil tankers (using e.g. IMO number), and group by company and country, and identify their trips and trip speed or even specific loitering. Try to correlate oil transportation and travel speed with oil price. Is there more or less traffic when oil prices are high or low? Can we predict future oil prices from movement on the ocean? Are ships delaying discharge (loitering) while prices are rising, resp. accelerating when prices are dropping?
S3: Suspicious Outage. Reconstruct ship trajectories with a focus of incomplete trajectories that given the reception range would be expected to be complete. In other words, ship trajectories where for some reason the AIS message sender had been turned off during parts of trip. Find incidents with such ships (e.g. using IMO number), group them to find the most suspicious ones and visualize their routes.

Dataset HDFS storage location: /user/hannesm/lsde/ais

NOAA Climate Measurements

The US National Oceanic and Atmospheric Administration (NOAA) publishes the Integrated Surface Data collection. It contains weather station measurements from stations around the world for the last decades. We have mirrored the ~205 GB of compressed weather measurements provided on the FTP server. A format documentation also is available there. Further, this year we added the Historical Land-Cover Change and Land-Use Conversions Global Dataset that keeps track of land use in the USA in different snapshots, starting from the 18th century.

N3: Urbanisation vs Climate Change. Rising temperatures at a measurement station over the years can be ascribed to general global warming, but might also be related to the local effect of increasing urbanization around the weather station. Using the land-use USA dataset, detect if urbanization at measurement stations is correlated with temperature increase and attempt to compare the contribution of urbanization to the overall picture of global warming.

Dataset HDFS storage location: /user/hannesm/lsde/noaa and /user/hannesm/lsde/landuse

Wikipedia Clicks

Wikipedia publishes page view statistics for their projects. We have collected a ~820GB dataset of this dataset from 2014 (will be updated to 2016). Tip: The page names mentioned in those files is before redirects etc. are performed. It might be a good idea to use the Wikipedia database dumps to resolve those first. Also, normalizing accesses by the sum of clicks on the observed pages might help to reduce skew.

W2: Election Prediction. Correlate clicks on Wikipedia pages with movements in the US polls. Focus on the debates during the USA primaries. What wikipedia topics spiked around (just before and after) these debates? Can we learn from these insights and the actual activity in Wikipedia with reference to the November presidential elections?
W4: DDOS Detection. Find Distributed Denial Of Service (DDOS) attacks to Wikipedia. This should include devising criteria to distinguish DDOS attacks from trending topics.

Dataset HDFS storage location: /user/hannesm/lsde/wikistats

Flickr Images

The Flickr hosts a staggering amount of pictures, of which a portion is publicly available. A subset of this public flickr is listed, by means of a textual description for which you can crawl the pictures and analyze them in various ways. ~

F1: FaceJoin. Crawl the flickr picture archive and use image recognition software to identify faces and match them (e.g. OpenFace) to a list of faces with known identity. A Join on Faces. The list of known faces could be for instance the FBI's most wanted list but may also be chosen differently (e.g. famous people from Wikipedia, or missing children). The visualization would show a ranked list of flickr matches per known portrait.

Dataset HDFS storage location: /user/hannesm/lsde/flickr

Spark SQL

The new 2.0 version of Spark SQL that recently came out, added Just-In-Time query compilation compilation, together with a number of other improvements. ~

D1: Database Benchmark. We would be interested in a performance comparison between the previous (1.6) release on the SF10000 or SF3000 size of the TPC-H benchmark. You might also include other SQL-on-Hadoop systems, as listed here, in the comparison.

Peter Boncz · Hannes Mühleisen