LSDE 2017 - Large Scale Data Engineering

LSDE2015 · LSDE2015-2016 · LSDE2016 · LSDE2017 · VU Canvas

LSDE: Large Scale Data Engineering 2017

Data Analysis Projects

In this assignment the goal is to get experience with a real Big Data challenge by attempting to solve this on a large Hadoop cluster. Computation for all projects will be performed on the SurfSara Hadoop cluster. We will make all datasets mentioned below available there.

Access to the cluster is via ssh login.hathi.surfsara.nl.

Datasets and Topics

In the following, we describe the datasets and project topics teams can work on. Some datasets are used by several projects.

ADS-B Plane Tracking

Commercial airplanes periodically send out radio messages containing their position details (plane identifier, flight number, latitude, longitude, height, speed, ...) . These ADS-B messages are picked up by enthusiasts and collected in systems such as the OpenSky network or Flightradar24. We have obtained ~200 GB of compressed ADS-B messages from September 2015 in a compressed format.

P3: Flight Visualization. Generate an interactive flight path animation (GIF?) of all flights based on their accurate location data. Speed up time. Reduce amount of flights if necessary through stratified sampling of diverse flight routes. Non-interactive Example.
P7: Airline Timeliness. Reverse engineer a flight schedule given the flight movements and compare it with any official flight info for that time period. Assess delay desitribution relative to airline/flights/days/times and try to assess which airlines try to mitigate delays (e.g. by flying faster or more straight), and which do not care
P8: Unscheduled Flights. Detect flights tnd related them to official flight numbers and well-known airlines. Investigate the remaining flights, showing flight patterns, and try to assess their purpose.

Dataset HDFS storage location: /user/hannesm/lsde/opensky2

AIS Ship Tracking

Commercial ships periodically send out messages containing their position details (ship identifier, latitude, longitude, speed, ...) . These AIS messages are collected and visualized in systems such as MarineTraffic. We have ~26 GB of compressed AIS messages (TXT/AIVDM format) over a period of two weeks.

S2: Running for Oil. Identify oil tankers (using e.g. IMO number), and group by company and country, and identify their trips and trip speed or even specific loitering. Try to correlate oil transportation and travel speed with oil price. Is there more or less traffic when oil prices are high or low? Can we predict future oil prices from movement on the ocean? Are ships delaying discharge (loitering) while prices are rising, resp. accelerating when prices are dropping?
S3: Suspicious Outage. Reconstruct ship trajectories with a focus of incomplete trajectories that given the reception range would be expected to be complete. In other words, ship trajectories where for some reason the AIS message sender had been turned off during parts of trip. Find incidents with such ships (e.g. using IMO number), group them to find the most suspicious ones and visualize their routes.

Dataset HDFS storage location: /user/hannesm/lsde/ais2

NOAA Climate Measurements

The US National Oceanic and Atmospheric Administration (NOAA) publishes the Integrated Surface Data collection. It contains weather station measurements from stations around the world for the last decades. We have mirrored the ~205 GB of compressed weather measurements provided on the FTP server. A format documentation also is available there. Further, this year we added the Historical Land-Cover Change and Land-Use Conversions Global Dataset that keeps track of land use in the USA in different snapshots, starting from the 18th century.

N4: visualize and exptrapolate urban sprawl. We would like you to visualize the effect of increased urbanization around major population centers over time, as well as model this sprawl and make predictions for this into the future (at particular time points).

Dataset HDFS storage location: /user/hannesm/lsde/noaa and /user/hannesm/lsde/landuse

Wikipedia Clicks

Wikipedia publishes page view statistics for their projects. We have collected a ~820GB dataset of this dataset from 2014 (will be updated to 2016). Tip: The page names mentioned in those files is before redirects etc. are performed. It might be a good idea to use the Wikipedia database dumps to resolve those first. Also, normalizing accesses by the sum of clicks on the observed pages might help to reduce skew.

W3: Attention Curves. Cluster topics with similar attention curves. Given a topic, show other topics with similar attention curves over time.

Dataset HDFS storage location: /user/hannesm/lsde/wikistats

Flickr Images

The Flickr hosts a staggering amount of pictures, of which a portion is publicly available. A subset of this public flickr is listed, by means of a textual description for which you can crawl the pictures and analyze them in various ways. ~

F2: Deep Locations. Crawl the flickr picture archive to get a large amount of pictures with GPS annotations. Now try to train a deep learning model to learn the GPS location of a picture. It could be a good idea to remove persons from the pictures automatically, but you must figure this out for yourselves. Create a visualization that shows the accuracy of your approach and/or allows to upload a picture and predicts its GPS location

Dataset HDFS storage location: /user/hannesm/lsde/flickr

Bitcoin Blockchain

Bitcoin is the internet currency based on so-called blockchain technology. A blockchain is a shared ledger (transaction log) consisting blocks that can be written only after computationally-intensive mining. The size of the current bitcoin blockchain is 130GB (compressed), and we downloaded it for you to analyze.

B1: Bitcoin Communities. The transactions of bitcoins form a huge graph that can be analyzed using techniques from graph analytics. We are generally interested in learning about the shape of this graph, and specifically would like to see community detection performed. One goal of that would be to identify bitcount account numbers likely belonging to the same entitiy
B2: BlockChain Evolution. The transactions of the bitcoin blockchain form a temporal graph as each transaction has a timestamp. We would like to learn about the evolution over time of blockhain usage (users, account balances, transactions, mined blocks, etc). An intermediary step will be the splitting of the blockchain in multiple time-based graph snapshots.
B3: Bitcoin Extortion. Bitcoin has been used by criminals as a financial channel for extortion schemes, e.g. related to ransomware. We would like you to detect patterns in known ransomware attacks, and then use this knowledge to find and characterize other possible ransomware attacks in bitcoin history.

Dataset HDFS storage location: /user/hannesm/lsde/bitcoin As for software, a useful pointer might be https://github.com/ZuInnoTe/hadoopcryptoledger/

Point Clouds

The Actuale Hoogtekaart Nederland (AHN) is a so-called point-cloud, consisting of 3D points measured using LIDAR technology. LIDAR stands for Light Detection and Ranging - a remote sensing method that uses light in the form of a pulsed laser to measure ranges to the Earth, in the case of AHN taken from an airplane. We got 2TB of point cloud data for you to analyze.

You may find the following Spark package useful: https://github.com/IGNF/spark-iqmulus though it works only with Spark 1.6.2 (which you would have to install yourself, as the preinstalled version now is 2.1.1)

C1: 3D Kadaster. We have downloaded XML descriptions contain all Dutch addresses and 2D building plan polygons. We would like you to turn these 2D plans into 3D models, using the point cloud dataset.
C2: Flooding Analysis. We would like you to summarize the Dutch terrain into areas of equal height at a relatively coarse-grained granularity, also detecting a topology of dike systems in this. Subsequently, we would like you to run flooding models that simulate the flooding effects a subset of these dikes being breached.
C3: Wind-aware Bike Routing. Cyclists plan their trips using multiple criteria in mind. Here, we would like to find minimum-effort (in biker energy) routes from A to B in Amsterdam. Given the 3D shapes of the point-cloud model, we would like you to run simulations to calculate wind speed in every street of Amsterdam, allowing to use this to derive effort-weights for every street segment (given a particular wind direction and force).
C4: Illegal Buildings. Train a deep learning model using the point-cloud dataset and the kadaster information on buildings as ground truth. Use this model to detect houses that are not on the kadaster map, or have larger dimensions than declared.

Dataset HDFS storage location: /user/hannesm/lsde/ahn2 (point cloud - check the web viewer) and /user/pboncz/geodata (house surface polygons from Kadaster). To process the point cloud data with Spark, you might try using the old spark installation on the cluster (/opt/spark-1.6.1-bin-hadoop2.6) with the code in https://github.com/IGNF/spark-iqmulus.

LandSat

LandSat (1-8) and Sentinel (1-2) are US and EU remote sensing sattelites orbiting the poles of the earth, providing free imagery for every spot on the planet. LandSat-8 at 30m granularity and Sentinal-2 at 10m (hence, less than DigitalGLobe Quickbird at 0.65m used by Google Earth). However, this data is free, and especially in case of LandSat has a historic information that spans four decades. We have a large quantity of LandSat8 data on Europe, as well as limited LandSat4-7 snapshots from 1985, 1990, 1994, 2000, 2005, 2010 and 2015. You might need to download some more data.

L1: earth imaging time-travel animations. Given the long landsat history in capturing images of the world, we would like you to contruct movies for every segment of the earth (at multiple granularities), showing how it evolves over time. This task includes stiching together imagery of different "scenes" as well as effort to clean imagery from clouds.
L2: Deap Learning Landuse. Given the Landsat data and its history, as well as given the NOAA landuse information, we would like you to train a model with deep learning, that given an earth image, derives likely landuse information (including polygons for these).
L3: Deep Learning to detect Clouds. Whereas cloud detection in earth imaging is currently largely performed with hand-written special-purpose techniques, we would like you to try learning how to distinguish clouds from real earth features using deep learning techniques. This model should then be used to produce clean(er) earth images.

Dataset HDFS storage location: /user/pboncz/landsat-s3 (Landsat-8 full Europe, July 2017) and user/pboncz/landsat-history with archive footage of Landsat 4, 5, 7 an 8 of Western Europe (Denmark, Germany, Switserland, France, Belgium, Netherlands, UK) taken around July in years 1985, 1990, 1994, 2000, 2005 and 2010. Note that both landsat-*/ directories have a metadata.tgz file with text-files and thumbnails (for some of the scenes).

Big SQL

Spark SQL 2.x added Just-In-Time query compilation compilation, together with a number of other improvements. Other popular Big SQL systems are Hive and Presto; however the question is how mature these relatively young systems are. ~

D1: Database Benchmark. We would be interested in a performance comparison on the SF1000 or SF3000 size of the TPC-H benchmark. Recently, an alternative version of the benchmark was introduced under the name JCC-H, which adds severe join skew. It would be interesting to test the performance of various SQL-on-Hadoop systems, as listed here, on these two benchmarks.

Peter Boncz · Hannes Mühleisen