LSDE 2018 - Large Scale Data Engineering

LSDE2015 · LSDE2015-2016 · LSDE2016 · LSDE2017 · LSDE2018 · VU Canvas

LSDE: Large Scale Data Engineering 2018

Data Analysis Projects

In this assignment the goal is to get experience with a real Big Data challenge by attempting to solve this on a large Hadoop cluster. Computation for all projects will be performed on the SurfSara Hadoop cluster. We will make all datasets mentioned below available there.

Access to the cluster is via ssh login.hathi.surfsara.nl.

Datasets and Topics

In the following, we describe the datasets and project topics teams can work on. Some datasets are used by several projects.

New York Taxis

Through the US Freedom of Information Law, records of all taxi rides in New York City were made public. The ~20 GB dataset contains ~170M rows on trips and paid fares. It includes time and place of passenger pickup and dropoff, trip time and distance, a driver and car identifier, and information about how much was paid for the trip.

T1: Taxi Business. Find lucrative pickup places for taxi drivers. Also find commercially successful drivers, who earn much more than others per hour driven (corrected for time-of-day). Visualize the results appropriately.
T3: Secret Nightlife. Find non-obvious places of night life in New York city, by looking for places outside the popular nightlife areas, that are frequently used as destination or starting point of taxi trip in the late or very early hours. Correlate this with the day-of-week.

Dataset HDFS storage location: /user/hannesm/lsde/taxi

SciHub

Scihub is a controversial project aiming to set free all scientific publications. We have obtained an archive of 2.4TB with millions of scientific papers, in PDF format. This archive could be studied with Big Data methods in manifold ways.

R1: Scientific Plagiarism. Extract the plaintext and author info from all PDF files, and start to analyze it. The goal is to compare all papers for overlapping paragraphs, and subsequently rank and classify the transgressions based on size, frequence, and authorship. Provide a visualization that allows to quantify plagiarism by various dimensions, such as geography, year and topic; preferably also to zoom into annotaex examples of serious plagiarism instances.
R2: Web of SciHub. Extract the plaintext and author info from all PDF files, and start to analyze it. Your goal is to create a DBLP-like overview of this web of science, where papers are linked to each other by citation. Thus, we are interested in creating a scientificis citation database, where all papers are present, inclusive publising date and venue, keywords learned from the fulltext, citations learned from the fulltext. From this data h-score, and citation distributions over time for each author could be calculated.

Dataset HDFS storage location: /user/hannesm/lsde/scihub

ADS-B Plane Tracking

Commercial airplanes periodically send out radio messages containing their position details (plane identifier, flight number, latitude, longitude, height, speed, ...) . These ADS-B messages are picked up by enthusiasts and collected in systems such as the OpenSky network or Flightradar24. We have obtained ~200 GB of compressed ADS-B messages from September 2015 in a compressed format.

P7: Airline Timeliness. Reverse engineer a flight schedule given the flight movements and compare it with any official flight info for that time period. Assess delay desitribution relative to airline/flights/days/times and try to assess which airlines try to mitigate delays (e.g. by flying faster or more straight), and which do not care
P8: Unscheduled Flights. Detect flights and relate them to official flight numbers and well-known airlines. Investigate the remaining flights, showing fflight patterns, and try to assess their purpose. Try to connect callsigns with additional outside information, in order to elarn more about these flights.

Dataset HDFS storage location: /user/hannesm/lsde/opensky2

NOAA Climate Measurements

The US National Oceanic and Atmospheric Administration (NOAA) publishes the Integrated Surface Data collection. It contains weather station measurements from stations around the world for the last decades. We have mirrored the ~205 GB of compressed weather measurements provided on the FTP server. A format documentation also is available there. Further, this year we added the Historical Land-Cover Change and Land-Use Conversions Global Dataset that keeps track of land use in the USA in different snapshots, starting from the 18th century.

N2: Twin Weather. Find surprising weather "twins" of places considering weather on a monthly granularity. Visualize with map which shows time and place occurrences with similar climate that are far away.
N4: visualize and exptrapolate urban sprawl. We would like you to visualize the effect of increased urbanization around major population centers over time, as well as model this sprawl and make predictions for this into the future (at particular time points).

Dataset HDFS storage location: /user/hannesm/lsde/noaa and /user/hannesm/lsde/landuse

Flickr Images

The Flickr hosts a staggering amount of pictures, of which a portion is publicly available. A subset of this public flickr is listed, by means of a textual description for which you can crawl the pictures and analyze them in various ways. ~

F1: FaceJoin. Crawl the flickr picture archive and use image recognition software to identify faces and match them (e.g. OpenFace) to a list of faces with known identity. A Join on Faces. The list of known faces could be for instance the FBI's most wanted list but may also be chosen differently (e.g. famous people from Wikipedia, or missing children). The visualization would show a ranked list of flickr matches per known portrait.
F2: Deep Locations. Crawl the flickr picture archive to get a large amount of pictures with GPS annotations. Now try to train a deep learning model to learn the GPS location of a picture. It could be a good idea to remove persons from the pictures automatically, but you must figure this out for yourselves. Create a visualization that shows the accuracy of your approach and/or allows to upload a picture and predicts its GPS location

Dataset HDFS storage location: /user/hannesm/lsde/flickr

Bitcoin Blockchain

Bitcoin is the internet currency based on so-called blockchain technology. A blockchain is a shared ledger (transaction log) consisting blocks that can be written only after computationally-intensive mining. The size of the current bitcoin blockchain is 130GB (compressed), and we downloaded it for you to analyze.

B1: Bitcoin Communities. The transactions of bitcoins form a huge graph that can be analyzed using techniques from graph analytics. We are generally interested in learning about the shape of this graph, and specifically would like to see community detection performed. One goal of that would be to identify bitcount account numbers likely belonging to the same entitiy
B3: Bitcoin Criminality. Bitcoin has been used by criminals as a financial channel for extortion schemes, e.g. related to ransomware or black markets (Silk Road). We would like you to detect patterns in known criminal activity, and then use this knowledge to find and characterize other possible criminal activities in bitcoin history.
B4: BlockChain Structure Evolution. The transactions of the bitcoin blockchain form a temporal graph as each transaction has a timestamp. We would like to learn about the evolution over time of the structure of the blockchain transaction graph (Clustering Coefficient, TPR, Bridge Ratio, Diameter, Conductance, Size) over time. An intermediary step will be the splitting of the blockchain in multiple time-based graph snapshots, in order to capture these statistics over different periods of time.

Dataset HDFS storage location: /user/hannesm/lsde/bitcoin As for software, a useful pointer might be https://github.com/ZuInnoTe/hadoopcryptoledger/

Point Clouds

The Actuale Hoogtekaart Nederland (AHN) is a so-called point-cloud, consisting of 3D points measured using LIDAR technology. LIDAR stands for Light Detection and Ranging - a remote sensing method that uses light in the form of a pulsed laser to measure ranges to the Earth, in the case of AHN taken from an airplane. We got 2TB of point cloud data for you to analyze.

You may find the following Spark package useful: https://github.com/IGNF/spark-iqmulus though it works only with Spark 1.6.2 (which you would have to install yourself, as the preinstalled version now is 2.1.1)

C1: 3D Kadaster. We have downloaded XML descriptions contain all Dutch addresses and 2D building plan polygons. We would like you to turn these 2D plans into 3D models, using the point cloud dataset.
C2: Flooding Analysis. We would like you to summarize the Dutch terrain into areas of equal height at a relatively coarse-grained granularity, also detecting a topology of dike systems in this. Subsequently, we would like you to run flooding models that simulate the flooding effects a subset of these dikes being breached.
C3: Wind-aware Bike Routing. Cyclists plan their trips using multiple criteria in mind. Here, we would like to find minimum-effort (in biker energy) routes from A to B in Amsterdam. Given the 3D shapes of the point-cloud model, we would like you to run simulations to calculate wind speed in every street of Amsterdam, allowing to use this to derive effort-weights for every street segment (given a particular wind direction and force).
C4: Illegal Buildings. Train a deep learning model using the point-cloud dataset and the kadaster information on buildings as ground truth. Use this model to detect houses that are not on the kadaster map, or have larger dimensions than declared.

Dataset HDFS storage location: /user/hannesm/lsde/ahn2 (point cloud - check the web viewer) and /user/pboncz/geodata (house surface polygons from Kadaster). To process the point cloud data with Spark, you might try using the old spark installation on the cluster (/opt/spark-1.6.1-bin-hadoop2.6) with the code in https://github.com/IGNF/spark-iqmulus.

LandSat

LandSat (1-8) and Sentinel (1-2) are US and EU remote sensing sattelites orbiting the poles of the earth, providing free imagery for every spot on the planet. LandSat-8 at 30m granularity and Sentinal-2 at 10m (hence, less than DigitalGLobe Quickbird at 0.65m used by Google Earth). However, this data is free, and especially in case of LandSat has a historic information that spans four decades. We have a large quantity of LandSat8 data on Europe, as well as limited LandSat4-7 snapshots from 1985, 1990, 1994, 2000, 2005, 2010 and 2015. You might need to download some more data.

L1: earth imaging time-travel animations. Given the long landsat history in capturing images of the world, we would like you to contruct movies for every segment of the earth (at multiple granularities), showing how it evolves over time. This task includes stiching together imagery of different "scenes" as well as effort to clean imagery from clouds.
L2: Deap Learning Landuse. Given the Landsat data and its history, as well as given the NOAA landuse information, we would like you to train a model with deep learning, that given an earth image, derives likely landuse information (including polygons for these).
L3: Deep Learning to detect Clouds. Whereas cloud detection in earth imaging is currently largely performed with hand-written special-purpose techniques, we would like you to try learning how to distinguish clouds from real earth features using deep learning techniques. This model should then be used to produce clean(er) earth images.

Dataset HDFS storage location: /user/pboncz/landsat-s3 (Landsat-8 full Europe, July 2017) and user/pboncz/landsat-history with archive footage of Landsat 4, 5, 7 an 8 of Western Europe (Denmark, Germany, Switserland, France, Belgium, Netherlands, UK) taken around July in years 1985, 1990, 1994, 2000, 2005 and 2010. Note that both landsat-*/ directories have a metadata.tgz file with text-files and thumbnails (for some of the scenes).

Peter Boncz · Hannes Mühleisen