2015 · 2015-2016 · 2016 · 2017 · 2018 · 2019 · 2020 · 2021 · VU Canvas
LSDE: Large Scale Data Engineering 2021
Data Analysis Projects

In this assignment the goal is to get experience with creating Big Data analysis pipelines. Computation for all projects will be performed on Databricks via Amazon Web Services.

Datasets and Topics

In the following, we describe the datasets and project topics teams can work on. Some datasets are used by several projects.

AIS Ship Tracking

Commercial ships periodically send out messages containing their position details (ship identifier, latitude, longitude, speed, ...) . These AIS messages are collected and visualized in systems such as MarineTraffic. We have 2.6 TB of AIS messages (TXT/AIVDM format) in the years 2016 and 2017.

  • S3: Suspicious Outage. Reconstruct ship trajectories with a focus of incomplete trajectories that given the reception range would be expected to be complete. In other words, ship trajectories where for some reason the AIS message sender had been turned off during parts of trip. Find incidents with such ships (e.g. using IMO number), group them to find the most suspicious ones and visualize their routes.
  • S4: Meetings at Sea. Try to identify situations in which ships during an extended period of time, where close to each other, possibly but not necessarily in the same place. Close enough, to make a transfer of cargo a possibility. Augment the encounters with additional information you can find on the properties of these ships (names, type, port, function, etc) e.g. using IMO number) but also other sources you might find.

Dataset storage location: dbfs:/mnt/lsde/ais/

WikiMedia Statistics

Wikipedia publishes hourly page view statistics for their projects. This data is available in this shape from 2015 onwards. The popularity of topics in Wikipedia can give an indication of the interest of people over time and space (the latter, specifically in non-english language domains).

  • M4: DDOS Detection. Find Distributed Denial Of Service (DDOS) attacks to Wikipedia. This should include devising criteria to distinguish DDOS attacks from trending topics. Summarize these attacks over time and cluster them by theme
  • M5: COVID-19 Attention. Analyze the pageviews statistics of Wikipedia over the past 4 years, and compare the previous access patterns with the months of the pandemic. Split this out over various language domains that can be related to countries (e.g. nl, de, fr, it, se, es). We are interesting to learn what topics are on the minds of the Wikipedia users over the months, hence prominent topics; and specifically topics whose attention is significantly altered (upwards or downwards) during peaks of COVID-19. Specifically, you could try to correlate temporal changes in attention span in certain countries to the COVID stringency in that country. Consider various forms of visualizing these results over topic (clouds?) time and space.

Dataset storage location: have to download this yourself

SciHub

Scihub is a controversial project aiming to set free all scientific publications. We have obtained an archive of 2.4TB with millions of scientific papers, in PDF format. This archive could be studied with Big Data methods in manifold ways.

  • R1: Scientific Plagiarism. Extract the plaintext and author info from all PDF files, and start to analyze it. The goal is to compare all papers for overlapping paragraphs, and subsequently rank and classify the transgressions based on size, frequency, and authorship. Provide a visualization that allows to quantify plagiarism by various dimensions, such as geography, year and topic; preferably also to zoom into annotated examples of serious plagiarism instances.

Dataset storage location: dbfs:/mnt/lsde/scihub/

ADS-B Plane Tracking

Commercial airplanes periodically send out radio messages containing their position details (plane identifier, flight number, latitude, longitude, height, speed, ...) . These ADS-B messages are picked up by enthusiasts and collected in systems such as the OpenSky network or Flightradar24. We have obtained 700 GB of compressed ADS-B messages from September 2016 and 2017 in a compressed format.

  • P9: Air Cargo. Extract flight trajectories from the opensky datasets, and compare air cargo flights with passenger flights: do the flight paths (speed, route, height, time, etc) differ? Try to understand the main cargo routes and flight frequencies. Compare the 2015 and 2016 datasets to observe any changes in the market between these samples. Try to find (yourself) and integrate outside sources to get to know more details on the cargo flights (what are the planes transporting, what are the buiggest operators).

Dataset storage location: dbfs:/mnt/lsde/opensky/

Flickr Images

The Flickr hosts a staggering amount of pictures, of which a portion is publicly available. A subset of this public flickr is listed, by means of a textual description for which you can crawl the pictures and analyze them in various ways. This data is also popular in image processing research and is being hosted by AWS as part of its open data sets under the name Multimedia Commons. This AWS availability means you don't have to download the pictures anymore to s3, but the original flickr dataset listing has some more information (e.g. GPS coordinates) that can be useful.

  • F1: FaceJoin (reloaded) Crawl the flickr picture archive and use image recognition software to identify faces and match them to a list of faces with known identity. A Join on Faces. The list of known faces could be for instance the FBI's most wanted list but may also be chosen differently (e.g. famous people from Wikipedia, or missing children). The visualization would show a ranked list of flickr matches per known portrait.
  • F5: Brand Popularity. Download a large subset of images from flickr, where these pictures have GPS coordinates. Create a data product that consist of a cleaned up flickr dataset where the non-loadable images have been removed as well as the pictures without GPS location. Find logos in these pictures and create a vizualizatiin that allows to analyze the popularity of logos accross regions and accross time.
  • F6: Characteristic Faces (reloaded). Analyze a large subset of images from flickr, where these pictures have GPS coordinates. Extract faces and facial expressions from these pictures tagged by location. The goal is to summarize the "face of the world" at different levels of spatial granularity (think: world, continent, country, city) by creating a morphed face for each place in the world at each granularity. The existing Characteristic Faces project has a nice approach that you may follow, however, due to the way data was sampled many regions are underrepresented (having few pictures to build the model from). Another direction for improvement is not to pick a single face per region, but pick a few different charcteristic faces per region. This Characteristic Face project thus should try to find faces that are not the average, but 'typical' for a region. The idea is to cluster faces for one region, and then pick the average face of the cluster that least resembles clusters in neighbouring regions as the representation of that region. See https://github.com/oarriaga/face_classification?utm_source=mybridge&utm_medium=blog&utm_campaign=read_more
  • F7: Image Transformation (reloaded) Take the data output by the Image Annotation showcase project of LSDE. For the top-100 of any query result, create transformed images with a techniques like pix2pic, GANcycle,.... Specifically, find faces and turn all faces into cartoons. Additionally, find buildings and re-texture their appearance. Create a visualization that augments the Image Annotation showcase with these toggles to transform the output of the queries.

Dataset storage location: dbfs:/mnt/lsde/flickr/ and dbfs:/mnt/lsde/multimedia-commons (which links to s3://multimedia-commons)

Blockchain

Bitcoin is the biggest digital currency based on so-called blockchain technology. A blockchain is a shared ledger (transaction log) consisting blocks. The bitcoin blockchain can be written only after computationally-intensive mining. The size of the current bitcoin blockchain is 350GB (compressed), and we downloaded it for you to analyze.

  • B1: Bitcoin Communities. The transactions of bitcoins form a huge graph: the entities wallet and transaction even from a temporal graph, as these entities have certain timestamp properties. There are various techniques for identifying addresses that belong together; one is to look at change return addresses. This graph can then be further analyzed using techniques from graph analytics. We are generally interested in learning about the shape of this graph, and specifically would like to see community detection performed. The main goal of the project is to identify bitcount wallets (account numbers) likely belonging to the same person and their interactions. By gathering extra data from the web (do this yourself!), specific bitcoin wallets of known persons/organizations can be found. The visualization could focus on these starting points to highlight or summarize parts of the graph.
  • B5: Bitcoin Snapshots. The transactions of the bitcoin blockchain form a temporal graph as each transaction has a timestamp. We would like to learn about the evolution over time of blockhain usage parameters (users, account balances, transactions, mined blocks, etc). The goal of the project is to extract a graph of wallets and transactions between these from the raw blockchain data at multiple timepoints (temporal graph snapshots). You could also create a property graph with valid-time properties attached to nodes and edges, from which such multiple snapshots can be derived. The visualization could show certain statistics of these various snapshots to give an overview of how the bitcoin blockchain has evolved.

Dataset storage location: dbfs:/mnt/lsde/bitcoin/ As for software, a useful pointer might be https://github.com/ZuInnoTe/hadoopcryptoledger/

Point Clouds

The Actuele Hoogtekaart Nederland (AHN) is a so-called point-cloud, consisting of 3D points measured using LIDAR technology. LIDAR stands for Light Detection and Ranging - a remote sensing method that uses light in the form of a pulsed laser to measure ranges to the Earth, in the case of AHN taken from an airplane. We got 2.5TB of point cloud data for you to analyze.

  • C1: 3D Kadaster (reloaded). We have downloaded XML descriptions contain all Dutch addresses and 2D building plan polygons. We would like you to turn these 2D plans into 3D models of buildings. A previous LSDE showcase achieved some success, however with partial coverage of the country and coarse-grained 3-d models (a box); we would like good data coverage and (more) detailed 3D models.
  • C5: Detect wind turbines. Train a deep learning model using the point-cloud dataset that detects wind turbines. You must search yourself for datasets with groud truth (i.e. locations of wind turbines) or annotate these yourself. Stretch goal is to recognize the type of wind turbine (to assess capacity).
  • C6: Tree Count. Train a deep learning model using the point-cloud dataset that detects trees. A basic goal would be to count trees in The Netherlands, and summarize them by size. The main goal though would be to detect the tree species as well. A stretch goal would be to go back to the older ahn2 dataset (ask for help, once you get there), and try to measure where trees have appeared and disappeared and summarize whether the amount of trees is going down, or up, per region (e.g. municipality, province).

Dataset storage location: dbfs:/mnt/lsde/ahn3/ (point cloud - check the web viewer) and dbfs:/mnt/lsde/geodata. (house surface polygons from Kadaster).

LandSat-8 and Sentinel-2

LandSat-8 and Sentinel-2 are US and EU remote sensing satellites orbiting the poles of the earth, providing free imagery for every spot on the planet. LandSat-8 at 30m granularity and Sentinel-2 at 10m (hence, less than DigitalGlobe Quickbird at 0.65m used by Google Earth). However, this data is free, and especially in case of LandSat has a historic information that spans four decades (note the sentinel data is requestor pays, so not fully free to use here, but you can do so).

  • L5: Lost Highway. detect highways from sentinel imagery (clouds removed..), with OpenStreetMap (OSM) data as ground truth. Use this model to detect unknown (new) or no longer existing highways. Create a geographic vizualization that shows tiles sattelite imagery with the detected roads, as well as the OSM road network, which highlights their discrepancies.
  • L6: Boosted Resolution Maps. Sentinel-2 has better resolution than LandSat-8, but is still way off from Google Maps. Maybe we can create open enhanced resolution images using deep learning. Hence: create a tiled visualization of the world using Sentinel-2 images (clouds removed). Enhance this imagery by using deep learning techniques to increase their resolution (see e.g.: "single image super-resolution with deep neural networks"). You could combine this with Google Maps visualization to show the relative success of this endeavour.
  • L7: Deforestation in the Rainforest. The world's natural jungles are increasingly being torn down by humans. It typically starts with building a road, and then junctions of these appear, while more and more trees disappear. Detect and visualize this process over the years in the Amazon (and possible other rainforests in the world). Then create a predictive model from this data and predict what the state of the rainforest will be in the future.

The LandSat-8 data is in dbfs:/mnt/lsde/landsat-pds; Sentinel-2 data is "requester pays" in dbfs:/mnt/lsde/sentinel-s2-l1c (which links to s3://sentinel-s2-l1c) and dbfs:/mnt/lsde/sentinel-s2-l2a (which links to s3://sentinel-s2-l2a). The metadata is in dbfs:/mnt/lsde/sentinel-inventory (which links to s3://sentinel-inventory). OpenStreetMap is in dbfs:/mnt/lsde/osm-pds

Big SQL

Data management is rapidly moving to the cloud, replacing upfront cost into operational costs with low barriers to entry. Cloud providers are now offering SQL cloud services, also in the area of data warehousing. It is often unclear what the performance, and price/performance and elasticity tradeoffs are of these various solutions, however.

  • D4: Cloud Database Benchmark. We would be interested in a performance comparison on the SF1000 or SF3000 size of the TPC-H benchmark. It would be interesting to test the performance of various Cloud database systems, specifically Spark SQL and Redshift, and services such as Athena and Snowflake; and compare their performance in a performance per dollar.
Common Crawl

The Common Crawl corpus contains petabytes of web pages collected since 2008. It contains raw web page data, extracted metadata and text extractions. Common Crawl currently stores the crawl data using the Web ARChive (WARC) format. There are also WAT files which store computed metadata for the data stored in the WARC Finally, there are WET files which store extracted plaintext from the data stored in the WARC. There is also a serverless Athena SQL service that indexes the WARC files.

  • W1: Covid-19 on the Web. Covid-19 on the Web. Take a reletive small subset of the 2020 Common Crawl by looking only at web pages in The Netherlands (or maybe the UK) and focus on mentions of Covid-19 (or its synonyms, or related topics). Extract interesting attributes including the importance of the web page (e.g. the number of incoming links in that same crawl, or a better, the PageRank of that page), keyword or statement extraction from the page, as well as sentiment analysis; from the text surrounding the mention of Covid-19 on that page. Do this for multiple crawls (e.g. Feb, Mar, May, Jul, Aug). Create a visualization that shows the thinking, sub-themes and opinions shaping the discourse around the pandemic over time.
  • W2: Non-scientific Impact of DB and AI research. Take the last few years database (DB) and AI research papers from the DBLP computer science repository. For each paper, try to find references to (mentions of it) in on web pages in the Common Crawl (restrict to possibly one crawl, and possibly only take a sample), and augment the data with a count of these references (stretch goal: possibly split by year in which the rerefencing page was likely written). Create a search interface for papers by author name or title, where ranking is done on that citation count. Stretch goal: try to stay under the Google Scholar radar and extract the scientific citation count of the papers (citations in the bibliography of other papers) as well. The DBLP dataset can be downloaded from: https://dblp.uni-trier.de/xml/

The Common Crawl data is in dbfs:/mnt/lsde/commoncrawl (which links to s3://commoncrawl). This data is seriously large, in terms of PetaBytes, which will break your budget if care is not taken. Therefore, carefully selecting and restricting data to access is important.

TROPOMI

This data set consists of observations from the Sentinel-5 Precursor (Sentinel-5P) satellite of the European Commission’s Copernicus Earth Observation Programme. Sentinel-5P is a polar orbiting satellite that completes 14 orbits of the Earth a day. It carries the TROPOspheric Monitoring Instrument (TROPOMI) which is a spectrometer that senses ultraviolet (UV), visible (VIS), near (NIR) and short wave infrared (SWIR) to monitor ozone, methane, formaldehyde, aerosol, carbon monoxide, nitrogen dioxide and sulphur dioxide in the atmosphere. This sattelite is managed in The Netherlands by SRON and KNMI.

  • I1: Covid-19 Pollution Decrease. Covid-19 impact on air pollution. Create a visualization (e.g. videos) of the various levels of (polluting) gases that TROPOMI measures and visualize the global impact of the covid-19 pandemic on these emissions geographically and over time at various heights of the troposphere. While a previous LSDE showcase produced a nice viusualization, their results can probably be improved -- at the very least the analysis should be extended up until the current time.

The TROPOMI data is available in dbfs:/mnt/lsde/meeo-s5p (which links to s3://meeo-s5p).

Reddit

The collection of Reddit posts and comments from 2005-2021, in total 1.3TB compressed ndjson files. It contains quite specialized discussion topics as well as potentially controversial and shady materials. This dataset only contains the texts of the posts and the post structure.

  • RE1: Heated Discussions. What are the most heated discussions around COVID in Reddit? Rank the discussions. Summarize and visualize the discussion topic as well as the discussion structure and contents. Techniques like sentiment analysis and text summarization will be useful here.
  • RE2: Expert Finding. Find the top-20 expert users for the 1000 most popular subreddits. Some users will be experts in multiple subreddits. Automatically rank, summarize and visualize the topic of the subreddits, and summarize and visualize the expertise of the experts.

The Reddit data is available in dbfs:/mnt/lsde/reddit

CrossRef

A collection of academic papers indexed by Crossref.org.

  • CR1: Scientific Impact. Compute the h-index of authors and try to characterize the citation quality of authors by analyzing who is citing them, are people citing themselves or a close inner circle to boost their scores? If so, when did this start? Also, characterize the completeness of the Crossref data by comparing the h-indexes to those of Google Scholar. One way to characterize the impact of a scientific paper is to count the number of papers that cite it and also the number of papers that cite those papers, etc. (recursively). Using a similar metric, what are the most impactful scientific papers in the last 15 years?
  • CR2: Scientific Communities. A scientific community is a group of researchers who work in a specific area who regularly publish at the same venues (for example, the database research community publishes at VLDB, SIGMOD, ICDE, EDBT, TODS, TKDE, and VLDBJ). In general, papers are more likely to cite other papers in their respective communities. Based on the citation data, which communities can be identified? Which are the most closed communities (which only cite a few papers outside their venues)? Which communities cite each other most frequently?

The CrossRef data is available in dbfs:/mnt/lsde/crossref

DotaLeague

A json archive of matches played on the gaming service OpenDota (450GB uncompressed)

  • DL1: Build Popularity. The key decisions that players make during an OpenDota game are what heroes to select and what items/skills to pick later in the game. What are popular builds and how did these change over time? Can these changes be connected to the game’s online metaverse (published on Twitch, Twitter, YouTube, etc.)?

The DotaLeague data is available in dbfs:/mnt/lsde/yasp-dota

GAB

A json archive of the alt-right social network GAB, over the period of 2016-2019.

  • G1: Alt-Right Sentiment. Sentiment analysis can reveal subjective opinions and feelings based on text. How did the major topics and the posts’ sentiment change over time in the GAB network? Does the connectivity of the social graph correlate with the sentiment? Are the changes correlated to real-world events or are they mainly driven by the narrative of the network?

The GAB data is available in dbfs:/mnt/lsde/gab