2015 · 2015-2016 · 2016 · 2017 · 2018 · 2019 · 2020 · 2021 · 2022 · Canvas
LSDE: Large Scale Data Engineering 2022
Data Analysis Projects

In this assignment the goal is to get experience with creating Big Data analysis pipelines. Computation for all projects will be performed on Databricks via Amazon Web Services.

Datasets and Topics

In the following, we describe the datasets and project topics teams can work on. Some datasets are used by several projects.

These are the projects for 2022.

SciHub

Scihub is a controversial project aiming to set free all scientific publications. We have obtained an archive of 2.4TB with millions of scientific papers, in PDF format. This archive could be studied with Big Data methods in manifold ways.

  • R1: Scientific Plagiarism. Extract the plaintext and author info from all PDF files, and start to analyze it. The goal is to compare all papers for overlapping paragraphs, and subsequently rank and classify the transgressions based on size, frequency, and authorship. Provide a visualization that allows to quantify plagiarism by various dimensions, such as geography, year and topic; preferably also to zoom into annotated examples of serious plagiarism instances.
  • R2: Scientific Topic Trends. Extract the plaintext and author info from all PDF files, and start to analyze it. You may use DOI information to get basic metadata from the web and determine the scientific area in which the paper was published, and when, and then do text mining to extract topic trends in various research areas, their (sub-)disciplines and research topics. Coming up with such a topic hierarchy is also part of the task but may use external metadata that you might find on the web. Also think of a way to nicely visualize these topic trends per area over time.

Dataset storage location: dbfs:/mnt/lsde/datasets/papers/

Flickr Images

The Flickr hosts a staggering amount of pictures, of which a portion is publicly available. A subset of this public flickr is listed, by means of a textual description for which you can crawl the pictures and analyze them in various ways. This data is also popular in image processing research and is being hosted by AWS as part of its open data sets under the name Multimedia Commons. This AWS availability means you don't have to download the pictures anymore to s3, but the original flickr dataset listing has some more information (e.g. GPS coordinates) that can be useful.

  • F1: FaceJoin (reloaded) Crawl the flickr picture archive and use image recognition software to identify faces and match them to a list of faces with known identity. A Join on Faces. The list of known faces could be for instance the FBI's most wanted list but may also be chosen differently (e.g. famous people from Wikipedia, or missing children). The visualization would show a ranked list of flickr matches per known portrait.
  • F2: Deep Locations. Crawl the flickr picture archive to get a large amount of pictures with GPS annotations. Now try to train a deep learning model to learn the GPS location of a picture. It could be a good idea to remove persons from the pictures automatically, but you must figure this out for yourselves. Create a visualization that shows the accuracy of your approach and/or allows to upload a picture and predicts its GPS location
  • F5: Brand Popularity. Download a large subset of images from flickr, where these pictures have GPS coordinates. Create a data product that consist of a cleaned up flickr dataset where the non-loadable images have been removed as well as the pictures without GPS location. Find logos in these pictures and create a vizualizatiin that allows to analyze the popularity of logos accross regions and accross time.
  • F6: Characteristic Faces (reloaded). Analyze a large subset of images from flickr, where these pictures have GPS coordinates. Extract faces and facial expressions from these pictures tagged by location. The goal is to summarize the "face of the world" at different levels of spatial granularity (think: world, continent, country, city) by creating a morphed face for each place in the world at each granularity. The existing Characteristic Faces project has a nice approach that you may follow, however, due to the way data was sampled many regions are underrepresented (having few pictures to build the model from). Another direction for improvement is not to pick a single face per region, but pick a few different charcteristic faces per region. This Characteristic Face project thus should try to find faces that are not the average, but 'typical' for a region. The idea is to cluster faces for one region, and then pick the average face of the cluster that least resembles clusters in neighbouring regions as the representation of that region. See https://github.com/oarriaga/face_classification?utm_source=mybridge&utm_medium=blog&utm_campaign=read_more
  • F7: Image Transformation (reloaded) Take the data output by the Image Annotation showcase project of LSDE. For the top-100 of any query result, create transformed images with a techniques like pix2pic, GANcycle,.... Specifically, find faces and turn all faces into cartoons. Additionally, find buildings and re-texture their appearance. Create a visualization that augments the Image Annotation showcase with these toggles to transform the output of the queries.

Dataset storage location: dbfs:/mnt/lsde/datasets/flickr/ has all the meta-data and dbfs:/mnt/lsde/datasets/multimedia-commons (which links to s3://multimedia-commons) has the images (they are also online at their original URLs at flickr.com). Note that the AWS picture names are derived in a botched way from the hash of the original URLs (the string omits bytes that happened to be 0 in the hash value), but with some tinkering it is possible to relate them back.

Point Clouds

The Actuele Hoogtekaart Nederland (AHN) is a so-called point-cloud, consisting of 3D points measured using LIDAR technology. LIDAR stands for Light Detection and Ranging - a remote sensing method that uses light in the form of a pulsed laser to measure ranges to the Earth, in the case of AHN taken from an airplane. We got 2.5TB of point cloud data for you to analyze.

  • C1: 3D Kadaster (reloaded). We have downloaded XML descriptions contain all Dutch addresses and 2D building plan polygons. We would like you to turn these 2D plans into 3D models of buildings. A previous LSDE showcase achieved some success, however with partial coverage of the country and coarse-grained 3-d models (a box); we would like good data coverage and (more) detailed 3D models.
  • C4: Illegal Buildings (reloaded). Train a deep learning model using the point-cloud dataset and the kadaster information on buildings as ground truth. Use this model to detect houses that are not on the kadaster map, or have larger dimensions than declared. A showcase shows it can be done, still however with significant errors in detection; you might be able to improve the robustness.
  • C5: Detect wind turbines. Train a deep learning model using the point-cloud dataset that detects wind turbines. You must search yourself for datasets with groud truth (i.e. locations of wind turbines) or annotate these yourself. A sub-goal is to recognize the type of wind turbine (to assess capacity), and this might even go down to historic windmills. A stretch goal is to do so on both the ANH2 and ANH3 datasets and also show how the wind turbine construction has fared over the last few years.
  • C6: Forests of Trees. Train a deep learning model using the point-cloud dataset that detects trees. A basic goal would be to count trees in The Netherlands, and summarize them by size. Another goal could be to visualize forest areas, and display how many trees are iun them. An absolute stretch goal though would be to detect the tree species as well. This could be done on both the ANH2 and ANH3 datasets, and the dynamics of trees and forsts in The Netherlands could be visualized (how does this change over time).
  • C7: The Sinking Netherlands. We would like you to summarize the Dutch terrain into areas of equal height at a relatively coarse-grained granularity, and embedding this in a height map. This should be done for both the ANH2 and ANH3 datasets and you should detect and visualize areas of The Netherlands that are sinking. Special interest is on built-up areas that are sinking; as well as some flooding visualization (which areas are below sea-level, possibly visuialized with a dial where one can see differences at different levels of sealevel rise in the future).

Dataset storage location: dbfs:/mnt/lsde/datasets/ahn3/ and dbfs:/mnt/lsde/datasets/ahn2/ (point cloud - check the web viewer) and dbfs:/mnt/lsde/datasets/geodata. (house surface polygons from Kadaster).

AIS Ship Tracking

Commercial ships periodically send out messages containing their position details (ship identifier, latitude, longitude, speed, ...) . These AIS messages are collected and visualized in systems such as MarineTraffic. We have 2.6 TB of AIS messages (TXT/AIVDM format) in the years 2016 and 2017.

  • S2: Running for Oil. Identify oil tankers (using e.g. IMO number), and group by company and country, and identify their trips and trip speed or even specific loitering. Try to correlate oil transportation and travel speed with oil price. Is there more or less traffic when oil prices are high or low? Can we predict future oil prices from movement on the ocean? Are ships delaying discharge (loitering) while prices are rising, resp. accelerating when prices are dropping?
  • S3: Suspicious Outage. Reconstruct ship trajectories with a focus of incomplete trajectories that given the reception range would be expected to be complete. In other words, ship trajectories where for some reason the AIS message sender had been turned off during parts of trip. Find incidents with such ships (e.g. using IMO number), group them to find the most suspicious ones and visualize their routes.
  • S4: Meetings at Sea. Try to identify situations in which ships during an extended period of time, where close to each other, possibly but not necessarily in the same place. Close enough, to make a transfer of cargo a possibility. Augment the encounters with additional information you can find on the properties of these ships (names, type, port, function, etc) e.g. using IMO number) but also other sources you might find.

Dataset storage location: dbfs:/mnt/lsde/datasets/ais/

Common Crawl

The Common Crawl corpus contains petabytes of web pages collected since 2008. It contains raw web page data, extracted metadata and text extractions. Common Crawl currently stores the crawl data using the Web ARChive (WARC) format. There are also WAT files which store computed metadata for the data stored in the WARC Finally, there are WET files which store extracted plaintext from the data stored in the WARC. There is also a serverless Athena SQL service that indexes the WARC files.

  • W1: Covid-19 on the Web. Covid-19 on the Web. Take a relative small subset of the Common Crawls by looking only at web pages in The Netherlands (or maybe the UK) and focus on mentions of Covid-19 (or its synonyms, or related topics). Extract interesting attributes including the importance of the web page (e.g. the number of incoming links in that same crawl, or a better, the PageRank of that page), keyword or statement extraction from the page, as well as sentiment analysis; from the text surrounding the mention of Covid-19 on that page. Do this for multiple crawls (e.g. from Feb 2020 until May 2022). Create a visualization that shows the thinking, sub-themes and opinions shaping the discourse around the pandemic over time.
  • W2: Non-scientific Impact of DB research. Take the last 20 years of database research papers (PVLDB, SIGMOD, ICDE) from the DBLP computer science repository. For each paper, try to find references to (mentions of it) in on web pages in the Common Crawl (restrict to possibly one crawl, and possibly only take a sample), and augment the data with a count of these references (stretch goal: possibly split by year in which the rerefencing page was likely written). Create a search interface for papers by author name or title, where ranking is done on that citation count. Stretch goal: try to stay under the Google Scholar radar and extract the scientific citation count of the papers (citations in the bibliography of other papers) as well. The DBLP dataset can be downloaded from: https://dblp.uni-trier.de/xml/
  • W3: Database Systems Ranking. Take comparable samples from different years from Common Crawl, looking for mentions of database systems (e.g. PostgresSQL, MySQL, Oracle, DuckDB, ClickHouse, Spark SQL, etc) in the HTML pages you find. You must think well about how to sample, and which pages to get, because Web Crawls are huge. From this information, construct a timeline of rankings of popular database systems. A stretch goal would be to perform sentiment analysis on the texts surrounding these mentions, in order to gauge a public opinion on these database systems.

The Common Crawl data is in dbfs:/mnt/lsde/datasets/commoncrawl (which links to s3://commoncrawl). This data is seriously large, in terms of PetaBytes, which will break your budget if care is not taken. Therefore, carefully selecting and restricting data to access is important.

Reddit

The collection of Reddit posts and comments from 2005-2021, in total 1.3TB compressed ndjson files. It contains quite specialized discussion topics as well as potentially controversial and shady materials. This dataset only contains the texts of the posts and the post structure.

  • RE1: Heated Discussions. What are the most heated discussions around COVID in Reddit? Rank the discussions. Summarize and visualize the discussion topic as well as the discussion structure and contents. Techniques like sentiment analysis and text summarization will be useful here.
  • RE2: Expert Finding. Find the top-20 expert users for the 1000 most popular subreddits. Some users will be experts in multiple subreddits. Automatically rank, summarize and visualize the topic of the subreddits, and summarize and visualize the expertise of the experts.
  • RE3: Ukraine War. Characterize the occurence of text on the conflict in Ukraine from 2014 on. We are interested in the amount of posts on this topics, the amount of posters, and the sentiment the posts had (sentiment analysis on the text). You may also summarize the discussions on the topics in woprd clouds. It would also be interesting to learn what are the most relevant subreddits for this topic over time, as well as the most prolific posters over time; and clasify these based on their point of view (i.e. pro-Ukraine vs pro-Russia).
  • RE4: Hate Speech. Scan reddit for hate speech, and classify this hate speech in some way. The goal would be to understand and visualize what type of hate speech is most common, and in which sub-reddits it occurs, and learn how this evolves over time (you may focus on the last X years, e.g. X=10). We would also be interested to learn about the users most involved in hate-speech over time.

The Reddit data is available in dbfs:/mnt/lsde/datasets/reddit

CrossRef

A collection of academic papers indexed by Crossref.org.

  • CR1: Scientific Impact. Compute the h-index of authors and try to characterize the citation quality of authors by analyzing who is citing them, are people citing themselves or a close inner circle to boost their scores? If so, when did this start? Also, characterize the completeness of the Crossref data by comparing the h-indexes to those of Google Scholar. One way to characterize the impact of a scientific paper is to count the number of papers that cite it and also the number of papers that cite those papers, etc. (recursively). Using a similar metric, what are the most impactful scientific papers in the last 15 years?
  • CR2: Scientific Communities. A scientific community is a group of researchers who work in a specific area who regularly publish at the same venues (for example, the database research community publishes at VLDB, SIGMOD, ICDE, EDBT, TODS, TKDE, and VLDBJ). In general, papers are more likely to cite other papers in their respective communities. Based on the citation data, which communities can be identified? Which are the most closed communities (which only cite a few papers outside their venues)? Which communities cite each other most frequently?

The CrossRef data is available in dbfs:/mnt/lsde/datasets/crossref

DotaLeague

A json archive of matches played on the gaming service OpenDota (450GB uncompressed)

  • DL1: Build Popularity. The key decisions that players make during an OpenDota game are what heroes to select and what items/skills to pick later in the game. What are popular builds and how did these change over time? Can these changes be connected to the game’s online metaverse (published on Twitch, Twitter, YouTube, etc.)?

The DotaLeague data is available in dbfs:/mnt/lsde/datasets/opendota