LSDE 2022 - Large Scale Data Engineering

2015 · 2015-2016 · 2016 · 2017 · 2018 · 2019 · 2020 · 2021 · 2022 · Canvas

LSDE: Large Scale Data Engineering 2022

Data Analysis Projects: Your Task

Assignment

Each group chooses one of the topics on the projects page, in a FCFS manner based on the final leaderboard ranking of practical 1c.

Group 18 (0.66) - RE4: Hate Speech
Group 02 (0.80) - C7: The Sinking Netherlands.
Group 20 (0.83) - S2: Running for Oil
Group 21 (0.83) - F1: FaceJoin (reloaded)
Group 13 (1.13) - F5: Brand Popularity
Group 16 (1.35) - W1: Covid-19 on the Web
Group 05 (1.38) - RE3: Ukraine War
Group 09 (1.41) - F7: Image Transformation (reloaded)
Group 06 (1.63) - RE1: Heated Discussions
Group 24 (1.90) - RE2: Expert Finding
Group 15 (14.15) - CR1: Scientific Impact
Group 22 (18.72) - C6: Forests of Trees
Group 08 (42.25) - F6: Characteristic Faces (reloaded).
Group 03 (47.17) - S3: Suspicious Outage
Group 07 (47.25) - DL1: Build Popularity
Group 12 (49.36) - S4: Meetings at Sea.
Group 19 (50.59) - C1: 3D Kadaster (reloaded)
Group 25 (51.04) - R1: Scientific Plagiarism
Group 01 (52.20) - F2: Deep Locations
Group 14 (52.81) - R2: Scientific Topic Trends
Group 10 (54.05) - W3: Database Systems Ranking
Group 17 (55.07) - C4: Illegal Buildings (reloaded)
Group 11 (55.37) - CR2: Scientific Communities
Group 04 (57.55) - C5: Detect wind turbines.
Group 23 (timeout) - W2: Non-scientific Impact of DB research

For each topic your are asked to identify at least two related papers. These papers may be taken from the technical reading material in this course, which are typically related to data processing systems. But, often you should also look at different scientific papers, e.g. papers relating to the data science task you are addressing.

Typically, your project will start with a rather massive ETL phase that transforms your raw input data into a data product, for instance, a set of cleaned Parquet files, that only contain relevant data. It is critical to define using experimentation, literature reading, algorithm finding/implementation and creative thinking, what this data product will be. On top of this data product, you will run some data analysis, and you will visualize the results of this analysis and/or your data product.

You will be asked to write a project plan and review plans of others (assignment 2a). The final plan grade is group-wide, but the quality of the individual review can lead to lower or higher individual grades.

You will be asked to hold a final 5-minute group presentation (assignment 2b). This grade is group-wide, though individual non-participation in the presentation can lead to individual point reduction.

The final result of the project is code (tar.gz) and a written project report, and a simple web-based visualization (assignment 2c), of which the report is most important. In the report you must state what has been done by which project member. This information has in the past in certain cases led to different individual 2c grades and this may happen again in 2021.

Project Plan and Review

This project plan (2-5 page PDF) should be a contain:

a short motivation and description of the project
a initial data investigation, showing an understanding of its size, formats, tools to read its, and data distributions
scientific papers and algorithms/techniques (or packages) relevant to your project
a description of the final goals: what is your data product, and how will it be visualized (in a static html website *only*)
a cloud compute budget showing planned use of hardware, compute and storage resources, plus an estimated calculation of project cost
a proposed timeline of your project

The project plan will be written by the whole team. You will also be asked to review individually one project plan of another group. Each group will thus receive three reviews and can then improve the project plan with these. The finally re-submitted project plan (date 2b set, to one week later) will be graded group-wide but individual scores may still vary due to the quality of the reviews written by individual members.

We will use FeedbackFruits for this assignment and its peer review process

Presentation

For both the final presentation (see website page 'Lecture Schedule' for the schedule):

you must have submitted your presentation to Canvas (pdf format slides)
all group members must be physically present and participate in the presentation (5 minutes) and answer questions (~3 minutes)

If you know you cannot make it, please communicate this to Peter Boncz days in advance.

Please read the description of what we expect of the presentations in the Canvas assignment corresponding to it. In all cases, slides should:

not contain too much info,
use a large font,
contain no more than 5 bullets per slide, and
please use figures wherever possible.
Ah. One more thing: please practice your talk beforehand

There will be a tight schedule, and we will mercilessly hard-stop when time is up and let the next group begin. Otherwise we will not be able to let all scheduled groups present.

Reporting

The final project report for assignment 2 is due at the end of October. The project report should be a paper of length 5-12 pages formatted two-column like a scientific paper (latex-style, latex-example) which contains the following information & structure:

Introduction: what is this project about and why is it interesting?
Related Work: meaningfully summarize related literature to explain what this project builds on, both regarding technical architecture and components as well as relevant to the data science question
Research Questions: which questions is this project trying to answer, and/or hypotheses to investigate? These should cover both the project topic, as well as the technological side (aptness of the tools for the job, scalability of the solution).
Project Setup. Describe the input data and the output data product plus the pipeline that you created to produce this. Show pertinent evaluation information, describing data characteristics, computation time, data distributions and relevant results. Describe and motivate the vizualization.
Conclusions: revisit the research questions and hypotheses and try to answer them. Any new questions? Insights in the usability of the employed technology for particular tasks?
Bibliography

Apart from the report and your code, we also ask you to create a web-based visualization of your results. This visualization can be basic, and due to the technical restrictions of this site (and to keep things simple) we request you to do this in the form of server-static HTML/JavaScript web pages.

Though the visualization should be basic, the cooler it is, the better. We will keep these online, so this is also a visible proof of you having passed this course and mastering large-scale data engineering techniques.

Peter Boncz - Large Scale Data Engineering