Each group chooses one of the topics on the projects page, in a FCFS manner based on the final leaderboard ranking of practical 1c.
For each topic your are asked to identify at least two related papers. These papers may be taken from the technical reading material in this course, which are typically related to data processing systems. But, often you should also look at different scientific papers, e.g. papers relating to the data science task you are addressing.
Typically, your project will start with a rather massive ETL phase that transforms your raw input data into a data product, for instance, a set of cleaned Parquet files, that only contain relevant data. It is critical to define using experimentation, literature reading, algorithm finding/implementation and creative thinking, what this data product will be. On top of this data product, you will run some data analysis, and you will visualize the results of this analysis and/or your data product.
You will be asked to write a project plan and review plans of others (assignment 2a). The final plan grade is group-wide, but the quality of the individual review can lead to lower or higher individual grades.
You will be asked to hold a final 5-minute group presentation (assignment 2b). This grade is group-wide, though individual non-participation in the presentation can lead to individual point reduction.
The final result of the project is code (tar.gz) and a written project report, and a simple web-based visualization (assignment 2c), of which the report is most important. In the report you must state what has been done by which project member. This information has in the past in certain cases led to different individual 2c grades and this may happen again in 2021.
This project plan (2-5 page PDF) should be a contain:
The project plan will be written by the whole team. You will also be asked to review individually one project plan of another group. Each group will thus receive three reviews and can then improve the project plan with these. The finally re-submitted project plan (date 2b set, to one week later) will be graded group-wide but individual scores may still vary due to the quality of the reviews written by individual members.
We will use FeedbackFruits for this assignment and its peer review process
For both the final presentation (see website page 'Lecture Schedule' for the schedule):
If you know you cannot make it, please communicate this to Peter Boncz days in advance.
Please read the description of what we expect of the presentations in the Canvas assignment corresponding to it. In all cases, slides should:
There will be a tight schedule, and we will mercilessly hard-stop when time is up and let the next group begin. Otherwise we will not be able to let all scheduled groups present.
The final project report for assignment 2 is due at the end of October (Extended to Nov 5). The project report should be a paper of length 5-12 pages formatted two-column like a scientific paper (latex-style, latex-example) which contains the following information & structure:
Apart from the report and your code, we also ask you to create a web-based visualization of your results. This visualization can be basic, and due to the technical restrictions of this site (and to keep things simple) we request you to do this in the form of server-static HTML/JavaScript web pages.
Though the visualization should be basic, the cooler it is, the better. We will keep these online, so this is also a visible proof of you having passed this course and mastering large-scale data engineering techniques.