LSDE2015 · LSDE2015-2016 · LSDE2016 · VU BlackBoard
LSDE: Large Scale Data Engineering 2016
Data Analysis Projects: Your Task
Assignment

Each group chooses one of the below topics in a FCFS manner based on the final leaderboard ranking of practical1. The outcome is as follows, in ranking order of Assignment1:

  1. Group 09 (0.03) - P4: Rescue Choppers.
  2. Group 02 (0.03) - W2: Election Prediction.
  3. Group 01 (0.03) - P2: Flight Routes.
  4. Group 11 (0.04) - P6: Cause for Emergency.
  5. Group 08 (0.1) - N3: Urbanisation vs Climate Change
  6. Group 04 (10.09) - S2: Running for Oil.
  7. Group 05 (91.76) - F1: FaceJoin.
  8. Group 03 (157.68) - P3: Flight Visualization
  9. Group 10 (160.01) - S3: Suspicious Outage.
  10. Group 13 (166.41) - S1: Shipping Safety.
  11. Group 14 (169.15) - P5: Private Jet
  12. Group 06 (190.41) - D1: Database Benchmark.

For each topic your are asked to identify two related papers. These papers may be taken from the technical reading material in this course, which are typically related to particular systems that work in Hadoop. But, you could also choose different scientific papers , e.g. papers relating to the data science task you are adressing -- in any case, please announce (by email to lsde_course@outlook.com) before your presentation which related papers you have chosen. Given that the presentations are on October 10 and 12, please send this email before Friday October 7.

You will be asked to give a presentation on your project in progress. In the presentation there should be both attention for how you are addressing your project (and possible early findings), as well as the technologies used. You may also summarize relevant points from the two related scientific papers you chose.

The final result of the project is code (tar.gz) and a written project report, and a simple web-based visualization.

Presentation

Please send your presentation in advance to lsde_course@outlook.com -- one full day in advance! There are two reasons for this requirement:

  • In order not to lose time in switching the projector, you must present from our laptop, hence we must have it in advance.
  • If you provide the slides of your presentation a day in advance, we will give feedback and the opportunity to fine-tune.

On the day you present (see website page 'Lecture Schedule' for the schedule):

  • we must have received (the last version of) your presentation at least 30 minutes before the start of the class by email
  • both group members must be physically present at the scheduled time (start of class + 15min * #preceding_groups) in order to get a grade > 1.

If you know you cannot make it, please arrange a swap with a group that presents on the other day. If you cannot make it on both occasions, contact us well in advance.

In the presentation, we would like you to explain to us and the other students how you are solving your Assignment2 data analyis problem and why.

A suggested setup of the presentation would be as follows:

  • what is the task in assignment2: question(s) of interest and characteristics of the data you work with. Please show us results of basic and early explorations on the dataset that you (should) have done.
  • what kind of analysis do you want to do (algorithms, data structures). Explain relevant previous approaches to this, possibly from a related paper.
  • what (cluster) tools did you chose to use, and why? Explain relevant features of the tool you chose to use, possibly from a related paper.
  • current state of the project, your results so far (insert cool visualizations here)
  • conclusion

Remember that all of this needs to fit in 12 minutes, to allow for some questions. Normal persons would prepare 5-8 slides. Of course, it depends on what is on the slides and how much you tell per slide. But, in all cases, slides should:

  • not contain too much info,
  • use a large font,
  • contain no more than 5 bullets per slide, and
  • please use figures wherever possible.
  • Ah. One more thing: please practice your talk beforehand

There will be a tight schedule, and we will mercilessly hard-stop you after 15 minutes and let the next group begin. Otherwise we will not be able to let all scheduled groups present.

Reporting

The final project report for assignment 2 is due October 23. The project report should be a paper of length 5-12 pages formatted two-column like a scientific paper (latex-style, latex-example, word-example) which contains the following information & structure:

  • Introduction: what is this project about and why is it interesting?
  • Related Work: meaningfully summarize related literature to explain what this project builds on, in particular, discuss the two research papers chosen with this project (and discussed in your presentation)
  • Research Questions: which questions is this project trying to answer, and/or hypotheses to investigate? These should cover both the project topic, as well as the technological side (aptness of the tools for the job, scalability of the solution).
  • Project Setup: the steps taken during the project.
  • Experiments: a description of the experiments and their results.
  • Conclusions: revisit the research questions and hypotheses and try to answer them. Any new questions? Insights in the usability of the employed technology for particular tasks?
  • Bibliography

If using Latex for formatting your paper, please use bibtex for the bibliography.

Apart from the report and your code, we also ask you to create a web-based visualization of your results. This visualization can be basic, and due to the technical restrictions of this site (and to keep things simple) we request you to do this in the form of server-static HTML/javascript web pages.

Though the visualization should be basic, the cooler it is, the better. We will keep these online, so this is also a visible proof of you having passed this course and mastering large-scale data engineering techniques.