LSDE 2022 - Large Scale Data Engineering

2015 · 2015-2016 · 2016 · 2017 · 2018 · 2019 · 2020 · 2021 · 2022 · Canvas

LSDE: Large Scale Data Engineering 2022

Dataset. Through the US Freedom of Information Law, records of all taxi rides in New York City were made public. NYC since then makes all taxi and other ride data public. In these projects, though, we focus on yellow cabs in the years from 2009 up and until July 2016 (after which the geographical level of detail was decreased).

T5: Nightlife Trends. Find non-obvious places of night life in New York city, by looking for places outside the popular nightlife areas, that are frequently used as destination or starting point of taxi trip in the late or very early hours. Correlate this with the day-of-week. Summary. When data is in fact not very big and when it can be processed in a single machine, it is generally a good idea to keep the pipeline simple and run on a single machine. This team observed this to be the case on the NYC taxi data, which was analyzed using a hand-crafted multi-step neighbour clustering algorithm to identify relatively isolated hotspot of nightly taxi drop-on/off locations.

The project was run on a single machine with relatively large memory on the AWS cloud. A mix of Apache Spark and Pandas was used for processing the data. The data was further integrated with additional data sources on New York nightlife establishments.

Data curiosity: ****
Related work: ***
Technical difficulties mastered: ***
Visualization coolness: ****