2015 · 2015-2016 · 2016 · 2017 · 2018 · 2019 · 2020 · 2021 · VU Canvas
LSDE: Large Scale Data Engineering 2021

Dataset. Through the US Freedom of Information Law, records of all taxi rides in New York City were made public. NYC since then makes all taxi and other ride data public. In these projects, though, we focus on yellow cabs in the years from 2009 up and until July 2016 (after which the geographical level of detail was decreased).

T1: Taxi Business. Find lucrative pickup places for taxi drivers. Also find commercially successful drivers, who earn much more than others per hour driven (corrected for time-of-day). Visualize the results appropriately.

Summary. An investigation of the well-known NYC taxi dataset, with the main scientific question to understand lucrative taxi driver strategies. The technical approach leveraged efficient partitioned delta tables and Spark SQL (i.e. parquet-stored tables in Databricks). The main conclusions are that commercial success depends on quickly finding new customers, which in turn means a preference to cruise in busy, centric, areas and pick up many short rides. Short rides increase the relative importance of tips, leave the cab in a still busy area so the time to the next customer is small. This implies a deliberate choice not to target the airports, as they are far from the center and rides are long and fewer. Furthermore, a succesful taxi driver should always be out when the weather is good, and preferably not skip work on the evenings of Thu-Sat.

Data curiosity: ****
Writing: ****
Technical difficulties mastered: ***
Visualization coolness: ****


Taxi Business -- Kalle Jansen, Futong Han and Kai Zhang (paper)