2015 · 2015-2016 · 2016 · 2017 · 2018 · 2019 · 2020 · 2021 · VU Canvas
LSDE: Large Scale Data Engineering 2021

Dataset. Data management is rapidly moving to the cloud, replacing upfront cost into operational costs with low barriers to entry. Cloud providers are now offering SQL cloud services, also in the area of data warehousing. It is often unclear what the performance, and price/performance and elasticity tradeoffs are of these various solutions, however.

D1: Database Benchmark. We would be interested in a performance comparison on the SF1000 or SF3000 size of the TPC-H benchmark. Recently, an alternative version of the benchmark was introduced under the name JCC-H, which adds severe join skew. It would be interesting to test the performance of various SQL-on-Hadoop systems, as listed here, on these two benchmarks.

Summary. The project executes the TPC-H and JCC-H benchmarks on the 1TB size on both Pig and Spark SQL. In order to better study the performance, a neat query visualizer was developed that not only shows the query plan, but also the distribution of performance metrics across the multiple machines in the cluster involved in the computation. The paper could have gone a bit deeper in its discussion of query optimization under skew, but still this is a nicely executed project.

Data Curiosity: ***
Paper Writing: ***
Technical difficulties mastered: ****
Visualization coolness: ****


Big SQL Performance Visualization -- Tim van Elsloo and Thomas van der Ham (paper)