LSDE 2017 - Large Scale Data Engineering

LSDE2015 · LSDE2015-2016 · LSDE2016 · LSDE2017 · VU Canvas

LSDE: Large Scale Data Engineering 2017

Dataset. Spark SQL 2.x added Just-In-Time query compilation compilation, together with a number of other improvements. Other popular Big SQL systems are Hive and Presto; however the question is how mature these relatively young systems are. ~

D1: Database Benchmark. We would be interested in a performance comparison on the SF1000 or SF3000 size of the TPC-H benchmark. Recently, an alternative version of the benchmark was introduced under the name JCC-H, which adds severe join skew. It would be interesting to test the performance of various SQL-on-Hadoop systems, as listed here, on these two benchmarks.

Summary. The project executes the TPC-H and JCC-H benchmarks on the 1TB size on both Pig and Sparq SQL. In order to better study the performance, a neat query visualizer was developed that not only shows the query plan, but also the distribution of performance metrics across the multiple machines in the cluster involved in the computation. The paper could have gone a bit deeper in its discussion of query optimization under skew, but still this is a nicely executed project.

Data Curiosity: ***
Paper Writing: ***
Technical difficulties mastered: ****
Visualization coolness: ****