PDF slides: SQL on Big Data
Database systems we focus here on are for analysis; what traditionally is called OLAP (On Line Analytical Processing), in SQL terms. In OLAP workloads, users run a few complex read-only queries that search through huge tables. These queries may take a long time. OLAP can be contrasted with OLTP (On Line Transactional Processing), which is about sustaining a high-throughput of single-record read and update queries, e.g. many thousands per second. OLTP systems also have their Big Data equivalents in the form of NoSQL systems, such as HBase and Cassandra. However, for data science, where we analyze data, we focus on analytical database systems, so noSQL systems are out of scope for this module.
Modern analytical database systems have undergone a transformation in the past decade. The idea that you use the same technology for OLAP and OLTP is gone. Analytical database systems have moved from row-wise to columnar storage and are called "column stores". Column stores typically use data compression; data is compressed per column and this is more effective since the value distribution is more regular than if values from different columns are mixed (as happens in row-stores). Besides general-purpose compression (ZIP, bzip, etc) we discussed a number of database compression schemes that exploit knowledge of the table column datatypes and distributions, among others RLE (Run-Length Encoding), BitMap storage and Differential Encoding and Dictionary Encoding. Compression helps improve performance because the data becomes smaller and therefore less memory, disk and network traffic is needed when executing queries, often improving performance. Sometimes it is even possible to answer queries directly on the compressed data, without decompressing it; in such cases compression is no longer a trade-off between less data movement for more CPU computation (decompression) but a win-win (less data movement and less computation).
Other storage tricks are "zone-maps" which keep simple statistics (Min,Max) for large tuple ranges, which allow to avoid reading (skipping) a zone in the table if the WHERE condition ask for values that outside the [Min,Max] range of a zone. Parallel database systems that run on multiple machines often let the database user (data architect) decide on table distribution (which machine gets which rows?), and partitioning to split the table on each machine further in smaller partitions. Distribution (DISTRIBUTE BY) is typically done by "hashing" on a key column (a hash function is a random but deterministic function) to ensure that data gets spread evenly. Partitioning (PARTITION BY) is often done on a time-related column. One goal of partitioning is to speed up queries by skipping partitions that do not match a WHERE condition. A second goal is data lifecycle management, so one can keep the last X days of data in X partitions, by each day dropping the oldest partition (containing the data of X days ago) and starting a new empty partition in which new data gets inserted. Distribution and partitioning are typically applied both, independently.
Modern analytical database systems also overhaul the SQL query engine, using either vectorization or just-in-time compilation (into fast machine code) to use much less CPU resources per processed tuple than OLTP row-stores. Finally, high-end analytical database systems are parallel systems that run on a cluster and try to involve all cores of all nodes in queries, to make the system faster and more scalable.
Modern big data formats such as ORC and Parquet have adopted columnar data storage: even though they store all columns together in a single file (to make sure all columns are on the same machine in Hadoop on HDFS), inside that file the data is placed column after column in huge chunks of rows (rowgroups). Applications that only read a few columns can thus skip over the unused columns (resulting in less I/O). These new data formats also apply columnar compression and sometimes also zone-maps (i.e. MinMax stats). So, the more advanced SQL-on-Hadoop systems try to adopt all modern ideas on analytical database systems (e.g. Hive uses vectorization, Impala and Spark SQL use JIT).
For technical background material, there are some papers:
Lecture by Andy Pavlo (CMU) covering quite a bit of this material, with slightly broader scope in explaining query execution in database systems.
Here is a 2020 Spark Summit talk, where Reynold Xin announces Spark switching from their JIT engine to a new vectorized Delta Engine (Photon):
Now that I am self-indulging, another one: from long ago, a Google Tech Talk by yours truly on columnar compression and vectorized execution: