LSDE 2021 - Large Scale Data Engineering

2015 · 2015-2016 · 2016 · 2017 · 2018 · 2019 · 2020 · 2021 · VU Canvas

LSDE: Large Scale Data Engineering 2021

Lecture Block 6: Analytical Databases at Cluster Scale

Cloud computing has proven to be a very attractive to organizations for their IT needs, including data analytics needs, such as the topic of the LSDE course. One reason for this popularity is, in economy terms, the changes from CapEx to OpEx cost: instead of having to pay upfront for IT infrastructure (capital expenditure), this is changed into paying for only those IT operation that you use, when you use them. Another reason is elasticity: cloud IT infrastructure can be scaled up and down quickly, so your system will neither be idle or overloaded. This means that rather than having to buy IT for the peak load, that almost never occurs (over-provisioning), you potentially pay much less on average, and still have the flexibility to scale up even more than you would ever thought necessary, if that becomes necessary. A third reason is organizational: IT in large organizations is managed by a separate department; and e.g. an organization unit that wants some new or changed IT does not always get what it wants immediately. With the cloud, the IT users can directly contract this IT without having to go to a different department. This also ties in to yet another advantage: namely IT personnel shortages, or difficulty in finding IT skills. Generally speaking, one can expect that IT skills (e.g. network, security, ..) at cloud companies is at a higher level than companies which are smaller and whose business is not IT can obtain on-premise.

Moving data systems to the cloud has become quite popular. It holds for all kinds of data systems, including traditional transactional database workloads, but it also holds for analytical workloads (data marts and even data warehouses), and also for the Big Data style data lakes.

Technically speaking, in terms of data systems, the biggest difference in the cloud is that storage and compute in the cloud are de-coupled from each other. That is, in the cloud, data is typically in some remote blob-store (e.g., S3). Any virtual machine that you power up is initially empty -- it may have a local disk, but this disk is not persistent and empty after machine restart (ephemeral storage). This is a big change for database systems in general, but even so from those designed for e.g. Hadoop. Because in Hadoop, data is stored in HDFS, but the Hadoop scheduler (YARN) is aware of HDFS and schedules (map) tasks with locality, i.e. where the data is. This locality is lost in the cloud, it is the price for an elastic system.

In this lecture, we look at some of the important analytical cloud database systems: Redshift, Snowflake and BigQuery. AWS Redshift is a port of the shared-nothing (on-premise) high-performance parallel data warehousing system ParAccel. It is a pretty fast system, using compressed columnar storage (using its own proprietary data format) and JIT query compilation (to C++). Amazon made ParAccel almost "no knobs" and easily controllable as a service, and re-baptized it Redshift. However, the ParAccel assumption that there is local data storage still makes Redshift a less-elastic database service than could be. AWS added the Spectrum connector to Redshift to do parts of the query processing (fetching and filtering data from S3) in a more elastic way.

A cloud-native and more recent system is Snowflake. It stores data in S3 and allows to start multiple virtual data warehouses that share this data. Snowflake uses the local disks in the compute nodes for data caching. One popular feature of Snowflake is that it can store and query large amounts of JSON data efficiently. JSON is a rather verbose format, but Snowflake stores the regular parts of JSON files under the hood in a compressed column store. The other remarkable feature is data sharing. Snowflake customers can publish data and allow other Snowflake customer to execute queries on it, for a price; as such it can be a kind of app-store for data and queries (Snowflake takes a cut of the query fees).

A third system of interest is Google BigQuery. This is the cloud version of Google's (initially internal) Dremel system for SQL analytics. What is remarkable is that BigQuery is serverless: while in Snowflake and Redshift users have to configure and start up database clusters (i.e. a group of VMs -- that can be scaled up or down), there is no such notion on BigQuery. Google just executes your queries on machines it runs itself (another example of a serverless system is AWS Athena, which is the database system Presto originally designed for Hadoop, run for you by Amazon). In these serverless systems, the abstraction level is further raised, and IT users do not even know anymore what IT is running for them. They just pay by the query; typically by the GB of data accessed.

Technical Literature

For technical background material, there are some papers:

Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases. The market leader in the space is Amazon. In terms of classical transactional database systems (think MySQL, Oracle, etc); these have been moved to the cloud very successfully by Amazon via its Aurora platform, where data storage and logging is now done using cloud storage, and all kinds of data partitioning and elastic scaling capabilities have been created around these older systems.

Amazon Redshift and the Case for Simpler Data warehouses. However, if we consider analytical database systems (i.e. used for data lakes, data marts or data warehouses), rather than transactional database systems, one needs different systems. In cloud data analysis systems, the first-mover was Redshift which is based on the columnar parallel database system Paraccel.
The Snowflake Elastic Data Warehouse. Unlike Redshift's roots in on-premise parallel database systems (Paraccel), Snowflake is a recent data warehousing system, developed specifically for cloud storage and cloud elasticity. The system is truly elastic and makes clever use of local (ephemeral) storage in cloud MS for caching. As Snowflake is becoming a very popular platform, creating an data economy between their users by allowing to sell and query data is one of their interesting growth strategies.
Choosing A Cloud DBMS: Architectures and Tradeoffs. A scientific paper that compares various cloud database offerings, and connects various dots in the LSDE course.
Understanding and Benchmarking the Impact of GD PR on Database Systems. Moving away from all technical aspects of cloud, it is good to pay attention to the legal and ethical drawbacks of cloud data storage. In our EU context, a big change has been the adoption of GD PR, to which data operators considering the usage of third party cloud services should comply. While this paper is still technical (eventually focusing on benchmarking performance effects of GDPR compliance), it does a good job in summarizing the consequences of GDPR on the operation of databases.
Cloud Programming Simplified: A Berkeley View on Serverless Computing. This last one is a long vision document. Most cloud data services still require you to start (groups of) VMs to run your data systems. Google BigQuery and AWS Athena are two examples of serverless systems where this is not required: the server provider runs the machines, and you only pay per query. Serverless computing thus further lifts the IT abstraction level and is being actively pursued in the future by cloud providers.

In the video/presentation section we provide material on database systems in the cloud ("platform as a service"):

Aurora is running MySQL and PostgreSQL in a scalable manner in the AWS cloud. The core idea is that MySQL and PostgreSQL in the compute node to only do logging, but not write any dirty pages to storage. Rather, a decoupled fleet of log processing servers performs continuous database recovery from the log, reconstructing the latest data images -- replicated 6-fold. This decoupling decreases the need for synchronous writes over the network and makes Aurora very scalable and also outperform the normal locally stored MySQL and PostgreSQL systems in terms of speed.

Snowflake, as presented by co-founder Marcin Zukowski (also co-founder of VectorWise), a CWI PhD graduate:

From the side of Google, there is Google BigQuery, which also runs on a parallel column store, this time Google's own Dremel system:

Exploring BigData with Google BigQuery from Dharmesh Vaya

A final contestant in this fast-growing market is Microsoft Azure Synapse Analytics, which used to be called SQL Data Warehouse. This was Microsoft's high-end parallel cluster version of SQL Server. Now, that is combined with Spark and other Machine Learning tools. This allows also parallel SQL server to fire off parallel Spark tasks and communicate results between the two efficiently (the scientific papers on this hybrid system refer to it by the name: PolyBase).

Peter Boncz - Large Scale Data Engineering