PDF slides: Cloud Database Systems
Cloud computing has proven to be a very attractive to organizations for their IT needs, including data analytics needs, such as the topic of the LSDE course. One reason for this popularity is, in economy terms, the changes from CapEx to OpEx cost: instead of having to pay upfront for IT infrastructure (capital expenditure), this is changed into paying for only those IT operation that you use, when you use them. Another reason is elasticity: cloud IT infrastructure can be scaled up and down quickly, so your system will neither be idle or overloaded. This means that rather than having to buy IT for the peak load, that almost never occurs (over-provisioning), you potentially pay much less on average, and still have the flexibility to scale up even more than you would ever thought necessary, if that becomes necessary. A third reason is organizational: IT in large organizations is managed by a separate department; and e.g. an organization unit that wants some new or changed IT does not always get what it wants immediately. With the cloud, the IT users can directly contract this IT without having to go to a different department. This also ties in to yet another advantage: namely IT personnel shortages, or difficulty in finding IT skills. Generally speaking, one can expect that IT skills (e.g. network, security, ..) at cloud companies is at a higher level than companies which are smaller and whose business is not IT can obtain on-premise.
Moving data systems to the cloud has become quite popular. It holds for all kinds of data systems, including traditional transactional database workloads, but it also holds for analytical workloads (data marts and even data warehouses), and also for the Big Data style data lakes.
Technically speaking, in terms of data systems, the biggest difference in the cloud is that storage and compute in the cloud are de-coupled from each other. That is, in the cloud, data is typically in some remote blob-store (e.g., S3). Any virtual machine that you power up is initially empty -- it may have a local disk, but this disk is not persistent and empty after machine restart (ephemeral storage). This is a big change for database systems in general, but even so from those designed for e.g. Hadoop. Because in Hadoop, data is stored in HDFS, but the Hadoop scheduler (YARN) is aware of HDFS and schedules (map) tasks with locality, i.e. where the data is. This locality is lost in the cloud, it is the price for an elastic system.
In this lecture, we look at some of the important analytical cloud database systems: Redshift, Snowflake and BigQuery. AWS Redshift is a port of the shared-nothing (on-premise) high-performance parallel data warehousing system ParAccel. It is a pretty fast system, using compressed columnar storage (using its own proprietary data format) and JIT query compilation (to C++). Amazon made ParAccel almost "no knobs" and easily controllable as a service, and re-baptized it Redshift. However, the ParAccel assumption that there is local data storage still makes Redshift a less-elastic database service than could be. AWS added the Spectrum connector to Redshift to do parts of the query processing (fetching and filtering data from S3) in a more elastic way.
A cloud-native and more recent system is Snowflake. It stores data in S3 and allows to start multiple virtual data warehouses that share this data. Snowflake uses the local disks in the compute nodes for data caching. One popular feature of Snowflake is that it can store and query large amounts of JSON data efficiently. JSON is a rather verbose format, but Snowflake stores the regular parts of JSON files under the hood in a compressed column store. The other remarkable feature is data sharing. Snowflake customers can publish data and allow other Snowflake customer to execute queries on it, for a price; as such it can be a kind of app-store for data and queries (Snowflake takes a cut of the query fees).
A third system of interest is Google BigQuery. This is the cloud version of Google's (initially internal) Dremel system for SQL analytics. What is remarkable is that BigQuery is serverless: while in Snowflake and Redshift users have to configure and start up database clusters (i.e. a group of VMs -- that can be scaled up or down), there is no such notion on BigQuery. Google just executes your queries on machines it runs itself (another example of a serverless system is AWS Athena, which is the database system Presto originally designed for Hadoop, run for you by Amazon). In these serverless systems, the abstraction level is further raised, and IT users do not even know anymore what IT is running for them. They just pay by the query; typically by the GB of data accessed.
For technical background material, there are some papers:
In the video/presentation section we provide material on database systems in the cloud ("platform as a service"):
Aurora is running MySQL and PostgreSQL in a scalable manner in the AWS cloud. The core idea is that MySQL and PostgreSQL in the compute node to only do logging, but not write any dirty pages to storage. Rather, a decoupled fleet of log processing servers performs continuous database recovery from the log, reconstructing the latest data images -- replicated 6-fold. This decoupling decreases the need for synchronous writes over the network and makes Aurora very scalable and also outperform the normal locally stored MySQL and PostgreSQL systems in terms of speed.
Snowflake, as presented by co-founder Marcin Zukowski (also co-founder of VectorWise), a CWI PhD graduate:
From the side of Google, there is Google BigQuery, which also runs on a parallel column store, this time Google's own Dremel system:
A final contestant in this fast-growing market is Microsoft Azure Synapse Analytics, which used to be called SQL Data Warehouse. This was Microsoft's high-end parallel cluster version of SQL Server. Now, that is combined with Spark and other Machine Learning tools. This allows also parallel SQL server to fire off parallel Spark tasks and communicate results between the two efficiently (the scientific papers on this hybrid system refer to it by the name: PolyBase).