LSDE 2021 - Large Scale Data Engineering

2015 · 2015-2016 · 2016 · 2017 · 2018 · 2019 · 2020 · 2021 · VU Canvas

LSDE: Large Scale Data Engineering 2021

Lecture Block 7: Transactional Databases at Cluster Scale

In this lecture we (finally) look at noSQL systems. These are systems that scale well on interactive workloads, i.e. are able to sustain a very high throughput workload of simple lookup or update queries. These systems outperform classical relational database systems running on a single computer by making use of clusters, typically by weakening consistency guarantees.

Workloads consisting of very many concurrent users doing simple lookups or updates are called transactional workloads: On-Line Transaction Processing: OLTP (as opposed to using database systems for complex analysis, a.k.a On-Line Analytical Processing: OLAP - something we covered in the previous lecture on large-scale SQL). Historically, relational database management systems (RDBMSs) have been the dominant technology to address transactional workloads. An RDBMS provides serializability (= the full illusion that concurrent users work alone on a database), and the properties this provides have been dubbed ACID: Atomicity, Consistency, Isolation and Durability. Such relational transaction processing works best on a single-server but this has inherent scalability limits.

When doing transactional workloads on internet scale, one must try to move relational database management systems (RDBMSs) towards a distributed architecture, i.e. make them work in a compute cluster. When managing data in a cluster, data typically gets (i) Partitioned (sharded, for scalability), (ii) Replicated (for availability) and (iii) Cached (for performance, i.e. latency). Distributed replication/caching leads to network communication becoming necessary on each update, at least if consistency must be maintained. In the introduction of the course, we already saw in the CAP theorem, that the properties Consistency, Availability and Partition-tolerance are unattainable together in a distributed system. Due to CAP effects, distributed database systems that provide ACID properties hit either performance or availability limits. This is because distributed ACID needs expensive protocols like 2PC (2-phase commit). Something new was needed, and from this situation arose noSQL systems.

noSQL systems are used typically to keep user state (say the OLTP database) for many users, and for which key-value lookups and (key,value) updates are sufficient. Their role in Big Data pipelines is thus often in the entry stage to capture events as they happen -- these events are logged and accumulated and periodically stored e.g. on HDFS for further analysis. However, as explained in the Lambda Architecture (see the streams lecture), noSQL systems are also often used to serve out precomputed views, where this computation is some kind of Big Data analysis pipeline. Hence, they also serve as a final station to serve out e.g. recommendations, statistics, preferences, suggestions etc -- as such closing the loop of Big Data architectures.

noSQL systems are not so much called like this because they do not offer SQL as a query interface (some in fact do) but because they do not provide the serializability and ACID properties that typical SQL systems provide. Opposite to ACID, its proponents have proposed BASE as an application development methodology. BASE stands for Basically Available Soft-State Services with Eventual Consistency. This means that applications must take into account the fact that answers coming from the noSQL system maybe stale (out of date) and corrections to these answers might be made by the application code (using soft state). You may argue that calling this Lecture Block "transactional databases" is a misnomer because noSQL systems typically do not offer transactional guarantees (i.e. ACID). However, the workload of handling many concurrent small lookup/update/insert requests is known as OLTP, i.e. a "transactional" workload.

For instance, a webmail application may send an email message and move it from the OutBox folder to the Sent folder using noSQL data updates. However, immediately after making these modifications, a noSQL query listing the contents of the OutBox and the Sent box may still show the old contents. This could confuse the email user to think the email message was not sent yet, and trigger the user to sent it again. Sending emails twice due to an inconsistent GUI will make the sender look stupid and would be perceived as a very bad user experience. In order not to confuse the users, in BASE the application developer therefore must perform compensating actions in the application (here: the GUI code), i.e. filter out by hand the just sent message from the message list in OutBox and add it (if not present yet) instead in the Sent box using explicit application logic written to hide this possible data inconsistency. The fact that the webmail GUI remembers that it just sent the message, and can use this knowledge to compensate for the stale noSQL answers, is its "soft state". Needless to say, such BASE application development, where one has to distrust the consistency of the data layer, makes application development harder. This is the price noSQL systems such as Amazon SimpleDB (Dynamo), REDIS, MongoDB make you pay for being scalable to very many concurrent users and always provide low-latency lookup/update performance.

Certain noSQL systems have introduced the concept of Entity Groups that allow to co-locate multiple tables on the same user key. So, all updates that only update items belonging to one customer, become local updates (reducing communication). Google Megastore and its follow-up Spanner provide full ACID properties, but thanks to entity groups they know that many types of transactions do not require waiting for expensive 2PC and Paxos protocols, such that the system remains scalable, while also being highly distributed. As such, these later contributions by Google to this field (its earlier BigTable is a classical noSQL system that sacrifices consistency) employ expensive protocols only when needed, showing that ACID properties are achievable in a cluster at high scalability and low latency after all.

Is your data safe in a NoSQL system?

Jepsen is a project to test distributed systems for consistent behavior. The testing methodology includes injecting network faults and machine failures under heavy read-write loads. They have tested a number of noSQL systems that claimed consistency, with often disastrous results:

Cassandra
quote: "Cassandra lightweight transactions are not even close to correct. .. It’s just a broken implementation of Paxos."
MongoDB
quote: "MongoDB’s version 0 replication protocol is inherently unsafe. .. Although v1 has been the default for newly created replica sets since MongoDB 3.2, there remain many production deployments of the v0 protocol, and I recommend they switch to v1 as soon as possible."
REDIS
quote: "..These results are catastrophic. In a partition which lasted for roughly 45% of the test, 45% of acknowledged writes were thrown away."
RIAK
quote: "..we’ve seen that Riak’s last-write-wins is fundamentally unsafe in the presence of network partitions. "
KAFKA (not a NoSQL system, but often used for buffering data streams)
quote: "..Kafka’s replication claimed to be CA, but in the presence of a partition, threw away an arbitrarily large volume of committed writes. It claimed tolerance to F-1 failures, but a single node could cause catastrophe.."