2015 · 2015-2016 · 2016 · 2017 · 2018 · 2019 · 2020 · 2021 · VU Canvas
LSDE: Large Scale Data Engineering 2021
Lecture Block 1: Cloud Computing: Cluster Computing as a Commodity Good

PDF slides:

In this lecture we first define Big Data in terms of data management problems with the three V's: (i) Volume: the data is large, (2) Variety: the data is often not clean and tabular, but messy (text, even images), (3) Velocity: new data keeps arriving continuously. We also discuss Power Laws in many Big Data problems: the mass of data is in the long tail, the long tail cannot be ignored as it may represent the majority of all datapoints. Power Law distributions are typical in social networks, e.g. the distribution of amount of twitter followers has a power law distribution.

We explain Eric Brewers CAP theorem: a global software system cannot achieve all three of Consistency (reads always reflect the latest updates), Availability (the system is always up), Partitionability (the system is resistant against loss of communication between datacenters), and describe concepts as replication and sharding and their consequences in terms of read speed, update speed and consistency.

As we work on large computer infrastructures, we distinguish between three related areas: (1) Super Computing, where performance is king and programmability a side-issue, (2) Cluster Computing, which is about quickly getting things done on large clusters of unreliable machines, and (3) Cloud Computing, where computation is performed in large clusters operated by third party, sold as a commodity service to users based on their actual use, seamlessly allowing to get more or less of it based on their needs (elasticity). In the practicum where we use Spark (cluster computing) on Amazon Web Services (cloud computing) we hence combine (2) and (3). Super computing is out of scope for this course. In cloud computing, we further distinguish between IaaS: Infrastructure-as-a-Service (virtual machines for rent, e.g. Amazon EC2), PaaS: Platform as-a-service (database system for rent, e.g. Amazon Redshift) and SaaS: Software-as-a-Service (application for rent, e.g. Microsoft Office 365 or Salesforce).

The last part of the lecture describes Amazon Web Services (AWS) in some detail to prepare you for the practicum. Services to remember are S3 blob storage: infinitely scaling storage in large files "blobs" (binary large objects) -- presumably implemented by spreading storage with replication over the hundreds of thousands of machines Amazon owns, but also EBS: filesystems which are virtual disks that store their data remotely on S3 but do a lot of caching to avoid network traffic, and the EC2 service that allows to power up virtual machines in the cloud (and some of the options to choose from in terms of CPU, RAM, and local disks aka "ephemeral storage" which are empty when you start-up).

Background Material

A light way into more material, specifically on Amazon Web Services, is this YouTube video given in 2011 at the STRATA conference by Amazon CTO Werner Vogels -- a VU PhD graduate!

Admittedly an old video, but one of the better ones IMHO. Werner Vogels typically gives the keynote to the AWS re:Invent conferences, so there are pretty many alternatives.

The AWS cloud services have greatly expanded from a handful in 2008: there now literally thousands of them, creating a problem of its own: how to understand which services exist and which best to use for your problem? The below is a (relatively) up-to-date overview of the most commonly used services.

Getting Started on AWS - AWSome Day 2018 from Amazon Web Services

Recognizing the situation that software architecture for the cloud is very different from classical (on-premise) architecture, plus the fact that there are now so many services to choose from, Amazon has developed its own philosophy which it calls the "AWS Well-Architected Framework". There is even an online tool (https://aws.amazon.com/well-architected-tool/) to design, document, review and evolve your cloud architecture.

Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AWS Summit from Amazon Web Services

Literature

The below book gives background information on the hardware aspects of datacenter computing (it is admittedly old, but still correct and it is free):

Recent study of the prices and capabilities of AWS instances over the years. It proposes a workload model in terms of CPU, network and I/O footprint, and from that attempts to compute the optimal cloud machine configuration to use: