LSDE 2016 - Large Scale Data Engineering

LSDE2015 · LSDE2015-2016 · LSDE2016 · VU BlackBoard

LSDE: Large Scale Data Engineering 2016

Lecture Block 1: Cloud Computing: Cluster Computing as a Commodity Good

PDF slides:

In this lecture we first define Big Data in terms of data management problems with the three V's: (i) Volume: the data is large, (2) Variety: the data is often not clean and tabular, but messy (text, even images), (3) Velocity: new data keeps arriving continuously. We also discuss Power Laws in many Big Data problems: the mass of data is in the long tail, the long tail cannot be ignored as it may represent the majority of all datapoints. Power Law distributions are typical in social networks, e.g. the distribution of amount of twitter followers has a power law distribution.

We explain Eric Brewers CAP theorem: a global software system cannot achieve all three of Consistency (reads always reflect the latest updates), Availability (the system is always up), Partitionability (the system is resistant against loss of communication between datacenters), and describe concepts as replication and sharding and their consequences in terms of read speed, update speed and consistency.

As we work on large computer infrastructures, we distinguish between three related areas: (1) Super Computing, where performance is king and programmability a side-issue, (2) Cluster Computing, which is about quickly getting things done on large clusters of unreliable machines, and (3) Cloud Computing, where computation is performed in large clusters operated by third party, sold as a commodity service to users based on their actual use, seamlessly allowing to get more or less of it based on their needs (elasticity). In the practicum where we use Spark (cluster computing) on Amazon Web Services (cloud computing) we hence combine (2) and (3). Super computing is out of scope for this course. In cloud computing, we further distinguish between IaaS: Infrastructure-as-a-Service (virtual machines for rent, e.g. Amazon EC2), PaaS: Platform as-a-service (database system for rent, e.g. Amazon Redshift) and SaaS: Software-as-a-Service (application for rent, e.g. Microsoft Office 365 or Salesforce).

The last part of the lecture describes Amazon Web Services (AWS) in some detail to prepare you for the practicum. Services to remember are S3 blob storage: infinitely scaling storage in large files "blobs" (binary large objects) -- presumably implemented by spreading storage with replication over the hundreds of thousands of machines Amazon owns, but also EBS: filesystems which are virtual disks that store their data remotely on S3 but do a lot of caching to avoid network traffic, and the EC2 service that allows to power up virtual machines in the cloud (and some of the options to choose from in terms of CPU, RAM, and local disks aka "ephemeral storage" which are empty when you start-up).

Technical Literature

This book on hardware architecture of clusters is oriented at people who want a deeper technical understanding of cloud hardware and software.

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines

General Background Material

A light way into more material, specifically on Amazon Web Services, is this YouTube video given in 2011 at the STRATA conference by Amazon CTO Werner Vogels -- a VU PhD graduate!

Finally, after having attended the lecture, the technical architects among you might have a shot at the following Amazon promo presentation How to Scale Your Next Idea on AWS: a Love Story that highlights the function of many of their services for constructing resilient, scalable and secure web applications. We hope it will make more sense to you than before the lecture -- although in the lecture we do not really explain all the Amazon Services mentioned there.

Introduction to Amazon Web Services - How to Scale your Next Idea on AWS : A Love Story - Jinesh Varia (Updated Jan 2014) from Amazon Web Services

Peter Boncz · Hannes Mühleisen