In this assignment the goal is to get experience with a real Big Data challenge by attempting to solve this in the cloud, on Amazon Web Services (AWS). All datasets are available in S3 buckets.
Access to the clusters is via lsde-2021.cloud.databricks.com. Here, we run Databricks Spark notebooks on AWS clusters. It is also possible to use AWS services directly (e.g. EMR, ECS, Sagemaker, Athena, Redshift). For this, contact Peter Boncz.
Datasets typically come in some non-obvious format. Data is typically dirty, and typically you need domain expert input to understand what is in it, e.g. how to interpret normal but especially outlier values. This is a fact of life for the data scientist. One of the first tasks when starting a project is therefore to run some initial scripts on this data, and visualize/summarize these to get a feel what it is. This implies that often you need to delve immediately in finding out how the data is represented, and what it looks like (size, structure, value distributions). To even contemplate doing this, you must start writing some scripts and tools that read the data. This should be done quickly and we expect you to have such visualizations and summaries by time of the planning presentation (assignment 2a).
Rather than diving head first with Big Data tools, stop and think first.
Try to understand as much of the properties of the data. Take samples and visualize these (on your laptop).
Ask yourself all kinds of questions on the veracity of the data. Is it what you think it is?
Start small and write pieces of code that you can test locally (on Spark on your laptop). Make sure your code is robust to all errors. It should log errors, but not crash. Try to make resumable pipelines, that e.g. prepare data in chunks, and where you can restart generation for parts of the data.
Do back-of-the-envelope calculations, extrapolating from your small-scale experiments. How much time will it take to calculate on all your data? Are there smarter ways? What is the best representation the data (maybe change it/ reduce it)? What tools to use for which step? What is the cost of each computational step in AWS? What are the storage requirements and their associate costs?
As a general guideline, projected AWS cost should be below $400 for the whole project. But aiming to keep things within $250 would be better.
A word of warning.
Do not expect anything to work out-of-the-box.
This is normal. The project will be a sequence of attempts to look at data and run tools, typically only responded with by the next error message.
We feel for you, but this is normal and is the current state of Big Data (still).
Also, do not wait too long in seeking help on the Slack channel (but google your question first).
Finally, here is an inspiring quote from the commencement speech of John Carmack, which captures the essence of LSDE's second assignment:
Today, you can do class projects that have capabilities that nobody on Earth could have done a few decades ago: train a neural net, crawl the web for data, deploy on the cloud — these are amazing things.