2015 · 2015-2016 · 2016 · 2017 · 2018 · 2019 · 2020 · 2021 · 2022 · Canvas
LSDE: Large Scale Data Engineering 2022

Dataset. The Flickr hosts a staggering amount of pictures, of which a portion is publicly available. A subset of this public flickr is listed, by means of a textual description for which you can crawl the pictures and analyze them in various ways. This data is also popular in image processing research and is being hosted by AWS as part of its open data sets under the name Multimedia Commons. This AWS availability means you don't have to download the pictures anymore to s3, but the original flickr dataset listing has some more information (e.g. GPS coordinates) that can be useful.

F4: Image Interpretation. Crawl the flickr picture archive. Create a data product that consist of a cleaned up flickr dataset where the non-loadable images have been removed. Further annotate the images using image-to-text classification and scene classification. Create a visualization that is an image retrieval engine by keyword (and possibly other features).

Summary. Using the InceptionV3 neural network model, students generated (in a static HTML website) a image-search-by-keyword index on the full collection of images. Downloading all images was done using the Databricks Runtime (i.e. Spark); they were then packaged in large Parquet files and subsequently processed using Pytorch on a GPU instance (all computation was done on AWS EC2 in conjunction with Databricks notebooks).

Data Curiosity: ****
Paper Writing: ****
Technical difficulties mastered: ****
Visualization coolness: ****


Image Annotation -- Abhinav Shankar, Corneliu Suficu and Leonard Herold (paper)