Dataset. The Flickr hosts a staggering amount of pictures, of which a portion is publicly available. A subset of this public flickr is listed, by means of a textual description for which you can crawl the pictures and analyze them in various ways. This data is also popular in image processing research and is being hosted by AWS as part of its open data sets under the name Multimedia Commons. This AWS availability means you don't have to download the pictures anymore to s3, but the original flickr dataset listing has some more information (e.g. GPS coordinates) that can be useful.
F1: FaceJoin (reloaded) Crawl the flickr picture archive and use image recognition software to identify faces and match them to a list of faces with known identity. A Join on Faces. The list of known faces could be for instance the FBI's most wanted list but may also be chosen differently (e.g. famous people from Wikipedia, or missing children). The visualization would show a ranked list of flickr matches per known portrait.
Summary. Rather than using Spark on the Hadoop cluster of SurfSara, this project was executed with a $450 budget on AWS (we got this at a heavy discount), using a GPU instance to learn a model that recognizes the faces of 20 famous people. This was then used to search for these people in 20M flickr images.
As we can see, deep learning can achieve some amazing results, though it works well for specific faces (e.g. Bush, Chavez, Saddam, Putin, Lula) but not for others. Given the little amount of time for hyperparameter tuning it still is an interesting result. Regrettably we do not get an analysis of the factors that may have led to success or failure.
Data Curiosity: ** Paper Writing: **** Technical difficulties mastered: **** Visualization coolness: ***