Dataset. The US National Oceanic and Atmospheric Administration (NOAA) publishes the Integrated Surface Data collection. It contains weather station measurements from stations around the world for the last decades. We have mirrored the ~205 GB of compressed weather measurements provided on the FTP server. A format documentation also is available there. Further, this year we added the Historical Land-Cover Change and Land-Use Conversions Global Dataset that keeps track of land use in the USA in different snapshots, starting from the 18th century.
N1: Global Warming. Measure and visualize global warming by comparing average, median, max, min temperatures by various date ranges over all weather stations.
Summary: The NOAA data was parsed, filtered (excluding results marked as erroneous) and then grouped and aggregated by geographic location and time using Spark Python. Some interesting technical difficulties, such as non-scalability of median aggregates as well as poor HDFS performance on small files are described (the latter circumvented). The report contains a nice analysis of the geographic coverage of weather stations over time.
Data curiosity: *** Related work: *** Technical difficulties mastered: *** Visualization coolness: **