[an error occurred while processing this directive]
Performed by Group 4
The hyperlinks between Web pages form a giant graph. Analysis of said graph can yield both hidden gems and depressing truths. The Common Crawl periodically crawls a significant part of the public Web and makes the data freely available. Your task is to filter the Common Crawl data to only include Dutch websites, and to construct and analyze the resulting link graph.
This analysis will be using Apache Pig. SurfSara provides a library, examples and a tutorial to read the Common Crawl data using Pig.
The Common Crawl data is stored on HDFS at /data/public/common-crawl/crawl-data/CC-TEST-2014-10
(test set) and /data/public/common-crawl/crawl-data/CC-MAIN-2014-10
(full set).