[an error occurred while processing this directive]
Project E: Structure of the Dutch Web

Performed by Group 4

The hyperlinks between Web pages form a giant graph. Analysis of said graph can yield both hidden gems and depressing truths. The Common Crawl periodically crawls a significant part of the public Web and makes the data freely available. Your task is to filter the Common Crawl data to only include Dutch websites, and to construct and analyze the resulting link graph.

Idea
Technology

This analysis will be using Apache Pig. SurfSara provides a library, examples and a tutorial to read the Common Crawl data using Pig.

Data

The Common Crawl data is stored on HDFS at /data/public/common-crawl/crawl-data/CC-TEST-2014-10 (test set) and /data/public/common-crawl/crawl-data/CC-MAIN-2014-10 (full set).

Literature
[an error occurred while processing this directive]