[an error occurred while processing this directive]
<div class="title">Project E: Structure of the Dutch Web</div>
<p><b>Performed by Group 4</b></p>
<p>
The hyperlinks between Web pages form a giant graph. Analysis of said graph can yield both hidden gems and depressing truths. The <a href="http://commoncrawl.org/">Common Crawl</a> periodically crawls a significant part of the public Web and makes the data freely available. Your task is to filter the Common Crawl data to only include Dutch websites, and to construct and analyze the resulting link graph.
</p>

<div class="post_title">Idea</div>
<ul>
<li>Process the Common Crawl dataset, filter only pages from NL and extract links</li>
<li>Transform the set of links into a graph structure</li>
<li>Calculate the in- and out-degree for all pages</li>
<li><i>Stretch goals: Calculate diameter and strongly connected component of the graph and visualize</i></li>
</ul>

<div class="post_title">Technology</div>
<p>
This analysis will be using <a href="https://pig.apache.org/">Apache Pig</a>. SurfSara provides <a href="https://github.com/norvigaward/warcexamples">a library, examples and a tutorial</a> to read the Common Crawl data using Pig. 
</p>

<div class="post_title">Data</div>
<p>
The <a href="commoncrawl.org">Common Crawl</a> data is stored on HDFS at <code>/data/public/common-crawl/crawl-data/CC-TEST-2014-10</code> (test set) and <code>/data/public/common-crawl/crawl-data/CC-MAIN-2014-10</code> (full set).
</p>

<div class="post_title">Literature</div>
<ul>
<li><a href="http://event.cwi.nl/lsde/papers/GraphStructureRevisited.pdf">Graph Structure in the Web — Revisited</a></li>
<li><a href="http://event.cwi.nl/lsde/papers/piglatin.pdf">Pig Latin: A Not-So-Foreign Language for Data Processing</a></li>
</ul>
[an error occurred while processing this directive]