Dataset. The Enron corporation was brought down by a massive insider trading scandal. As part of the court case, the email archives of the involved employees were made public. We have obtained a ~50 GB dataset containing a large number of messages.
E2: ENRON hierarchy. Derive Enron's organisational structure from email metadata (senders/receivers) and visualize the derived organization graph.
Summary: Observing that without attachments, the ENRON dataset is just 1.4GB in size, this project was performed on a laptop -- the visualization is done with D3.js. A new hierarchy building algorithm was presented based on a mix of (cosine) similarity between two persons based on the emails they exchange and (different types of) graph centrality in the communication network. This algorithm is evaluated against a manually created ground truth the actual known ENRON organizational structure, as well as to an algorithm proposed (Zhou, ICMLA 2005) previous to the creation of the ground truth. The new approach performs better than that, but it should be noted that the recall measure is still rather low (14% of all pairwise hierarchy relationships get derived). The result (imperfect) hierarchy is shown below, made clickable and leading to a visualization that shows for each person a force-directed graph with the six nearest neighbours.
Data curiosity: *** Related work: **** Technical difficulties mastered: *** Visualization coolness: **