Sunday, July 4, 2010

Visualizing Email Communications using NodeXL

Email has become an integral part of communication in both the business and personal spheres. Given its centrality, it is surprising how few tools are generally available for analyzing it outside specialist areas such as Early Case Assessment tools within the litigation area: Xobni being a notable exception at the individual level. However, the rise of social network analysis, and the tools that support it, may change that. Graph theory is remarkably neutral as to whether it is applied to Facebook Friend networks or email communications within a Sales and Marketing division.

In a previous post, we reported on using Gephi – an open source tool for graphing social networks – to visualize email communications. In this post, we look at using NodeXL for the same purpose. We used the same email data set before – the ‘Godfather Sample’ – in which an original email data set was processed to extract the metadata (e.g. sender, recipient, date sent, subject) and subsequently anonymized using fictional names.

NodeXL is a free and open source template for Microsoft Excel 2007 that provides a range of basic network analysis and visualization features intended for use on modest-sized networks of several thousand nodes/vertices. It is targeted at non-programmers and builds upon the familiar concepts and features within Excel. Information about the network, e.g. node data and edge lists, is all contained within worksheets.


Data can be simply loaded by cutting and pasting an edge list from another Excel worksheet but there are also a wide range of other options including the ability to import network data from Twitter (Search and User networks), YouTube and Flickr and from files in GraphML, Pajek and UCINET Full Matrix DL file formats. There is also an option to import directly from an Email PST file which we will discuss a following post. In addition to the basics of an edge list, attribute information can be associated with each edge and node. In our “Godfather” email sample, we added a weighting for communication strength (i.e. the number of emails between the two individuals) to each edge and the affiliation with the Corleone family to each node.

Once an edge list has been added, the vertices/node list is automatically created and a variety of graphical representations can be produced depending on the layout option selected, (Fruchterman Riengold is the default but Harel-Koren Fast Multiscale as well as Grid, Polar, Sugiyama and Sine Wave options are also available), and by mapping data attributes to the visual properties of nodes and vertices. For example, in the graph shown below, nodes were color coded and sized with respect to the individual’s connections with the Corleone family: blue for Corleone family members, green for Corleone allies, orange for Corleone enemies and Pink for individuals with no known associations with the family.



The width of the edges/links was then set to vary in relation to the degree of communication between the two nodes i.e. the number of emails sent between the two individuals concerned.


Labels can be added to both nodes and links showing either information about the node/link or its attributes, as required.






Different graph layout options are available which may be used to generate alternative perspectives and/or easier to view graphs.

Harel-Koren Layout


Circle Layout


Because even a small network can generate a complex, dense graph, NodeXL has a wide range of options for filtering and hiding parts of the graph, the better to elucidate others. The visibility of an edge/vertex for example, can be linked to a particular attribute e.g. degree of closeness. We found the dynamic filters particularly useful for rapidly focusing on areas of interest without altering the properties of the graph themselves. For example, in the following screenshot we are showing only those links where the number of emails between the parties is greater than 40. This allows us to focus on individuals who have been emailing each other more frequently than the average.


In addition to graphical display, NodeXL can be used to calculate key network metrics including: Degree (the number of links on a node and a reflection of the number of relationships an individual has with other members of the network) with In-Degree and Out-Degree options for directed graphs, Betweenness Centrality (the extent to which a node lies between other nodes in the network and a reflection of the number of people an individual is connecting to indirectly), Closeness Centrality (a measure of the degree to which a node is near all other nodes in a network and reflects the ability of an individual to access information through the "grapevine" of network members) and Eigenvector Centrality (a measure of the importance of an individual in the network). In an analysis of email communications, these can be used to identify the degree of connectedness between individuals and their relative importance in the communication flow.

For example, in our Godfather sample, we have sized the nodes in the graph below by their Degree Centrality. While Vito Corleone is, as expected, shown to be highly connected, Ritchie Martin – an individual not thought to have business associations with the Corleone family, is shown to be more connected than supposed.

Node Sized by Degree Centrality


When we look at the same data from the perspective of betweenness, we see that Vito, Connie and Ritchie all have a high degree of indirect connections.

Nodes Sized by Betweenness Centrality


And the Eignevector Centrality measure confirms Vito Corleone's signficance in the network as well as Connie's, two "allies" - Hyman Roth and Salvatore Tessio and two individuals  Ritchie Martin.

Nodes Sized by Eigenvector Centrality


Last but not least, it is also possible to use NodeXL to visualize clusters of nodes to show or identify subgroups within a network. Clusters can be added manually or generated automatically. Manually creating clusters requires first assigning nodes to an attribute or group membership and then determining the color and shape of the nodes for each subgroup/cluster. In our GodFather example, we used “Family” affiliation to create clusters within the network but equally one could use organization/company, country, language, date etc.
"Family Affiliation" Clusters Coded by Node Color

Selected Cluster (Corleone Affiliates)

NodeXL will also generate clusters automatically using a clustering algorithm developed specifically for large scale social network analysis which works by aggregating closely interconnected groups of nodes. The results for the Godfather sample are shown below. We did not find the automated clustering helpful but this is probably a reflection of the relatively small size of the sample.

In the next post, we will look at importing email data directly into NodeXL and compare approaches based on analyzing processed vs unprocessed email data.

Larger Email Network Visualization

To download NodeXL, go to http://nodexl.codeplex.com//. We would also recommend working though the NodeXL tutorial which can be downloaded from: http://casci.umd.edu/images/4/46/NodeXL_tutorial_draft.pdf


A top level overview of social network analysis and the basic concepts behind graph metrics can be found on Wikipedia e.g. http://en.wikipedia.org/wiki/Social_network and http://en.wikipedia.org/wiki/Betweenness_centrality#Eigenvector_centrality

No comments:

Post a Comment