Hackfest 2013: Text Mining and Visualization for Literary History

[This is the second in a series of posts on the TM & V hackfests. See the first at Stefan Sinclair’s blog.]

At the beginning of May, as those of us from Northern Alberta were at last starting to recover from our final late-April dump of snow, a group of intrepid scholars, literary researchers, and programmers gathered in sunny southern Ontario to participate in a Hackfest focused on data mining and visualization for literary history. Many of us were affiliated in various ways with the Orlando, CWRC, INKE, or Voyant Tools projects. In addition to our enjoyment of the beautiful weather and pristine locale, we ate some fantastic food, played pool, snooker, and frisbee,  chatted about our absent children, and built some great visualization tool prototypes.

At the beginning of the ‘fest, we split into three groups. While we all leveraged the Orlando Project’s extensive textbase of xml-encoded literary history, we worked separately on: 1) machine learning with WEKA and named entity extraction; 2) generating RDF and visualizing triples from Orlando; and 3) producing non-RDF visualizations of the Orlando project’s textbase (which Stefan Sinclair has written about here.)

I was personally involved with the RDF visualization team, which included John Simpson, Nicole Mardis, and Mark Turcato, with Susan Brown present when we needed someone to bounce ideas off of. Our team goal was to come up with an interactive prototype tool that would map relationships between individuals and groups in a web-deliverable format. In the past, I’ve worked a lot with a similar network visualization tool called OrlandoVision (which you can read more about here), which maps the connections between individual and groups in a conventional node-edge graph, where people are nodes and the edges that connect them are tags. OrlandoVision is an interactive tool that maps the entire Orlando textbase (over 1,300 biocritical entries in an XML file that’s 129MB large and with over 1.8 million tags!), but is filterable by keywords, authors, dates, and tags so as to become readable according to the user’s research interests. While we had a great deal of positive feedback on the tool, its inability to function in a web-deliverable format has meant that it’s hard to deliver to the everyday student or literary researcher. Our goal was thus to come up with a similar tool to OrlandoVision, incorporating the extensive feedback we’ve received, but easily accessible and useable from the web. Our team decided to go with RDF for this particular tool because it is the emerging web standard and because of its native graph-based format.

I personally spent much of my time building regular expressions that accurately extracted organizational and textual relationships from the Orlando dataset. While we had done a lot of pre-work extracting other types of relationships (particularly interpersonal relationships), when we showed up at the Hackfest, we realized that organizations and textual relationships were not accurately represented in the data to the degree that they were present in the Orlando textbase. This was an important part of the process because we really wanted to have the ability to represent organizations and texts as nodes, not just individuals. In all the feedback we received about OrlandoVision, one of the key changes that students and literary scholars wanted was the ability to represent organizations as nodes, with individuals as the relational edges, and the possibility of representing texts as nodes, with other types of intertextual edges. As a result of that feedback, this was one of the main functionalities we wanted to incorporate in our RDF visualization tool. After we had extracted the accurate data, our team then spent a fair bit of time modifying the tool to convert RDF to JSON, which was then used to produce force-directed graphs on the screen. The graphs we produced represented people as nodes, as well as organizations as nodes (with individuals as the edges), but we made sure to include the possibility of having texts as nodes in the future, with other types of intertextual edges.

Capture d’écran 2013-07-08 à 15.55.22

The above image is an initial representation of organizations as nodes with humans as the relational edges. These all have hoverticks for grabbing info about links and nodes. Although we weren’t able to built this capacity in the time we had at Hackfest, in the future, this particular tool will support zoom features, animation over time, as well as google maps with geographic location associated with a place. Selecting a node and having it call a function, or hovering over a node to call a function, is also supported. Hovering over a node would cover all edges to support degrees of difference.

Capture d’écran 2013-07-08 à 15.54.57

In this second image, individuals are the nodes, and links are the connections between individuals, represented by particular tag, and coloured according to the tag they represent. We took the obvious male/female attributes to colour certain nodes in red and blue, so genders of authors with entries are pinpointed. Some links are coloured yellow, indicating family and social relationships. We made sure to make it very readily possible to have a wide variety of criteria to do colouring, with buttons to switch between the colouring criteria you would like, like having  a bluescale or daterange. Weighting of nodes is a step for the future.

It was exciting to get to the end of three days of hard, intense work and actually have something to show for it! While we have much work to do going forward, Hackfest was an excellent venue for getting started. Working in a digital age has meant that often it is not necessary to collaborate in one physical location, but when it is possible, these in-person meetings have the potential to kick-start great collaborative work.