Text Mining and Visualization for Digital Literary History

This research project is investigating how literary historical analysis can be radically extended by text mining and visualization, using the experimental Orlando Project as our test bed. The proposed research is at the crossroads of several fields, including literary criticism, textual history, digital text encoding, computer-assisted text analysis, visualization interfaces, data mining, and high performance computing.

Orlando: Women’s Writing in the British Isles from the Beginnings to the Present is recognized as the most extensive and detailed resource in its field and as a model for innovative scholarly resources. Composed of 1,200 critical biographies plus contextual and bibliographical materials, it is extensively encoded using an interpretive Extensible Markup Language (XML) tagset with more than 250 tags for everything from cultural influences, to relations with publishers, or use of genre or dialect. The content and the markup together provide a unique representation of a complex set of interrelations of people, texts, and contexts. These interrelations and their development through time are at the heart of literary inquiry, and having those relations embedded in the markup, and hence processable by computer, offers the opportunity to develop new forms of inquiry into, and representations of, literary history.
The extensive tagging makes Orlando a unique resource for experimentation with data mining, machine learning, and visualization techniques to investigate the impact that interpretive markup has on the data mining and the visualization of results. Building on preliminary work on text mining and the use of high-performance computing for literary scholarship, we will test and extend existing techniques to develop new approaches to literary history using computers.
This research project will ask:

1) What methods and algorithms are most appropriate to mining collections of scholarly texts for literary historical data?

2) How does interpretive XML encoding influence the outcomes of text mining and machine learning when compared with input from text with no encoding or only structural encoding?

3) What forms of visualization will be most useful to literary scholars using text mining tools?
We will design and test prototypes based on this inquiry. We hypothesize that

1) data mining can help identify patterns, sequences, and connections of interest to literary historians;

2) that semantic markup significantly enhances the results of such mining; 3) that such methods require new kinds of interfaces and visualization.

By engaging in this multi-pronged inquiry in a coordinated fashion, this interdisciplinary team will provide novel insights into the value and potential of the mining and visualization of interpretive encoding for literary historical inquiry.
This research has the potential to advance significantly our understanding of how to develop innovative tools and interfaces for online resources, not only for scholars, but also for a general public that is increasingly reliant on online tools help them make sense of an avalanche of digital text.

PI: Susan Brown
Co-applicants: Michael Bauer, Denilson Barbosa, Isobel Grundy, Geoffrey Rockwell, Stan Ruecker, Stéfan Sinclair