Cultural Text Mining

In the past decade, more and more data have become available in a digital format, either because they were born-digital or as a result of large-scale digitisation efforts. Many of these data are interesting for Humanities researchers. However, collections are often too large to analyze by traditional humanities methods (`close reading'). They can, however, be mined automatically for interesting trends and interdependencies (`distant reading') using sophisticated text mining and natural language processing techniques. The aim is not to replace traditional humanities methods of analysis, but to get the best of both worlds by combining close and distant reading and providing a toolbox for the analysis of big data in the humanities.

While fairly shallow methods such as topic modeling have been readily adopted by many digital humanists, methods which allow a deep semantic analysis of text are still used rarely in the field. From a computational linguistics point of view, humanities data offers many challenges to standard NLP tools. Texts can come from a plethora of different domains, are often written in older language varieties and may contain digitization errors. Typically it is also unrealistic to annotate sufficient data for standard supervised machine learning methods. In this course, I will discuss the chances and challenges of ``cultural text mining'' and provide an overview of different techniques.


Caroline Sporleder is a professor of Digital Humanities at Trier University.