7 — Semantic Text Analytics
Sub-project 7: Semantic Text Analytics and e-Humanities Research
PhD candidate: Tom Kenter
This sub-project investigates semantic text analytics for e-humanities. It will develop, test and apply algorithms that support an interactive search and exploration of textual data collections that form the primary sources for the research in sub-projects 1 through 6. The main focus of this sub-project is to develop methods for mining attitudes and experiences around key issues in the textual data. We seek to organize information around entities. We identify entities in textual sources, find contexts in which they are being discussed and profile them in terms of issues and related entities. These findings will then be used for mining attitudes and experiences towards key issues.
This sub-project will be informed by studies of the way humanities scholars handle information. The studies have identified exploratory, search, analysis (and contextualization) and aggregation phases, respectively. Insights gained from validating these existing models against the actual research practices of scholars (which will be monitored via several workshops and hands-on sessions; see workflow table) within the project will inform the algorithmic developments that form the core of the sub-project.
The first algorithmic phase of the sub-project is devoted to entities. Studies have shown (Bron et al., 2011a) that entities play a key role in all phases of the information seeking process of humanities scholars. With corpora spanning many decades, entity normalization is a key issue (‘which real-world entity does this expression refer to?’) and entity tracking is a key issue (‘how does the information around an entity evolve over time’). The aim is to develop semi-supervised methods for entity normalization, using the robust supervised methods developed by Meij et al. (2012) as our starting point.
In the next phase, entities will be profiled—brief summaries will be generated of the main issues associated with an entity. When a fixed set of issues is available for tracking, this task will apply existing (and high-performing) solutions developed by the University of Amsterdam. If a closed set of issues is not available, the profiling task lends itself to a graph-based approach, where, iteratively, nodes representing entities and key issues are identified and the strength of their associations is re-estimated; the algorithms proposed in Jijkoun et al. (2010) are our starting point.
While the focus of the project is a corpus of historical newspapers provided by the National Library of the Netherlands, none of the news items that are reported in this corpus are isolated; external sources may provide valuable insights in the reception of issues. In its third algorithmic phase, this sub-project will study and develop contextualization algorithms that provide cross-links between articles in various corpora that discuss the same events or developments. Starting from recent work carried at the University of Amsterdam (Bron et al., 2011b) as a baseline, this sub-project will bring a strong semantic focus to the task and use semantic features (entities, relations, events) to improve the precision of the baseline approach. Building on existing and ongoing working collaborations, sources that will be included for contextualization are the collections of the Netherlands Institute of Sound and Vision (Beeld en Geluid), the Dutch Photography Museum (Nederlands Fotomuseum), Wikipedia, and domain-specific periodicals pertaining to the various sub-projects.
The most challenging and promising phase of this project will be to develop methods to find attitudes and experiences around entities and related issues identified in the previous phases. The fourth algorithmic phase of the sub-project will use a bootstrapping method to generate a topic-specific attitude and experience lexicon from a general-purpose polarity lexicon. This dedicated lexicon will consist of pairs—a syntactic clue and a target. The targets are aspects of the entity or issue at hand and the syntactic clue describes an attitude or experience. This will yield a groundbreaking tool for humanities scholars to focus on the perception of a given entity or issue. The research challenge in this phase will be to generate the lexicon with as little interaction and human supervision as possible.
The technical foundation for the development will be provided by xTAS (formerly: Fietstas all applied/tested in computational humanities projects such as Political Mashup, WAHSP, BILAND, DutchSemCor, Bridge, Seed and Infiniti), a text analytics platform developed at the University of Amsterdam. All algorithms that are explored in the four phases of this sub-project are available to researchers involved in the other sub-projects from the very start of the project, and will be developed and perfected into generic user-ready tools for the humanities on the basis of hands-on experience and feedback.