Journalists, through their words, provide an invaluable service, sharing information about global events to which many of us would not otherwise have access. They send missives directly from event sites, recording what’s happening during protests, summits, speeches, and violent actions. For political scientists, these articles offer a rich mine of data about these events. Jill Irvine, Presidential Professor of International and Area Studies at the University of Oklahoma, Christan Grant, Assistant Professor in the School of Computer Science at the University of Oklahoma, Andrew Halterman, Ph.D. candidate in Political Science at MIT, and the rest of their team want to make this data accessible. Enter the Temporally Extended, Regular, Reproducible International Event Records (TERRIER) dataset, which extracts event data from roughly 300 million news articles and puts it into a form usable by researchers.
In determining the kinds of actions that qualify as an event, the team uses a framework established within political science that events consist of actors, actions, and targets. The process of coding events consists of three steps. First, software goes through every sentence in the corpus of text and parses the grammar of the sentence. A second piece of software finds “candidate spans” (i.e., words that look like they belong to an actor or action), then checks the dictionary to see if they match up with a known actor. If a match occurs, the software assigns each actor or action to a predefined category, such as “military” actors and “protest” actions. For instance, the term “George W. Bush,” during the timespan of his presidency, would result in the actor category of “USA government.” Importantly, the dictionaries are coded to account for different ways people might speak of actions such as protests through words like “demonstrated,” “chanted,” or “carried placards.”
The project not only involves producing data for researchers, it also aims to improve the tools available to other researchers to produce their own datasets. To accomplish this, Grant built several open-source tools to speed the natural language processing (NLP) of the large corpus of documents. The NLP step is the third and most time intensive of the event coding process, but without the grammatical information it provides, the event coder can’t process the sentences. Once that step is complete, the events still need to be extracted and geolocated, more time-intensive tasks. The scale of the news corpus - hundreds of millions of documents - complicated the effort. According to Grant, the first projected timeline for extraction and geolocation on a single machine was many years; thus, it became clear that the team needed more resources. Enter Jetstream, which gave the team the scale, parallel computing, and storage capabilities it needed in order expedite the process. Grant built a distributed container system; Jetstream provided the storage and structure to launch the pipeline and process these many documents. Grant speaks highly of the Jetstream team, calling them “responsive, helpful, and accommodating to researchers,” and a “lifesaver” for the project.
The TERRIER team’s work will serve future researchers in at least two ways: the complete event coding pipeline software is available to other researchers in NLP and political science and the dataset is available to political scientists. Thus far, the datatool has been used to gain a better sense of the causes and dynamics related to conflict and levels of violence. Now that the process works, the team is excited to see what it will look like in application. They’re currently working on a paper on Syria investigating the relationship between locations of the 2011 protests, and where the government exerted violence once the civil war began. Researchers can potentially use protests reported in news media as a way of tracking prewar anti-regime mobilization and thereby measuring how mobilization affects later violence. More broadly, the data will contribute to the growing set of applied models using text-derived event data to provide early warning of civil conflict.