(This article originally appeared on the HathiTrust website: http://go.iu.edu/2aSh.)
HathiTrust Research Center (HTRC) has selected five projects for the fourth round of Advanced Collaborative Support (ACS) projects. The application pool this year was of especially high quality, and it was difficult to narrow to just five. The selected projects demonstrate a range of research approaches. They are likely to produce results that will benefit the larger HathiTrust and HTRC communities.
Awardees will be provided dedicated HTRC staff time to support their research using texts in the HathiTrust Digital Library for a period of up to one year. Proposals were reviewed on their feasibility, research methodology, and compatibility with HTRC staff resources, as well as the availability of requested data and potential to positively benefit a wide community of scholars.
The new projects are:
Building large-scale collections of genre fiction
Laure Thompson and David Mimno (Cornell)
This project will develop methods for automatically constructing large-scale collections of genre fiction from HathiTrust. Even, and especially, in digital libraries as large as HathiTrust, it can prove challenging to understand whether the library contains suitable representations of a chosen genre. The researchers plan to focus on collections of speculative fiction novels as a case study, but they intend their solutions to be generalizable. They will identify robust methods for correlating author-title pairs to matching volume sets in HathiTrust. Using these methods in conjunction with lists of novels that were curated by hand, they will build their collections and investigate which works are (over)represented and which are missing. They expect their project will enable scholars to better understand the suitability of studying genre fiction through HathiTrust and highlight underserved author and genre groups. Moreover, the project will result in collections of genre fiction which can be readily reused and reorganized for different lines of humanistic inquiry.
Mapping scientific names to the HathiTrust Digital Library
Matthew J. Yoder and Dmitry Mozzherin (University of Illinois)
This project will create an index of all the scientific names of the Earth’s species found within the HathiTrust corpus. The index, which will likely measure in the hundreds of millions to billions of entries, will consist of a simple link between the scientific name and the volume and page location of that name within HathiTrust. The index will assist in identifying volumes that may be medically relevant, for example by identifying all of the volumes containing the scientific name for the mosquito that carries illnesses such as Zika virus (‘Aedes aegypti’). The index will also allow volumes to be grouped into clusters based on which scientific names they contain to show which taxon (e.g. “mammals”) are most common. This team of researchers has completed similar work across the data of the Biodiversity Heritage Library. Their ACS project will allow them to do cross-corpora comparisons.
Supporting The Conglomerate Era Project
Dan Sinykin (Emory)
This project furthers the researcher’s investigation into how the conglomeration of the publishing industry changed literature. The results will be included the researcher’s in-progress book titled The Conglomerate Era: A Computational History of Literature in the Age of the Agent. The project explores a set of publisher-based corpora to see whether there are distinctions in what is published by large publishing houses versus independent presses. It will make use of predictive modeling to further the researcher’s existing work to build a computational model of genre that aids in identifying latent patterns in the publishers’ editorial practices.The project will utilize methods such as genre detection through unsupervised modeling; stylistic differentiation through text classification and supervised learning via logistic regressions; and social network analysis with metadata to determine latent literary connections, especially with regard to gender and race of the author.
Deriving Basic Illustration Metadata
Stephen Krewson (Yale)
This project aims at identifying all pictorial elements in educational texts from 1800-1850 to explore the interplay between progressive education and print media in the early nineteenth century. The resulting research will characterize the extent to which wood engravings and other reprographic materials were shared among educational publishers. The researcher will extract specific features from page images, such as illustration location, using advances in machine learning. The project intends to make use of the process developed to identify pictorial elements to motivate a new metadata field that describes the location and type of illustrations on the page. An ultimate goal of the project is to move toward “machine-read” texts where the data generated by classifiers and dimensionality reduction techniques are bundled as metadata with the corresponding volumes and made available to future research. (“Machine-read” is a term is borrowed from researcher Ben Schmidt.)
Semantic Phasor Embeddings
Molly Des Jardin, Scott Enderle, and Katie Rawson (University of Pennsylvania)
This project intends to explore a novel way of abstracting and representing textual data that could aid in new ways of discovering and deduplicating items in HathiTrust, detecting and analyzing genre, or analyzing narrative analogies. The project team will investigate the utility of a certain kind of mathematical representation of text documents, called semantic phasor embeddings, that combine a mathematical structure called phasors with data from standard word embeddings (strings of numbers that represent an item). If successful, the vectors could represent documents with a tunable degree of granularity, which could provide an opportunity to share vectors representing copyright-protected without concerns about wholesale text reproduction. The vectors would also carry valuable information about the global ordinal structure of the volumes, so that the items could be queried, clustered, and visualized in a robust way that recognizes similarity not just in the content of the items, but also their structure.