Skip to main content

Customizable, automated data management for research

High performance systems Jan 8, 2020

Dimitar Nikolov, Principal Research Software Engineer

As part of Research Technologies at Indiana University, data experts Dimitar Nikolov and Esen Tuna developed a customizable framework for research data management based on open-source tools.

Citing the rise of computation and data-intensive methods in academic labs, the team built a framework for automating the transfer of data between storage systems, and for archiving metadata. The framework makes use of Apache Airflow, an open-source tool developed at Airbnb, and executes regularly repeating data management and analysis tasks. Airflow enables the specification of data policies throughout the full lifecycle of data – from the moment it is  collected or generated, through many steps of processing, to its archival or deposition to a repository. Airflow can schedule complex sets of dependent tasks and outcomes, while scaling to the volume of data and the number of people involved in its handling. Allowing researchers to use their established workflows without interruption, Airflow’s scheduling and task execution work behind the scenes. Airflow can be deployed centrally for multiple units, or by lab/department, and can adapt to the variety of computational resources that are part of everyday lab research. For example, a research group might utilize cloud and high-performance storage and document sharing services, and adopt conventions that enable the automated policies defined in Airflow to work alongside researchers. 

Esen Tuna, Manager, Research Data Services

Nikolov and Tuna’s work on metadata management focuses on the Scholarly Data Archive. Metadata, also called tags or descriptors within data, can be specified for an individual file, or for an entire archive, and is useful for intelligently searching through the data. Still, scholarly work often fails to include metadata in the archiving process. Nikolov and Tuna’s archival system can also ensure data is archived within certain scheduled windows (daily or weekly, for instance) and only updates SDA with new material, instead of the entire dataset. Integrating data management into existing workflows, Nikolov and Tuna diagnose individual lab needs for data archiving, while configuring the archiving process to integrate popular tools researchers use, like command-line scripts and Python programs.

Maintained and updated by the open-source community, the integration of Apache Airflow into an academic environment saves individual projects from developing archiving and data management tools. Still, there are limits to this strategy, namely in projects with larger teams and higher data volumes. Concerns for such projects include inconsistencies, omissions and duplication in the storage of data. As a solution for such projects, the team proposes the use of shared electronic lab notebooks (ELNs) or science gateways. Adoption of either solution requires change in existing workflows and potential lock-in.

More stories