Skip to main content

Exploring gene expression through high performance computing

“This wouldn’t have been possible without the great resources that UITS provided for us.”

Research and discovery High performance systems Apr 14, 2020

Organisms are made up of thousands of genes, but not all genes are turned on (that is, expressed) at the same time. Instead, gene expression is controlled, or ‘regulated’, by a complement of molecular signals. Transcriptional profiles seek to explore gene expression in a cell/tissue type at a specific time, reflecting the spatial and temporal functions of genes. A starfish, for instance, can turn on specific genes in order to generate a lost limb, and a male clownfish can alter his gene expression to become female when group hierarchy demands it.

R. Taylor Raborn, Arizona State University

R. Taylor Raborn, now a Research Scientist at Arizona State University’s Biodesign Institute, is interested in the genomic signals (sequences known as promoters) that function like dimmer switches to turn genes on and off. In order to determine where these promoters are, Raborn begins by experimentally figuring out the location of the transcription start sites (TSSs) for individual genes, which provides a specific landmark of where the promoter would have to be. 

Common methods for mapping TSSs are time-consuming, expensive, and complex. As a postdoctoral fellow in Volker Brendel’s group here at Indiana University, Raborn collaborated with Prof. Gabriel Zentner and his student Robert Policastro to devise a new, faster, more straightforward, reproducible method of locating TSSs.

Called STRIPE-seq (an acronym that coincidentally alludes to Hoosier basketball teams’ iconic warm up pants), their method streamlines the process, shortening the timeline from five days to roughly five hours. The method is experimental, but once a researcher generates and sequences STRIPE-seq library, the computational work begins. Even with the improvements made possible by STRIPE-seq, variability of TSS distributions at promoters, and the presence of noise in datasets, can make it difficult to identify gene promoters without sophisticated computational approaches.

This wouldn’t have been possible without the great resources that UITS provided for us. I don’t know if there’s any high performance computing center in the country that is as easy to work with and that does as much for the people that work in the university as UITS….It really democratizes the use of resources.

R. Taylor Raborn

While TSS data were being generated by labs around the world, Raborn noticed that there was no software designed to analyze all of these forms of TSS data in a systematic, reproducible way. Raborn and Brendel thus proceeded to create a software package called TSRchitect to process this data and locate TSSs and promoters. TSRchitect works with STRIPE-seq data as well as a variety of other TSS profiling data types, including CAGE and RAMPAGE, and can process even large amounts of data on resources like Indiana University’s high-memory supercomputer Carbonate. Along with promoter locations, TSRchitect also imports annotation files, allowing the user to associate identified promoters with genes and other genomic features. TSRchitect is available on the Bioconductor repository of bioinformatics software.

The team built and used a Singularity container to host TSRchitect and their upstream analysis pipeline, called GoSTRIPES, to create a computing environment that bundles the package and all its dependencies for deployment and running on Carbonate (or any other sufficiently powerful Linux machine). By disseminating their software and workflow in this way, they dramatically reduced the barrier to entry for other researchers who want to use their methods. Anyone can use the container, and without the time-spend associated with installing new software. The authors want to make it as easy as possible for other labs to use their software, offering a big win to labs in which time and resources are at a premium.

More stories