The Indiana University School of Medicine Center for Medical Genomics (CMG) and Research Technologies (RT) are partnering to provide seamless access to pipeline processed sequencing data to research groups without the usual barriers of complicated cyberinfrastructure.
The Center for Medical Genomics (CMG) at Indiana University School of Medicine is a state-of-the-art technology center providing medical scientists with affordable access to high-quality high-throughput genomics services. Founded in 2000 by Dr. Howard Edenberg with initial support from the Indiana 21st Century Research and Technology Fund and additional funding from the Indiana Genomics Initiative (INGEN), CMG has been led by director Dr. Yunlong Liu since 2016. Under Dr. Liu’s leadership and with significant investment from the IU School of Medicine Precision Medicine program, the center significantly expanded its sequencing capacity to meet rapidly growing demand. On average, CMG supports 120 researchers every year spanning a wide spectrum of research areas including genetic disorders, oncology, neurodegenerative disorders, cellular and molecular metabolism, and diabetes and metabolic diseases.
The solutions of the SCA allows us to enhance our data management practices by streamlining the data processing, data dissemination, and data backup with just a few clicks. In addition, the users can also easily share their data with partners within the system, which enables collaboration.
Dr. Yunlong Liu
In 2020, despite a 3-month shutdown owing to the pandemic, CMG received over 7k samples for quality assessment and processed/sequenced 4,324 samples. All of this translates to massive amounts of data (more than 100TB of raw sequencing data every year), data processing challenges, and significant difficulties providing processed data back to the research community. In spring 2020, CMG entered a partnership with the RT Scalable Compute Archive (IU SCA) to develop an integrated system consisting of a secure archive, pipeline processing capabilities, and user-friendly data distribution functionality. The new service, CMG-SCA - data access restricted to CMG clients), in its first 300 days of production operations has served close to 200 users from nearly 40 labs and research groups.
Speaking about the cyberinfrastructure challenges faced by the CMG prior to this partnership, Dr. Yunlong Liu said “We were completely overwhelmed by the amount of data generated by our mighty sequencers. Storing data and sharing it with users emerged as a major obstacle for our operation.” There were several hundred terabytes of data on the Slate file system associated with CMG. “The solutions of the SCA allows us to enhance our data management practices by streamlining the data processing, data dissemination, and data backup with just a few clicks. In addition, the users can also easily share their data with partners within the system, which enables collaboration.”, added Dr. Liu. Maks Luthra, program manager at CMG said “The SCA team has developed a system for us that is not only efficient but also highly customized for our needs. We are very impressed with how the system has evolved with the engagement and support of the SCA team.”
The CMG-SCA automatically archives raw sequencing data coming off the sequencers at CMG onto the Scholarly Data Archive (SDA) tape library. Information about a run is immediately registered on the web portal so a CMG administrator can monitor data archival. An integrated pipeline setup allows for conversion of the raw BCL files to FASTQ formatted data products; an uploader function also allows the CMG lead bioinformatician to upload data products produced outside of the SCA system (for e.g. on the IU Big Red 3 supercomputer). Access to data products are grouped by projects that are accessible to one or more users/groups.
“The large data volumes associated with sequencing data presented a unique challenge when compared to similar scientific data archives we’ve developed,” said Dr. Michael Young, software development lead in the SCA team. “Fortunately the cyberinfrastructure here at IU, including the SDA tape storage and Slate Project and DCWAN network storage, made transferring and processing these large datasets feasible.” Charles Brandt, CMG-SCA developer added “Simplifying complex workflows into a user friendly system is satisfying work. CMG has been very patient in explaining these processes and working with us to optimize the software to yield an ideal user experience.”
“It has been fantastic working with CMG. It’s been an opportunity to assist IU’s flagship sequencing core, optimize use of RT storage resources, and for Mike Young and myself to circle back to our roots where we worked together on a similar project (ODI-PPA) 9 years ago”, said Arvind Gopu, Manager of SCA.
Future plans include the development of additional pipelines which integrate production of BAM-formatted data products from the FASTQ files, migration to IU’s Carbonate supercomputer for data management and processing needs, and the potential for development of other genome sequence analysis routines.