Forum for Science, Industry and Business

Sponsored by:     3M 
Search our Site:

 

New TGen technology reduces storage needs and costs for genomic data

07.07.2010
G-SQZ provides scientists with compact format for genomic data processing

A new computer data compression technique called Genomic SQueeZ (G-SQZ), developed by the Translational Genomics Research Institute (TGen), will allow genetic researchers and others to store, analyze and share massive volumes of data in less space and at lower cost.

Created specifically for genomic sequencing data, the encoding method underlying G-SQZ and its software use are described in a paper published today in the journal Bioinformatics.

Tests show that G-SQZ can compress data by as much as 80 percent while maintaining the relative order of the data and allowing for selective content access. This could save researchers and others millions of dollars worldwide.

Plans are to make the G-SQZ program freely available for research and academic use, and to explore commercial opportunities in genomic data storage and processing. TGen has filed a patent application for the G-SQZ technology.

"Data storage and processing costs are becoming a large factor in research planning as high-throughput genomic sequencing studies continue to generate increasing amounts of data. G-SQZ has the potential to save individual institutes hundreds of thousands of dollars per year in storage costs," said Dr. Waibhav Tembe, the paper's lead author and TGen's Senior Computational Scientist, who led the development of the G-SQZ algorithm and its software.

Enormous computing power is required to conduct today's cutting-edge analysis of large volumes of genomic sequencing data. This data is critical in studying the genes that are a part of the 3-billion-letter DNA sequence, the entire genome of one person. Such analysis is enabling researchers to identify those genomic components that either prevent or contribute to diseases, such as cancer, diabetes and Alzheimer's, and to discover treatments tailored to individual patients that can prolong and increase their quality of life.

Today's genomic sequence analysis requires analyzing terabytes of data. Large sequencing centers are planning or have installed petabyte-scale storage. One terabyte is more than 1 trillion bytes of data. One petabyte is 1,000 terabytes.

Benefits shared with other institutes

Dr. Edward Suh, TGen's Chief Information Officer, described G-SQZ as a significant breakthrough in storing and analyzing ever-increasing genomic sequencing data.

"As a non-profit research institute dedicated to advancing science for the public good, we at TGen are proud to be able to share aspects of this technology with other non-profit research institutes, especially in these times of tightened budgets," said Dr. Suh, who also is a Senior Investigator at TGen and co-author of the paper.

James Lowey, TGen's Director of High-Performance Biocomputing and the third co-author of the paper, said reducing storage costs for genomic technology has the potential to eventually lead to a chain reaction of lower health costs for medical institutions and, ultimately, for patients.

"When you reduce the need for storage, you also are reducing your overhead costs, such as electricity and space, and that can save money," Lowey said.

The software is available for download from http://public.tgen.org/sqz.

Technology springs from Next-Gen research

Dr. Tembe's motivation for G-SQZ came from the challenges involved in storing, processing, parsing and transferring enormous Next-Generation Sequencing data, which primarily is stored in plain text formats.

"Generating this data is one thing. It is quite another to store, query and manage it in an efficient manner, minimizing data-analysis bottlenecks and expediting the discovery process," Dr. Tembe said.

The G-SQZ approach is a novel application of Huffman coding of information, an idea first developed in the 1950s, which uses shorter codes for most frequently-occurring pieces of information.

Dr. Tembe's solution is specific to genomic sequencing data. In addition to analyzing the frequency of the ACGT letters that make up DNA, G-SQZ also can encode the annotation information, including the data's quality, as well as erroneous entries, such as unidentified bases.

The indexing system used in G-SQZ allows access at regular intervals, such as every millionth data point, so all the information need not be decoded from the start.

"It's not enough to compress the information. The compressed representation should allow quick retrieval and querying," Dr. Tembe said. "To that end, G-SQZ has been designed as an efficient practical approach, rather than a theoretically optimal compression algorithm."

Even faster advancements on the horizon

Dr. Tembe is moving ahead with improving his current design to accommodate what he calls "parallel computing."

Because G-SQZ compression keeps the data ordered and indexed, the squeezed data can be split into smaller "chunks," allowing multiple computer processors to decode and analyze different parts of the same file simultaneously, he said. For example, if a file is indexed at 1,000 places, it can be fed into a supercomputer, allowing 1,000 processors to analyze the data at the same time, speeding up the results. Analysis tools using parallel programming approaches can take advantage of the G-SQZ encoding format.

"While indexed and compressed representation is ready, the parallel computing functionality is undergoing a testing phase," Dr. Tembe said. "But this is where it is headed. Sequencing hundreds of billions of bases per run is now a reality. The real impact of G-SQZ lies in the storage, transfer and processing of genomic sequencing data, where substantial room for improvement still exists."

About TGen

The Translational Genomics Research Institute (TGen) is a Phoenix, Arizona-based non-profit organization dedicated to conducting groundbreaking research with life changing results. Research at TGen is focused on helping patients with diseases such as cancer, neurological disorders and diabetes. TGen is on the cutting edge of translational research where investigators are able to unravel the genetic components of common and complex diseases. Working with collaborators in the scientific and medical communities, TGen believes it can make a substantial contribution to the efficiency and effectiveness of the translational process. TGen is affiliated with the Van Andel Research Institute in Grand Rapids, Michigan. For more information, visit: www.tgen.org.

Press Contact:
Steve Yozwiak
TGen Senior Science Writer
602-343-8704
syozwiak@tgen.org

Steve Yozwiak | EurekAlert!
Further information:
http://www.tgen.org

More articles from Information Technology:

nachricht Japanese researchers develop ultrathin, highly elastic skin display
19.02.2018 | University of Tokyo

nachricht Why bees soared and slime flopped as inspirations for systems engineering
19.02.2018 | Georgia Institute of Technology

All articles from Information Technology >>>

The most recent press releases about innovation >>>

Die letzten 5 Focus-News des innovations-reports im Überblick:

Im Focus: In best circles: First integrated circuit from self-assembled polymer

For the first time, a team of researchers at the Max-Planck Institute (MPI) for Polymer Research in Mainz, Germany, has succeeded in making an integrated circuit (IC) from just a monolayer of a semiconducting polymer via a bottom-up, self-assembly approach.

In the self-assembly process, the semiconducting polymer arranges itself into an ordered monolayer in a transistor. The transistors are binary switches used...

Im Focus: Demonstration of a single molecule piezoelectric effect

Breakthrough provides a new concept of the design of molecular motors, sensors and electricity generators at nanoscale

Researchers from the Institute of Organic Chemistry and Biochemistry of the CAS (IOCB Prague), Institute of Physics of the CAS (IP CAS) and Palacký University...

Im Focus: Hybrid optics bring color imaging using ultrathin metalenses into focus

For photographers and scientists, lenses are lifesavers. They reflect and refract light, making possible the imaging systems that drive discovery through the microscope and preserve history through cameras.

But today's glass-based lenses are bulky and resist miniaturization. Next-generation technologies, such as ultrathin cameras or tiny microscopes, require...

Im Focus: Stem cell divisions in the adult brain seen for the first time

Scientists from the University of Zurich have succeeded for the first time in tracking individual stem cells and their neuronal progeny over months within the intact adult brain. This study sheds light on how new neurons are produced throughout life.

The generation of new nerve cells was once thought to taper off at the end of embryonic development. However, recent research has shown that the adult brain...

Im Focus: Interference as a new method for cooling quantum devices

Theoretical physicists propose to use negative interference to control heat flow in quantum devices. Study published in Physical Review Letters

Quantum computer parts are sensitive and need to be cooled to very low temperatures. Their tiny size makes them particularly susceptible to a temperature...

All Focus news of the innovation-report >>>

Anzeige

Anzeige

VideoLinks
Industry & Economy
Event News

2nd International Conference on High Temperature Shape Memory Alloys (HTSMAs)

15.02.2018 | Event News

Aachen DC Grid Summit 2018

13.02.2018 | Event News

How Global Climate Policy Can Learn from the Energy Transition

12.02.2018 | Event News

 
Latest News

'Lipid asymmetry' plays key role in activating immune cells

20.02.2018 | Life Sciences

MRI technique differentiates benign breast lesions from malignancies

20.02.2018 | Medical Engineering

Major discovery in controlling quantum states of single atoms

20.02.2018 | Physics and Astronomy

VideoLinks
Science & Research
Overview of more VideoLinks >>>