Big data As we gather vast numbers of genome sequences, new approaches are transforming information into insight
In 2003, medical research changed forever. This was the year the human genome project concluded, which, after US$3 billion and 13 years, determined the first entire sequence of human DNA. From there, the sequencing cost per genome began a 15-year free fall to around US$1000 today. Meanwhile, the number of genomes sequenced has been equally dizzying in its ascent. When Garvan acquired the Illumina HiSeq X Ten Sequencing System in 2014, it enabled us to move to the leading edge of medical science and the consequent transformation of healthcare. The clinical and research value of whole genome sequencing has since been amply demonstrated, such as in diagnosing rare genetic disease, guiding personalised treatment in cancer and enabling cohort studies to compare genomes en masse. Indeed, personal genome sequences will soon become an intrinsic part of medical research and health management. While this has major implications for scientific discovery and the economy, it also poses an increasingly common challenge: as the sequences accumulate, so do the computing demands for analysing genomic data and its relationship to clinical information. “Historically in medical research, generating the data has always been the bottleneck. But now science is changing,” says Dr Warren Kaplan, Chief of Informatics at Garvan’s Kinghorn Centre for Clinical Genomics. “We’re able to generate data much faster and progressively the work is happening on the analytical side of things rather than in the labs. If we’re going to continue to excel as a research institute we have to grow the pedigree of our data engineers and data.” Big data describes the deluge of information that is a hallmark of our times. But how big is big? Let’s start with something small: one byte of storage. This is about the power needed for a single character. If we imagine a byte as a grain of sand, a kilobyte would be a pinch, a megabyte would make a small turret on a sandcastle,
while a gigabyte would be a whole sandcastle. Increasingly, however, the data go-to unit is the petabyte, which satisfies the proverbial ‘long walk on the beach’. The Garvan Institute’s genomics program has a presence of about three and a half petabytes at the National Computational Infrastructure facility in Canberra, which includes more than 15,000 whole genome sequences and grows by 50 genomes a day. Yet, this is only a fraction of what is becoming an unimaginably huge universe of all the data across the globe, which expands by 2.5 quintillion bytes every day. This is chiefly driven by the daily activities of casual data- makers the world over through search engines, social media, telecommunications, navigation and many other means. It is, in Dr Kaplan’s words, “an almost infinite scale of computation”. The six billion bases in a genome sequence thus become six billion data points. Such detail allows for ongoing analysis as knowledge and needs evolve. “It’s almost exactly like the internet,” says Dr Kaplan. “No one wants to know everything about the whole internet at any given time. Today you might want to understand how Jamie Oliver makes a certain recipe, but by next Wednesday you want to know how a new model of car compares to the next one. Where all this sequencing data becomes really valuable is in enabling us to do Google-like queries across the genome, and we can start asking questions that are unprecedented.” The ultimate potential of such information is only realised when it is integrated with parallel data sets, says Garvan’s Executive Director, Professor John Mattick: “The genetic data is not very useful without the clinical information.” Professor Mattick describes a multifaceted data landscape as an ‘ecology’ in which “we will have thousands, and soon Where all this sequencing data becomes really valuable is enabling us to do Google-like queries across the genome
6 | breakthrough
Made with FlippingBook HTML5