Genome Sequencing Technology and Algorithms

Genome Sequencing Technologand Algorithms For a listing of related Artech House titles, turn to the back of this book...

Author: Sun Kim | Haixu Tang | Elaine R. Mardis

15 downloads 801 Views 2MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Genome Sequencing Technologand Algorithms

For a listing of related Artech House titles, turn to the back of this book.

Genome Sequencing Technology and Algorithms Sun Kim Haixu Tang Elaine R. Mardis Editors

artechhouse.com

Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the U.S. Library of Congress. British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library.

Cover design by Igor Valdman

ISBN 13: 978-1-59693-094-0

© 2008 ARTECH HOUSE, INC. 685 Canton Street Norwood, MA 02062 All rights reserved. Printed and bound in the United States of America. No part of this book may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage and retrieval system, without permission in writing from the publisher. All terms mentioned in this book that are known to be trademarks or service marks have been appropriately capitalized. Artech House cannot attest to the accuracy of this information. Use of a term in this book should not be regarded as affecting the validity of any trademark or service mark.

10 9 8 7 6 5 4 3 2 1

List of Contributors Chapter 1

Elaine R. Mardis, Washington University

Chapter 2

Baback Gharizadeh, Roxana Jalili, and Mostafa Ronaghi, Stanford University

Chapter 3

David Okou and Michael E. Zwick, Emory University School of Medicine

Chapter 4

Jay Shendure, University of Washington Gregory J. Porreca and George M. Church, Harvard Medical School

Chapter 5

Lewis J. Frey and Joyce A. Mitchell, University of Utah Victor Maojo, Universidad Politecnica de Madrid

Chapter 6

Sun Kim and Haixu Tang, Indiana University

Chapter 7

Sun Kim and Haixu Tang, Indiana University

Chapter 8

Jiacheng Chen and Steven Skiena, Stony Brook University

Chapter 9

Haixu Tang and Sun Kim, Indiana University

Chapter 10

Paola Bonizzoni and Gianluca Della Vedova, University di Milano-Bicocca Riccardo Dondi, University of Bergamo Jing Li, Case Western Reserve University

Chapter 11

Benjamin J. Raphael, Brown University Stas Volik, Colin C. Collins, University of California at San Francisco

Chapter 12

Curt Balch and Kenneth P. Nephew, Indiana University Tim H.-M. Huang, The Ohio State University

Chapter 13

Aleksandar Milosavljevic and Cristian Coarfa, Baylor College of Medicine

Contents Part I The New DNA Sequencing Technology

1

1

An Overview of New DNA Sequencing Technology

3

1.1

An Overview

3

1.1.1 1.1.2

Background 3 Rationale for Technology Development Toward Massively Parallel Scale DNA Sequencing 4

1.1.3

Goals of Massively Parallel Sequencing Approaches

1.2 1.2.1 1.2.2

Massively Parallel Sequencing by Synthesis Pyrosequencing 6 Principle of the Method 6 Pyrosequencing in a Microtiter Plate Format 7

1.2.3 1.2.4

The 454 GS-20 Sequencer Novel Applications Enabled by Massively Parallel Pyrosequencing

8

1.3

Massively Parallel Sequencing by Other Approaches

8

1.3.1 1.3.2 1.3.3

Sequencing by Synthesis with Reversible Terminators Ligation-Based Sequencing Sequencing by Hybridization

8 8 9

vii

6

7

viii

Genome Sequencing Technology and Algorithms

1.4

Survey of Future Massively Parallel Sequencing Methods

1.4.1 1.4.2

Sequencing Within a Zero-Mode Waveguide Nanopore Sequencing Approaches

9 10

References

11

2

Array-Based Pyrosequencing Technology

15

2.1

Introduction

15

2.2

Pyrosequencing Chemistry

16

2.3

Array-Based Pyrosequencing

17

2.4

454 Sequencing Chemistry

18

2.5 2.5.1 2.5.2

Applications of 454 Sequencing Technology Whole-Genome Sequencing Ultrabroad Sequencing

19 19 20

2.5.3

Ultradeep Amplicon Sequencing

20

2.6

Advantages and Challenges

20

2.7

Future of Pyrosequencing

21

References

21

The Role of Resequencing Arrays in Revolutionizing DNA Sequencing

25

3.1

Introduction

25

3.2

DNA Sequencing by Hybridization with Resequencing Arrays

26

3.3

Resequencing Array Experimental Protocols

28

3.4

Analyzing Resequencing Array Data with ABACUS

29

3.5 3.5.1 3.5.2 3.5.3

Review of RA Applications Human Resequencing Mitochondrial DNA Resequencing Microbial Pathogen Resequencing

33 33 33 35

3.6

Further Challenges References

38 40

3

9

Contents

ix

4

Polony Sequencing

43

4.1

Introduction

43

4.2

Overview

44

4.3

Construction of Sequencing Libraries

45

4.4

Template Amplification with Emulsion PCR

46

4.5

Sequencing

48

4.6

Future Directions

49

References

50

Genome Sequencing: A Complex Path to Personalized Medicine

53

5.1

Introduction

53

5.2

Personalized Medicine

55

5.3

Heterogeneous Data Sources

56

5.4

Information Modeling

57

5.5

Ontologies and Terminologies

58

5.6

Applications

59

5.7

Conclusion References

71 72

Part II Genome Sequencing and Fragment Assembly

77

6

Overview of Genome Assembly Techniques

79

6.1 6.1.1

Genome Sequencing by Shotgun-Sequencing Strategy A Procedure for Whole-Genome Shotgun (WGS) Sequencing

79

5

6.2 6.2.1 6.3 6.3.1

80

Trimming Vector and Low-Quality Sequences The Trimming Vector and Low-Quality Sequences Problem

82 82

Fragment Assembly The Fragment Assembly Problem

84 84

x

Genome Sequencing Technology and Algorithms

6.4

Assembly Validation

85

6.4.1

The Assembly Validation Problem

85

6.5

Scaffold Generation

87

6.5.1 6.5.2 6.5.3

The Scaffold Generation Problem Bambus GigAssembler

89 89 93

6.6

Finishing

94

6.7

Three Strategies for Whole-Genome Sequencing

94

6.8 6.8.1

Discussion A Thought on an Exploratory Genome Sequencing Framework

95

Acknowledgments

97

References

97

96

7

Fragment Assembly Algorithms

101

7.1

TIGR Assembler

102

7.1.1 7.1.2 7.1.3

Merging Fragments with Assemblies Building a Consensus Sequence Handling Repetitive Sequences

102 102 103

7.2

Phrap

103

7.3 7.3.1

CAP3 Automatics Clipping of 5’ and 3’ Poor Quality Regions

104 104

7.3.2 7.3.3

Computation and Evaluation of Overlaps 104 Use of Mate-Pair Constraints in Construction of Contigs 105

7.4 7.4.1 7.4.2 7.4.3 7.4.4 7.4.5

Celera Assembler Kececioglu and Myers Approach The Design Principle of the Celera Whole-Genome Assembler Overlapper Unitigger Scaffolder

107 108 108 109

7.5 7.5.1

Arachne Contig Assembly

110 111

105 105

Contents

xi

7.5.2

Detecting Repeat Contigs and Repeat Supercontigs

111

7.6 7.6.1

EULER Idury-Waterman Algorithm

112 112

7.6.2

An Overview of EULER

113

7.6.3 7.6.4

Error Correction and Data Corruption Eulerian Superpath

113 114

7.6.5

Use of Mate-Pair Information

115

7.7 7.7.1

Other Approaches to Fragment Assembly A Genetic Algorithm Approach

116 116

7.7.2

A Structured Pattern-Matching Approach

117

7.8

Incompleteness of the Survey

119

Acknowledgments References

120 120

Assembly for Double-Ended Short-Read Sequencing Technologies

123

8.1

Introduction

123

8.2

Short-Read Sequencing Technologies

125

8.3 8.3.1

Assembly for Short-Read Sequencing Algorithmic Methods

128 129

8.3.2

Simulation Results

129

8.4 8.4.1

Developing a Short-Read-Pair Assembler Analysis References

132 135 140

Part III Beyond Conventional Genome Sequencing

143

Genome Characterization in the Post–Human Genome Project Era

145

9.1

Genome Resequencing and Comparative Assembly

146

9.2

Genotyping Versus Haplotyping

147

9.3

Large-Scale Genome Variations

147

8

9

xii

Genome Sequencing Technology and Algorithms

9.4

Epigenomics: Genetic Variations Beyond Genome Sequences

148

9.5

Conclusion References

149 149

10

The Haplotyping Problem: An Overview of Computational Models and Solutions

151

10.1

Introduction

151

10.2

Preliminary Definitions

153

10.3 10.3.1

Inferring Haplotypes in a Population The Inference Problem: A General Rule

154 156

10.3.2 10.3.3

The Pure Parsimony Haplotyping Problem The Inference Problem by the Coalescent Model

158 158

10.3.4 10.3.5

Xor-Genotyping Incomplete Data

161 162

10.4

Inferring Haplotypes in Pedigrees

163

10.5

Inferring Haplotypes from Fragments

169

10.6

A Glimpse over Statistical Methods

175

10.7

Discussion Acknowledgments

177 178

References

178

11

Analysis of Genomic Alterations in Cancer

183

11.1 11.1.1

Introduction Measurement of Copy Number Changes by Array Hybridization Measurement of Genome Rearrangements by End Sequence Profiling

183

11.2

Analysis of ESP Data

188

11.3

Combination of Techniques

191

11.4

Future Directions References

191 192

11.1.2

185 187

Contents 12

xiii

High-Throughput Assessments of Epigenomics in Human Disease

197

12.1

Introduction

197

12.2 12.2.1 12.2.2

Epigenetic Phenomena That Regulate Gene Expression Methylation of Deoxycytosine Histone Modifications and Nucleosome Remodeling

198 198 198

12.2.3

Small Inhibitory RNA Molecules

199

12.3 12.3.1

Epigenetics and Disease Epigenetics and Developmental and Neurological Diseases

200 200

12.3.2

Epigenetics and Cancer

200

12.4 12.4.1

High-Throughput Analyses of Epigenetic Phenomena Gel-Based Approaches

201 201

12.4.2 12.4.3 12.4.4

Microarrays Cloning/Sequencing Mass Spectrometry

212 213 215

12.5

Conclusions Acknowledgments References

215 215 216

13

Comparative Sequencing, Assembly, and Anchoring

225

13.1

Comparing an Assembled Genome with Another Assembled Genome

226

13.2

Mutual Comparison of Genome Fragments

229

13.3 13.3.1

Comparing an Assembled Genome with Genome Fragments Applications Using Read Anchoring

230 230

13.3.2 13.3.3

Applications Employing Anchoring of Paired Ends Applications Utilizing Mapping of Clone Reads

232 233

13.4

Anchoring by Seed-and-Extend Versus Positional Hashing Methods

234

The UD-CSD Benchmark for Anchoring

237

13.5

xiv

Genome Sequencing Technology and Algorithms

13.6

Conclusions

239

References

241

About the Authors

245

Index

251

Part I The New DNA Sequencing Technology

1 An Overview of New DNA Sequencing Technology Elaine R. Mardis

1.1 An Overview 1.1.1

Background

The dideoxynucleotide termination DNA sequencing technology invented by Fred Sanger and colleagues, published in 1977, formed the basis for DNA sequencing from its inception through 2004 [1]. Originally based on radioactive labeling, the method was automated by the use of fluorescent labeling coupled with excitation and detection on dedicated instruments, with fragment separation by slab gel [2] and ultimately by capillary gel electrophoresis. A variety of molecular biology, chemistry, and enzymology-based improvements have brought Sanger’s approach to its current state of the art. By virtue of economies of scale, high-throughput automation and reaction optimization, large sequencing centers have decreased the cost of a fluorescent Sanger sequencing reaction to around $0.30. However, it is likely that only incremental cost decreases will continue to be achieved for Sanger sequencing in its current manifestation. This fact, coupled with the ever-increasing need for DNA sequencing toward a variety of biomedical (and other) studies, has resulted in a rapid phase of technology development of so-called next generation or massively parallel sequencing technologies, that will revolutionize DNA sequencing as we now know it. Along with this revolution will come a significant and potentially unanticipated impact

3

4

Genome Sequencing Technology and Algorithms

on sequencing-supportive infrastructures, namely, the computational hardware and software required to process and interpret these data. 1.1.2

Rationale for Technology Development Toward Massively Parallel Scale DNA Sequencing

It is perhaps interesting to evaluate the scientific underpinnings that have led to the recent revolution in DNA sequencing technology. With the completion of the reference human genome [3, 4], human geneticists and others began to question the nature and extent of genome-wide interindividual genomic variation. This concept of “strain-to-reference” comparison was not a novel one—certainly microbiologists had been studying the genomic differences between reference and pathogenic (clinical) strains of viruses and bacteria for many years, largely enabled by the ever-increasing availability of genome sequences for these organisms. Transitioning this concept to larger and more complex genomes simply is a matter of increasing the scale of comparison, since the human genome is approximately 1,000-fold larger than that of the average bacterium. It is also appropriate to note that much more focused strain-to-reference comparisons of human sequences have been pursued for many years in many studies, using PCR-based resequencing approaches. Here, PCR with genome-unique primers is utilized to selectively amplify the same region from many individual genomes and each resulting product is sequenced. A comparison of all sequences to the human reference can subsequently highlight common and rare mutations that may predispose to the disease state, predict outcome, or aid in the identification of specific treatments [5–13]. A first stab at understanding human diversity at the single-nucleotide level was embodied initially in 1999–2000 by the SNP Consortium efforts [14] and then scaled up considerably in 2002–2004 by the human “HapMap” project [15]. The latter project was significant not only in its accomplishments (over 1 million verified common human SNPs across four major human populations) but also in that it formed the basis for the development of high-throughput technologies for single nucleotide, genome-wide genotyping that could interrogate the many common human SNPs identified. For example, these approaches now enable the typing of more than 500,000 SNPs across the human genome for around $400 per sample. At the RNA level, DNA sequencing (long used for sequencing the ends of cloned mRNAs to produce expressed sequence tags or ESTs) was also being implemented to quantitate genome-wide gene expression levels for a given tissue or experimental condition by sequencing small RNA “tags” using an approach termed SAGE (serial analysis of gene expression) [16]. Mapping SAGE tag sequences back to a reference genome or transcriptome enabled identification of expressed genes and corresponding quantitative expression levels to be

An Overview of New DNA Sequencing Technology

5

ascertained [17–22] (for examples). One drawback of this powerful technique, especially compared to microarray technology, was the significant expense incurred to sequence the number of SAGE tags required to yield a meaningful result; a number that scaled with genome complexity. With genome-wide genotyping technology in-hand and providing ever-increasing numbers of SNPs per genotype, the interest in going beyond common single nucleotide variations (those SNPs found at more than 5% or greater frequency in the population, by definition), to characterize the range of variation in multiple base insertions and deletions, as well as inversions, rearrangements, and translocations began to increase. One approach to this scale of characterization was reported by Eichler and colleagues [23], and was a variation on an earlier approach that had utilized BAC (bacterial artificial chromosome) clone end sequencing to identify genome rearrangements [24]. In the Eichler approach, genomic DNA from a single CEPH individual (who also had been genotyped in the HapMap project) was used to generate a fosmid library. Next, the sequences obtained from the ends of about 1 million fosmid clones in this library were mapped back to the human reference sequence and the end mapping locations were interpreted as described next. Since fosmids package their recombinant genomes within a relatively tight size range of 35,000–50,000 bases, the expected separation between end sequences of a given insert lies in that range by definition (much more so than BACs, by contrast). Using this approach, one can identify fosmid ends that do not map both to the same chromosomal pseudomolecule in the human reference, indicating a translocation or other rearrangement has occurred. Another possibility is a fosmid end placement distance that is smaller than the expected range (the sampled genome is deleted relative to the reference genome) or is at a distance larger than expected (the sampled genome has an inserted sequence relative to the reference genome). Typically, one would then select these clones and sequence them in their entirety to better characterize the nature of the rearrangement, deletion or insertion, as was described in the Eichler manuscript. Again, although this is a very powerful method for elucidating genomic variation beyond SNP variation, significant sequencing costs are involved (around $1 million per genome). Beyond this sequencing-intensive approach, many groups [25–28] are now reporting analyses of copy number variation in individual human genomes, an offshoot of genome-wide SNP profiling on high-density microarray substrates that enable comparative genome hybridization and signal intensity analysis to ascertain regions of the genome that are exhibiting greater than two copies, in the case of amplification, and one or zero copies in the case of loss of heterozygosity or deletion.

6

1.1.3

Genome Sequencing Technology and Algorithms

Goals of Massively Parallel Sequencing Approaches

Ultimately, one major aim of massively parallel DNA sequencing instrumentation is to enable the large-scale sequencing of many human genomes in a accelerated timeframe (about 1 week per genome) at a cost approximating a high-end diagnostic assay (about $1,000 per genome). At the first pass, this will largely happen in research laboratories such as at genome centers or large pharmaceutical companies, and will involve disease-focused patient collections. Obviously, generating the data is only one part of the equation, but it must be done first. The secondary, more challenging goal is the analysis of this data on an individual genome basis and then, across all genomes in a given disease cohort, in correlation with other genomic and clinical data types. In addition to known genes, the interpretation of sequence data in nongenic sequence will become increasingly more informative as the overall knowledge of the functions of these regions of the human genome is increasingly elucidated. It is likely that with the technology at hand, however, basic research will not wait. What this means for human health is yet to be seen, but likely we will begin to distill out of these basic research activities the genomic hallmarks of disease, which can then be further coalesced into sequence-based diagnostic assays. These diagnostics will almost certainly not involve whole genome resequencing but rather will focus on the genomic hallmarks or biomarkers, perhaps using modified-scale instrumentation derived from today’s massively parallel instruments or perhaps using single-molecule detection methods that are presently under development.

1.2 Massively Parallel Sequencing by Synthesis Pyrosequencing 1.2.1

Principle of the Method

Conceptually and in practice, pyrosequencing is a completely novel approach from the standard Sanger dideoxynucleotide approach, initially reported by Nyren and colleagues [29], and later modified [30–34]. Figure 1 exemplifies the method in its current form. Basically, upon nucleotide incorporation by the polymerase, the released pyrophosphate is converted to ATP by action of the enzyme sulfurylase, providing firefly luciferase with the necessary energy source to convert luciferin to oxy-luciferin plus light. Since each template is queried with a single-nucleotide species in each step of pyrosequencing, detection of emitted light from this reaction can be directly correlated to the number of nucleotides incorporated, up to a point of nonlinear response which is typically greater than six nucleotides of the same identity (depending upon the sensitivity of the detector).

An Overview of New DNA Sequencing Technology 1.2.2

7

Pyrosequencing in a Microtiter Plate Format

The pyrosequencing concept was initially limited due to the buildup of unincorporated nucleotides and residual ATP between base additions. The inclusion of the apyrase enzyme to the pyrosequencing reaction cocktail overcame this limitation by its enzymatic action [33], and the pyrosequencing method was then translated into a commercially available microtiter plate-format instrument. Still, the microtiter plate-based assay has a limited read-length of around 10–16 nucleotides but has found widespread use for low-to-medium throughput genotyping. 1.2.3

The 454 GS-20 Sequencer

In 2004, pyrosequencing transitioned onto a massively parallel DNA sequencing instrument that achieved commercial release by 454 Life Sciences, Inc. [35]. The instrument embodies several novel approaches to DNA sequencing that tremendously streamline and parallelize the process from library construction through sequence detection, as described below. In this system, a genomic library is made by fragmenting the genomic DNA to about 500 bp, repairing the ends and ligating two 454-specific linkers to each genomic fragment. These fragments are coupled to Sepharose beads carrying covalently linked oligonucleotides complementary to the fragment library’s ligated linkers. This bead/DNA mixture is emulsified in an oil suspension containing aqueous PCR reactants, enabling the amplification of millions of unique fragment-bead combinations in a large-batch PCR format. The Sepharose beads that contain amplified DNA are prepared for the sequencing reaction by denaturation of the unattached strand and annealing of a sequencing primer, and then are pipetted into a PicoTiterPlate (PTP) device. The PTP is composed of hundreds of thousands of fused fiber optic strands, the ends of which are hollowed out to a diameter sufficient to contain a single Sepharose bead. Smaller magnetic beads, to which pyrosequencing (sulfurylase and luciferase) enzymes are covalently attached, are added into the PTP. The PTP fits into a flow-cell device that positions it against a high-sensitivity CCD camera in the 454 GS-20 sequencing instrument. Pyrosequencing follows, whereby sequential flows of each dNTP, separated by an imaging step and a wash step, take place. At each well address in the PTP, the incorporation of one or more nucleotides into the synthesized strand on each bead is captured by the CCD camera, which records positional information about each well address throughout the sequencing process. A post-run bioinformatic pipeline processes the raw pyrosequencing data into approximately 200,000 sequencing reads of about 100 bp each. Recent improvements to the 454 system have enabled increased read-lengths, averaging around 400,000 reads of 250-bp read-length per 7-hour run.

8

1.2.4

Genome Sequencing Technology and Algorithms

Novel Applications Enabled by Massively Parallel Pyrosequencing

The advent of massively parallel sequencing, ushered in by the availability of 454 pyrosequencing, has enabled genome scientists to pursue applications for DNA sequencing at levels that heretofore were often not possible, due to cost and timeline. These include SAGE profiling [36], cDNA sequencing [V. Magrini, personal communication, 2006], metagenomics [37], nucleosome positioning [38], and others. The increasing efficiency of the instrument, coupled with the availability of paired end read sequencing, predicts that the instrument will continue to inspire scientists to more and varied ingenuity in 454-based approaches.

1.3 Massively Parallel Sequencing by Other Approaches 1.3.1

Sequencing by Synthesis with Reversible Terminators

In addition to the 454 pyrosequencing chemistry, several companies and academic groups are working to develop massively parallel sequencing instrumentation that uses reversible dye terminator chemistry for DNA sequencing. To date only one has achieved commercial availability, which is the platform offered by Solexa, LTD (now a wholly owned subsidiary of Illumina, Inc.). Although this platform has a similar initial library construction procedure as that outlined for 454, once linker ligated genomic fragments are produced, they are amplified in situ following hybridization to a complementary oligo that is covalently linked to a glass slide (“flow cell”) surface. The amplified fragments, or clusters, are denatured, annealed with a sequencing primer, and placed onto the sequencing instrument for sequencing by synthesis (SBS) using 3’-blocked fluorescent-labeled deoxynucleotides. Each synthesis step includes single nucleotide incorporation, washing to remove unincorporated nucleotides, imaging of the entire flow cell, deblocking of the 3’ – OH ends of each synthesized strand, and washing. Using this approach, at present a 4-day run on the Solexa instrument results in around 50 to 60 million sequences of 40 to 50 base pairs each, for a total of around 1 Gb of sequence (following data quality filtering). 1.3.2

Ligation-Based Sequencing

An important variant to polymerase-based methods for massively parallel sequencing is an approach that depends upon the high specificity of DNA ligase to mediate the sequencing of genomic fragments. This approach builds on previous genotyping methods such as the ligation chain reaction (LCR) [39–41] and oligonucleotide ligation assay (OLA) [42] that rely on the specificity of DNA ligase to join the DNA backbone. Ligation-based sequencing is covered in great detail later in this book, and will form the core technology for a next

An Overview of New DNA Sequencing Technology

9

generation sequencing instrument scheduled to be introduced in 2007 by Applied Biosystems [H. Fiske, personal communication, 2007]. 1.3.3

Sequencing by Hybridization

The same technology that has enabled genome-wide surveys of gene expression by the hybridization of single-stranded cDNA copies of messenger RNA species [43] more recently has been utilized for a variant of genome resequencing typically referred to as “sequencing by hybridization.” Originally described early in the brief history of genome sequencing [44–47], methods to generate arrays of oligonucleotides of sufficient depth to address even single human chromosomes were not technologically feasible at that time, and so the method was not applicable to most genomes of interest. More recently, sequencing by hybridization has taken the form of whole genome genotyping of SNPs, using a variety of commercially available oligonucleotide-based approaches, such as Illumina, Affymetrix, and Nimblegen [48–50]. Since the technology to create ever-increasing densities of oligonucleotides on solid supports is rapidly improving and costs are falling, sequencing by hybridization offers an attractive first pass at characterizing genomes in advance of sequencing. Methods to extract additional information from genome-wide SNP typing microarrays are now enhancing the data value from these experiments by offering information about large-scale amplifications and deletions, as well as loss of heterozygosity (LOH). Additional historical and practical aspects of sequencing by hybridization are covered later in this book.

1.4 Survey of Future Massively Parallel Sequencing Methods Several DNA sequencing approaches of interest that will likely not be available in the near term bear mentioning in this introduction because of the incredible potential they may offer for genome sequencing at throughputs one to two orders of magnitude higher than the methods discussed in this book, as well as at high accuracy and high sequence contiguity. Concomitantly, they are the highest risk approaches presently being pursued in genome sequencing technology development. 1.4.1

Sequencing Within a Zero-Mode Waveguide

The basis for DNA sequencing in a zero-mode waveguide was first spelled out in a seminal paper by Levene et al. [51]. This paper described a sequencing reaction and detection environment, the zero-mode waveguide, that more closely correlates with the observation volumes necessary for many single-molecule detection technologies while enabling substrate concentrations to be held

10

Genome Sequencing Technology and Algorithms

optimal for the enzymological assay being performed. As described, zero-mode waveguides would be formed by electron beam lithography, followed by reactive ion etching in a metal film deposited on a microscope coverslip (in other words, the 1 × 1-inch-thin glass typically placed over samples on a microscope slide). Since each coverslip could potentially contain millions of waveguides, the resulting assay would have the potential of massive parallelism. For direct observations of single molecules, enzymes (such as DNA polymerase) could be adsorbed onto the bottom of the waveguides and provided with necessary substrates and reactants, while being monitored from below by a microscope objective that both provides the necessary illumination and collects emitted light through the same objective. Since the waveguide provides a limited illumination volume, it is much more likely that the signal of a nucleotide in the active site of the enzyme would be detected as opposed to freely diffusing labeled nucleotides. Although the latter will provide a background fluorescence due to weakly illuminated molecules, it will be essentially constant and therefore readily subtracted. Using fluorescence-coupled spectroscopy, the authors were able to observe the incorporation of coumarin-dCTP into an M13 singlestranded DNA template by immobilized mutant T7 polymerase [51]. This work has set the stage for an attempt to commercialize the zero-mode waveguide technology for DNA sequencing (and perhaps other applications) by Pacific Biosciences (Sunnyvale, California). The potential advantages of this approach for DNA sequencing include very long read-lengths of unlabeled single-template molecules with very high accuracy, in a highly multiplexed fashion. Although technically challenging, this is one approach that could potentially revolutionize DNA sequencing and its routine application in biology as well as in diagnostic and prognostic medicine. 1.4.2

Nanopore Sequencing Approaches

The Coulter counter [52] was no doubt the inspiration for the use of nanopores to sequence DNA, since it works to separate electrolyte solution-suspended particles by drawing them through a channel between two reservoirs. As a particle enters the channel, it increases the electrical impedance of the channel and therefore affects a current drop when a voltage is being passed across the channel. Current research on nanopore-based sequencing utilizes one of two types of nanopores: protein-based or synthetic. Much of the initial nanopore DNA sequencing work was performed using the bacterial protein α-hemolysin [53], a transmembrane protein that inserts into the lipid bilayer. However, these studies also have largely demonstrated that α-hemolysin pores are limited by size, variation, and stability. Hence, synthetic nanopores are currently being widely investigated for DNA-based analysis, using a wide variety of detection approaches [54–56]. The promise of nanopores includes single-molecule sequencing with

An Overview of New DNA Sequencing Technology

11

very long read-lengths, potentially without the requirement for labeling the DNA. Thousands to millions of nanopores could be contained in a single device, allowing a very high sequencing capacity in a very low cost device. The drawbacks of nanopores for sequencing so far include low sensitivity to detect signals that can be used to distinguish the identity of individual nucleotides, and difficulty forcing DNA through the pores in a uniformly single-stranded fashion, due to its tendency to form hairpins and other structures. If these limitations can be overcome, however, the possible application of nanopores to very inexpensive, very long read sequencing would undoubtedly usher in a revolution in DNA sequencing.

References [1] Sanger, F., S. Nicklen, and A. R. Coulson, “DNA Sequencing with Chain-Terminating Inhibitors,” Proc. Natl. Acad. Sci. USA, Vol. 74, No. 12, 1977, pp. 5463–5467. [2] Smith, L. M., et al., “Fluorescence Detection in Automated DNA Sequence Analysis,” Nature, Vol. 321, No. 6071, 1986, pp. 674–679. [3] Lander, E. S., et al., “Initial Sequencing and Analysis of the Human Genome,” Nature, Vol. 409, No. 6822, 2001, pp. 860–921. [4] International Human Genome Sequencing Consortium, “Finishing the Euchromatic Sequence of the Human Genome,” Nature, Vol. 431, No. 7011, 2004, pp. 931–945. [5] Akey, J. M., et al., “Population History and Natural Selection Shape Patterns of Genetic Variation in 132 Genes,” PLoS Biol., Vol. 2, No. 10, 2004, p. e286. [6] Livingston, R. J., et al., “Pattern of Sequence Variation Across 213 Environmental Response Genes,” Genome Res., Vol. 14, No. 10A, 2004, pp. 1821–1831. [7] Wilson, R. K., et al., “Mutational Profiling in the Human Genome,” Cold Spring Harb. Symp. Quant. Biol., Vol. 68, 2003, pp. 23–29. [8] Fullerton, S. M., et al., “Apolipoprotein E Variation at the Sequence Haplotype Level: Implications for the Origin and Maintenance of a Major Human Polymorphism,” Am. J. Hum. Genet., Vol. 67, No. 4, 2000, pp. 881–900. [9] Nickerson, D. A., et al., “Sequence Diversity and Large-Scale Typing of SNPs in the Human Apolipoprotein E Gene,” Genome Res., Vol. 10, No. 10, 2000, pp. 1532–1545. [10] Rieder, M. J., and D. A. Nickerson, “Hypertension and Single Nucleotide Polymorphisms,” Curr. Hypertens. Rep., Vol. 2, No. 1, 2000, pp. 44–49. [11] Zhu, X., et al., “Localization of a Small Genomic Region Associated with Elevated ACE,” Am. J. Hum. Genet., Vol. 67, No. 5, 2000, pp. 1144–1153. [12] Levine, R. L., et al., “Activating Mutation in the Tyrosine Kinase JAK2 in Polycythemia Vera, Essential Thrombocythemia, and Myeloid Metaplasia with Myelofibrosis,” Cancer Cell, Vol. 7, No. 4, 2005, pp. 387–397.

12

Genome Sequencing Technology and Algorithms

[13] Thomas, R. K., et al., “Detection of Oncogenic Mutations in the EGFR Gene in Lung Adenocarcinoma with Differential Sensitivity to EGFR Tyrosine Kinase Inhibitors,” Cold Spring Harb. Symp. Quant. Biol., Vol. 70, 2005, pp. 73–81. [14] Sachidanandam, R., et al., “A Map of Human Genome Sequence Variation Containing 1.42 Million Single Nucleotide Polymorphisms,” Nature, Vol. 409, No. 6822, 2001, pp. 928–933. [15] The International Human Genome Sequencing Consortium, “A Haplotype Map of the Human Genome,” Nature, Vol. 437, No. 7063, 2005, pp. 1299–1320. [16] Velculescu, V. E., B. Vogelstein, and K. W. Kinzler, “Analysing Uncharted Transcriptomes with SAGE,” Trends Genet., Vol. 16, No. 10, 2000, pp. 423–425. [17] Funaguma, S., et al., “SAGE Analysis of Early Oogenesis in the Silkworm, Bombyx Mori,” Insect Biochem. Mol. Biol., Vol. 37, No. 2, 2007, pp. 147–154. [18] McIntosh, S., et al., “SAGE of the Developing Wheat Caryopsis,” Plant Biotechnol. J., Vol. 5, No. 1, 2007, pp. 69–83. [19] Gibbings, J. G., et al., “Global Transcript Analysis of Rice Leaf and Seed Using SAGE Technology,” Plant Biotechnol. J., Vol. 1, No. 4, 2003, pp. 271–285. [20] Gowda, M., et al., “Deep and Comparative Analysis of the Mycelium and Appressorium Transcriptomes of Magnaporthe Grisea Using MPSS, RL-SAGE, and Oligoarray Methods,” BMC Genomics, Vol. 7, 2006, p. 310. [21] Berthier, D., et al., “Bovine Transcriptome Analysis by SAGE Technology During an Experimental Trypanosoma Congolense Infection,” Ann. NY Acad. Sci., Vol. 1081, 2006, pp. 286–299. [22] Rosinski-Chupin, I., et al., “SAGE Analysis of Mosquito Salivary Gland Transcriptomes During Plasmodium Invasion,” Cell Microbiol., Vol. 9, No. 3, 2007, pp. 708–724. [23] Tuzun, E., et al., “Fine-Scale Structural Variation of the Human Genome,” Nat. Genet., Vol. 37, No. 7, 2005, pp. 727–732. [24] Volik, S., et al., “End-Sequence Profiling: Sequence-Based Analysis of Aberrant Genomes,” Proc. Natl. Acad. Sci. USA, Vol. 100, No. 13, 2003, pp. 7696–7701. [25] Pfeifer, D., et al., “Genome-Wide Analysis of DNA Copy Number Changes and LOH in CLL Using High-Density SNP Arrays,” Blood, Vol. 109, No. 3, 2007, pp. 1202–1210. [26] Komura, D., et al., “Genome-Wide Detection of Human Copy Number Variations Using High-Density DNA Oligonucleotide Arrays,” Genome Res., Vol. 16, No. 12, 2006, pp. 1575–1584. [27] Redon, R., et al., “Global Variation in Copy Number in the Human Genome,” Nature, Vol. 444, No. 7118, 2006, pp. 444–454. [28] Kotliarov, Y., et al., “High-Resolution Global Genomic Survey of 178 Gliomas Reveals Novel Regions of Copy Number Alteration and Allelic Imbalances,” Cancer Res., Vol. 66, No. 19, 2006, pp. 9428–9436.

An Overview of New DNA Sequencing Technology

13

[29] Nyren, P., B. Pettersson, and M. Uhlen, “Solid Phase DNA Minisequencing by an Enzymatic Luminometric Inorganic Pyrophosphate Detection Assay,” Anal. Biochem., Vol. 208, No. 1, 1993, pp. 171–175. [30] Ahmadian, A., et al., “Single-Nucleotide Polymorphism Analysis by Pyrosequencing,” Anal. Biochem., Vol. 280, No. 1, 2000, pp. 103–110. [31] Ahmadian, A., et al., “Analysis of the p53 Tumor Suppressor Gene by Pyrosequencing,” Biotechniques, Vol. 28, No. 1, 2000, pp. 140–4, 146–7. [32] Ronaghi, M., et al., “PCR-Introduced Loop Structure as Primer in DNA Sequencing,” Biotechniques, Vol. 25, No. 5, 1998, pp. 876–878, 880–882, 884. [33] Ronaghi, M., M. Uhlen, and P. Nyren, “A Sequencing Method Based on Real-Time Pyrophosphate,” Science, Vol. 281, No. 5375, 1998, pp. 363, 365. [34] Ronaghi, M., et al., “Real-Time DNA Sequencing Using Detection of Pyrophosphate Release,” Anal. Biochem., Vol. 242, No. 1, 1996, pp. 84–89. [35] Margulies, M., et al., “Genome Sequencing in Microfabricated High-Density Picolitre Reactors,” Nature, Vol. 437, No. 7057, 2005, pp. 376–380. [36] Bainbridge, M. N., et al., “Analysis of the Prostate Cancer Cell Line LNCaP Transcriptome Using a Sequencing-by-Synthesis Approach,” BMC Genomics, Vol. 7, 2006, p. 246. [37] Turnbaugh, P. J., et al., “An Obesity-Associated Gut Microbiome with Increased Capacity for Energy Harvest,” Nature, Vol. 444, No. 7122, 2006, pp. 1027–1031. [38] Johnson, S. M., et al., “Flexibility and Constraint in the Nucleosome Core Landscape of Caenorhabditis Elegans Chromatin,” Genome Res., Vol. 16, No. 12, 2006, pp. 1505–1516. [39] Wiedmann, M., et al., “Ligase Chain Reaction (LCR)—Overview and Applications,” PCR Methods Appl., Vol. 3, No. 4, 1994, pp. S51–S64. [40] Wu, D. Y., and R. B. Wallace, “The Ligation Amplification Reaction (LAR)—Amplification of Specific DNA Sequences Using Sequential Rounds of Template-Dependent Ligation,” Genomics, Vol. 4, No. 4, 1989, pp. 560–569. [41] Barany, F., “The Ligase Chain Reaction in a PCR World,” PCR Methods Appl., Vol. 1, No. 1, 1991, pp. 5–16. [42] Nickerson, D. A., et al., “Automated DNA Diagnostics Using an ELISA-Based Oligonucleotide Ligation Assay,” Proc. Natl. Acad. Sci. USA, Vol. 87, No. 22, 1990, pp. 8923–8927. [43] Brown, P. O., and D. Botstein, “Exploring the New World of the Genome with DNA Microarrays,” Nat. Genet., Vol. 21, No. 1, Suppl., 1999, pp. 33–37. [44] Drmanac, R., et al., “DNA Sequence Determination by Hybridization: A Strategy for Efficient Large-Scale Sequencing,” Science, Vol. 260, No. 5114, 1993, pp. 1649–1652. [45] Lipshutz, R. J., “Likelihood DNA Sequencing by Hybridization,” J. Biomol. Struct. Dyn., Vol. 11, No. 3, 1993, pp. 637–653.

14

Genome Sequencing Technology and Algorithms

[46] Gunderson, K. L., et al., “Mutation Detection by Ligation to Complete N-Mer DNA Arrays,” Genome Res., Vol. 8, No. 11, 1988, pp. 1142–1153. [47] Broude, N. E., et al., “Enhanced DNA Sequencing by Hybridization,” Proc. Natl. Acad. Sci. USA, Vol. 91, No. 8, 1994, pp. 3072–3076. [48] Lipshutz, R. J., et al., “Using Oligonucleotide Probe Arrays to Access Genetic Diversity,” Biotechniques, Vol. 19, No. 3, 1995, pp. 442–447. [49] Wang, D. G., et al., “Large-Scale Identification, Mapping, and Genotyping of SingleNucleotide Polymorphisms in the Human Genome,” Science, Vol. 280, No. 5366, 1988, pp. 1077–1082. [50] Albert, T. J., et al., “Light-Directed 5’—>3’ Synthesis of Complex Oligonucleotide Microarrays,” Nucleic Acids Res., Vol. 31, No. 7, 2003, p. e35. [51] Levene, M. J., et al., “Zero-Mode Waveguides for Single-Molecule Analysis at High Concentrations,” Science, Vol. 299, No. 5607, 2003, pp. 682–686. [52] DeBlois, R. W., “Counting and Sizing of Submicron Particles by Resistive Pulse Technique,” Rev. Sci. Instrum., Vol. 41, 1970, p. 909. [53] Song, L., et al., “Structure of Staphylococcal Alpha-Hemolysin: A Heptameric Transmembrane Pore,” Science, Vol. 274, No. 5294, 1996, pp. 1859–1866. [54] Heng, J. B., et al., “The Electromechanics of DNA in a Synthetic Nanopore,” Biophys. J., Vol. 90, No. 3, 2006, pp. 1098–1106. [55] Meller, A., et al., “Rapid Nanopore Discrimination Between Single Polynucleotide Molecules,” Proc. Natl. Acad. Sci. USA, Vol. 97, No. 3, 2000, pp. 1079–1084. [56] Fologea, D., et al., “Detecting Single Stranded DNA with a Solid State Nanopore,” Nano. Lett., Vol. 5, No. 10, 2005, pp. 1905–1909.

2 Array-Based Pyrosequencing Technology Baback Gharizadeh, Roxana Jalili, and Mostafa Ronaghi

With the completion draft of the human genome, we are entering a new era in the biological sciences with the DNA sequencing as one of the main catalysts. In the DNA sequencing field, pyrosequencing has emerged as a technology for de novo high-throughput whole genome sequencing. The method is based on the principle of sequencing-by-synthesis and pyrophosphate detection through a series of enzymatic reactions to generate luminescence sequence peak signals. Pyrosequencing is being used from single nucleotide to whole genome sequencing and the method is commercially available for low throughput sequencing by Biotage and high-throughput sequencing by 454 Life Sciences (currently owned by Roche). In this chapter we describe the methodologies, applications and discuss current developments, which would decrease the cost of sequencing by another two orders of magnitude within the next 5 years.

2.1 Introduction DNA sequencing is a key tool to determine nucleic acids sequence and is applied in biosciences from single-nucleotide polymorphism (SNP) genotyping to whole-genome sequencing. Due to its widespread applications, an extensive number of different sciences have been benefiting from DNA sequencing from molecular biology, medicine, diagnostics, genetics, biotechnology, pharmacology, and forensics to archeology and anthropology. Moreover, DNA sequencing is promoting new discoveries that are revolutionizing the conceptual foundations of many fields. 15

16

Genome Sequencing Technology and Algorithms

High throughput and affordable DNA-sequencing techniques would undoubtedly continue the stream of the revolution initiated by the Human Genome Project. Recent impressive advances in DNA-sequencing technologies have accelerated the detailed analysis of genomes from many organisms. We have been observing numerous reports of complete or draft versions of the genome sequence of several well-studied, multicellular organisms. The Human Genome Project was made achievable by a significant reduction in DNA sequencing cost by three orders of magnitude; a further cost reduction of two to three would launch a new era of DNA sequencing applications from short DNA reads to whole-genome sequencing. The chain termination sequencing method, also known as Sanger sequencing, developed by Frederick Sanger and colleagues [1], has been the most widely used sequencing method since its advent in 1977, and still is extensively in use. The remarkable advances in chemistry, automation, and data acquisition to the Sanger sequencing method has made it into a simple and elegant technique, central to almost all past and current genome-sequencing projects of any significant scale. Despite all these grand advantages, there are limitations in this method, which could be complemented with other techniques. Among the current state-of-the-art DNA-sequencing techniques, pyrosequencing [2, 3] has emerged, which is being used for a wide variety of applications. In the beginning, the method was only restricted to SNP genotyping [4] and short reads [5, 6] when it was introduced in 1997 but it is now being used for broader applications. The original pyrosequencing method is based on conventional PCR and a four-enzyme system sequencing 96 samples at a time (http://www.biotage.com). Later on, this method was developed further to a high-throughput micorfluidics format by 454 Life Sciences (http://www. 454.com), which was acquired by Roche recently. Several groups are working on further development of this technology for different applications, which will be discussed in this chapter.

2.2 Pyrosequencing Chemistry Pyrosequencing technology is based on the sequencing-by-synthesis principle and employs a cascade of four enzymes to accurately determine nucleic acid sequences during DNA synthesis. In pyrosequencing, the sequencing primer is hybridized to a single-stranded biotin-labeled DNA template (post-PCR alkali treated) and mixed with the enzymes: DNA polymerase, ATP sulfurylase, luciferase and apyrase, and the substrates adenosine 5’ phosphosulfate (APS), luciferin and deoxynucleotide triphosphates (dNTPs). Cycles of four dNTPs are separately dispensed to the reaction mixture iteratively. After each nucleotide

Array-Based Pyrosequencing Technology

17

dispensation, DNA polymerase catalyzes the incorporation of complimentary dNTP into the template strand. Each nucleotide incorporation event is followed by release of inorganic pyrophosphate (ppi) in a quantity equimolar to the amount of incorporated nucleotide. ATP sulfurylase quantitatively converts ppi to ATP in the presence of APS. The generated ATP drives the luciferase-mediated conversion of luciferin to oxyluciferin, producing visible light in amounts that are proportional to the amount of ATP. The light in the luciferase-catalyzed reaction is then detected by a photon-detection device such a charge-coupled device (CCD) camera, CMOS image sensor, or photomultiplier tube. The generated light is observed as a peak signal in the pyrogram or flowgram (corresponding to electropherogram in dideoxy sequencing). Each signal peak is proportional to the number of nucleotides incorporated (e.g., a triple dCTP nucleotide incorporation generates a triple higher peak). Apyrase is a nucleotidedegrading enzyme, which continuously degrades ATP and nonincorporated dNTPs in the reaction mixture. There is a certain time interval between each nucleotide dispensation to allow complete degradation. For this reason, dNTP addition is performed one at a time. During this synthesis process, the DNA strand is extended by complementary nucleotides, and the DNA sequence is demonstrated by the signal peaks in a pyrogram on a computer monitor. Base-callings are performed with integrated software with features for related SNP and sequencing analysis. Pyrosequencing was earlier limited to sequencing of short stretches of DNA, due to the inhibition of apyrase. The natural dATP was a substrate for luciferase, resulting in false sequence signals. dATP was substituted by dATP-α-S [2]. Higher concentrations of this nucleotide had an inhibitory effect on apyrase catalytic activity, causing nonsynchronized extension. The dATP-α-S consisted of two isomers: Sp and Rp. The Rp isomer was not incorporated in the DNA template as it was not a substrate for DNA polymerase, and its presence in the sequencing reaction simply inhibited apyrase activity. By introducing the dATP-α-S SP isomer, substantial longer reads were achieved [7]. This improvement had a major impact on pyrosequencing read-length and allowed sequencing of over one-hundred bases and opened many avenues for numerous applications.

2.3 Array-Based Pyrosequencing Pyrosequencing has been developed into a massively parallel microfluidic sequencing platform [8] by 454 Life Sciences Corporation (Branford, Connecticut). The first generation of 454 sequencing platform (GS20) is capable of sequencing 100 bases in average and generates between 20 to 50 megabases (depending on the protocol) of raw DNA sequence in less than 5 hours. The

18

Genome Sequencing Technology and Algorithms

company has recently released a GS FLX platform, which enables longer reads of 200–300 bases generating over 100-megabases per run. Based on the same platform, read-length distribution around 500 nucleotides has also been achieved. This read-length will open the doors for direct shotgun sequencing of complex genomes circumventing laborious library construction for de novo genome sequencing.

2.4 454 Sequencing Chemistry The 454 sequencing method uses a three-enzyme pyrosequencing chemistry in a microfluidic format (apyrase is not used except in the washing buffer). The detection enzymes are immobilized on beads and the substrates luciferin, APS, and nucleotides are flowed into the system. For sequencing of large fragments and whole genome, 454 sequencing consist of four steps. 1. Library preparation where whole-genomic DNA or large fragments (usually over 2 kb) are nebulized into small fragments preferably with majority in the 300–800 bp range by high nitrogen gas pressure. The fragmented DNA is then blunt-ended and polished by DNA end repair for adapter ligation. There are two 44-mer double-stranded adapters that are blunt-ended on one side and consist of 20-bases PCR primer, 20-bases sequencing primer, and 4-base key sequences (for initial sequence quality monitoring) with one of the adapters biotin-labeled. The adapters provide sequences for both amplification and sequencing of the immobilized DNA fragments. After ligation and nick repair, the nonbiotin labeled fragments are separated, single stranded by alkali treatment and used as DNA library. 2. The second step is emulsion PCR, where the ssDNA library is hybridized to complementary strand immobilized on sepharose capture beads and distributed to water-in-oil emulsion and PCR reaction mixture, which are then mixed by a high-speed shaker to form emulsions. The theoretical distribution ratio of beads and ssDNA is 1:1 for the clonal amplification. After the PCR reaction, the microreactors are broken and the beads are captured by filtration. The biotin-labeled amplicon positive beads are then enriched by using streptavidin magnetic beads and are single stranded. 3. Sequencing step where the ssDNA positive beads (23 µm in diameter) are incubated with DNA polymerase and single-strand binding protein (SSB), after that distributed into the wells on an optical faceplate called PicoTiterPlate, which contains 1.6 million wells (each well is 44 µm in diameter and has the capacity for one bead). After adding

Array-Based Pyrosequencing Technology

19

the DNA beads and enzyme beads (ATP sulfurylase and luciferase), the packing beads are layered onto the wells and the plate is centrifuged for bead deposition. The packing beads ensure that the DNA beads are kept in place during DNA sequencing and minimize DNA bead loss during the liquid flow. The PicoTiterPlate is placed into the instrument for sequencing. 4. Generated sequence signals are recorded as images for data analysis. The signal processing, base-callings, assembly, mapping, and variance detections are performed by integrated software. For amplicon sequencing there is no need for library preparation and adapter ligation. Instead 19-mer sequences are added to the PCR primers, which function as amplification and sequencing primers. Amplicons can be sequenced one-directional or bidirectional.

2.5 Applications of 454 Sequencing Technology The 454 sequencing technology is the only commercially available technology that can be used for high-throughput de novo sequencing generating readlength as long as 200–300 bases with a 500-base-sequencing platform in the pipeline. The method can generally be categorized for whole-genome sequencing, broad sequencing, and deep sequencing. 2.5.1

Whole-Genome Sequencing

454 sequencing has been used widely for whole-genome sequencing generally for short genomes such as de novo bacterial [9–11] and BAC/PAC/cosmid/ forsmid [12, 13] sequencing basically because of their ease of assembly. Due to high number of sequence repeat regions in mammalian genomes, the 100-base GS 20 has not been as suitable as Sanger sequencing for complex genomes. The 454 longer read platform FLX (and the coming generation producing 500 bases) along with paired-end technology [14] and its upcoming improvements are making this technique suitable for sequencing complex genomes. The 454 sequencing is now being used for evolutionary anthropology by The Max Planck Institute to sequence Neanderthal genome [15, 16], which is one of the closest to human genome. The sequencing of the ancient Neanderthal genome is specially challenging as it is highly fragmented, which has just been enabled by pyrosequencing. This sequencing project is currently being taken place at the 454 Sequencing Center. Moreover, the whole human genome of a Nobel Prize–winning scientist is also being sequenced at the same center.

20

2.5.2

Genome Sequencing Technology and Algorithms

Ultrabroad Sequencing

The 454 sequencing is now currently being utilized for many different applications such as large pools of cDNA [17–19], small RNA, and micro-RNA [20–23], ditag/SAGE libraries [14, 24] as well as other amplicon pools [25], paleogenomics and metagenomics [26–29]. Moreover, using DNA tags or barcodes [30] could be a great advantage in reducing the cost of sequencing, which allows en masse sequencing of pooled amplicons from different samples. By using bioinformatic tools one can easily sort out the amplicon sequences by the identification tags or barcodes.

2.5.3

Ultradeep Amplicon Sequencing

The ultradeep sequencing allows detection of low-frequency mutations [30–32]. This is generally not possible by conventional Sanger sequencing as it can only detect mutations down to 20%. Screening the low-frequency mutations is important for many studies such as drug resistance or antibiotic resistance as well as finding novel mutations in heterogeneous samples.

2.6 Advantages and Challenges The 454 sequencing technology is the only commercially available high-throughput sequencing platform for de novo sequencing with lower cost than Sanger sequencing. It relies on clonal amplification, which allows sequencing of unclonable regions. Moreover, GC content is generally not an issue in pyrosequencing [33]. It has the ability to detect low-frequency mutations, which cannot be detected by Sanger sequencing [32]. Detection of low-frequency mutations is a valuable tool for clinical cancer and drug-resistance research. The 454 sequencing has been limited to average sequences of 100-bases read-length, however, by the new developments in chemistry and software algorithms, the read-length is being expanded to 500 bases. This will facilitate sequencing of highly repetitive regions and complex genomes with the ability to assemble the “left-out” contigs. Homopolymers are the main challenge affecting the accuracy of the 454 sequencing technology. The main challenge is caused by dATP-α-S when incorporated in homopolymer Ts (more than 4 to 5). Homopolymer strings (mainly homopolymeric T) regions can reduce synchronized extension causing nonuniform sequence peak heights, affecting the read-length and possibly causing sequence errors [7]. This problem could be resolved by engineering more efficient DNA polymerases or a luciferase that does not recognize natural dATP as a substrate.

Array-Based Pyrosequencing Technology

21

2.7 Future of Pyrosequencing While pyrosequencing chemistry will be improved to extend the read-length, progress in detection, microfluidics, and base-calling will be critical for further advancement of this technology for different applications at low cost. In addition, upstream processes should be integrated to reduce the cost, shorten the time for sample preparation and minimize human intervention for accurate pyrosequencing. Currently the cost of the machine is about $500,000 and sequencing of mammalian-size genome would cost more than $2 million. Both the cost of instrumentation and reagents should decrease by tenfold and a hundredfold, respectively. Our group has been working on developing methodologies, which could dramatically reduce instrumentation cost. More specifically, we have developed a new detection scheme based on CMOS image sensor [34], which uses standard semiconductor manufacturing techniques, with higher sensitivity than CCD-camera operating at room temperature with a 2-V battery. Furthermore, analog-to-digital conversion and digital signal processing units have been integrated with image sensor to maximize the efficiency of detection. We have integrated this CMOS system with the fluidics to successfully pyrosequence mixtures of PCR products. As CMOS can be customized and offers high sensitivity, there is room for further miniaturization. A 9-megapixel CMOS has now been designed which would enable sequencing of 9 million reaction wells simultaneously. Sequencing 500 nucleotides with an overall 60% efficiency of the system would potentially produce ∼1× of mammalian genome in a single run. As this integrated system is very compact, a superscalar version of such device containing 20 chips can be envisioned for even more high-throughput settings. For lower throughput version of pyrosequencing, Advanced Liquid Logic, Inc., (http://www.liquid-logic.com), has implemented digital microfluidics based on electrowetting to perform pyrosequencing on a flat chip. Sample preparation including PCR and DNA immobilization can potentially be performed on the sequencing chip. The chip is fully programmable enabling various sequencing applications on the same platform. This device could serve for diagnostics, point-of-care, and biodefence applications. In conclusion, we believe that pyrosequencing technology is scalable and can be further miniaturized. A cost of $10,000 per de novo mammalian genome could be available within the next 5 to 7 years.

References [1]

Sanger, F., S. Nicklen, and A. R. Coulson, “DNA Sequencing with Chain-Terminating Inhibitors,” Proc. Natl. Acad. Sci. USA, Vol. 74, 1977, pp. 5463–5467.

22

Genome Sequencing Technology and Algorithms

[2] Ronaghi, M., et al., “Real-Time DNA Sequencing Using Detection of Pyrophosphate Release,” Analytical Biochemistry, Vol. 242, 1996, pp. 84–89. [3] Ronaghi, M., M. Uhlen, and P. Nyren, “A Sequencing Method Based on Real-Time Pyrophosphate,” Science, Vol. 281, 1998, pp. 363, 365. [4] Ahmadian, A., et al., “Single-Nucleotide Polymorphism Analysis by Pyrosequencing,” Analytical Biochemistry, Vol. 280, 2000, pp. 103–110. [5] Gharizadeh, B., et al., “Typing of Human Papillomavirus by Pyrosequencing,” Lab Invest., Vol. 81, 2001, pp. 673–679. [6] Nordstrom, T., et al., “Method Enabling Fast Partial Sequencing of cDNA Clones,” Analytical Biochemistry, Vol. 292, 2001, pp. 266–271. [7] Gharizadeh, B., et al., “Long-Read Pyrosequencing Using Pure 2’-Deoxyadenosine-5’-O’(1-Thiotriphosphate) Sp-Isomer,” Analytical Biochemistry, Vol. 301, 2002, pp. 82–90. [8] Margulies, M., et al., “Genome Sequencing in Microfabricated High-Density Picolitre Reactors,” Nature, Vol. 437, 2005, pp. 376–380. [9] Goldberg, S. M., et al., “A Sanger/Pyrosequencing Hybrid Approach for the Generation of High-Quality Draft Assemblies of Marine Microbial Genomes,” Proc. Natl. Acad. Sci. USA, Vol. 103, 2006, pp. 11240–11245. [10] Hofreuter, D., et al., “Unique Features of a Highly Pathogenic Campylobacter Jejuni Strain,” Infect Immun., Vol. 74, 2006, pp. 4694–4707. [11] Oh, J. D., et al., “The Complete Genome Sequence of a Chronic Atrophic Gastritis Helicobacter Pylori Strain: Evolution During Disease Progression,” Proc. Natl. Acad. Sci. USA, Vol. 103, 2006, pp. 9999–10004. [12] Moore, M. J., et al., “Rapid and Accurate Pyrosequencing of Angiosperm Plastid Genomes,” BMC Plant Biol., Vol. 6, 2006, p. 17. [13] Wicker, T., et al., “454 Sequencing Put to the Test Using the Complex Genome of Barley,” BMC Genomics, Vol. 7, 2006, p. 275. [14] Ng, P., et al., “Multiplex Sequencing of Paired-End Ditags (MS-PET): A Strategy for the Ultra-High-Throughput Analysis of Transcriptomes and Genomes,” Nucleic Acids Res., Vol. 34, 2006, p. e84. [15] Green, R. E., et al., “Analysis of One Million Base Pairs of Neanderthal DNA,” Nature, Vol. 444, 2006, pp. 330–336. [16] Noonan, J. P., et al., “Sequencing and Analysis of Neanderthal Genomic DNA,” Science, Vol. 314, 2006, pp. 1113–1118. [17] Cheung, F., et al., Sequencing Medicago Truncatula Expressed Sequenced Tags Using 454 Life Sciences Technology,” BMC Genomics, Vol. 7, 2006, p. 272. [18] Emrich, S. J., et al., “Gene Discovery and Annotation Using LCM-454 Transcriptome Sequencing,” Genome Res., Vol. 17, 2007, pp. 69–73. [19] Weber, A. P., et al., “Sampling the Arabidopsis Transcriptome with Massively-Parallel Pyrosequencing,” Plant Physiol., 2007.

Array-Based Pyrosequencing Technology

23

[20] Axtell, M. J., et al., “A Two-Hit Trigger for siRNA Biogenesis in Plants,” Cell, Vol. 127, 2006, pp. 565–577. [21] Girard, A., et al., “A Germline-Specific Class of Small RNAs Binds Mammalian Piwi Proteins,” Nature, Vol. 442, 2006, pp. 199–202. [22] Lu, C., et al., “MicroRNAs and Other Small RNAs Enriched in the Arabidopsis RNA-Dependent RNA Polymerase-2 Mutant,” Genome Res., Vol. 16, 2006, pp. 1276–1288. [23] Pak, J., and A. Fire, “Distinct Populations of Primary and Secondary Effectors During RNAi in C. Elegans,” Science, Vol. 315, 2007, pp. 241–244. [24] Nielsen, K. L., A. L., Hogh, and J. Emmersen, “DeepSAGE—Digital Transcriptomics with High Sensitivity, Simple Experimental Protocol and Multiplexing of Samples,” Nucleic Acids Res., Vol. 34, 2006, p. e133. [25] Albert, I., et al., “Translational and Rotational Settings of H2A.Z Nucleosomes Across the Saccharomyces Cerevisiae Genome,” Nature, Vol. 446, 2007, pp. 572–576. [26] Edwards, R. A., et al., “Using Pyrosequencing to Shed Light on Deep Mine Microbial Ecology,” BMC Genomics, Vol. 7, 2006, p. 57. [27] Krause, L., et al., “Finding Novel Genes in Bacterial Communities Isolated from the Environment,” Bioinformatics, Vol. 22, 2006, pp. e281–e289. [28] Sogin, M. L., et al., “Microbial Diversity in the Deep Sea and the Underexplored ‘Rare Biosphere,’” Proc. Natl. Acad. Sci. USA, Vol. 103, 2006, pp. 12115–12120. [29] Turnbaugh, P. J., et al., “An Obesity-Associated Gut Microbiome with Increased Capacity for Energy Harvest,” Nature, Vol. 444, 2006, pp. 1027–1031. [30] Binladen, J., et al., “The Use of Coded PCR Primers Enables High-Throughput Sequencing of Multiple Homolog Amplification Products by 454 Parallel Sequencing,” PLoS ONE, Vol. 2, 2007, p. e197. [31] Thomas, R. K., et al., “Sensitive Mutation Detection in Heterogeneous Cancer Specimens by Massively Parallel Picoliter Reactor Sequencing,” Nat. Med., Vol. 12, 2006, pp. 852–855. [32] Wang, C., et al., “Characterization of Mutation Spectra with Ultra-Deep Pyrosequencing: Application to HIV-1 Drug Resistance,” Genome Res., Vol. 17, No. 8, August 2007, pp. 1195–1201. [33] Ronaghi, M., et al., “Analyses of Secondary Structures in DNA by Pyrosequencing,” Analytical Biochemistry, Vol. 267, 1999, pp. 65–71. [34] Eltoukhy, H., K. Salama, and A. El Gamal, “A 0.18µm CMOS Luminescence Detection Lab-on-Chip,” IEEE Journal of Solid State Circuits, Vol. 41, 2004, pp. 651–662.

3 The Role of Resequencing Arrays in Revolutionizing DNA Sequencing David Okou and Michael E. Zwick

3.1 Introduction Genomics technologies that enable rapid, accurate, and cost-effective DNA sequencing promise to revolutionize genetic research. At the same time, these technologies will usher in a new era of patient care, with an increasing focus on predictive health and individualized genomic medicine. The Human Genome Project’s (HGP) resounding success [1] and its numerous innovations [2], combined with the pressure of direct competition from a commercial company [3], have fostered an environment in which ever-greater quantities of DNA are being sequenced at an ever-decreasing cost [4]. Yet all scientific revolutions tend to change the research landscape in unanticipated ways, and the HGP is no exception. While remarkably efficient, the costs, infrastructure, and personnel required to maintain a traditional, large-scale industrial DNA-sequencing center that uses gel electrophoresis and Sanger-sequencing chemistry may not be scalable to the degree necessary to meet future research and medical application needs [4]. Consequently, the focus of genomics is shifting to a number of exciting new technologies that offer drastic cost reductions, while at the same time dramatically increasing data production [4]. A characteristic these technologies share, and one not often factored into cost-reduction estimates, is that their space, personnel, and infrastructure requirements are a fraction of those

25

26

Genome Sequencing Technology and Algorithms

necessary for the traditional industrial model. Therefore, the prospect of a genome-sequencing center in every lab is increasingly within reach. Resequencing arrays (RAs) are one example of a next generation DNA sequencing technology that can enable individual laboratories to generate vast quantities of genome sequence rapidly and in a cost-effective fashion. This chapter first provides an overview of how RAs determine a DNA sequence. A discussion of basic experimental protocols and methods of RA data analysis follows. Finally, the chapter concludes with a review of several diverse studies that demonstrate how RAs can be employed for DNA sequencing.

3.2 DNA Sequencing by Hybridization with Resequencing Arrays The term “resequencing” refers to the act of sequencing a gene, genomic region, or genome in multiple different individuals in order to identify genetic variation. Resequencing arrays accomplish this task by hybridizing target DNA from a single individual to a set of custom-designed oligonucleotide probes located on a solid surface. When target DNA is hybridized to a RA, interactions with complementary probe oligonucleotides act either to query the identity of a given base (haploid) or to determine a genotype (diploid), depending on the target DNA source. With a high-quality genome reference sequence in hand, RAs can be custom-designed to resequence the unique genomic sequences of any organism. The great power of this approach arises from the fact that high-density oligonucleotide (oligo) RAs possess a very large number of oligos and can be processed by a single individual within a single laboratory [5, 6]. RAs query a given base by using overlapping oligonucleotide probes, tiled at a 1-base-pair (bp) resolution. The oligonucleotide probes, referred to as features, are typically 25–base-pairs long. Both the forward and reverse strands are interrogated, so sequencing a single base requires a total of 8 features. A set of four features contains oligonucleotides identical to the forward reference strand, except at position 13 (the base to be queried), where there is either A, C, G, or T. The remaining four features are similarly designed for the complementary strand (Figure 3.1). When a labeled DNA sample, called a target, is hybridized to these eight features on the array, the two features complementary to the reference sequence (forward and reverse complement) will yield the highest signal. If, however, the target DNA contains a variant base at position 13, the two features complementary to that variant base will yield the highest signal (Figure 3.2). Given eight features for each base, interrogation of an L-length duplex strand would require 8L oligonucleotide probes. Two companies currently manufacture high-density oligonucleotide RAs: Affymetrix, Inc. and NimbleGen Systems, Inc. Affymetrix’s GeneChip technology uses a series of masks and photolithography to manufacture microarrays

The Role of Resequencing Arrays in Revolutionizing DNA Sequencing

27

Reference genomic DNA

Probes tiled at 1-bp resolution

Probe features Synthesized on RA

Reference sequence Base queried

Figure 3.1 A description of the concept and design of resequencing arrays.

Genome

Haploid

Diploid

Target DNA: production and fragmentation

RA hybridization, wash, scan Probes

Probes

Fwd Rev

Data analysis (ABACUS)

Fwd Rev

A C G

A C G

T

T

Haploid Base call: A

Diploid Genotype: A/G

Figure 3.2 Overview of the basic resequencing array protocol.

containing oligos with a length of 25 bp [7]. The current commercially available RAs with an 8-micron feature size contain about 2.4 million distinct oligos. These RAs can resequence approximately 300 kb per chip (2.4 million 8-micron features/eight features per queried base). Because Affymetrix’s technology synthesizes oligos on entire wafers that are subsequently broken into individual RAs, a typical order of 45 chips can resequence up to 13.5 megabases (Mb). The small feature size and concomitant high density of Affymetrix RAs is an appealing strength of their platform. NimbleGen Systems, Inc., on the other hand, uses maskless photolithography to manufacture microarrays that contain about

28

Genome Sequencing Technology and Algorithms

385,000 16-micron features [8]. These RAs can resequence approximately 48 kb per chip. There are two advantages to the Nimblecen Systems approach: they can manufacture single arrays, and they can put oligos of different lengths on a single chip. Oligos of different lengths on a single RA enable more precise matching of melting temperatures between target and probe DNA. Regardless of the RA manufacturer, decreasing probe feature size and increasing microarray density is a very fertile area for research and development, and in all likelihood, further progress in these areas will continue to enhance RA capabilities.

3.3 Resequencing Array Experimental Protocols RAs generate the highest quality data when target DNA sequences that correspond to the complementary probe oligos are unique [5, 6]. Common bioinformatics procedures that identify repetitive sequences underlie RA design, regardless of the targeted organism. Unique DNA sequences can be readily identified and obtained from publicly available DNA sequence repositories using tools like the UCSC Genome Bioinformatics site [9]. Also, genome repeat masking (i.e., RepeatMasker, available at http://www.repeatmasker.org/) is another good alternative. RepeatMasker identifies and masks common repetitive sequences and simple sequence repeats [10]. The unique sequences that remain are then candidates for RA probe selection. In order to use RA space most efficiently, a typical design requires that unique sequences be longer than 50 bp to be tiled on the microarray, while repetitive sequences should be longer than 50 bp to be excluded. However, these guidelines can be relaxed or tightened depending on the specific RA application. This approach is employed because, during RA synthesis, the first and last 12 bases of each unique sequence fragment are not full-length (25-bp) oligos on the chip; hence they do not provide useful base-calling information. These oligos do, however, take up space on the RA, thereby reducing the total number of bases that can be resequenced. Once the desired unique sequences have been identified, they are placed in a flat text multi-FASTA file (each unique fragment is in a separate FASTA text block) that is provided to the chip manufacturer. The RAs are then manufactured as discussed previously. The greatest technical challenge posed by resequencing with RAs lies in isolating target DNA. RAs can resequence either haploid or diploid target DNA. In both cases, the methods of target DNA isolation are identical. Target DNA from relatively simple microbial genomes can be obtained either by using long PCR (LPCR) of selected genomic regions [6] or, even more simply, via whole-genome amplification (WGA) of the entire genome. Larger, more complex eukaryotic genomes present a more significant challenge. LPCR is currently the preferred method of target DNA production and has been used in a project

The Role of Resequencing Arrays in Revolutionizing DNA Sequencing

29

that generated 160 MB of human genomic sequence. LPCR target DNA products are quantified and pooled by individual for a given chip design. A modern robotic infrastructure can further automate this process, but scaling this process while simultaneously amplifying the majority of the target genome sequences (>90%) is a significant obstacle. How to improve methods of target DNA production from complex eukaryotic genomes is an active area of research. Clone-based methodologies are a promising approach that may be more scalable [11, 12], although these methods still require substantial effort and quality control in the course of isolating target DNA products. Isolated target DNA requires processing prior to RA hybridization. The first step is to fragment the target DNA to products that are 20 to 200 bases long. This is typically performed enzymatically using DNAse I and should be followed by gel confirmation of digestion. This step in the experimental protocol is critical, since target DNA overdigestion and underdigestion both lead to poor RA performance. After digestion, the target DNA fragments are labeled and then hybridized to the RA overnight. The chip is then washed and scanned, which generates a high-definition image of the RA. At this point, the chip image is ready to be analyzed.

3.4 Analyzing Resequencing Array Data with ABACUS Determining the DNA sequence from a RA requires analysis of the chip image for each queried base. Initial studies using high-density oligonucleotide microarrays used for single nucleotide polymorphism (SNP) discovery and genotyping relied on manual examination and visual inspection of the chip images by the human eye [13–20]. While these initial studies held out great promise, they also suffered from major shortcomings. The reliance on human inspection of each chip was inherently not scalable, was subject to variation among individual viewers, and tended to bias the resulting base (haploid target) and genotype (diploid target) calls. For example, human inspectors tended to identify common SNPs more often than rare SNPs. Equally problematic, the initial large-scale microarray experiments reported SNP false-positive detection rates between 12% and 45% [19, 21]. While these numbers certainly seemed high, because each chip scans a large number of sites, these initial experiments suggested that microarray-based variation detection and genotyping in fact achieves an accuracy between 99.93% and 99.99% [19–21]. The paradox of high false-positive SNP detection rates paired with seemingly high RA accuracy exposes a significant problem that faces any resequencing technology [5]. A simple thought experiment can clarify this situation. If we propose a hypothetical technology that can determine the sequence of bases with an accuracy of 99.9%, then we would expect to observe 10 errors

30

Genome Sequencing Technology and Algorithms

for every 10,000 bases resequenced. Naturally occurring rates of genetic variation in populations of virtually any organism one can examine are relatively low; thus, most resequenced bases are monomorphic. For example, in humans, the average level of variation is about eight differences per 10,000 bases [22]. Hence, if we resequence 10,000 bases with our 99.9% accurate resequencing technology, we would expect to identify eight true variants and make ten errors. This corresponds to an SNP false-positive detection rate of 55%, a result that demonstrates the need for all resequencing technologies to have very high accuracy in order to detect the generally rare but quite real genetic variants found in genomes. The conundrum of high RA error rates combined with the reliance on human visual inspection hinted that RAs might prove inadequate as an effective resequencing technology. To address these challenges, researchers developed ABACUS (Adaptive Background genotype calling scheme) [5]. ABACUS is an objective statistical framework designed to distinguish base (haploid target)/genotype (diploid target) calls, which can be made with extraordinary accuracy, from those that are less reliable. Because RAs are extraordinarily accurate on the vast majority of queried bases, significantly increasing RA data accuracy becomes possible. Furthermore, the analysis is automated and does not require any human visual inspection of the chip images. ABACUS assigns a quality score to every base/genotype call that corresponds to the level of support at each resequenced base. This approach is similar to the one undertaken for traditional DNA sequencing [23, 24]. To assess data quality, both replicate experiments and accuracy estimates are performed. RA data has been shown to be highly reproducible in replicate experiments consisting of independent amplification of identical samples followed by hybridization to distinct microarrays of the same design [5]. In an autosomal (diploid target) replicate experiment, 813,295 out of 813,295 genotypes were called identically (including 351 heterozygotes); at X-linked loci in males (haploid target), 841,236 out of 841,236 sites were called identically. If repeatability could be equated to accuracy, then this level of repeatability in haploid and diploid genotype calls would correspond to a phred score of at least 54 (assuming a binomial error probability of P; in relating P-values to phred scores, note that phred = −10 log10P) [23, 24]. Although repeatability can certainly suggest accuracy, it is also possible that systematic but repeatable errors might remain undetected. Therefore, using independent resequencing/genotyping technologies, accuracy was also assessed via two separate means. For haploid target data, a 6x-shotgun resequencing on a single individual sample yielded 17,423 base calls that were identical to the ABACUS calls. To assess the accuracy at segregating sites in diploids (nonsegregating sites identical to the reference are exceedingly likely to be correct, since overall polymorphism rates are so low), 1,938 genotypes were

The Role of Resequencing Arrays in Revolutionizing DNA Sequencing

31

obtained at 108 segregating sites. The accuracy of diploid genotypes at segregating autosomal sites was confirmed for 1,515 of 1,515 homozygous calls, and for 420 out of 423 (99.29%) heterozygotes [5]. These results indicated that genotyping accuracy at segregating sites was greater than 99.8% (and this of course does not take into account the nonsegregating sites also likely to be called correctly). A subset of these SNPs was investigated to assess the accuracy of SNP detection: 108 out of 108 SNPs were experimentally confirmed, and an additional 371 SNPs were confirmed electronically. This indicated that RAs can be used for both detection and genotyping of variation simultaneously, and that the base-calling accuracy reaches or exceeds most other widely available stand-alone technologies. Furthermore, because ABACUS provides a quality score for individual bases (haploid targets)/genotypes (diploid targets), investigators can focus attention on those sites providing accurate information. Haploid quality scores, which are those relevant to resequencing X-chromosome loci in human males and microbial genomes, for example, are higher on average than those for diploid data (Figure 3.3). RAs can also detect insertion/deletion (indel) variation in haploid targets. Currently, ABACUS is not designed to detect indel variation in diploid targets. This initial application of ABACUS demonstrated greater than 99.9999% accuracy on more than 80% of the genotype calls in a 7%

6%

Percent of total bases resequenced

Diploid Haploid 5%

4%

3%

2%

1%

0% 5

25

45

65

85

105 125

145

165

185

205

225 245 265

285

305

Quality score (log10)

Figure 3.3 Quality score (log10) distribution for haploid and diploid resequenced bases [5].

32

Genome Sequencing Technology and Algorithms

single experiment using Affymetrix RAs [5]. This work established that RAs could be an effective resequencing platform. Improving the overall call rate above 80% for single experiments then became a priority. After scanning a RA, the first step in data analysis is determining the position of features on the chip. This process is called grid alignment. Grid alignment is followed by the extraction of the raw data from each RA image that is subsequently used for base calling by ABACUS. When ABACUS was developed, grid alignment required manual intervention by a laboratory technician, who would determine the positions of the four corners of a grid by eye. The single grid placed by the technician encompassed the entire RA image, effectively assuming a “globally perfect” grid alignment. Once the grid was placed over the RA image, the intensity data could then be extracted for analysis by ABACUS. This process, in addition to being laborious, led to varying degrees of misalignment on portions of (and in some cases, entire) RAs. Data extracted from RAs with misaligned grids raised error rates. Increasing the quality score threshold could eliminate errors, but at the cost of reduced rates of base/genotype calling. The problems associated with manual grid alignment were eventually solved by software that performs automated grid alignment on RA image files [D. J. Cutler and M. Zwick, personal communication, 2006]. The software makes no “globally perfect” assumption, but instead assumes a locally perfect alignment. Micromisalignments, when undetected, can correspond to very weak heterozygotes when resequencing diploid targets. These errors are minimized with a locally perfect method of grid alignment. Figure 3.4 provides an example of the way local grid alignment can improve data quality. At the same time, automated grid alignment allowed for a reduction of the quality score threshold, increasing call rates to more than 90% of sites. Currently, automated grid alignment followed by ABACUS base calling requires less than 1 minute per Global grid alignment

Local grid alignment

Figure 3.4 Local grid alignment (right) extracts RA data more accurately than global grid alignment (left).

The Role of Resequencing Arrays in Revolutionizing DNA Sequencing

33

microarray on a standard desktop computer. All of these tools have been implemented in a free, open-source software package called RATools, which is publicly available at http://www.dpgp.org/RA/ra.htm. RATools works equally well with RAs manufactured by Affymetrix or Nimblecen Systems, Inc.

3.5 Review of RA Applications Resequencing arrays have been used successfully in a variety of systems and projects. As this new technology advances, an increasingly diverse collection of RA projects has ensued. Here we review a few select applications of RA technology. 3.5.1

Human Resequencing

High-throughput RAs were applied to an experiment encompassing 32-autosomal and eight X-linked genomic regions, each consisting of approximately 50 kb of unique sequence spanning a 100-kb region, in 40 humans [M. Zwick, personal communication, 2007]. The regions resequenced included unique coding and noncoding sequences. Long PCR was used to generate target DNA for RA hybridization. Using thresholds identical to those described previously, approximately 1.6 MB of DNA sequence was determined for each of the 40 individuals. In total, 6,040 SNPs were identified, corresponding to human genetic variation levels determined by other sequencing technologies. The SNPs that were discovered consisted of both common and rare variants. A pronounced excess of rare variants, relative to the number predicted by the neutral theory, was observed in the genomic regions resequenced. These data demonstrated that the RAs can be a viable resequencing platform that is capable of identifying both common and rare SNPs in large-scale resequencing projects focused on complex eukaryotic genomes. 3.5.2

Mitochondrial DNA Resequencing

Mitochondrial DNA mutations are commonly found in cancers, but until recently, strategies for using the mitochondrial genome as a cancer-screening tool were limited by the lack of a high-throughput platform for mutation detection. Maitra et al. [25] used a RA to rapidly sequence the human mitochondrial genome in order to detect mutations linked to cancers. The first generation of the MitoChip (v1.0, released in 2004) could sequence more than 29 kb of double-stranded DNA in a single assay. Both strands of the entire human mitochondrial coding sequence (15,451 bp) were arrayed on the MitoChip; both strands of an additional 12,935 bp (84% of coding DNA) were also arrayed in duplicate. Using ABACUS and a total threshold quality score of 30 [5], the Affymetrix GeneChip DNA Analysis software successfully assigned base calls at

34

Genome Sequencing Technology and Algorithms

96.0% of nucleotide positions [25]. Replicate experiments demonstrated more than 99.99% reproducibility of base calls both within and between chips using array-based sequencing (Table 3.1). There were no significant differences in the percentage of genotype calls among mitochondrial DNA extracted from lymphocytes, cell lines, primary tumors, or body fluids. The second generation MitoChip (v2.0 released in 2006) has a smaller feature size (50 kb of sequence, compared to 32 kb) and includes redundant tiling features, enabling researchers to detect potential insertion-deletion mutations and nearly three-dozen variations in the noncoding D-loop region. These regions were not directly detectable using the original MitoChip [26]. Using RAs to resequence mitochondrial genomes accomplished three key objectives. First, the number of polymerase chain reactions (PCR) required to sequence the entire mitochondrial DNA was reduced to three (from 20 to 40 individual reactions). Second, the amount of starting template needed was reduced to just 10 nanograms (100 nanograms cited [25]). And finally, sequencing the entire mitochondria on a single chip eliminates the need to visually inspect traditional chromatograms. The ability of array-based mitochondrial genome resequencing to detect mitochondrial mutations in samples of body fluids obtained from cancer patients attests to its promise as a tool for the early detection of cancer and other disorders in human clinical samples.

Table 3.1 Summary of Replicate Experiments Total samples analyzed in replicate

13

Total samples analyzed for replicate experiments

26

Within chip reproducibility Mitochondrial base pairs tiled in duplicate per chip

12,935 bp

Total within-chip duplicate base pairs analyzed

26*12,935 = 336,310 bp

Total within-chip duplicate base pairs called

311,814 bp (92.71%)

Discordant called within chips

8 bp (0.0025%)

Between-chip reproducibility Total mitochondrial base pairs tiled per chip

28,386 bp

Total base pairs analyzed in one set of 13 chips

13*28,386 = 369,018 bp

Total base pairs assigned in first set of chips

350,010 bp (94.84%)

Total base pairs assigned in second set of chips

355,086 bp (96.22%)

Total base pairs assigned in both sets

345,094 bp (93.51%)

Discordant bases calls between chips

10 bp (0.0028%)

The Role of Resequencing Arrays in Revolutionizing DNA Sequencing 3.5.3

35

Microbial Pathogen Resequencing

RAs can be a rapid, cost-effective, and efficient platform for rapidly resequencing microbial and viral genomes or genomic regions. Three recent studies have made use of RAs to resequence genomic regions from multiple strain isolates (Bacillus anthracis), viral genomes (SARS virus), and portions of microbial and viral genomes from human clinical samples, as discussed later. 3.5.3.1 Anthrax Resequencing (Bacterial Pathogens)

After the pathogen Bacillus anthracis was employed in the 2001 bioterror attacks in the United States, RAs were used to characterize genetic variations from multiple B. anthracis strains [6]. B. anthracis has two characteristics that make it an ideal system for RA-based resequencing. First, the B. anthracis genome consists of a main circular chromosome and two plasmids (pXO1, pXO2) that are approximately 5.2 MB in size. The relatively small size of microbial genomes allows for rapid target DNA isolation. While target DNA was generated using LPCR [6], subsequent work demonstrated that WGA on higher-density RAs works equally well [M. Zwick, personal communication, 2007]. Second, the ABACUS base-calling model for anthrax is haploid (the same as for the X chromosome in human males), and haploid base calls on RAs have higher-quality scores. A geographically diverse panel of 56 B. anthracis strains was resequenced using Affymetrix RAs, and base calls were determined with the RA Tools software package [6]. Each RA was capable of resequencing 29,212 bp, or about 0.5%, of the B. anthracis genome. Long PCR sample preparation and chip processing was conducted for 118 RAs. Analysis of these 118 RAs with the ABACUS software package showed that 115 were successful (97.5%). Experimental failure was declared when less than 60% of the total possible bases failed to achieve quality scores exceeding the ABACUS user-defined threshold. For that study, the total threshold was set at 31 with a strand minimum of −2 [5], as determined by analysis of a replication experiment. The successful RAs (115) called 92.6% of the possible bases (3,109,539 bp out of a total possible 3,359,380 bp). Bases with quality scores that failed to meet the minimum threshold stemmed from two primary causes. Amplicon failure, which typically arises from LPCR failure, accounted for 1.1% of the uncalled bases. The remaining uncalled bases were distributed nonrandomly across the RAs and were composed of oligonucleotide probes with a significantly higher purine composition (P<10−22). A similar pattern emerged in an analysis focused on guanine-rich probes (P<10−9). These analyses demonstrated that both purine-rich and guanine-rich oligonucleotide probes were significantly more likely to fail at generating quality scores exceeding the experimental threshold.

36

Genome Sequencing Technology and Algorithms

RA data quality was assessed by two methods [6]. The first consisted of a replicate experiment where 51 samples were independently hybridized on 102 RAs. Table 3.2 shows the results of this experiment. If repeatability could be related to accuracy, then this level of repeatability would correspond to a phred score of at least 61. Because repeated systematic errors would not be detected by a replicate experiment, the RA sequence data was compared to shotgun sequence data from the identical strains (Table 3.2, part B). A total of 15 discrepancies were observed. Making the conservative assumption that all observed errors arose from the RA data, then this level of accuracy would correspond to a phred score of at least 44. However, two lines of evidence suggested that such a conservative assumption was incorrect. First, at all discrepant sites, the RA called the base corresponding to the reference sequence, while the shotgun sequence called a SNP. Second, a closer examination of the sequence from a later assembly agreed with the RA base call at 14 of the 15 discrepant sites. The sole remaining discrepant site came from a single shotgun-sequencing read with a phred score of 7. Clearly, the shotgun coverage lacked sufficient depth at this site, and the RA base call seemed far more likely to be correct. Assuming that this is indeed the case, the observed level of sequencing accuracy corresponds to a phred score of 56. These data demonstrate that RA data quality from a single experiment matches, and in some cases may even exceed, the level obtained by multiple DNA sequencing reads generated by conventional DNA sequencing technologies. The study found no evidence of plasmid exchange or recombination altering the patterns of DNA sequence variation among B. anthracis strains in the regions that were resequenced. An excess of rare SNPs was observed. This pattern of B. anthracis genetic variation is consistent with expectations for a

Table 3.2 Assessing B. anthracis RA Data Quality A. Replication Experiment Total number of bases called in Replicate 1

1,383,229

Total number of bases called in Replicate 2

1,373,905

Total number of bases called in both replicates

1,349,177

Total number of bases called differently

1

Replication Experiment Discrepancy Rate

7.4E–07

B. Accuracy Estimation Experiment Total number of bases called identically

398,452

Total number of bases called differently

15

Accuracy Experiment Discrepancy Rate

3.8E–05

The Role of Resequencing Arrays in Revolutionizing DNA Sequencing

37

bacterial species that has undergone a rapid, historically recent expansion from a single clone. This study demonstrated that high-throughput resequencing with RAs of multiple closely related strains could assist with studies aimed at the following goals: identifying a specific strain in environmental and clinical samples; determining forensic attribution and phylogenetic relationships of strains; and uncovering the genetic basis of phenotypic variation in traits such as mammalian virulence and antibiotic resistance. 3.5.3.2 SARS Resequencing (Viral Pathogens)

In April 2003 the Severe Acute Respiratory Syndrome (SARS) outbreak was attributed to the SARS coronavirus (SARS-CoV), a single-stranded RNA virus. Many point mutations among SARS-CoV genome sequences correlated closely with the SARS epidemic. These mutations can alter virulence and the therapeutic response of viral pathogens. To rapidly detect genetic variants in newly confirmed SARS cases and to further provide important clues to the geographic origins of infection, Wong et al. [27] used high-throughput RAs (NimbleGen Systems, Inc., Madison, Wisconsin) to resequence and track the evolution of the SARS coronavirus. With the aid of the ABACUS algorithm, the SARS-CoV resequencing chip repeatedly called greater than 99% of the bases and yielded highly accurate sequence (greater than 99.99%). Traditional ABI capillary sequencing (ACS) was used for independent confirmation to determine discordant calls (Table 3.3). In the table, ambiguous calls refer to those bases lacking sufficient information for high-confidence base assignment. The call rate is the percentage of genome sequence with high-confidence base calls. Accuracy is represented by the percentage of correctly called bases (as determined by ACS) over the total number of bases called (excluding any ambiguous calls). The rapid sequence turnaround time and high accuracy of the RA, coupled with its reasonable cost, make it an ideal platform for the global monitoring of any small-genome pathogen and its entire infected population. 3.5.3.3 Infectious Agent Detection

With mounting concern over the threats posed by bioterrorism and the spread of novel infectious diseases, the rapid diagnosis of infections in human clinical samples has become an increasingly high priority. RAs are one possible technology that could efficiently and rapidly detect these infectious agents. Furthermore, compared to other current state-of-the-art technologies, which require multiple diagnostic tests to identify the infectious agent, RAs could (at least in principle) disclose a wide variety of infectious agents in a single assay. This capability was demonstrated by Lin et al. in 2006. They used a respiratory pathogen RA to survey pathogen nucleic acid sequences from clinical samples [28], and ABACUS allowed them to obtain unambiguous, reproducible, and accurate results. This application of RAs and ABACUS not only resulted in the

38

Genome Sequencing Technology and Algorithms

Table 3.3 Call Rate and Accuracy of SARS Resequencing Array

Array Sequence

Discordant Calls

Ambiguous Calls (ns) Call Rate Accuracy

SIN2500

3

495

98.33%

99.989%

SIN2677

4

179

99.40%

99.986%

SIN2679

0

138

99.53%

SIN2748

2

223

99.22%

99.993%

Vero isolate 1

1

230

99.02%

99.997%

Vero isolate 2

1

328

98.89%

99.997%

Vero isolate 3

4

183

99.38%

Vero isolate 4

0

198

99.33%

100%

Vero isolate 5

0

218

99.36%

100%

Vero isolate 6

0

210

99.29%

100%

Vero isolate 7

0

307

98.96%

100%

Vero isolate 8

0

220

99.26%

100%

Tissue 1-1

1

227

99.24%

99.993%

Tissue 1-2

1

982

96.69%

99.997%

Tissue 1-3

2

120

99.60%

99.997%

Tissue 2

0

278

99.07%

100%

99.986%

100%

simultaneous detection and differentiation of common circulating respiratory pathogens at clinically relevant sensitivity levels, but also identified coinfections and unusual pathogens that are rarely encountered or typically unexpected. In addition, these researchers demonstrated that RAs are able to discriminate and make strain-level identifications of several coinfectants in a complex mixture of samples. With the future expected to bring increases in density and decreases in cost, RA technology holds out great promise for the surveillance of broadspectrum respiratory pathogens.

3.6 Further Challenges With the current explosion of interest in next generation DNA sequencing technologies, the idea of a genome-sequencing center in every lab seems increasingly feasible. High-density oligonucleotide resequencing arrays are one possible technology well suited to fulfilling this vision. RAs can generate enormous quantities of very high-quality DNA sequence with minimal personnel and infrastructure

The Role of Resequencing Arrays in Revolutionizing DNA Sequencing

39

requirements. And yet, while RAs already are a powerful tool applicable to a variety of whole-genome analyses, significant challenges remain to be addressed. The most important technical challenge for RAs lies in the isolation of target DNA from large, complex eukaryotic genomes. For organisms with relatively small genomes, whole-genome amplification followed by labeling and hybridization is a viable strategy. While LPCR can be used to resequence large genomic regions, scaling this approach to obtain high call rates on regions as big as an entire eukaryotic chromosome arm is probably not within reach of a single laboratory. Clone-based methods have shown some promise, but they are still somewhat labor-intensive and relatively expensive. As RA feature density increases, this problem will only become more pronounced; therefore, this will remain an important area of research. Because RAs can only compare target DNA to complementary probes based upon a reference sequence, they are not best used for identifying novel sequences or sequence rearrangements that may occur in the experimental sample. Furthermore, existing algorithms do not allow RAs to detect copy number variation (CNV) in diploid targets. The CNV class of variation, though more rare than the SNP, has nonetheless proved to be surprisingly frequent [29, 30]. While the number of features possible on a RA has grown rapidly, there are undoubtedly practical limits to the absolute number of oligos that can be synthesized on a single chip. Despite this limitation, for many genetic studies, particularly in human or mouse genetics, resequencing large genomic regions, rather than entire genomes, may be sufficient to identify specific mutations. Coupled with an appropriate target DNA isolation procedure, RAs could be very efficient at performing these types of studies. RAs also present a number of computational and bioinformatic challenges that will likely be addressed as this technology develops. Although ABACUS does perform extraordinarily well and has become the industry standard for analyzing RAs, algorithmic improvements are certainly within the realm of possibility. For instance, ABACUS is a purely statistical model that does not account for probe sequence composition. Other algorithms, like model-P [31], are based on probe sequences and have demonstrated improved base calling in diploid samples, making this an active area of research. The need to annotate and analyze the vast quantities of sequence data produced by such next generation sequencing technologies as RAs poses another challenge. Integrating these data with other existing genome annotations in current databases, like Ensembl [32] (http://www.ensembl.org/) or the UCSC Genome Bioinformatics site (http://www.genome.ucsc.edu), will be vital to maximizing the utility of these data. In the final analysis, all next generation sequencing technologies will face a range of unique and common challenges as the technologies develop. RAs may ultimately be combined with other sequencing technologies to speed specific

40

Genome Sequencing Technology and Algorithms

research applications, while at the same time helping to fulfill the promised future of personalized medicine. Yet all the next generation sequencing technologies, which will generate ever-greater quantities of DNA sequence data while decreasing personnel and infrastructure requirements, will undoubtedly pose significant threats to the traditional industrial model of DNA sequence production. Envisioning a scientific landscape radically altered by a future where every laboratory is a genome-sequencing center promises to foment still other revolutions, the shape of which we can only begin to imagine.

References [1] The International Human Genome Sequencing Consortium, “Initial Sequencing and Analysis of the Human Genome,” Nature, Vol. 409, 2001, pp. 860–921. [2] Collins, F. S., M. Morgan, and A. Patrinos, “The Human Genome Project: Lessons from Large-Scale Biology,” Science, Vol. 300, 2003, pp. 286–290. [3] Venter, J. C., et al., “The Sequence of the Human Genome,” Science, Vol. 291, 2001, pp. 1304–1351. [4] Shendure, J., et al., “Advanced Sequencing Technologies: Methods and Goals,” Nat. Rev. Genet., Vol. 5, 2004, pp. 335–344. [5] Cutler, D. J., et al., “High-Throughput Variation Detection and Genotyping Using Microarrays,” Genome Res., Vol. 11, 2001, pp. 1913–1925. [6] Zwick, M. E., et al., “Microarray-Based Resequencing of Multiple Bacillus Anthracis Isolates,” Genome Biol., Vol. 6, 2005, p. R10. [7] Lipshutz, R. J., et al., “High Density Synthetic Oligonucleotide Arrays,” Nat. Genet., Vol. 21, 1999, pp. 20–24. [8] Singh-Gasson, S., et al., “Maskless Fabrication of Light-Directed Oligonucleotide Microarrays Using a Digital Micromirror Array,” Nat. Biotechnol., Vol. 17, 1999, pp. 974–978. [9] Karolchik, D., et al., “The UCSC Table Browser Data Retrieval Tool,” Nucleic Acids Res., Vol. 32, 2004, pp. D493–D496. [10] Smit, A. F. A., and P. Green, “RepeatMasker,” http://ftp.genome.washington.edu/RM/RepeatMasker.html. [11] Raymond, C. K., E. H. Sims, and M. V. Olson, “Linker-Mediated Recombinational Subcloning of Large DNA Fragments Using Yeast,” Genome Res., Vol. 12, 2002, pp. 190–197. [12] Raymond, C. K., et al., “Targeted, Haplotype-Resolved Resequencing of Long Segments of the Human Genome,” Genomics, Vol. 86, 2005, pp. 759–766.

The Role of Resequencing Arrays in Revolutionizing DNA Sequencing

41

[13] Hacia, J., and F. S. G. Collins, “Mutational Analysis Using Oligonucleotide Microarrays,” J. Med. Genet., Vol. 36, 1999, pp. 730–736. [14] Hacia, J. G., et al., “Oligonucleotide Microarray Based Detection of Repetitive Sequence Changes,” Hum. Mutat., Vol. 16, 2000, pp. 354–363. [15] Hacia, J. G., et al., “Determination of Ancestral Alleles for Human Single-Nucleotide Polymorphisms Using High-Density Oligonucleotide Arrays,” Nat. Genet., Vol. 22, 1999, pp. 164–167. [16] Hacia, J. G., et al., “Evolutionary Sequence Comparisons Using High-Density Oligonucleotide Arrays,” Nat Genet., Vol. 18, 1998, pp. 155–158. [17] Hacia, J. G., et al., “Strategies for Mutational Analysis of the Large Multiexon ATM Gene Using High-Density Oligonucleotide Arrays,” Genome Res., Vol. 8, 1998, pp. 1245–1258. [18] Chee, M., et al., “Accessing Genetic Information with High-Density DNA Arrays,” Science, Vol. 274, 1996, pp. 610–614. [19] Halushka, M. K., et al., “Patterns of Single-Nucleotide Polymorphisms in Candidate Genes for Blood-Pressure Homeostasis,” Nat. Genet., Vol. 22, 1999, pp. 239–247. [20] Wang, D. G., et al., “Large-Scale Identification, Mapping, and Genotyping of SingleNucleotide Polymorphisms in the Human Genome,” Science, Vol. 280, 1998, pp. 1077–1082. [21] Cargill, M., et al., “Characterization of Single-Nucleotide Polymorphisms in Coding Regions of Human Genes,” Nat. Genet., Vol. 22, 1999, pp. 231–238. [22] Sachidanandam, R., et al., “A Map of Human Genome Sequence Variation Containing 1.42 Million Single Nucleotide Polymorphisms,” Nature, Vol. 409, 2001, pp. 928–933. [23] Ewing, B., and P. Green, “Base-Calling If Automated Sequencer Traces Using Phred. II. Error Probabilities,” Genome Res., Vol. 8, 1998, pp. 186–194. [24] Ewing, B., et al., “Base-Calling of Automated Sequencer Traces Using Phred. I. Accuracy Assessment,” Genome Res., Vol. 8, 1998, pp. 175–185. [25] Maitra, A., et al., “The Human MitoChip: A High-Throughput Sequencing Microarray for Mitochondrial Mutation Detection,” Genome Res., Vol. 14, 2004, pp. 812–819. [26] Zhou, S., et al., “An Oligonucleotide Microarray for High-Throughput Sequencing of the Mitochondrial Genome,” J. Mol. Diagn., Vol. 8, 2006, pp. 476–482. [27] Wong, C. W., et al., “Tracking the Evolution of the SARS Coronavirus Using High-Throughput, High-Density Resequencing Arrays,” Genome Res., Vol. 14, 2004, pp. 398–405. [28] Lin, B., et al., “Broad-Spectrum Respiratory Tract Pathogen Identification Using Resequencing DNA Microarrays,” Genome Res., Vol. 16, 2006, pp. 527–535. [29] Sebat, J., et al., “Large-Scale Copy Number Polymorphism in the Human Genome,” Science, Vol. 305, 2004, pp. 525–528.

42

Genome Sequencing Technology and Algorithms

[30] Sharp, A. J., et al., “Segmental Duplications and Copy-Number Variation in the Human Genome,” Am. J. Hum. Genet., Vol. 77, 2005, pp. 78–88. [31] Zhan, Y., and D. Kulp, “Model-P: A Basecalling Method for Resequencing Microarrays of Diploid Samples,” Bioinformatics, Vol. 21, Suppl. 2, 2005, pp. ii182–ii189. [32] Birney, E., et al., “Ensembl 2006,” Nucleic Acids Res., Vol. 34, 2006, pp. D556–D561.

4 Polony Sequencing Jay Shendure, Gregory J. Porreca, and George M. Church

4.1 Introduction Conventional technology and industrial capacity for sequencing DNA polymers is many orders of magnitude ahead of the equivalents for direct RNA and protein sequencing. This is ironic, given that initial development of a DNA-sequencing method was the most difficult of the three, and took the longest [1]. The first attempts at DNA sequencing followed the precedent set by protein and RNA, sequencing by detailed analysis of degradation products. However, the length and consequent complexity of DNA polymers proved to be significantly problematic, and a real breakthrough was not achieved until the development of the Maxam-Gilbert and Sanger methods in the late 1970s [2, 3]. The former proceeds by base-specific chemical degradation, and the latter by base-specific termination of polymerase-driven synthesis. For both approaches, however, it was the use of electrophoretic gels capable of resolving single basepair length differences that proved critical to enabling long, accurate reads. Through the 1980s and 1990s, the Human Genome Project motivated a wide range of innovations in biochemistry and automation that brought the sequencing field to its present state. Highly optimized machines performing the Sanger method generate up to one megabase of sequence per day, at a cost of approximately $1 per raw kilobase. Although this low cost and high data-rate are the result of incremental improvements approximating a Moore’s Law of DNA sequencing, it is generally recognized that most avenues for optimization have been exhausted. As we have discussed previously in a review of emerging sequencing technologies [4], there remains a long list of applications for DNA 43

44

Genome Sequencing Technology and Algorithms

sequencing that will be increasingly realistic and worthwhile, provided the cost of sequencing continues to drop. These applications provide a strong incentive for investing significantly in the development of alternative sequencing technologies. Methodologies under development can be broadly grouped into a small number of categories [4], namely: (1) microelectrophoresis, (2) nanopore sequencing, (3) resequencing by hybridization to microarrays, and (4) cyclic-array methods. We recently described a cyclic-array method for DNA sequencing [5], involving sequencing of PCR colonies tethered to beads (bead polonies) via a novel ligation-based chemistry. In this chapter, we will briefly review how our method works and discuss some of the existing and potential applications.

4.2 Overview The term “polony,” in the context of molecular biology, is an abbreviation of “POLymerase colONY” [6]. To quote from [7]: “Polonies are discrete clonal amplifications of a single DNA molecule, grown on a solid-phase surface. This approach greatly improves the signal-to-noise ratio. Polonies can be generated using several techniques that include solid-phase PCR in polyacrylamide gels, bridge PCR, rolling-circle amplification, BEAMing (beads, emulsions, amplification and magnetics) based cloning on beads and massively parallel signature sequencing (MPSS) to generate clonal bead arrays.” The workflow for our implementation of polony sequencing can be divided into three steps: (1) library construction, (2) template amplification, (3) sequencing. To briefly summarize our use of the technology as it was implemented in [5]: Genomic DNA (gDNA), isolated from an E. coli strain that was the product of an experimental evolution experiment was converted by a purely in vitro protocol into a sequencing library. In the library, each ∼135 bp template contained two 17- to 18-bp tags of unique sequence (derived from the E. coli genome being sequenced), flanked and separated by universal adaptor sequences. Each set of tags on the same template molecule was “paired,” in that the two tags were known to have derived from locations on the genome separated by a defined distance distribution. This mixture of templates was then amplified in parallel on 1-micron beads by emulsion PCR [8]. This resulted in a population of “clonal” beads, where each bead bears thousands of copies of the same library template, while a different bead will have copies of a different template, and so forth. Millions of beads were immobilized to a two-dimensional array and subjected to automated cycles of multiplex sequencing via a novel ligation-based chemistry. Sequencing data were acquired by rapid four-color imaging via epifluoresence microscopy. Although the reads obtained were short (a discontinuous 13 bp from each tag),

Polony Sequencing

45

this information was sufficient to sensitively and specifically identify adaptive mutations and genomic rearrangements in the evolved E. coli strain. On our prototype instrument, we estimated our costs at roughly an order of magnitude below those of conventional sequencing as implemented by the major sequencing centers.

4.3 Construction of Sequencing Libraries Conventional DNA sequencing libraries are generated by cloning fragmented gDNA into a vector which can be transformed into E. coli for propagation and amplification. Paired reads can be obtained by sequencing from either direction into inserts with a known size distribution. As we set out to develop a library construction protocol, we had several constraints in mind. First, the transformation step imposes a numerical bottleneck on the complexity of the library, and propagation in E. coli can bias the library against certain sequences. We therefore sought to develop a purely in vitro method of library construction. Second, we wanted the library method to be compatible with obtaining paired reads, which are proven to be useful for detecting rearrangements as well as for genome assembly. Third, we had observed that our intended amplification protocol, emulsion PCR, poorly amplified sequences that were greater than 200 bp. We therefore sought a method that would yield relatively short library molecules. With these constraints in mind, the method that we developed involves the following steps: (1) mechanical shearing of genomic DNA; (2) gel-based size selection of fragments approximately 1 kb in length; (3) end-repair and A-tailing of these fragments; (4) circularization with a short, ∼30-bp, T-tailed fragment (T30) that contained outward facing sites for a Type IIs restriction enzyme, MmeI; (5) exonuclease treatment and hyperbranched rolling-circle amplification of circles (hRCA), to increase the available number of molecules; (7) digestion with MmeI, to release T30 fragments along with paired, 17- to 18-bp tags (each set of which is derived from the ends of an about 1-kb genomic DNA fragment); (8) ligation of adaptor sequences; and (9) PCR amplification of the library to increase the available number of molecules. We are presently working to improve and simplify this protocol in a number of ways. (1) Analysis of our sequencing data clearly indicates that the bottleneck in the complexity of our sequencing libraries is due to inefficient circularization of genomic DNA fragments to the T30 adaptor. This is likely due to the poor efficiency of TA ligation. We are replacing the A-tailing step with the ligation of adaptors bearing longer ssDNA overhangs, enabling efficient self-circularization of library molecules. (2) MmeI yields tags of 17 or 18 bp, at approximately a 1:1 ratio. We are exploring a number of different Type IIs enzymes, including EcoP15I and AcuI, that yield tags of ∼27 bp and ∼16 bp,

46

Genome Sequencing Technology and Algorithms

respectively. We have generally found that TypeIIs enzymes that yield shorter tags have less uncertainty in the precise position at which they cut. (3) Several polyacrylamide gel electrophoresis (PAGE) size purifications are required in our published protocol. We have switched to a column-based gel system in which size ranges are isolated by time-sensitive elution, significantly reducing the labor due to these steps.

4.4 Template Amplification with Emulsion PCR Although direct interrogation of single nucleic acid molecules is gaining in practicality, DNA and RNA generally require amplification to enable analysis. The application of restriction enzymes to develop generalized cloning methods [9] opened the gates to modern molecular biology. Cloning allows isolation and perpetual high-fidelity amplification of a specific DNA fragment. The operation is “clonal,” in that all resulting amplicons originated from the same single molecule. In a shotgun-sequencing pipeline, the multiplex construction of a complex library is essentially disentangled by transformation into E. coli. Individual colonies propagate and amplify a single sequencing template, but a bottleneck exists in that colonies must be serially picked and then fed to a highly parallel sequencing platform. Since 1988, methods have been developed that enable in vitro polymerasedriven amplification of single molecules, such that all amplicons, analogous to a bacterial clone, reside at a shared location [10]. In vitro amplification can be potentially based on the polymerase chain reaction (PCR) or rolling circle amplification. Avoidance of mixing between amplification products can be achieved most simply through template dilution and physically distinct wells [11, 12]. Alternatively, immobilization of at least one end of each amplicon to a two-dimensional [8, 13, 14] or three-dimensional [6, 15] solid support matrix can result in physically separated PCR colonies. A third method of compartmentalizing reactions involves water-in-oil emulsions [16]. These were originally used for in vitro directed evolution, but more recently have been modified to allow capture of clonal PCR reaction products on recoverable beads [8], as described in further detail later. We and others [7] use the term “polony” broadly, to describe systems where multiplex PCR amplification is performed, but amplicon mixing is prevented by local immobilization of amplicons to yield colonies of PCR products. “Rolony” refers to amplified colonies generated by rolling circle amplification, rather than PCR amplification. Our initial focus centered on in situ polonies, in which amplification was performed within a polyacrylamide gel to restrict diffusion, and PCR products were locally immobilized by covalent linkage of one of the PCR primers to the gel matrix itself [6, 15]. In this system, multiplex

Polony Sequencing

47

hybridizations and enzymatic reactions to interrogate amplified in situ polonies were performed with oligonucleotides and dNTPs bearing fluorescent labels. Data were acquired via a laser-based scanning microarray reader. With colleagues, we developed applications for querying nucleic acids in ways that are extremely difficult with conventional technology, such as long-range direct molecular haplotyping across kilobases [17] or whole chromosomes [18], quantitation of mRNAs bearing complex, combinatorial patterns of alternative splicing [19], and sequencing-by-synthesis of short tags using cleavable fluorescent nucleotides [20]. We have subsequently transitioned to generating bead-based polonies via emulsion PCR (ePCR) as described by [8]. The ePCR protocol involves generating a water-in-oil emulsion that contains all PCR reactants in the aqueous phase. Template is included at a low concentration, such that the vast majority of compartments contain zero or one template molecule, though a small percentage has multiple templates. Also included in the reaction are 1-micron paramagnetic beads that bear one of the PCR primers (“reverse primer”) by a streptavidin-biotin linkage, such that one strand of each PCR product will be tethered to the beads. The forward primer and a low concentration of reverse primer are present in the aqueous phase. The PCR reaction is carried out via standard thermocycling, albeit with a greater number of cycles. In compartments containing a single template, amplicons are captured to the beads, resulting in a clonal bead. In compartments containing zero or multiple templates, beads end up with no amplicons, or mixed amplicons, respectively. The primary advantages of bead polonies/ePCR, relative to in situ PCR, include the small, uniform size and high signal density of the resulting sequencing features. These in turn enable data acquisition by epifluorescence microcopy, rather than a microarray scanner. In ePCR, the consequence of lowering the concentration of template to avoid obtaining mixed beads (resulting from multiple templates in one compartment) is that one increases the fraction of beads that are “empty” because their compartment contained no template. The ratios are governed by a Poisson distribution, and a typical experiment might yield 90.5% empty beads, 9% clonal beads, and 0.5% mixed beads. As empty beads occupy substantial territory on the sequencing arrays, we developed a method (as described in [5]) to enrich the fraction of template-bearing beads. The method is based on hybridization of ePCR-generated beads to large (3 µm), low-density beads that bear a primer complementary to the forward PCR primer. As this sequence is only present on amplicons tethered to PCR-product bearing beads, these beads selectively hybridize to the 3-µm capture beads. This fraction is separated from empty 1-µm beads by centrifugation through a glycerol gradient. The protocol can be repeated for increased levels of enrichment.

48

Genome Sequencing Technology and Algorithms

Other emerging sequencing technologies, such as that of 454 Corp., also rely on emulsion PCR. Contrasts between their ePCR protocols and those of [8] include (1) different bead sizes (25 µm versus 1 µm, which translates to a 625-fold difference in cross-sectional area and an approximate 15,000-fold difference in volume), and (2) markedly different emulsion compositions. It is also notable that the protocols [8] have recently been updated by several papers that include protocols for greater ePCR consistency and a method for using rolling circle amplification (RCA) to enable more than a hundredfold increases in signal [21, 22]. We are currently working to integrate these improvements into our sequencing platform protocols.

4.5 Sequencing Arrays of sequencing features, produced by layering beads, can potentially be ordered or random. Ordered arrays have the advantage of defined feature separation and can enable very high packing densities, while disordered arrays are much easier to generate but retain the potential for high feature density (by self-assembly of monolayers, for example). In our transition from in situ polonies to bead-based sequencing features, we adapted the gel-based system to enable monolayering of beads at the surface of a thin acrylamide gel. We have subsequently adapted these protocols to enable gel-free bead arrays [23]. In initial experiments, we have found that in the absence of the acrylamide gel, enzymatic reactions are more efficient and the resulting imaging data is significantly less noisy. In cyclic-array methods, each cycle generally involves an enzymatic step in which a fluorescent label is incorporated to a primer that is hybridized to a constant sequence present on each template molecule. The enzymatic reaction provides specificity, and the identity of the fluorophore provides information as to whether a particular base is present at a given position. Enzymes that have been used in this way include polymerases and ligases. We described a sequencing method that relies on ligation of degenerate nonamers to an “anchor” primer that hybridizes immediately adjacent to the unknown sequence of interest [5]. In each cycle: 1. An anchor primer is hybridized either immediately 5’ or 3’ to one of the unknown tag sequences. 2. A population of random 9-mers is flowed in; each bears a fluorescent label. Depending on which base is being queried at a given cycle, the population of nonamers is structured such that the identity of a specific base position correlates with the identity of the attached fluorophore.

Polony Sequencing

49

3. Data acquisition via four-color imaging of beads (e.g., A = Cy3, G = Cy5, T = FITC, C = Texas Red). 4. Stripping of the anchor primer/nonamer complexes to reset the array prior to starting the subsequent cycle, which will interrogate a different base position. In [5], stripping was performed by enzymatic digestion at deoxyuridines included in the anchor primer, but we have since begun chemically stripping with alkali to reduce cost and time. We have found that T4 ligase will perform ligation efficiently, but the fidelity of ligation extends only to 6 to 7 bases away (in both directions) from the actual site of ligation. Taq ligase (commercially known as Ampligase) has strong potential to extend the specificity to 9 bases [24]. The drawback of nonamer ligation for sequencing, as we have currently implemented it, is that reads are limited to relatively short lengths. However, by positioning the anchor primer at either end of each unknown tag, one can sequence from either direction into the unknown tag. In the described work, we were able to sequence 6 to 7 bp from each direction into each of two 17- to 18-bp tags. Although this provided 13 bp of sequence for each of the two tags, the sequence itself was not continuous, and was compounded by the uncertainty in the distance between the 6-mer and 7-mer because of the ”wobble” in MmeI cut distance. However, we were nevertheless able to use the information to place reads to a reference E. coli sequence and identify mutations. We feel our success in making use of these data supports the case that for a variety of goals, as discussed below, relatively modest read-lengths will be sufficient. Specifically, applications that put a larger priority on the total amount of sequence, or the number of independent sequence tags obtained, may be more accepting of shorter reads. However, discontinuous 6- to 7-bp reads are not sufficient for many of these purposes, and we are therefore working to enable continuous 20–25 bp reads, via modifications to ligation-based protocols.

4.6 Future Directions In both academia [5] and industry [25], we have seen several alternative DNA sequencing technologies achieve important proof-of-principle milestones. These successes have led to broadened interest and funding for enabling new technologies. In the 1990s, a single-sequencing platform (namely, the ABI series of instruments) was generally considered the instrument of choice for high-throughput sequencing. It remains unclear whether a single approach will become dominant in this next generation of technologies. The range of applications of sequencing has expanded considerably (e.g., genome resequencing, de

50

Genome Sequencing Technology and Algorithms

novo sequencing, barcode sequencing, transcript tag sequencing, digital karyotyping, and so forth) and the critical parameters for these applications differ. For example, de novo sequencing is simplest with long, paired-sequencing reads. For genome resequencing, the key is to achieve high final consensus accuracies, and reads need only be long enough to match to a reference sequence. For transcript tag sequencing, digital karyotyping, and barcode sequencing, the priority is on depth—obtaining the greatest possible number of independent reads. It is possible that one technology will excel in all of these, but it is also possible that different platforms may be the best choice for specific applications. In any case, as this field gathers momentum it is clear that cost of the DNA sequencing will continue to fall. The completion of a human genome may have marked the end-of-the-beginning of the era of sequencing and, more generally, the era of nucleic-acid technologies. It is hard to avoid the feeling that we have only scratched the surface of what it will be possible to do with DNA.

References [1] Sanger, F., “Sequences, Sequences, and Sequences,” Ann. Rev. Biochem., Vol. 57, 1988, pp. 1–28. [2] Maxam, A. M., and W. Gilbert, “A New Method for Sequencing DNA,” Proc. Natl. Acad. Sci. USA, Vol. 74, No. 2, February 1977, pp. 560–564. [3] Sanger, F., S. Nicklen, and A. R. Coulson, “DNA Sequencing with Chain-Terminating Inhibitors,” Proc. Natl. Acad. Sci. USA, Vol. 74, No. 12, December 1977, pp. 5463–5467. [4] Shendure, J., et al., “Advanced Sequencing Technologies: Methods and Goals,” Nat. Rev. Genet., Vol. 5, No. 5, May 2004, pp. 335–344. [5] Shendure, J., et al., “Accurate Multiplex Polony Sequencing of an Evolved Bacterial Genome,” Science, Vol. 309, No. 5741, September 9, 2005, pp. 1728–1732. [6] Mitra, R. D., and G. M. Church, “In Situ Localized Amplification and Contact Replication of Many Individual DNA Molecules,” Nucleic Acids Res., Vol. 27, No. 24, December 15, 1999, p. e34. [7] Fan, J. B., M. S. Chee, and K. L. Gunderson, “Highly Parallel Genomic Assays,” Nat. Rev. Genet., Vol. 7, No. 8, August 2006, pp. 632–644. [8] Dressman, D., et al., “Transforming Single DNA Molecules into Fluorescent Magnetic Particles for Detection and Enumeration of Genetic Variations,” Proc. Natl. Acad. Sci. USA, Vol. 100, No. 15, July 22, 2003, pp. 8817–8822. [9] Cohen, S. N., et al., “Construction of Biologically Functional Bacterial Plasmids In Vitro,” Proc. Natl. Acad. Sci. USA, Vol. 70, No. 11, November 1973, pp. 3240–3244. [10] Li, H. H., et al., “Amplification and Analysis of DNA Sequences in Single Human Sperm and Diploid Cells,” Nature, Vol. 335, No. 6189, September 29, 1988, pp. 414–417.

Polony Sequencing

51

[11] Leamon, J. H., et al., “A Massively Parallel PicoTiterPlate Based Platform for Discrete Picoliter-Scale Polymerase Chain Reactions,” Electrophoresis, Vol. 24, No. 21, November 2003, pp. 3769–3777. [12] Vogelstein, B., and K. W. Kinzler, “Digital PCR,” Proc. Natl. Acad. Sci. USA, Vol. 96, No. 16, August 3, 1999, pp. 9236–9241. [13] Adams, C., and S. Kron, “Method for Performing Amplification of Nucleic Acid with Two Primers Bound to a Single Solid Support,” U.S. Patent No. 5641658, 1997. [14] Lizardi, P. M., et al., “Mutation Detection and Single-Molecule Counting Using Isothermal Rolling-Circle Amplification,” Nat. Genet., Vol. 19, No. 3, July 1998, pp. 225–232. [15] Chetverin, A. B., H. V. Chetverina, and A. V. Munishkin, “On the Nature of Spontaneous RNA Synthesis by Q Beta Replicase,” J. Mol. Biol., Vol. 222, No. 1, November 5, 1991, pp. 3–9. [16] Tawfik, D. S., and A. D. Griffiths, “Man-Made Cell-Like Compartments for Molecular Evolution,” Nat. Biotechnol., Vol. 16, No. 7, July 1998, pp. 652–656. [17] Mitra, R. D., et al., “Digital Genotyping and Haplotyping with Polymerase Colonies,” Proc. Natl. Acad. Sci. USA, Vol. 100, No. 10, May 13, 2003, pp. 5926–5931. [18] Zhang, K., et al., “Long-Range Polony Haplotyping of Individual Human Chromosome Molecules,” Nat. Genet., Vol. 38, No. 3, March 2006, pp. 382–387. [19] Zhu, J., et al., “Single Molecule Profiling of Alternative Pre-mRNA Splicing,” Science, Vol. 301, No. 5634, August 8, 2003, pp. 836–838. [20] Mitra, R. D., et al., “Fluorescent In Situ Sequencing on Polymerase Colonies,” Anal. Biochem., Vol. 320, No. 1, September 1, 2003, pp. 55–65; erratum in Anal. Biochem., Vol. 328, No. 2, May 15, 2004, p. 245. [21] Diehl, F., et al., “BEAMing: Single-Molecule PCR on Microparticles in Water-in-Oil Emulsions,” Nat. Methods, Vol. 3, No. 7, July 2006, pp. 551–559. [22] Li, M., et al., “BEAMing Up for Detection and Quantification of Rare Sequence Variants,” Nat. Methods, Vol. 3, No. 2, February 2006, pp. 95–97. [23] Kim, J. B., et al., “Polony Multiplex Analysis of Gene Expression (PMAGE) in a Mouse Model of Hypertrophic Cardiomypathy,” Science, Vol. 316, No. 5830, 2007, pp. 1481–1484. [24] Housby, J. N., and E. M. Southern, “Fidelity of DNA Ligation: A Novel Experimental Approach Based on the Polymerisation of Libraries of Oligonucleotides,” Nucleic Acids Res., Vol. 26, No. 18, September 15, 1998, pp. 4259–4266. [25] Margulies, M., et al., “Genome Sequencing in Microfabricated High-Density Picolitre Reactors,” Nature, Vol. 437, No. 7057, September 15, 2005, pp. 376–380.

5 Genome Sequencing: A Complex Path to Personalized Medicine Lewis J. Frey, Victor Maojo, and Joyce A. Mitchell

5.1 Introduction With the completion of the Human Genome Project in 2003, the worldwide focus has shifted to converting this vast storehouse of information into innovative health-care solutions. The ultimate promise is to ensure that every person has optimum health throughout their life. This promise is fulfilled by many parts, including adequate nutrition, clean water supplies, up-to-date immunizations, and regular health screening. The part of the promise to be fulfilled by knowledge and information stemming from genomics and other “omics” is yet unfolding. The clear informational basis for bioinformatics, the famous rule of the Crick’s dogma of biology, DNA to RNA to proteins, gave researchers concrete ground that facilitated the rapid completion of the Human Genome Project. It is quite a different situation for biomedical informatics, where the path from genomic sequences to personalized medicine is complex given the many difficulties of modeling systems in clinical medicine. The purpose of this chapter is to review some of the progress made towards personalized medicine from the perspective of biomedical informatics. With respect to their usefulness on human health care, the vast and heterogenous storehouses of information are just beginning to be tapped. A major push from the informatics community as it focuses on how to organize, synthesize, and deliver information to providers and

53

54

Genome Sequencing Technology and Algorithms

the public worldwide in an effective manner to demonstrate improved health is critical to realize the potential of personalized medicine. Personalized medicine does not have a crisp definition, but rather reflects a broad coalition of ideas brought to bear on an age-old notion of personalized care. In the Genomics and Personalized Medicine Act of 2006 [1] personalized medicine is defined as “…the application of genomic and molecular data to better target the delivery of health care, facilitate the discovery and clinical testing of new products, and help determine a patient’s predisposition to a particular disease or condition.” The tradition of health care is to focus on the patient, whose care is being provided in the best possible manner, thus providing personalized care. The “personalized medicine” movement incorporates the use of molecular analysis and specifically methods evolving from knowledge of genomics to better manage a patient’s disease or predisposition toward a disease [2]. The belief of the worldwide scientific community is that knowledge of genomics will contribute to better health outcomes. The actual approaches are not well defined but would include genetic screening programs, genetic risk analyses, and the use of diagnostic and therapeutic modalities that are still evolving. The realization of this dream of personalized medicine is tied to the discoveries of bioinformatics and the implementations of health information technology that have been developed by the pioneers of medical informatics. Thus the two subfields of biomedical informatics are dependent on each other for realizing their full level of success. While the development of new genetic tests and diagnostic modalities is steadily advancing, the progress of incorporating results of such tests into the electronic medical records and using these results in clinical decision support is inching forward and would benefit from a coordinated worldwide approach from the biomedical informatics research community. The realization of the dream of personalized medicine is also tied to the Internet and the public’s use of the World Wide Web. There is growing interest from the public to find information related to health problems they see in their families, and also where they might be tested for specific genetic mutations [3]. Web sites that focus on providing genetic information linked to specific human health concerns have shown growing interest. For example, the Genetics Home Reference is a Web site created by the National Library of Medicine as part of its consumer health information resources (see Figure 5.1) [4, 5]. Links to this resource from such health-news outlets as CNN and CNBC demonstrate that the public is interested in genetic and health topics and is willing to follow up the news stories with investigation of related Web links. This interest from the public is creating pressure on care providers worldwide to order more genetic tests and to be more educated about the range of possible genomic-based medicine as it evolves [6].

Genome Sequencing: A Complex Path to Personalized Medicine

55

Figure 5.1 Home page of the Genetics Home Reference. This National Library of Medicine–supported site is focused on presenting the health-related results of the Human Genome Project to the public. The research findings are linked to specific health conditions.

5.2 Personalized Medicine After the completion of the Human Genome and other “omics” projects, biomedical scientists and professionals aimed to introduce biological information, such as SNPs, microarray data, biomarkers, and ever increasing data sets into clinical practice. Researchers hoped that new findings could have immediate clinical applications, leading to new visions of medicine, such as genomic medicine or “personalized medicine” [7–12]. Personalized medicine aims to adapt these mechanisms to individual patients [13]. Nevertheless, each patient will not be treated differently from every other patient [14]. Rather, patients are divided into groups by genetic and other markers that predict disease progression and treatment outcome. Personalized medicine, based on genetic factors or markers and more specific drugs [15], cannot be a magical solution for every individual. Instead, drugs will be tailored for groups of people with similar or related genetics characteristics. In this scenario of personalized medicine, pharmacogenetics is at the center of research and practice [14, 16–18]. Pharmacogenetics explains the different

56

Genome Sequencing Technology and Algorithms

responses of individuals to the same drugs. To validate preliminary findings and models, experiments (e.g., clinical trials) must be carried out with a large number of patients [13]. Therefore, classical clinical studies must be redesigned to adapt to these new situations. Weatherall [19] has emphasized the different cautious steps to be taken before personalized medicine can be applied in clinical medicine. Personalized medicine can prove that diseases such as cancer or diabetes type II, formerly classed as a unique and isolated category, can be reconsidered and reclassified, proving that they are due to different causes. This reclassification leads to new diagnostic and therapeutic procedures. Once this process is established, researchers must show that it is cost-effective to test specific markers and use specific drugs. Otherwise, health systems may refuse to afford the enormous demands of this new shift in medicine, particularly if there is not enough evidence of their significant impact in changing medical outcomes in the population.

5.3 Heterogeneous Data Sources Over the last decade a whole area of “data integration” has been created and expanded. It includes approaches such as data warehousing, information linkage, data translation, and query translation [20] as well as techniques such as ontologies [21]. We have earlier proposed that genomic data could be integrated into health information systems [22–24] but recent research [25–29] suggests that this bridging process will not be easy. At the time of writing, around 900 biological public databases (e.g., genomic, proteomic, metabolomic, and others) were available to researchers and other professionals. These databases have been designed and maintained as a result of many biological research projects that produced a huge amount of heterogeneous information about genes, proteins, and genetic diseases. Public databases usually include different data, ranging from biochemical to public health data. A larger number of organizations maintain their own databases, restricted to public access for different reasons (e.g., socioeconomic, confidentiality, strategic), often focused in one specific area or topic. Within biomedical databases information is often inconsistent or missing. In systems biology, for instance we can find problems related to functional annotations of genes and proteins, genotype-phenotype relationships, kinetic values for enzymes or components of pathways [16]. In clinical systems, spanning over decades, the inconsistencies can be even higher. Analyzing old medical records stored on paper, in settings where computerized medical records were not available until recently, the rate of missing or inconsistent data can be quite variable and often very high [30].

Genome Sequencing: A Complex Path to Personalized Medicine

57

From a scientific perspective, these increase the problems of data integration and analysis, due to the frequent variability across different settings regarding experiments, techniques, procedures, theoretical approaches, cognitive biases, among others [25, 31, 32]. Given this lack of consistency, using data obtained from heterogeneous sources to advance scientific research presents different problems, especially in biomedicine. In the “omics” areas, there has been a predominant data-driven research approach. Multiple sources of genomic-scale data must be integrated to develop more precise descriptions of clinical phenotypes. For instance, gene expression data reflect the effect of oncogenes on metabolic pathways leading to oncological disease. We know now that cancer is not a precise and unique disease, but a number of pathologies, with different causes and therapies. In fact, the differences in all these databases, considering hardware, software applications (e.g., operating systems, database management systems), semantics as well as the differences mentioned above in scientific approaches, cultural environments, cognitive biases, and others (subtly hidden in the stored data) must be solved if the researchers want to really integrate information and extract useful patterns. Frequently, data must be normalized [25] and some common data models and coding systems must be used or developed to standardize the representation of genotype-phenotype information.

5.4 Information Modeling In this effort to enhance information storage and data exchange, bioinformaticians can draw on work that the Object Management Group (OMG; http://www.omg.org) has done in the development of the model driven architecture (MDA) approach. This is an approach that represents systems with graphical object models. They propose the development of platform independent models using the Unified Modeling Language (UML). For modeling complex systems that combine clinical and genomic data, a special kind of UML model, called a domain model, can be used [33]. Such a domain model incorporates the scientific domain knowledge that is necessary in using clinical and genomic data for personalized medicine. The models can be used to communicate information about the system to both developers and users of the system. For example when creating a system about transcription and translation, the objects of DNA, RNA, and proteins should be modeled along with their associations. Models that use domain information to represent knowledge about the data are needed in order for the field to create systems that have better information representations and data exchange. Bioinformaticians are in a position to communicate the underlying body of knowledge to developers in order to create

58

Genome Sequencing Technology and Algorithms

such domain models. In bridging the technology and scientific communities they can help in the creation of domain models that communicate semantically meaningful information about the data. Semantically meaningful information models allow developers to construct objects that have meaningful counter parts in use in their area of application. Since many systems are being developed independently, if developers are enabled to represent objects in a meaningful way, then the chances of having reusable objects or at least objects that are easier to map between systems is increased. Reuse of objects between systems will help improve information storage, data exchange, and the development of interoperable systems.

5.5 Ontologies and Terminologies An interesting effort to normalize and reuse vocabularies and knowledge over different projects and groups is related to ontological development. Ontologies provide the semantics needed to bridge the existing gaps between heterogeneous data sources and a formal language for information retrieval. The underlying vocabularies that are currently being used to support MDA development at the National Cancer Institute (NCI) are those provided by the Enterprise Vocabulary Services (EVS). They support a number of terminologies for the needs of NCI. The two products being used in MDA at NCI are the NCI Thesaurus [34–37] and the NCI Metathesaurus [38]. The former is a reference terminology that has vocabulary in clinical care, translational research, and basic research. It contains information on 10,000 cancers and related diseases and 8,000 therapies. The NCI Metathesaurus is a mapping between multiple terminologies. It includes a specialized version of the UMLS [39, 40]. This version is specialized in order to focus on terminologies that can be related to cancer terminologies [33]. Some of the terminologies that the Metathesaurus includes are LOINC [41], SNOMED/CT [42], Veterans Health Administration National Drug File (VA NDF), Gene Ontology (GO) [43], and MGED Ontology [44]. The NCI Metathesaurus contains 1.2 million concepts mapped to 2.9 million terms with 5 million relationships. Mapping these terminologies supports the goal of representing and combining clinical and genomic information. These terminologies help the developers of domain UML models by giving them a broad range of terminologies. In this way the projects are combining bioinformatics and clinical informatics concepts in data models that support interoperable systems for the field. This is a key component of creating systems that support the development of personalized medicine. In the medical domain, vocabulary and coding systems such as the ICD 9 and 10, SNOMED, LOINC, MeSH, UMLS, ICNP, GALEN, the NCI thesaurus, MedDRA, and others are now used for ontology-related tasks. Although it

Genome Sequencing: A Complex Path to Personalized Medicine

59

can be assumed that they are not “true” ontologies from a formal computing perspective, they have been used in some concrete applications (e.g., the UMLS) [21]. The Foundational Model of Anatomy (FMA) is an extended work concerned with the representation of classes and relationships used for the symbolic representation of the structure of the human body in a format that is understandable to humans and also navigable by computers. Specifically, the FMA is a domain ontology that represents a coherent body of explicit declarative knowledge about human anatomy. In genetics, GO has experienced great success and leads ontological development in the genetic area. GO is a collaborative effort among different organizations and professionals to create a controlled vocabulary of gene and protein roles in cells to consistently describe gene products in different databases. GO includes three structured, controlled vocabularies (or “light” ontologies) that describe gene products in terms of their associated biological processes, cellular components, and molecular functions. There are no current worldwide standards that represent genotype-phenotype data with specific data models that can be used to enhance information storage and exchange. Different coding and vocabulary systems have been used in the medical domain in electronic health records, HIS, epidemiological surveillance systems, and so on, but only recently proposals have been made to link clinical and genetic concepts. The UMLS has recently included genetics terms [45] and the GO [46] within its medical vocabularies and nomenclatures. In general, these efforts have been supported by medical informatics professionals, whereas bioinformaticians have more recently realized that they should dedicate their efforts to this area. In a semantic mediation system, the users (humans or machines) do not care about the specific format of the information source, but only about the terms contained in the ontology for building the query in a proper way. In this regard, ontologies can be understood and used by both humans and machines.

5.6 Applications In a research environment where thousands of devices, databases, Web-based documents and other sources are used in research, clinico-genomics, semantic mediation, and novel computing techniques for such computationally demanding tasks will be needed. For instance, “data grids” can be used in the short term to enhance access to computationally demanding clinico-”omics” applications from remote sites [47, 48]. Bioinformatics’ Web and grid services can be orchestrated to organize intelligent work flows of different applications, organized using program managers and semantic mediators. Semantic mediation will be needed to intelligently organize such “choreography” [47, 49].

60

Genome Sequencing Technology and Algorithms

ACGT (Advancing Clinico-Genomics Trials on Cancer) [50] is funded by the European Commission as an integration project to design new methods and resources for cancer research. Twenty-five partners from Europe and Japan participate in ACGT. The goal of the project is to identify technological gaps and barriers in cancer research and create novel techniques for diagnosis and prevention as well as to design new models of clinico-genomic trials that will facilitate the creation of new drugs and therapies in the context of personalized medicine. From a biomedical informatics perspective, ACGT aims to develop a biomedical grid infrastructure at a European level to conduct research on two different kinds of cancer: pediatric nephroblastoma and breast cancer. In this grid-based scenario, research on systems interoperability (based on the development of the “ACGT master ontology on nephroblastoma” and a semantic mediator to organize the choreography of different Web and grid services), in-silico simulation of drug design, data mining, and clinico-genomic information modeling and management are being investigated to develop novel approaches of clinico-genomic trials. In this framework, heterogeneous data from three ongoing clinical trials are linked with “omic” information into a common virtual repository, by using an ontology-based approach. Such a repository includes different types of clinical and genomic information, such as numerical data, text and images from patients participating in the trials, external public databases, and in-silico simulation. This project, ending in 2010, aims to solve some of the problems that arise in this kind of research. For instance, to build an efficient biomedical Grid infrastructure or carry out in-silico simulations to design and test new drugs in the context of personalized medicine. Two examples of European systems that provide examples of intermediate solutions to the hard challenge of integrating heterogenous data sources are OntoFusion and DiseaseCard. OntoFusion is a tool developed to integrate remote databases. DiseaseCard is an information retrieval tool for accessing and integrating genetic and medical information for medical applications. OntoFusion [21] uses an ontology-based approach, designed to build virtual repositories of physical databases by mapping the schemas of the latter with a domain ontology. Once these schemas have been normalized according to the semantic contents of the ontology, then the virtual repositories can be automatically unified by using a unification algorithm previously described [21]. OntoFusion has been mainly designed to be used by biomedical researchers, although it is domain independent. One of its main features is the ability to integrate public and private databases using the semantic-based approach of the integration method. While its primary objective was just related to database integration, its developers are expanding its features to use ontology-based approaches to automatically preprocess data coming from remote databases.

Genome Sequencing: A Complex Path to Personalized Medicine

61

This process eliminates inconsistencies in terminologies, format, scales, and so on [51]. In such virtual worldwide environment, researchers can access, for instance, a large number of SNP databases or link specific concepts of computerized medical records with similar concepts from remote databases. Finally, users can retrieve the information they need by using Google-like searches or navigating the ontology representing the global virtual repository. Figure 5.2 represents a screenshot of the OntoFusion tool. On the left, the ontology represents the schema of the different (unified or individual) virtual repositories. On the right, the virtual repositories are physically displayed. By clicking these virtual repositories the contents of the physical databases can be displayed, navigated, or retrieved. In this case, the figure represents the integration of different biomedical ontologies for developing a vocabulary server used by the application. Different Web technologies can be used to access and share remote heterogeneous information. Web services are a good example. Web services are based on service-oriented architectures using established Web standards such as eXtensible Markup Language (XML), Web Services Description Language (WSDL), Simple Object Access Protocol (SOAP), or Universal Description, Discovery, and Integration (UDDI). Web services are platform-independent and can be invoked from any platform or architecture. For the different needs related to personalized medicine hundreds of Web services are currently available for practitioners and researchers. To integrate such heterogeneous services new standards are needed for managing semantic

Figure 5.2 OntoFusion tool screen shot. OntoFusion builds virtual repositories of physical databases by mapping the schemas with a domain ontology.

62

Genome Sequencing Technology and Algorithms

heterogeneity. Isolated services are many times unable to solve complex problems. The possibility of combining these services on customized workflows, adapted to the exact demands of researchers, from the combination of those previously available is a new challenge. To reuse and “orchestrate” Web services, some approaches have been reported, such as Taverna [52] and BioMOBY [53]. DiseaseCard is a comprehensive resource to access diseases’ data available over the Internet, in different public remote databases (e.g., OMIM, Medline, Orphaned, PharmGKb). DiseaseCard was developed by the Bioinformatics Group at the University of Aveiro, Portugal, with the collaboration of the Institute of Health Carlos III, Spain. The ultimate goal of DiseaseCard is to provide biomedical practitioners, researchers, and patients rapid access to useful information related to many rare and common genetic diseases. In the framework of personalized medicine, these kind of tools can be a useful aid for practitioners, researchers, and the public in general to rapidly access information about different diseases over the Internet and link it to specific patients’ data. Figure 5.3 shows an example related to hemochromatosis. On the left, users can expand different folders related to different subjects (e.g., references, genetics tests and labs, polymorphisms, nucleotide sequences, metabolic pathways, protein structure, and others). When these nodes are expanded, the information related (which can be locally stored or is retrieved from remote databases) is shown in the right side. In the United States, the National Cancer Institute has taken up the challenge of combining genomic and clinical data for cancer research and treatment. The perspective has been one of building a well-specified infrastructure to support the development of interoperable tooling built upon that infrastructure. The effort has been undertaken by the Cancer Biomedical Informatics Grid (caBIG) community. Their approach based on binding controlled terminology is described below along with some of the tooling developed for caBIG. The UML models in caBIG contain class diagrams that represent the scientifically relevant objects that are part of running software systems. These are objects like DNA sequence, RNA sequence, and protein. The classes are composed of the class name itself (e.g., protein) and the attributes that are in the class (e.g., uniProtKB, name, symbol, and so forth). Between the classes there are association links that convey the relationships between the classes. A description, stored in the UML model, is required for each class and attribute. Using UML domain models to instantiate the MDA framework gives a conceptual representation of the underlying scientific objects via the classes, attributes, associations, and descriptions. This model does not provide an unambiguous representation because developers at different sites, while using the same scientific domain knowledge, can create similar classes and attributes, but with different names, configurations, and descriptions. The binding of

Genome Sequencing: A Complex Path to Personalized Medicine

63

Figure 5.3 Screen shot example of using DiseaseCard to access information related to hemochromatosis. The Bioinformatics Group at the University of Aveiro, Portugal, in collaboration with the Institute of Health Carlos III, Spain, developed DiseaseCard (http://bioserver.ieeta.pt/diseasecard/).

controlled terminology to the model can mitigate the latter problem of different semantics associated with the scientific concepts in the model. The binding does this by unambiguously specifying the description in the model with controlled terminology. The binding is accomplished by mapping concepts from the EVS NCI Thesaurus and Metathesaurus to the classes and attributes in the UML model. These are bound using the concept codes or concept unique identifiers (CUIs) that are maintained by EVS. The descriptions for the classes and attributes are used to determine what CUIs from the EVS terminology should be used for the mapping. This mapping is then used in the creation of the metadata for the data elements that represent the model. This generation of metadata is supported by a suite of tools and infrastructure developed at the NCI. The NCI has implemented the ISO11179 standard for representing metadata in their Cancer Data Standards Repository (caDSR). This incorporates the data structures and format of ISO11179 to store a data element (DE) using a combination of a data element concept (DEC) and a value domain (VD). The DEC maintains the set of CUIs associated with the DE, and the VD

64

Genome Sequencing Technology and Algorithms

specifies the data type and permissible values associated with the DE. Since ISO11179 does not have associations in its representation, the caDSR extends upon the standard to store the metadata for the associations between the UML classes. This is in order to more fully support the MDA approach and incorporate the associations that are in the UML information models. The potential benefit of binding UML models to terminology is the ability to more readily support reuse of model elements. Once the attributes are mapped to the terminology, it has a defining series of concept codes. This along with the value domain specifies the data element. Hence, if another model has an attribute mapped to the same series of concept codes and value domains, then the two models share an identical attribute. This is regardless of differences in the naming conventions in the two models and any differences in the descriptions in the two models. The scientific meaning is encapsulated in the series of concept codes and the value domain that is identical for both models. The point of this system is to help developers and users reach consensus and converge to common models for their systems. In this way tooling can be built in an interoperability fashion. Three tissue banking and pathology tools have been developed using the caBIG infrastructure to coordinate the underlying information models. These are caTISSUE Core, caTISSUE Clinical Annotation Engine (CAE), and the cancer Text Information Extraction System (caTIES). The caTISSUE Core system is used to inventory and track biospecimens. This includes searching for specimens and requesting them for studies. The CAE annotation system is based on the College of American Pathologist (CAP) cancer protocols [54]. It has the functionality to import data from anatomic pathology laboratory systems, cancer tumor registries and clinical pathology laboratory systems. The information is clinically oriented and is tightly tied in with the caDSR, enabling the definitions from the EVS terminology to be displayed for field titles in the interface. The tool caTIES extracts structured text from free-text surgical pathology reports and encodes it in caBIG compliant terminology. This enables researchers to search for annotated tissue over structured terminology instead of free text in order to obtain relevant biospecimens. The Cancer Translational Research Informatics Platform (caTRIP) system utilizes the EVS terminology and metadata in caDSR to run distributed queries on federated data resources. This includes the caTISSUE tools that have their metadata registered in the caDSR. This is an example of the UML models bound to controlled terminology enabling systems to interoperate more effectively. A specific example of interoperability using the caBIG MDA infrastructure is illustrated for the CAE and caTRIP systems. The UML model for breast biomarkers was developed at the University of Pittsburgh in the Biomedical Informatics Department and implemented for their CAE system. It contains the

Genome Sequencing: A Complex Path to Personalized Medicine

65

class Breast Biomarkers with the attribute her2Status [Figure 5.4(a)]. The class and attribute are mapped onto EVS concepts. The class, Breast Biomarkers, is mapped to Breast Carcinoma Tumor Marker [Figure 5.4(b)] and the attribute, her2Status, is mapped to Her2/Neu Status. These supply a list of concept codes with definitions that uniquely identify the common data element. The permissible values are also specified and are viewable under the Permissible Values tab in the CDE Browser (http://cdebrowser.nci.nih.gov/CDEBrowser/). These values are “Positive” and “Negative” and also have EVS concept codes associated with

breast: BreastCancerBiomarkers – erStatus: string – prStatus: string – her2Status: string – her2TestType: string – efgrStatus: String (a)

(b)

Figure 5.4 The breast cancer biomarkers class (a) from the CAE UML model, developed at the University of Pittsburgh, has been mapped to the common data elements stored in the caDSR. This is viewable with the NCI’s CDE browser (b).

66

Genome Sequencing Technology and Algorithms

them. This CDE is used in the CAE to store the status of the genetic test for her2. The caTRIP system (https://cabig.nci.nih.gov/tools/caTRIP) leverages the metadata in the caDSR with respect to CAE. It uses it to construct the attribute filters for its distributed queries. In Figure 5.5 the attribute filter for Her2/Neu Status is specified. This reuses the CDE for her2Status developed for CAE. The definition of the test result is identical as are the permissible values. This is seen in the screen shot with “Positive” being in the Value field for the search on available biospecimens with a positive her2 status. Other filters can be added such as the type of test used (e.g., ImmunoHistoChemistry). This also comes from the CAE UML model with the attribute her2TestType (see Figure 5.4). The returned rows of available tissue with positive test results demonstrate an intermediate solution using MDA to achieve interoperability across institutions for genetic testing related to cancer. The caIntegrator system (https://cabig.nci.nih.gov/tools/caIntegrator) combines a variety of biomedical data related to clinical trials together with bioinformatics experimental data. These latter data types include Immunohistochemistry (IHC), microarray-based gene expression, and SNPs. The tools support the analysis of these data in an integrated system. Some proposals aim to link genotype and phenotype information. One example is the Polymorphism Markup Language (PML), to represent and store SNP (single nucleotide polymorphism) information [46]. This project, launched by a broad international consortium, aims to model the variation of genetic information, including a whole range of mutations. Following a different direction, the IBM Haifa Group has led the HL7 Clinical Genomics special interest group (SIG) to create standards for exchanging clinical and genomic data [55] by using genomic data in health care to support personalized medicine, or integrating genomic data into classical electronic health records, linked to emerging bioinformatics formats such as MAGE-ML for gene expression or BSML for sequencing data. Its genotype model includes various types of genomic data such as sequencing, expression, and proteomics data. Developers have tested this model on cystic-fibrosis data, bone-marrow transplantation, and pharmacogenetics projects [55]. From a European perspective, the European Commission has launched several initiatives since 2001 related to specific contributions that biomedical informatics can apply to personalized medicine. A preliminary conference, called “Synergy between Research in Medical Informatics, Bioinformatics and Neuroinformatics: Knowledge Empowering Individualized Healthcare and Well-Being” was held in Brussels in 2001. In June 2006, another meeting, “ICT for BIO Medical Sciences 2006,” analyzed the results obtained in the 5 years since 2006. In this time various conferences, projects, and different activities were carried out. The BIOINFOMED study [22] was delivered in 2002,

Figure 5.5 The caTRIP Federated Query Builder, developed at Duke University, uses metadata consistent with that in caDSR to populate its query tool. The Her2Neu Status is the same as the one from the CAE system as demonstrated in Figure 5.4.

Genome Sequencing: A Complex Path to Personalized Medicine 67

68

Genome Sequencing Technology and Algorithms

establishing various significant challenges in biomedical informatics at a European level. These challenges were related to linking clinical and genomic information for biomedical research and practice, in issues such as biobanking, genomic-based computerized medical records, pharmainformatics, integrated ontologies, or integrated access to clinical and genomic databases. In summary, the goal was to introduce a strong scientific biological basis to clinical medicine that could lead to better diagnostic and therapeutic procedures. The Network of Excellence (NoE) [56], an initiative started in 2004 to promote biomedical informatics in Europe in support of personalized health care, was created to launch different directions and ideas, aiming to establish a common meeting place within the European Union in the biomedical informatics area. In this NoE, several workpackages were envisioned to deal with dissemination, training and mobility, data modeling, integration and mining as well as four clinical pilots, all of them linking different types of biomedical information. These pilots were designed to create new biomedical informatics approaches in (1) pharmacoinformatics, (2) genomics and infectious diseases, (3) periodontitis, and (4) genetics and colon cancer. These four pilots are examples of the different approaches and problems that biomedical informatics can face in the new environment of genomic medicine, previously described [57]. For instance, the periodontitis pilot project, led at the VU University Medical Center in Amsterdam, aimed to develop informatics methods to store and analyze clinical and genomic information. Periodontitis is an example of chronic infectious and inflammatory disease caused by multiple factors (genetics, infectious, geographical, and environmental) that affect the teeth-supporting tissues. It affects more than 10% of the adult population and nearly 30% of elderly people, increasing the risk of cardiovascular diseases in this group of people. Periodontitis was selected since it seems to have a small number of triggering factors, facilitating clinico-genomic research. A database has been built including clinico-genomic information. PML and other models are being used for representing genotype-phenotype links. In such datasets, data-mining methods are being used to discover links between clinical information and genetic traits. In addition to the intermediate opportunities of developing applications that serve specific biological and clinical needs and the hard challenge of integrating heterogeneous data sources, there are examples of immediate low-hanging fruit. Creating an infrastructure to standardize genetic tests would actualize an immediate benefit of personalized medicine. The first step is establishing a genetic test that identifies the specific mutation, or the set of mutations or functional changes, for the disorder in question. In many cases, such tests are registered on either a research basis or a clinical basis in the GeneTests system (http://www.genetests.org/), an international registry of all available genetic tests for clinical disorders. As of March 2006, there were over 900 different

Genome Sequencing: A Complex Path to Personalized Medicine

69

genes and their variants that could be tested on a clinical basis, and another 300 genes that could be tested on a research basis at almost 600 laboratories worldwide. These tests are performed by a variety of methods, depending on whether the tests look for a specific characteristic set of mutations or determine the DNA sequence for a section of the gene (or the entire gene and its flanking regions). All of these tests use simple free text to report results to the requesting physicians, with little or no standardization, making the results unsuitable for decision support within an electronic medical record and often resulting in information loss. Thus, these are a natural target for a standards-based approach. A specific example of standardizing genetic tests is creating the infrastructure for reporting genetic tests on hemochromatosis. The first step would be for a clinician to order a genetic test to determine if a patient had a genetic mutation that was causative for hemochromatosis. The reason for the test might be suspicious clinical symptoms (e.g., elevated serum ferritin, hepatic cirrhosis or fibrosis, hepatoma) or the diagnosis of hemochromatosis in a sibling. Since there are five different genes that cause various types of hemochromatosis, we will use the HFE gene example since it is the most common gene related to this disorder in the United States. The ARUP laboratory at the University of Utah is one of the five largest molecular diagnostic testing laboratories in the United States. Their HFE test is a targeted mutation analysis looking for the C282Y, H63D, and S65C mutations. C282Y is the most common mutation in the HFE gene, a G to A transition at nucleotide 845, which substitutes a tyrosine (Y) for a cysteine (C) at amino acid position 282. The ARUP result report explains the basic test and includes a textual interpretation, plus algorithms for diagnosis and other screening tests. A positive case report includes all of this information and is sent to the ordering physician as a multipage printed document, unless the order came from another laboratory that can receive an electronic report, in which case it is sent electronically as a multipage textual report. Printed documents are usually put into a paper chart at the receiving end, perhaps then scanned into an electronic record, but the results are not available in computable format using standardized vocabulary and data fields. Therefore these test results are usually limited to the person who ordered the test and certainly not readily accessible to the next physician who sees the patient. This is low-hanging fruit in that the standardized report can be relatively easily agreed upon which then opens it up to more than just the original physician who saw the patient. Standards are being developed for the representation of concepts and test results from the various kinds of genetic tests. The Human Genome Variation Society has a nomenclature for the description of sequence variations (http://www.genomic.unimelb.edu.au/mdi/mutnomen/), the American College of Medical Genetics has policy statements and nomenclature standards (http://www.acmg.net/resources/policies/pol-027.pdf), and the LOINC

70

Genome Sequencing Technology and Algorithms

Committee (Logical Observation Identifiers Names and Codes) has developed codes for the most common genetic tests and is developing specifications for many additional genetic tests (http://www.regenstrief.org/loinc/download/loinc/guide/LOINCManual.pdf). There is also an initiative underway to create standard definitions for genetic-testing panels using the LOINC coding system. This activity is a collaborative effort of the HL7 Clinical Genetics SIG, the LOINC Committee, CDISC, and investigators in the life sciences group at IBM. This activity will make it possible to order genetic tests using standard coded terms and transmit the orders using HL7 messages and standard test codes. Clinical, phenotypic, and family-history-related data that accompanies such orders will also be encoded so that testing laboratories will be able to create algorithms based on data that guide the technologists in their selection of tests to be performed. Results of the testing will be returned in HL7 messages using standard codes as well. The LOINC code for a laboratory test result for the HFE mutation C282Y is: 21695-2 HFE GENE.P.C282Y ARB PT BLD/TISS ORD MOLGEN Translated as a narrative description, the fully specified LOINC name indicates that the C282Y mutation (as named using protein-based nomenclature, “P”) of the HFE GENE was tested for on a blood (BLD) or tissue (TISS) sample. The result value associated with this test code would indicate: no mutation present, heterozygous mutation present, or homozygous mutation present. The precise definition of these test codes is essential if the gene testing results are to be referenced in computable rules and protocols. Work is being done to store results of genetic testing in a coded format in the electronic medical record (EMR). If this test result, LOINC–encoded data, is incorporated into the EMR, then computable recommendations for treatment can be represented using a standard decision support language like Arden Syntax or GELLO (an object-oriented guideline expression language). These computable recommendations can be shared, implemented, and executed in any EMR system as long as the system is using standard coded terminologies such as LOINC, SNOMED CT, or NDF-RT. The basic recommendations are to follow the serum ferritin levels of the patient every few years, keeping in mind that the serum ferritin levels are age and sex dependent. If the serum ferritin becomes elevated, then the physician should suggest that the patient become a blood donor or have frequent phlebotomies. If the serum ferritin level approaches 1,000 ng/mh, then the patient should be evaluated for difficulties arising from iron storage including liver biopsy. If the patient has liver fibrosis and elevated serum ferritin, then the physician should keep a close watch for hepatocellular cancers. Finally, the

Genome Sequencing: A Complex Path to Personalized Medicine

71

appropriate relatives should be tested for the same mutation. All of these recommendations can be structured in a computable knowledge representation format using standards, and once the algorithms have been tested and verified, the set of recommendations can be made available to the worldwide community. The activities of the Biomedical Informatics project to translate clinical research findings into clinical practice and communities will focus on the standards and tool sets by which such translation can occur within the gene-testing domain. To accomplish this task what is needed are clinical geneticists who work on the impact of genomics on clinical information systems; experts who are familiar with LOINC and in a position to incorporate new genetic tests into the LOINC system and into laboratory information systems; and experts who will develop the clinical decision support system rules in a standard form that can be used in all EMRs. One such system is being tested in the Cerner system at the University of Utah and the GE/IDX system at the Intermountain Healthcare, a community health care delivery system that operates within Utah, Idaho, and Wyoming at multiple clinics and hospitals. Ties have already been established with both the Cerner Corporation and the GE Corporation to start work on translational research methods using sharable standards. Additionally there is very close collaborations with the Department of Veterans Affairs field office in Salt Lake City, in addition to work with them on a number of federal interoperability projects relevant to data exchange standards and formats.

5.7 Conclusion Like all interdisciplinary areas, educational needs are demanding and complex. In the case of medical informatics, basic training included topics related, of course, to medicine (including clinical medicine and decision-making, publichealth, or health-services research) and computer science (including AI, probability, or statistics). Personalized medicine will also demand new knowledge and expertise. For instance, physicians will need to learn concepts related to genetics or systems biology, whereas biologists and bioinformaticians will have to deal with clinical data and issues that have been unknown to them until now. Such complexity may even increase more dramatically if nanotechnology begins to have a significant impact on clinical practice beyond current laboratory research. Large initiatives will be necessary to create the tooling, interoperability, and scientific-domain-driven knowledge base to effectively advance personalized medicine. After some preliminary overoptimistic expectations, it seems clear now that genetics alone cannot transform medicine [58, 59]. Research on biomarker discovery that can be detected before clinical onset, has signaled molecular profiling as a great challenge for personalized medicine, but biomarkers with

72

Genome Sequencing Technology and Algorithms

adequate specificity and sensitivity values are still scarce for most diseases. Biomarkers must be evaluated in order to demonstrate their medical significance and cost-effectiveness [60]. In order to achieve this, heterogeneous clinical and genomic data sources must be integrated in scientifically meaningful and productive systems. This will include hypothesis-driven scientific research systems along with well-understood information systems to support such research. These in turn will enable the faster advancement of personalized medicine.

References [1] S. 3822 [109th]: Genomics and Personalized Medicine Act of 2006, http://www. govtrack.us/congress/billtext.xpd?bill=s109-3822. [2] Personalized Health Care RFI, January 26, 2007, Personalized Medicine Coalition Response, http://www.personalizedmedicinecoalition.org/PMC_Response_to_HHS_RFI_on_HIT_ final_1_25_071.pdf. [3] Johnson, J. D., et al., “Genomics—The Perfect Information-Seeking Research Problem,” J. Health Commun., Vol. 10, No. 4, June 2005, pp. 323–329. [4] Mitchell, J. A., F. Fun, and A. T. McCray, “Design of Genetics Home Reference: A New NLM Consumer Health Resource,” J. Am. Med. Inform. Assoc., Vol. 11, 2004, pp. 439–447. [5] Patrick, T. B., et al., “Evidence-Based Retrieval in Evidence-Based Medicine,” J. Med. Libr. Assoc., Vol. 92, No. 2, 2004, pp. 196–199. [6] Sifri, R., et al., “Use of Cancer Susceptibility Testing Among Primary Care Physicians,” Clin. Genet., Vol. 64, 2003, pp. 355–360. [7] Collins, F. S., Shattuck Lecture, “Medical and Societal Consequences of the Human Genome Project,” N. Engl. J. Med., Vol. 341, 1999, pp. 28–37. [8] Abrahams, E., G. S. Ginsburg, and M. Silver, “The Personalized Medicine Coalition: Goals and Strategies,” Am. J. Pharmacogenomics, Vol. 5, No. 6, 2005, pp. 345–355. [9] Meadows, M., “Genomics and Personalized Medicine,” FDA Consum., Vol. 39, No. 6, November-December 2005, pp. 12–17. [10] Nicholson, J. K., “Global Systems Biology, Personalized Medicine and Molecular Epidemiology,” Mol. Syst. Biol., Vol. 2, No. 52, 2006. [11] Haselden, J. N., and A. W. Nicholls, “Personalized Medicine Progresses,” Nat. Med., Vol. 12, No. 5, May 2006, pp. 510–511. [12] West, M., et al., “Embracing the Complexity of Genomic Data for Personalized Medicine,” Genome Res., Vol. 16, No. 5, May 2006, pp. 559–566. [13] Gurwitz, D., J. E. Lunshof, and R. B. Altman, “A Call for the Creation of Personalized Medicine Databases,” Nat. Rev. Drug Discov., Vol. 5, No. 1, January 2006, pp. 23–26.

Genome Sequencing: A Complex Path to Personalized Medicine

73

[14] Davies, S. M., “Pharmacogenetics, Pharmacogenomics and Personalized Medicine: Are We There Yet?” Hematology Am. Soc. Hematol. Educ. Program, 2006, pp. 111–117. [15] Dietel, M., and C. Sers, “Personalized Medicine and Development of Targeted Therapies: The Upcoming Challenge for Diagnostic Molecular Pathology. A Review,” Virchows Arch., Vol. 448, No. 6, June 2006, pp. 744–755. [16] Klein, T. E., et al., “Integrating Genotype and Phenotype Information: An Overview of the PharmGKB Project,” Pharmacogenomics J., Vol. 1, No. 3, 2001, pp. 167–170. [17] Sadee, W., and Z. Dai, “Pharmacogenetics/Genomics and Personalized Medicine,” Hum. Mol. Genet., Vol. 14, Spec. No. 2, October 15, 2005, pp. R207–R214. [18] Woodcock, J., “The Prospects for ‘Personalized Medicine’ in Drug Development and Drug Therapy,” Clin. Pharmacol. Ther., Vol. 81, No. 2, February 2007, pp. 164–169. [19] Weatherall, D., “Sir David Weatherall Reflects on Genetics and Personalized Medicine,” Drug Discov. Today, Vol. 11, Nos. 13–14, July 2006, pp. 576–579. [20] Sujansky, W., “Heterogeneous Database Integration in Biomedicine,” J. Biomed. Inform., Vol. 34, No. 4, 2001, pp. 285–298. [21] Alonso-Calvo, R., et al., “An Agent- and Ontology-Based System for Integrating Public Gene, Protein, and Disease Databases,” J. Biomed. Inform., Vol. 40, No. 1, February 2007, pp. 17–29. [22] Martin-Sanchez, F., et al. “Synergy Between Medical Informatics and Bioinformatics: Facilitating Genomic Medicine for Future Health Care,” J. Biomed. Inform., Vol. 37, No. 1, 2004, pp. 30–42. [23] Del Fiol, G., et al., “Integrating Genetic Information Resources with an HER,” AMIA Ann. Symp. Proc., 2006, p. 904. [24] Mitchell, J. A., “The Impact of Genomics on E-Health,” Stud. Health Technol. Inform., Vol. 106, 2004, pp. 63–74. [25] Searls, D. B., “Data Integration: Challenges for Drug Discovery,” Nat. Rev. Drug Discov., Vol. 4, No. 1, January 2005, pp. 45–58. [26] Sax, U., and S. Schmidt, “Integration of Genomic Data in Electronic Health Records—Opportunities and Dilemmas,” Methods Inf. Med., Vol. 44, No. 4, 2005, pp. 546–550. [27] Mitchell, D. R., and J. A. Mitchell, “Status of Clinical Gene Sequencing Data Reporting and Associated Risks for Information Loss,” J. Biomed. Inform., Vol. 40, No. 1, February 2007, pp. 47–54. [28] Mitchell, J. A., A. T. McCray, and O. Bodenreider, “From Phenotype to Genotype: Issues in Navigating the Available Information Resources,” Methods Inf. Med., Vol. 42, No. 5, 2003, pp. 557–563. [29] Adida, B., and I. S. Kohane, “GenePING: Secure, Scalable Management of Personal Genomic Data,” BMC Genomics, Vol. 7, No. 93, 2006, pp. 1–10.

74

Genome Sequencing Technology and Algorithms

[30] Sanandres-Ledesma, J. A., et al., “A Performance Comparative Analysis Between Rule-Induction Algorithms: Application to Rheumatoid Arthritis,” Lecture Notes in Computer Science, Vol. 3337, 2004, pp. 224–234. [31] Pazzani, M., “Knowledge Discovery from Data?” IEEE Intelligent Systems, Vol. 15, No. 2, 2000, pp. 10–13. [32] Maojo, V., “Domain-Specific Particularities of Data Mining: Lessons Learned,” Lecture Notes in Computer Science, Vol. 3337, 2004, pp. 235–242. [33] Komatsoulis, G. A., et al., “caCORE Version 3: Implementation of a Model Driven, Service-Oriented Architecture for Semantic Interoperability,” J. Biomed. Inform., 2007. [34] Sioutos, N., et al., “NCI Thesaurus: A Semantic Model Integrating Cancer-Related Clinical and Molecular Information,” J. Biomed. Inform., 2006. [35] Hartel, F. W., et al., “Modeling a Description Logic Vocabulary for Cancer Research,” J. Biomed. Inform., Vol. 38, No. 2, 2005, pp. 114–129. [36] Fragoso, G., et al., “Overview and Utilization of the NCI Thesaurus,” Comparative and Functional Genomics, 2004, p. 5. [37] de Coronado, S., et al., “NCI Thesaurus: Using Science-Based Terminology to Integrate Cancer Research Results,” Medinfo, Vol. 11, Pt. 1, 2004, pp. 33–37. [38] Covitz, P. A., et al., “caCORE: A Common Infrastructure for Cancer Informatics,” Bioinformatics, Vol. 19, No. 18, 2003, pp. 2404–2412. [39] Lindberg, C., “The Unified Medical Language System (UMLS) of the National Library of Medicine,” J. Am. Med. Rec. Assoc., Vol. 61, No. 5, 1990, pp. 40–42. [40] Tuttle, M. S., et al., “The Homogenization of the Metathesaurus Schema and Distribution Format,” Proc. Ann. Symp. Comput. Appl. Med. Care, 1992, pp. 299–303. [41] McDonald, C. J., et al., “LOINC, A Universal Standard for Identifying Laboratory Observations: A 5-Year Update,” Clin. Chem., Vol. 49, No. 4, 2003, pp. 624–633. [42] Kudla, K. M. and M. C. Rallins, “SNOMED: A Controlled Vocabulary for Computer-Based Patient Records,” J. Ahima, Vol. 69, No. 5, 1998, pp. 40–44; quiz pp. 45–46. [43] Ashburner, M., et al., “Gene Ontology: Tool for the Unification of Biology: The Gene Ontology Consortium,” Nat. Genet., Vol. 25, No. 1, 2000, pp. 25–29. [44] Whetzel, P. L., et al., “The MGED Ontology: A Resource for Semantics-Based Description of Microarray Experiments,” Bioinformatics, Vol. 22, No. 7, 2006, pp. 866–873. [45] Yu, H., et al., “Representing Genomic Knowledge in the UMLS Semantic Network,” Proc. AMIA Symp., 1999, pp. 181–185. [46] Bodenreider, O., J. A. Mitchell, and A. T. McCray, “Evaluation of the UMLS as a Terminology and Knowledge Resource for Biomedical Informatics,” Proc. AMIA Symp. 2002, pp. 61–65. [47] Konagaya, A., “Trends in Life Science Grid: From Computing Grid to Knowledge Grid: Pharmacogenetics Research Network and Knowledge Base,” BMC Bioinformatics, Vol. 18, No. 7, Suppl. 5, December 2006, p. S10.

Genome Sequencing: A Complex Path to Personalized Medicine

75

[48] Saltz, J., et al., “caGrid: Design and Implementation of the Core Architecture of the Cancer Biomedical Informatics Grid,” Bioinformatics, Vol. 22, No. 15, 2006, pp. 1910–1916. [49] de Knikker, R., et al., ”A Web Services Choreography Scenario for Interoperating Bioinformatics Applications,” BMC Bioinformatics, Vol. 5, No. 25, March 10, 2004. [50] ACGT, http://www.eu-acgt.org. [51] Pérez-Rey, D., A. Anguita, and J. Crespo, “OntoDataClean: Ontology-Based Integration and Preprocessing of Distributed Data,” Lecture Notes in Computer Science, Vol. 4345, 2006, pp. 262–272. [52] Hull, D., et al., “Taverna: A Tool For Building and Running Workflows of Services,” Nucleic Acids Res., Vol. 34 (Web Server issue), July 1, 2006, pp. W729–W732. [53] Wilkinson, M. D., and M. Links, “BioMOBY: An Open Source Biological Web Services Proposal,” Brief Bioinform., Vol. 3, No. 4, December 2002, pp. 331–341. [54] Tobias, J., et al., “The CAP Cancer Protocols—A Case Study of caCORE Based Data Standards Implementation to Integrate with the Cancer Biomedical Informatics Grid,” BMC Med. Inform. Decis. Mak., Vol. 6, June 20, 2006, pp. 25–40, http://stdsnp. genes.nig.ac.jp/index.html/. [55] HL7 SIG, http://www.haifa.ibm.com/projects/software/cgl7/specifications.html. [56] INFOBIOMED, http://www.infobiomed.org. [57] Kulikowski, C., “The Micro-Macro Spectrum of Medical Informatics. Challenges: From Molecular Medicine to Transforming Health Care in a Globalizing Society,” Methods Inf. Med., Vol. 41, 2002, pp. 20–24. [58] Kiberstis, P., and L. Roberts, “It’s Not Just the Genes,” Science, Vol. 296, 2002, p. 685. [59] Lunshof, J. E., M. Pirmohamed, and D. Gurwitz, “Personalized Medicine: Decades Away?” Pharmacogenomics, Vol. 7, No. 2, March 2006, pp. 237–241. [60] Collins, C. D., et al., “The Application of Genomic and Proteomic Technologies in Predictive, Preventive and Personalized Medicine,” Vascul. Pharmacol., Vol. 45, No. 5, November 2006, pp. 258–267.

Part II Genome Sequencing and Fragment Assembly

6 Overview of Genome Assembly Techniques Sun Kim and Haixu Tang

The most common laboratory mechanism for reading DNA sequences (e.g., gel electrophoresis) can determine sequence of up to approximately 1,000 nucleotides at a time.1 However, the size of an organism’s genome is much larger; for example, the human genome consists of approximately 3 billion nucleotides. The most commonly used and most cost-effective process is shotgun sequencing, which physically breaks multiple copies (or clones) of a target DNA molecule into short, readable fragments and then reassembles the short fragments to reconstruct the target DNA sequence. The assembly of short fragments in shotgun sequencing was originally done by hand, but manual assembly clearly is not desirable since it is error prone and not cost-effective. Automatic fragment assembly has been studied for a long period of time [1–11]. Various sequence assemblers contributed to the determination of many genome sequences, including the most recent announcement of the human genome [12, 13].

6.1 Genome Sequencing by Shotgun-Sequencing Strategy Almost all large-scale sequencing projects employ the shotgun strategy that assembles (deduces) the target DNA sequence from a set of short DNA 1. Recently, several promising new experimental techniques have been developed. Note that we survey new sequencing technology in Part I.

79

80

Genome Sequencing Technology and Algorithms

fragments determined from DNA pieces randomly sampled from the target sequence. The set of short DNA fragments, called shotgun reads, are assembled into a set of contigs, or sets of aligned fragments, using a computer program, fragment assembler. The fragment assembly is a conceptually simple procedure that generates longer sequences by detecting overlapping fragments. If the fragment assembly can be done perfectly, the genome sequencing would be a simple problem. However, there are extensive repetitive sequences, repeats in short, in a genomic sequence, which can easily mislead the fragment assembly process (see Figure 6.1). A useful technique to overcome the difficulty from repeats is to sequence both ends of a clone, generating two fragment reads per clone. Since the insert size of clone is known, we know the approximate distance between two fragments. This technique is developed by Hood and his colleagues [14]. The fragment matching information is often referred as mate-pair information, which becomes essential for large-scale shotgun sequencing. The main issue utilizing this information during the assembly process is that we do not know the sequence between the two reads, which can only be deduced by assembling other fragments into a single contig. As a result, we can utilize the clone-length information only after assembly, which results in either a correct or an incorrect assembly based on the clone-length information. See Figure 6.2. One strategy to use the mate-pair information effectively is to assemble contigs as accurate as possible by detecting potentially misassembled contigs and then utilize the mate-pair information using only contigs that are likely to be assembled correctly. See Sections 7.4 and 7.5 for the strategy. 6.1.1

A Procedure for Whole-Genome Shotgun (WGS) Sequencing

In general, assembly of shotgun reads generates a large number of contigs and some of them are misassembled probably due to repetitive sequences in the target DNA. As a result, genome sequencing is usually carried out in multiple steps and, unfortunately, there is no consensus on the steps of all genome-sequencing

Assembly is wrong due to repeats

Figure 6.1 Effect of repeats in fragment assembly. Since assembly is based on overlapping regions between fragments, repeats can easily mislead assembly process, putting all five fragments from two repeat copies into one contig in this figure.

Overview of Genome Assembly Techniques

81

Approximate length in bps

f2 5’ 5’

f1

Read one fragment f1 from one end.

Read another fragment f2 from the other end.

f1 f1

f2

f2 2 kb

6 kb

Figure 6.2 (Upper panel) Mate-pair information. Two fragments, f1 and f2, are read at both ends of the same clone of approximate size of 2 kb. (Bottom panel) The main issue utilizing this information during the assembly process is that we do not know the sequence between the two reads, which can only be deduced by assembling other fragments into a single contig. As a result, we can utilize the clone-length information only after assembly, which result in either a correct or an incorrect assembly based on the clone length information.

projects. We describe a general procedure for genome sequencing below, but we emphasize that procedures used at genome-sequencing centers differ in details. 1. Fragment readout: The sequences of each fragment are determined using an automatic base-calling software. Phred [15, 16] is the most widely used program. 2. Trimming vector sequences: Shotgun reads often contain part of the vector sequences that have to be removed before sequence assembly. 3. Trimming low-quality sequences: Shotgun reads contain poor-quality basecalls and removing or masking out these low-quality base calls often leads to more accurate sequence assembly. However, this step is optional and some sequencing centers do not mask out low-quality

82

Genome Sequencing Technology and Algorithms

4.

5.

6.

7.

base calls, relying on the fragment assembler to utilize quality values to decide true fragment overlaps. Fragment assembly: The shotgun data is input to a fragment assembler that automatically generates a set of aligned fragment called contigs. A survey of fragment assembly algorithms is in Chapter 7. Assembly validation: Some contigs that assembled in the previous steps may be misassembled due to repeats. Since we do not have a priori knowledge on repeats in the target DNA, it is very difficult to verify the correctness of assembly of each contig and this step is largely done manually. There are recent algorithmic developments on automatic verification of contig assemblies (see Section 6.4). Scaffolding contigs: Contigs need to be oriented and ordered. The mate-pair information is a primary information source for this step, thus this step is not achievable if the input shotgun data is not prepared by reading both ends of clones (see Section 6.5 for more details). Finishing: Assuming that all contigs are assembled correctly and contigs are oriented and ordered correctly, we can close gaps between two contigs by sequencing specific regions that correspond to the positions of gaps.

6.2 Trimming Vector and Low-Quality Sequences DNA characters in a fragment are determined from a chromatogram that can be viewed as a plot that shows possibilities of each of four DNA characters. The base call is a DNA character that is determined from a chromatogram and this process is done automatically by a computer program. Phred [15], the most widely used base-calling program, generates numeric values to denote the confidence level of each base call. The quality value of a base is q = −10 × log 10 ( p ) where p is the estimated error probability for the base [15]. Thus a sequencing machine generates two types of output files, one for DNA fragment sequence and another for based-call-quality values for a DNA fragment. Using this information, the next step is to trim sequences from vectors and identify low-quality regions. 6.2.1

The Trimming Vector and Low-Quality Sequences Problem

Input: A set of fragment reads with base-call-quality information; a set of vector sequences. Output: A set of fragment reads with vector sequences trimmed, and with information on start and end positions of good-quality regions.

Overview of Genome Assembly Techniques

83

The problem is relatively simple but it requires a carefully written computer program to handle issues related to fragment trimming, such as vector sequence removal, identification of low-quality regions, and identification of contaminant reads. Chou and Holmes [16] wrote a suite of programs called LUCY for trimming vector and low-quality sequences from each fragment read. The design goal of LUCY is to process fragments so that trimmed fragments have the best overall quality rather than individual base-quality values. LUCY operates in multiple steps. 6.2.1.1 Quality Region Determination

The following is the quality region determination: 1. To determine the good-quality region of a fragment, LUCY first removes low-quality regions from both ends since the beginning and end of each sequence are typically of low quality. This is done by scanning a fragment from its left by identifying the first window of 10 bases with an error probability rate of 0.02 or less. Similarly, it finds the first window from its right end. 2. The next step is to identify regions with high error rates. This is done by scanning the remaining fragment region with a window size of 50 bases and a maximum error probability rate with 0.08, and then with a window size of 10 bases and a maximum probability error rate with 0.3. Any clean range sequences of less than 100 bases will be discarded. 3. Each of remaining clean range sequences will be further examined by checking it with two parameters, the overall maximum error probability (0.025) and the maximum error probability rate of two consecutive bases at the ends (0.02). Note that this step looks at the entire sequence range rather than a window. 6.2.1.2 Vector Splice Site Trimming

This step requires two input files, one with a whole vector sequence and another with two splice-site temple sequences, upsteam and downstream from the insertion point on the vector. Since vector splice sites are usually at the beginning of a fragment where the quality of bases is low, a simple sequence matching does not guarantee to find vector splice sites. Note that vector splice sites may be outside the good-quality region of a fragment and these splice sites will still be sought for since splice-site information is useful (e.g., for estimating clone length). To deal with sequences with low-quality bases, LUCY uses three consecutive windows of 40, 60, and 100 bases. Vector sequences are matched with a minimum match length of 8, 12, and 16 bases within the three windows.

84

Genome Sequencing Technology and Algorithms

6.2.1.3 Contaminant Detection

Contaminates can come from many sources, including Escherichia coli or human, which can be identified easily by sequence comparison methods. The real challenge is to identify contaminates from cloning vectors themselves. Two common contaminants are vector inserts that formed by a vector’s splicing with another vector and short inserts in which case most of a fragment read is a vector sequence. 1. The first step is to prepare a sequence tag pool of, say, 10 bases from a full-length vector sequence. 2. Each fragment is converted into tags and searched against the contaminant tag pools. A contaminant sequence is detected by counting the number of matched tags. Since the tag-matching step is performed after trimming low-quality regions, tag matching is done only for good-quality regions, so matched tags are of very high confidence.

6.3 Fragment Assembly Given a set of shotgun reads with vector and low-quality sequences trimmed, a fragment assembly program is used to assemble the reads to reconstruct the target sequence. 6.3.1

The Fragment Assembly Problem

Input: A set of fragment reads with vector sequences trimmed and with information on start and end positions of good-quality regions; a set of base-call-quality information for each fragment. Output: A set of contigs, each of which is a set of aligned fragments; a set of consensus sequences, a consensus sequence for a contig. Typically, a fragment assembly program generates many sets of assembled fragments instead of a single contiguous sequence. Two major reasons are repeats in the target sequence and low-coverage regions in the shotgun data. See Chapter 7 for details. Most fragment assembly algorithms employ the overlap-layout-consensus approach. 6.3.1.1 Overlap-Layout-Consensus Approach

The most widely used “overlap-layout-consensus” approach, pioneered by Peltola et al. [11], consists of three major steps: (1) identification of candidate overlaps, (2) fragment layout, and (3) consensus sequence generation from the layout. The first step is achieved using string pattern matching techniques, generating possible overlaps between fragments. The second and third steps involve

Overview of Genome Assembly Techniques

85

building models, implicit or explicit, for computing the layout of fragments and generating consensus sequences by enumerating the search space based on the model. Many successful sequence assembly algorithms have been developed based on this paradigm [1–3, 5, 7–9, 12]. There are also other approaches explicitly based on graph theory [6, 17]. These sequence assemblers contributed to the determination of many genome sequences, including the most recent announcement of the human genome [12, 13]. However, sequence assemblers typically generate a large number of contigs rather than single contiguous sequences, due to repetitive sequences and technical difficulties encountered at different stages of a genome-sequencing project. For example, the four most widely used assemblers generated 149 to more than 300 contigs for the N. meningitidis genome of 2.18 Mb [6]. The complete determination of the target sequence from the set of contigs requires a significant amount of work, which is called finishing.

6.4 Assembly Validation Repeats in the genome can easily lead to misassembly of contigs. Thus it is very important to validate contig assembly before scaffolding contigs or finishing gaps between contigs. The most accurate method to detect such misassembled contigs is to perform wet-lab experiments. However, this is time consuming and requires carefully designed experiments. 6.4.1

The Assembly Validation Problem

Input: A set of contigs with fragment alignment information; mate-pair information (some methods, e.g., an information theoretic probabilistic approach in this section, do not require this information). Output: For each base position, prediction on whether the position is assembled correctly or not. Rouchka and States [18] proposed a computational technique to design wet-lab experiments for contig assembly validation, including high clone coverage maps, multiple complete digest mapping, optical restriction mapping, and ordered shotgun sequencing [18]. Recently, several computational techniques without using wet-lab experiments have been developed. These techniques can be implemented as separate computational tools [19–22] or embedded in assemblers. The assembly validation techniques used in the sequence assemblers are reviewed in Sections 7.3, 7.4, and 7.5. Another interesting approach is to compare sequence assemblies from two or more fragment assembly programs to detect misassembled regions and to get a higher quality assembly (e.g., [23]).

86

Genome Sequencing Technology and Algorithms

6.4.1.1 TAMPA

Dew et al. [19] developed a sequence-assembly validation method that utilizes mate-pair data to evaluate and compare assemblies. The basic assumption is that lengths of mate pairs from a clone library follow Gaussian distribution with a mean µ and a standard deviation σ, which can be observed in the plots of clone mate lengths in the final curated assembly. Thus mate pairs are unsatisfied if the distance between pairs are beyond the range µ ± 3σ. TAMPA is a computational geometry-based approach to detecting assembly breakpoints by exploiting constraints that mate pairs impose on each other. They classified mate pairs into four assembly problems, insertion of incorrect sequences between a mate pair, deletion of sequences between sequences of a mate pair, inversion between two or more mate pairs, and transposition of mate pairs. The effects of four assembly problems are stretched (insertion and transposition), compressed (deletion or transposition), and (anti)-normal (inversion). 6.4.1.2 Compression/Expansion Statistics

Zimin and Yorke [22] developed compression/expansion statistics to identify misassembled regions (i.e., assembly regions that are either compressed or expanded due to repeats). The basic idea is again to assume that insert lengths between two mate fragments are distributed according to a Gaussian distribution with a mean and a variance. Given a contig, a global mean and a global variance of insert lengths are estimated. Then a sample mean and sample variance for a given library at a given base position in the contig is computed as follows. The sample mean length is the average length of inserts that covers the given position. The sample variance is estimated as sample standard deviation = global standard deviation

N

where N is the number of inserts that cover the given base position. Using the sample and global means and variance, the CE statistic is computed as C = ( sample mean − global mean ) sample standard deviation The CE statistic is negative at a collapsed region and positive at an expanded region. The thresholds for collapsed and expanded regions are empirically determined as –4 and 4.7, respectively. Using the CE statistic, Zimin et al. developed a method to compare and correct (reconciliate) misassembled regions using two different assemblies.

Overview of Genome Assembly Techniques

87

6.4.1.3 Clone Coverage Analysis

Sequence assembly validation based on the clone coverages can be used to detect large-scale misassemblies, especially collapsed repeats [20, 24]. This approach works in three steps as listed next. 1. Contigs are oriented and ordered. 2. The estimated lengths for all library clone types are computed and clones are classified into two classes, bad clones whose length deviate much from the expected clone length and good clones whose length is within an acceptable range of the expected clone length. 3. A good-minus-bad clone coverage plot is computed for each contig by subtracting the number of bad clones from the number of good clones. The basic idea is simple. Any region where more bad clones are aligned than good clones is likely to be misassembled. 6.4.1.4 An Information Theoretic Probabilistic Approach

This approach [21] identifies misassembled regions using entropy plots that are computed using statistics on the number of patterns per fragment. To compute entropy of fragments, we need to construct a probability model that measures how much each aligned fragment contributes to misassembly. The probability function fi is built using the fragment distribution, a measure used for repeat handling in a sequence assembler called AMASS [3]. From the probability model, we compute entropy at base position p in a contig as: entropy ( p ) =

∑ − prob ( f ) × log( prob ( f ))

p − δ ≤ pos ( f i ) ≤ p + δ

i

i

where pos(fi ) denotes the left-end position of fi in the contig and δ is a user-input parameter (by default, it is the same as the window size used for the fragment distribution calculation). Figure 6.3 shows how the entropy plot can detect misassembled regions.

6.5 Scaffold Generation Some sequence assembly packages include a scaffold generation module that generates scaffolds of assembled contigs [5, 6, 9, 25–28]. There are separate packages such as GigAssembler [26] and Bambus [28], which will be surveyed in this section.

88

Genome Sequencing Technology and Algorithms 60 C8 fragment coverage 50

Coverage

40 30 20 10 0 0

50000 0

100000 150000 Base position (a)

200000

250000

25 Contig8.good–bad

20 15

Coverage

10 5 0 −5 −10 −15 −20 −25 0

50000

100000 150000 Base position (b)

200000

250000

8 C8 entropy 7

Coverage

6 5 4 3 2 1 0 0

50000

100000 150000 Base position (c)

200000

250000

Figure 6.3 (a) The fragment coverage, (b) the clone coverage plot, and (c) the entropy plot, for a contig 8 generated by Phrap (version 2001). There is a misassembled region from 89,415 to 90,332 where the fragment coverages are not distinctly high but the valleys in the clone coverage plot and peaks in the entropy plot are distinct, effectively identifying the misassembled region.

Overview of Genome Assembly Techniques 6.5.1

89

The Scaffold Generation Problem

Input: A set of contigs; mate-pair information; physical/genetic map (optional); expressed sequence tags (EST) (optional). Output: A set of linearly ordered contigs with optional gap-size information between adjacent contigs. In a sense, scaffolding is to generate a linear order of contigs after orienting and ordering contigs. Issues related to scaffolding are: 1. Mate-pair information is erroneous. Some mate pairs come from chimeric clones. More seriously, mate-pair information from fragments aligned at wrong places can easily confuse contig orientation and ordering. 2. There are typically several types of clone libraries that differ in length, say, 2 kb, 10 kb, 40 kb, 100 kb, and so on. In general, mate-pair information from shorter clones is more accurate than that from longer clones. How to utilize mate-pair information of different quality is not trivial. Bambus is one of the hierarchical scaffolding methods that utilize mate-pair information from clones of different length in a hierarchical fashion. 3. There are external information sources, that can be utilized for scaffolding contigs, such as physical/genetic map, alignment information obtained by aligning contigs to already finished genomes, and conservation of gene synteny. 6.5.2

Bambus

The main design philosophy of Bambus is to make a stand-alone scaffolding package so that is can be coupled with other fragment assembly packages and users can easily control parameters for scaffolding contigs. Note that recent assemblers, such as the Celera Whole Genome Assembler, and Arachne, embedded a scaffolding module, but the scaffolding modules are tightly coupled with specific assemblers. In this section, we explain the steps of Bambus, discussing how Bambus deals with the main issues of scaffolding contigs. 6.5.2.1 Edge Bundling to Handle Errors in Mate Pairs

Mate-pair information is very important in orienting and ordering contigs. However, some mate pairs are incorrect due to misassembly of contigs or fragments from chimeric clones. Intuitively, if more mate pairs between two contigs that are consistent in terms of orienting and ordering the two contigs, the mate pairs can be considered to be correct with a higher confidence. This problem can be formally defined and solved by finding the largest clique (i.e., a fully

90

Genome Sequencing Technology and Algorithms

connected subgraph), in the interval graph induced by the inter-conitg gap ranges for the links. The cluster with the most links is chosen and all links in other clusters are given the invalid orientation tags. Bambus allows the user to specify different redundancies to be used for contig links, depending on the confidence in the data. For example, shorter clones, say, of 2 kb, require a smaller number of edge bundling while longer clones, say, of 100 kb, require more edge bundling to be used for contig orientation. The output of this step is a set of contig edges between contig pairs. The remaining task is to orient and order contigs using these contig edges. 6.5.2.2 Contig Orientation

The contig orientation problem is to find a consistent orientation for “all” contigs. This is sometime challenging. Consider three contigs, A, B, and C. The orientation of a contig, say B, with respect to another contig, say A, can be either A → B, A → rc(B), rc(A) → B, or rc(A) → rc(B) where rc(A) and rc(B) represent the reverse complement of A and B, respectively. Suppose that contigs edge impose contig orientation, A → rc(B) and A → rc(C). In addition, suppose that contigs edge impose contig orientation, B → rc(C). Then the three-contig orientation is not consistent. Note that there are clones of different length, thus bundled edges are also of different length. In case there is error in bundled edges, the situation described above can happen. Based on the principle of parsimony, we can consider the contig orientation problem as contig orientation and ordering with computing the minimum number of contig edges to be removed to make a consistent orientation for all contigs. Unfortunately, this problem is an NP hard [4]. Thus a greedy heuristic algorithm is used in practice. In general, the greedy contig orientation algorithm works well since the edge-bundling step generates contig edges of high accuracy. 6.5.2.3 Contig Ordering

The contig-ordering problem is to embed contigs on a line while preserving the gap length suggested by bundled edges. This can be formulated as a problem of topological ordering of contigs subject to length constraints between contigs. An optimization problem formulation would be to find a topological ordering with a minimum number of edges removed. This is also known as an NP-hard problem. Bambus uses an “expand-contract” greedy heuristic for contig ordering. The first step expand is to anchor the first unplaced contig with edges at their maximum allowable length, then traverse the contig graph in a breadth-first search manner to fill in the range. As this is a greedy placement of contig, any contig with inconsistent ordering is not placed. After the expand step, contigs are brought back and placed as close as possible to the midpoint of the range defined by the length constraints of edges. This contraction step allows placement of as many contigs as possible. The resulting ordering may not be

Overview of Genome Assembly Techniques

91

consistent, meaning that two contigs may occupy the same space. This ambiguous placement can be helpful for the final genome-finishing step. 6.5.2.4 Hierarchical Scaffolding

Contig edges from smaller insert libraries have significant fewer errors than those from longer insert libraries. Thus how to utilize clone edges of different accuracies is not trivial. Bambus generates scaffolds of contigs in a hierarchical fashion, starting with contig edges from smallest libraries, say, of 2 kb, then Bambus adds edges of lower-quality from larger insert libraries. The quality of contig edges are evaluated not only by the library length but also by the number of edges connecting two contigs since confirmation from two independent edges is an indication of higher quality in connecting contigs. 6.5.2.5 Untangling

Given a scaffold of contigs, there can be contigs that are involved in multiple paths of contigs. In this case, it may be desirable to untangle those contigs to convert an ambiguous scaffold to a single linear stretch. The Bambus untangler resolves an ambiguous scaffold by iteratively finding the longest nonself-overlapping path in a greedy fashion. If a contig involves multiple potential paths, it may be desirable to break a contig into multiple pieces and then test if single linear—nonoverlapping—stretches of contigs can be generated. Bambus plans to incorporate this implementation in its future release. 6.5.3

GigAssembler

GigAssembler [26] was used for the human genome assembly in the public Human Genome Project [13]. It generated scaffolds using contigs, map, mRNA, EST, and BAC end data. The overview of the scaffold-generation process is as follows. 1. Decontaminating and repeat masking the sequence. RepeatMasker [29] is used to mask known repeats and contaminants from bacteria, vectors, and others. 2. Aligning mRNA, EST, BAC end, and paired-plasmid reads against initial sequence contigs. 3. Creating an input directory structure using map and other data. For the human genome, they used Washington University’s map data. A directory is created for a chromosome and a subdirectory for each fingerprint clone contig. 4. For each fingerprint clone contig, aligning the initial sequence contigs within that contig against each other.

92

Genome Sequencing Technology and Algorithms

5. Using the Gigassembler program within each fingerprint clone contig to merge overlapping initial sequence contigs, and to order and orient the resulting sequence contigs into scaffolds. 6. Combining the contig assemblies into full chromosome assemblies. 6.5.3.1 Preprocessing: Alignment of mRNA, ESTs, BAC Ends, and Paired Reads

Contigs are oriented and ordered by aligning mRNA, ESTs, BAC ends, and paired reads to contigs using a program called psLayout. It reports all matches above a certain minimal quality between query sequences and database sequences. To compute alignments, it collects candidate-matching regions by using 10-mer indices in a set of overlapping 500-base regions of query sequence. Then candidate-matching regions are aligned, especially tolerating intron regions in the case of aligning mRNA and ESTs. Resulting aligned regions are combined using a dynamic programming algorithm. To reduce the effects of repeats, two techniques are used. The first technique is to use repeat sequences (repeats are not masked although they are detected by RepeatMasker). The second technique is to maintain “near best" matches only, which means discard matches (even good ones) if they are below the best match. 6.5.3.2 Assembly and Ordering of Contigs

The GigAssembler operates in many steps. The task is to generate consensus sequences determined by contigs and their ordering. Contigs are ordered and merged gradually into larger ones by building rafts, barges, raft-ordering graphs, and bridge graphs. Next we describe the algorithm in more detail. 1. Build merged sequence contigs, called rafts, from overlapping initial sequence contigs. A score to each aligning pair is assigned, and then the alignments are processed from the best scoring ones to least ones. 2. Build sequenced clone contigs, called barges, from overlapping clones. Barges are constructed in a greedy fashion where the clone overlap is the sum of all initial sequence contig overlaps. Each clone will be assigned a coordinate in the result barge. 3. Once the orientation and order of clones are determined while constructing barges, rafts (merged sequence contigs) can be ordered using a raft-ordering graph. This is a directed graph with two types of nodes, rafts and sequenced clone endpoints. To understand what is happening at this stage, see Figure 6.4. 4. Rafts are bridged with mRNAs, ESTs, paired-plasmid reads, BAC end pairs, and ordering information from the sequencing centers. The resulting graph is called a bridge graph. Bridge is added one at a time, starting with the best scoring, to the ordering graph. The score

Overview of Genome Assembly Techniques

93

AAAA AAAA AAAA A A a1a1a1a1 a2a2a2a2 BBBB BB BB BB BBBB BBBB BBBB B b1b1b1b1b1b1b1

As

Bs

As

Bs

b2b2b2b2 CCCCCCCCCCCCCCCCCCC c1c1c1c1c1c1 c2c2c2

Ae

Ae

Cs

Cs

Be

Ce

Be

Ce

Figure 6.4 How to build a raft-ordering graph. Six initial contigs (a1, a2, b1, b2, c1, c2) are aligned to three clones (A, B, C) (top figure), an ordering graph of clone starts and ends is given (middle figure), and the final raft-ordering graph after adding in rafts to the ordering graph (bottom figure). Form the top figure, we can construct three rafts, a1, b1, a2, b2, c1, and c2, based on their overlaps. Then an ordering graph of clone starts and ends can be constructed based on the positions of clone start and end positions as in the middle figure. The node names As and Ae denotes the start and the end of a clone A, respectively. Finally, the three rafts are added to the ordering graph as in the bottom figure.

function for bridges is based on the type of information. mRNA information is given the highest weight, then paired-plasmid reads, information provided by the sequencing centers, ESTs, and BAC end matches, in that order. 5. Walk the bridge graph to get an ordering of rafts. Each bridge is walked in the order of the default coordinates assigned with a constraint that if a raft has predecessors, all the predecessors must be walked before the raft is walked. 6. A sequence path through each raft is built in a greedy fashion, starting with the longest, most finished initial sequence contig that passes though each section of the raft.

94

Genome Sequencing Technology and Algorithms

7. Build the final sequence for the fingerprint clone contig by inserting the appropriate number of Ns between raft sequence paths.

6.6 Finishing Finishing is labor intensive and constitutes a major bottleneck in any genome-sequencing project. Input to the finishing stage is a set of oriented and ordered contigs. However, as we discussed in previous sections, it is still challenging to verify the correctness of contigs and to generate scaffolds of contigs. Due to the difficulties, there are two different views in pursuing genomelevel sequencing, one in favor of the whole-genome shotgun strategy [30] and another in favor of a hierarchical strategy involving only smaller-scale shotgun sequencing [31] (see Section 6.7 for more discussion). In summary, there should be more efforts in developing frameworks for genome sequencing as well as component tools such as scalable, reliable sequence assemblers and contig assembly- validation methods. As one of the initial efforts, the AMOS project aims at developing open-source whole-genome assembly software for the genome- sequencing community [32].

6.7 Three Strategies for Whole-Genome Sequencing The first whole bacterial genome, H. influenzae, was sequenced at TIGR in 1995 using the whole-genome shotgun strategy [33]. Since then, the whole-genome shotgun strategy has been successfully used for many genomes, including human [12]. Basically, the longer the size of DNA region, the more repeats exist, which becomes clearly a major hurdle to the genome sequencing. There are three different strategies of employing shotgun strategy in wholegenome sequencing. 1. The whole-genome shotgun strategy applies shotgun-sequencing strategy to the whole-genome level. The advantage of this approach is that it is most cost-effective since shotgun data can be prepared in a single step at the whole-genome level. The human genome assembly by Celera was achieved using this strategy [12]. 2. The hierarchical approach uses libraries of different insert size. Libraries of larger insert size are further split into libraries of smaller insert size while keeping track of which library the current library is a descendant of. This often requires a high-resolution genetic map prior to the whole-genome assembly and low-resolution physical map. Shotgun strategy is applied when a library becomes small enough so that the

Overview of Genome Assembly Techniques

95

current assembly algorithm could determine the target sequence with confidence. Since the library hierarchy information can be easily used to produce the target sequence, this approach may produce more accurate genome sequences. The major drawback of this approach is time and cost of genome sequencing. The human genome consortium used this approach [13]. 3. The hybrid approach, called pooled genomic indexing—a technique pioneered at the Baylor College of Medicine, employs both the hierarchical and the whole-genome shotgun approaches, but without physical and genetic mapping information. This approach combines two different types of shotgun reads, one from the whole-genome shotgun and the other from the shotgun sequencing of individual BACs. BACs are generated by using large insert BAC clones and a minimum tiling path of BACs that covers the whole genome is computed. Then shotgun strategy with a low coverage is applied to a set of selected BACs from the tiling path. In parallel, the whole-genome shotgun strategy is applied to generate another set of shotgun data at the whole-genome level. These two shotgun data sets are combined to determine the whole-genome sequence. The brown Norway rat genome was assembled with this strategy [34].

6.8 Discussion In this section, we briefly summarize techniques for genome sequencing. For more information, readers may refer to several review articles on genome sequencing [35–37]. Although genome sequencing still remains an open problem, recent advances in computational techniques make it possible to sequence very large eukaryotic genomes such as Drosophila melanogaster [27] and the human genome [12, 13]. We discuss some of the recent trends in genome-sequencing strategies next. 1. To achieve very large-scale genome sequencing, it is necessary to have shotgun data of very high quality (see [27]). Otherwise, is not possible to distinguish repeats from errors in the shotgun data. What is really interesting is that recent assemblers attempt to correct errors in the input shotgun data before sequence assembly [5, 6]. There is no guarantee to correct errors without knowing the target sequence. Indeed, EULER, a genome assembly package, names this procedure as data corruption instead of error correction. Nonetheless, this is a promising technique that works for large-scale sequence assembly.

96

Genome Sequencing Technology and Algorithms

2. Repeat boundaries are identified before sequence assembly and then contigs are assembled up to the boundaries [5, 27]. Like the error correction, there is no guarantee to identify the repeat boundaries correctly without knowing the target sequence, but this is another promising technique. 3. Computational techniques to ensure the correctness of contig assembly becomes more important. Correctness can be checked using the characteristics of the shotgun data (i.e., random sampling).2 There are two ways to utilize the characteristics of data, on the fragment level [2, 5, 9, 27] and there is also an interesting approach based on pattern statistics [21]. 4. Mate-pair information and base-call-quality values become an essential data for genome sequencing. 6.8.1

A Thought on an Exploratory Genome Sequencing Framework

As hinted in Section 6.5.2, contigs that are involved in multiple paths of contigs may be broken into smaller pieces and can be tested if linear paths can be generated. To realize this idea, methods for assembly validation that detect misassembled regions in contigs are much needed (see Section 6.4). We tested this idea for several bacterial genomes including Agrobacterium [38]. The schematic overview of a genome-sequencing framework [39] developed at DuPont is depicted in Figure 6.5. This approach can be viewed as “a hypothesis generation and validation paradigm” in search of a set of correctly assembled contigs and their ordering. All decisions made at user interaction points are hypotheses that will be subsequently tested with larger clones in the next step. This approach was successful in assembling several genome sequences. For example, 502 contigs in the Phrap assembly of the Agrobacterium shotgun data were grouped and ordered into only 15 sets of contigs (the largest set longer than 2 Mb) using a Web interface in a single iteration of our genome-sequencing framework. Note that there are four replicons in the Agrobacterium genome. As more accurate assembly validation methods are developed, this technique might be useful for automating the sequencing of microbial genomes. By embedding sequence assembly modules into an assembly package, such as Minimus [40].

2. There are regions where sampling is biased due to biological reasons. However, random sampling can be assumed as a whole shotgun data.

Overview of Genome Assembly Techniques

97

A set of DNA fragments

Sequence assembler

A set of contigs Assembly validation

Split contigs Clone linkage information

Ordering contigs

A set of groups of contigs Large clone linkage information

Ordering groups

A set of groups of groups

Figure 6.5 A framework for genome sequencing. This framework is to search for a set of correctly assembled contigs and their ordering in an iterative fashion.

Acknowledgments Sun Kim was supported in part by a Career DBI-0237901 from the National Science Foundation (United States) and a grant from the Korea Insititute of Science and Technology Information. We thank the anonymous reviewer for his or her valuable comments.

References [1]

Green, P., http://www.phrap.org.

[2]

Sutton, G., et al., “TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects,” Genome Science and Technology, Vol. 1, 1995, pp. 9–19.

[3]

Kim, S., and A. M. Segre, “AMASS: A Structured Pattern Matching Approach to Shotgun Sequence Assembly,” Journal of Computational Biology, Vol. 6, No. 4, 1999.

[4]

Kececioglu, J. D., and E. W. Myers, “Combinatorial Algorithms for DNA Sequence Assembly,” Algorithmica, Vol. 13, 1995.

98

Genome Sequencing Technology and Algorithms

[5] Batzoglou, S., et al., “Arachne: A Whole-Genome Shotgun Assembler,” Genome Research, Vol. 12, No. 1, 2002, pp. 177–189. [6] Pevzner, P. A., et al., “An Eulerian Path Approach to DNA Fragment Assembly,” PNAS, Vol. 98, 2001, pp. 9748–9753. [7] Huang, X., “A Contig Assembly Program Based on Sensitive Detection of Fragment Overlaps,” Genomics, Vol. 14, 1992. [8] Huang, X., “An Improved Sequence Assembly Program,” Genomics, Vol. 33, 1996. [9] Huang, X., and A. Madan, “CAP3: A DNA Sequence Assembly Program,” Genome Research, Vol. 9, No. 9, 1999, pp. 868–877. [10] Huang, X. et al., “PCAP: A Whole-Genome Assembly Program,” Genome Res., Vol. 13, 2003, pp. 2164–2170. [11] Peltola, H., et al., “SEQAID: A DNA Sequence Assembling Program Based on a Mathematical Model,” Nucleic Acids Res., Vol. 12, No. 1, Pt. 1, 1984, pp. 307–321. [12] Venter, J. C., et al., “The Sequence of the Human Genome,” Science, Vol. 291, 2001, pp. 1304–1351. [13] Lander, E. S., et al., “Initial Sequencing and Analysis of the Human Genome,” Nature, Vol. 409, 2001, pp. 860–921. [14] Roach, J. C., “Pairwise End Sequencing: A Unified Approach to Genome Mapping and Sequencing,” Genomics, Vol. 26, 1995, pp. 345–353. [15] Ewing, B., et al., “Base-Calling of Automated Sequencer Traces Using Phred. I. Accuracy Assessment,” Genome Research, Vol. 8, 1998, pp. 175–185. [16] Chou, H. H., and M. H. Holmes, “DNA Sequence Quality Trimming and Vector Removal,” Bioinformatics, Vol. 17, No. 12, 2001, pp. 1093–1104. [17] Idury, R., and M. S. Waterman, “A New Algorithm for DNA Sequence Assembly,” Journal of Computational Biology, Vol. 2, No. 2, 1995, pp. 291–306. [18] Rouchka, E. C., and D. J. States, “Sequence Assembly Validation by Multiple Restriction Digest Fragment Coverage Analysis,” Proc. of Intelligent Systems for Molecular Biology (ISMB), 1998, pp 140–147. [19] Dew, I. M., et al., “A Tool for Analyzing Mate Pairs in Assemblies (TAMPA),” J. Comput. Biol., Vol. 12, No. 5, 2005, pp. 497–513. [20] Kim, S., et al., “Enumerating Repetitive Sequences from Pairwise Sequence Matches,” manuscript, DuPont Central Research, 2000. [21] Kim, S., et al., “A Probabilistic Approach to Sequence Assembly Validation,” ACM SIGKDD Workshop on Data Mining in Bioinformatics (BioKDD2001), 2001, pp. 38–43. [22] Zimin, R., and J. A. Yorke, “Assembly Reconciliation Method,” http://www.genome. umd.edu/reconciliation.htm, 2007. [23] Shatkay, H., et al., “ThurGood: Evaluating Assembly-to-Assembly Mapping,” Journal of Computational Biology, Vol. 11, No. 5, 2004, pp. 800–811

Overview of Genome Assembly Techniques

99

[24] Kim, S., et al., “A Computational Approach to Sequence Assembly Validation,” manuscript, DuPont Central Research, 2000. [25] She, X., et al., “Shotgun Sequence Assembly and Recent Segmental Duplications Within the Human Genome,” Nature, Vol. 431, 2004, pp. 927–930. [26] Kent, W. J., and D. Haussler, “Assembly of the Working Draft of the Human Genome with GigAssembler,” Genome Res., Vol. 11, 2001, pp. 1541–1548. [27] Myers, G., et al., “A Whole-Genome Assembly of Drosophila,” Science, Vol. 287, 2000, pp. 2196–2204. [28] Pop, M., et al., “Hierarchical Scaffolding with Bambus,” Genome Res., Vol. 14, No. 1, 2004, pp. 149–159. [29] Smit, A., “Repeat Masker,” http://repeatmasker.genome.washington.edu/, 2007. [30] Weber, J. L., and E. Myers, “Human Whole-Genome Shotgun Sequencing,” Genome Research, Vol. 7, 1997, pp. 401–409. [31] Olson, M., and P. Green, “A ‘Quality-First’ Credo for the Human Genome Project,” Genome Research, Vol. 8, No. 5, 1998, pp. 414–415. [32] AMOS: A Modular Open-Source Assembler, http://amos.sourceforge.net/, 2007. [33] Fleischmann, R. D., et al., “Whole-Genome Random Sequencing and Assembly of Haemophilus Influenzae Rd.,” Science, Vol. 269, No. 5223, 1995, pp. 496–512. [34] Gibbs, R. A., et al., (Rat Genome Sequencing Project Consortium), “Genome Sequence of the Brown Norway Rat Yields Insights into Mammalian Evolution,” Nature, Vol. 428, No. 6982, 2004, pp. 493–521. [35] Pop, M., et al., “Genome Sequence Assembly: Algorithms and Issues,” IEEE Computer, Vol. 35, No. 7, 2002, pp. 47–55. [36] Pop, M., et al., “Shotgun Sequence Assembly,” Advances in Computers, Vol. 60, June 2004. [37] Batzoglou, S., “Algorithmic Challenges in Mammalian Genome Sequence Assembly: Special Review,” in D. M. Jordel, P. Little, and S. Subramaniam, (eds.), Encyclopedia of Genomics, Proteomics, and Bioinformatics, New York: John Wiley & Sons, 2005. [38] Wood, D. W., et al., “The Genome of Agrobacterium Tumefaciens C58: Insights into the Evolution and Biology of a Natural Genetic Engineer,” Science, 2001, pp. 2317–2323. [39] Kim, S., “The AMASS Genome Sequencing Package,” Advances in Genome Biology and Technology Conference, February 2002. [40] Sommer, D. D., et al., “Minimus: A Fast, Lightweight Genome Assembler,” BMC Bioinformatics, Vol. 8, February 26, 2007, p. 64.

7 Fragment Assembly Algorithms Sun Kim and Haixu Tang

Fragment assembly algorithms take, as an input, a set of fragment reads with vector sequences trimmed and information on the start and end positions of good quality regions, and together with base-call-quality information. The output from fragment assembly algorithms is a set of aligned fragments, called contigs, and a consensus sequence for each contig. Repeats that are often longer than the average fragment length make the assembly very difficult, thus assembly algorithms generates many contigs rather than single contiguous sequences. See Chapter 6 for more information. One of the most effective methods to handle repeats is to use mate-pair information from a clone, which most modern fragment assembly algorithms utilize. However, it is not trivial to utilize the mate-pair information. The main issue utilizing this information during the assembly process is that we do not know the sequence between the two reads, which can only be deduced by assembling other fragments into a single contig. As a result, we can utilize the clone-length information only after assembly, which results in either a correct or an incorrect assembly based on the clone-length information (see Figure 6.2). In this chapter, we will survey nine fragment assembly algorithms. Recent assemblers are comprehensive genome-sequencing packages, rather than just fragment assembly algorithms, thus description of these algorithms includes all issues in genome assembly: Celera Whole Genome Assembler [1] in Section 7.4 and Arachne [2] in Section 7.5. The rest of this chapter is organized as follows. In the next six sections, we describe seven fragment assembly algorithms among widely used ones as of

101

102

Genome Sequencing Technology and Algorithms

today. The following section describes two more algorithms that are interesting due to the algorithm design.

7.1 TIGR Assembler The TIGR assembler [3] is the first assembler that was used to assemble a whole bacterial genome, H. influenzae, using the shotgun sequencing strategy in 1995. The TIGR assembler operates in two phases: pairwise comparison of fragments and fragment assembly. A consensus sequence is automatically determined as fragments are assembled. It handles repetitive sequences with increased match criteria, while incorporating mate-pair information in assembly.

7.1.1

Merging Fragments with Assemblies

After computing pairwise overlaps among input fragments, a fragment is merged with the current assembly if the four conditions are satisfied: minimum length of overlap, minimum similarity in the overlap region as a percentage of the best possible score, maximum length of overhang, and maximum number of local errors. Overhang is referred to as the region in the alignment where two fragments do not match, and the length of overhang is the maximum number of overhang at both ends of the alignment. Local errors are the number of base-pair mismatches between the fragment and the assembly in any given contiguous 32 bp in the overlap region. The maximum number of local errors is used to reject overlap with clustered errors, which might survive the similarity test. For example, 16 errors spread out evenly over a 400-bp overlap (96% similarity) is more likely to be a real overlap than 8 errors clustered in a 32-bp window of 400-bp overlap (98% similarity). Note that, when a fragment is considered for assembly, match criteria are computed against the current assembly, not against a fragment, thus the assembly does not rely on pairwise alignment, although selection of a fragment for assembly does.

7.1.2

Building a Consensus Sequence

When a fragment is added to the current assembly, TIGR does not commit to a consensus sequence but keeps a history of what bases have been aligned to that position in the past. This results in a column or profile where a count of bases plus gaps is recorded. From this profile, a consensus sequence is generated after the assembly is completed, selecting the most frequent bases for consensus sequence.

Fragment Assembly Algorithms 7.1.3

103

Handling Repetitive Sequences

TIGR uses two strategies to handle repetitive sequences: increased match criteria for fragments possibly from repeats, and clone-length constraints. Fragments from repetitive sequences are identified by the number of potential overlaps each fragment has based on pairwise comparisons. Any fragment with the number of overlaps more than a threshold is labeled as being from repetitive sequences. When a repeat fragment is added to the current assembly, the stringency of the match criteria is increased so that inexact repeats (for example, 80% similarity for Alu repeats) can be distinguished. Even with strict match criteria, it is impossible to avoid false overlap identifications when repeats are longer than the average fragment size. TIGR incorporates mate-pair information that maintains distance between two reads from the same clone to further eliminate false overlaps.

7.2 Phrap Phrap is one of the most widely used fragment assemblers. However, there is no published paper that describes the algorithm for Phrap. The description of the Phrap algorithm is based on the online documentation at http://www.phrap.org. 1. All matching words of length minmatch or greater are identified and a sorted list of the words are created. 2. For each pair, a band around a diagonal that is defined by matching words is defined, and overlapping words are merged. SWAT, an implementation of the Smith-Waterman algorithm, is used to find matching segments above a preset minscore between matches. By masking out the current matched regions, SWAT is recursively applied between matches. 3. A log-likelihood ratio (LLR) score for comparing two hypotheses, a hypothesis that the reads truly overlap and another hypothesis that they are from repeats of 95% similarity, is computed. LLR utilizes the probability of the observed data under each hypothesis that is quantified using the interpretation of the Phrap qualities in terms of error probabilities. Since LLR is a ratio score of the two hypotheses, a positive LLR score implies a true overlap while a negative LLR score implies an overlap from repeats. 4. Fragment layout is progressively generated using a sorted list of matches in terms of their LLR scores.

104

Genome Sequencing Technology and Algorithms

5. The consensus sequence of each contig is generated as a mosaic of individual read by building a weighted graph using selected positions of matches as vertices (ends of alignments, and midpoints of perfectly matching segments of sufficient length). Unidirectional edges from 5’ to 3’ positions within a single read, with weight equal to the total quality of the sequence between the two nodes. To cross-link matches, bidirectional edges with weight 0 between aligned bases in overlapping reads are added. Then, a standard maximum weight single-source maximum weight path is computed (the complexity is linear in relation to the number of vertices).

7.3 CAP3 CAP3 is an improved version of its previous releases, CAP2 [4] and CAP [5]. Three major improvements in CAP2 were: (1) performance improvement by filtering out potentially nonoverlapping fragments, (2) identification of chimeric fragments by computing and utilizing an error rate vector for each input fragment, and (3) handling repetitive sequences by constructing repetitive contigs while merging two different contigs. CAP2 was further improved to CAP3 as described in this section. There is also a parallel version of CAP, called PCAP [6]. 7.3.1

Automatics Clipping of 5’ and 3’ Poor Quality Regions

Clipping is done using both base-quality values and sequence similarities, and by defining and using the notion of good regions of a read as follows: 1. Any sufficiently long region of high-quality values; 2. Any sufficiently long region that is highly similar to a good highquality region of another read defined to be good. The 3’ and 5’ clipping positions of a read are determined by the boundaries of good regions. 7.3.2

Computation and Evaluation of Overlaps

A global alignment of two fragments is computed over the band determined by the optimal local alignment while clipping poor quality regions. Each overlap is evaluated by five measures: minimum length, minimum percent identity, minimum similarity score, differences between overlapped reads at bases of

Fragment Assembly Algorithms

105

high-quality values, and difference between the rate of error at the overlap being considered with respect to the sequencing error rate estimate. 7.3.3

Use of Mate-Pair Constraints in Construction of Contigs

The mate-pair information is used while assembling contigs: 1. An initial layout of reads is performed by use of a greedy method in decreasing order of overlap scores. 2. The quality of the current layout is assessed by mate-pair constraints, (i.e., existence of paired reads and their distance). 3. A region of the current layout with the largest number of unsatisfied constraints is located and then the unsatisfied constraints are checked for being satisfiable by aligning unaligned pairs according to their distance. If such a region exists, corrections to the region are made by adding satisfiable mate pairs and breaking unsatisfiable pairs, and then steps 2 and 3 are repeated. Otherwise, the correction procedure is terminated. 4. Contigs are ordered and linked with unsatisfied constraints (i.e., using mate pairs in two different contigs).

7.4 Celera Assembler The Celera Whole Genome Assembler (WGA) is the first fragment assembler that successfully assembled WGS reads from large eukaryotic genomes (>100 Mbp). The description of the Celera assembler is based on the article describing the Drosophila genome assembly [1]. Before we explain the Celera WGA, we will survey an approach by Kececioglu and Myers [7] which was a basis for the Celera WGA. 7.4.1

Kececioglu and Myers Approach

The Kececioglu and Myers approach [7] formulated the fragment assembly problem in a formal way. The algorithm operates in four steps: 1. Construct a graph of approximate overlaps between every pair of fragments. 2. Assign an orientation to each fragment (i.e., choose the forward or reverse complement sequence for each fragment). 3. Select a set of overlaps that induces a consistent layout of the oriented fragments.

106

Genome Sequencing Technology and Algorithms

4. Merge the selected overlaps into a multiple sequence alignment, and vote on a consensus. Each of the four phases can be viewed as a separate problem. Unfortunately, the last three problems are NP-complete and are handled by a heuristic algorithm in the Kececioglu and Myers approach. 7.4.1.1 Overlap Graph Construction

There are four possible types of overlaps between two fragments f 1 and f 2: a fragment f 1 dovetails to another fragment f 2 when a suffix of f 1 aligns to a prefix f 2;f 1 contains f 2 when f 2 is a substring of f 1. An overlap graph G = (V, E, w) represents fragments with vertex set V, and overlaps with edge set E. An edge weight function w gives the plausibility metric for an overlap. To choose which of the possible overlaps actually appears between each vertex pairs in the overlap graph, Kececioglu and Myers choose the maximum likelihood overlap among all possible overlaps based on the edit distance and the probabilistic metric from Chvatal and Sankoff. The probabilistic metric is computed as follows: given an overlap between two fragments, the overlap of length l + d is a string containing l exact matches (substrings that both fragments agree) and d mismatches. The Chvatal and Sankoff algorithm computes the probability that a string of length l + d which contains the fixed substring of length l occurs. Finally, overlap edges are culled according to the two following criteria: (1) match significance, and (2) error distribution. An overlap with match probability less than match significance threshold is rejected. Assuming a binomial distribution in the error, they also reject some edges by error distribution threshold. Roughly speaking, an overlap with highly clustered errors is rejected by a probabilistic argument. 7.4.1.2 Fragment Orientation

Once an overlap graph has been constructed, the next step is to determine the orientation of each fragment. This is formulated as an optimization problem with an objective function: maximize w (O ) =

∑ opp( f 1, f 2 ) + (

( f 1 , f 2 ) ∈(O ,O )

∑ opp( f 1, f 2 ) same ( f 1, f 2 )

f 1 , f 2 ) ∉(O ,O )

where same ( f 1, f 2) = max (w ( f 1, f 2)), w ( f 1, f 2) and opp ( f 1, f 2) = max (w ( f 1, f 2), w ( f 2, f 1)). This problem is NP-complete, shown by the reduction from the maximum weight problem [7]. Thus, Kececioglu and Myers proposed a greedy approximation algorithm.

Fragment Assembly Algorithms

107

7.4.1.3 Fragment Layout

A layout of fragment overlaps is generated by traversing the oriented graph according to the following four constraints: 1. 2. 3. 4.

Every vertex in G has at most one incoming edge. The edges do not form cycles. No two dovetail edges leave the same vertex. No containment edge f 1⇒ f 2 is followed by a dovetail edge f 2 → f 3.

A set of edges that satisfies constraints 1 and 2 is called a branching. A branching can be seen as a collection of trees. A set of edges that also satisfies constraints 3 and 4 is called a dovetail-chain branching. Then the layout problem is formulated as the maximum dovetail-branching problem, which can be solved by repeating the generation of the maximum weight branching and then generating subproblems by removing a pair of edges violating constraints 3 and 4. Obviously, it grows exponentially. Kececioglu and Myers presented two techniques to reduce search space. 7.4.1.4 Multiple Sequence Alignment

Once pairwise sequence alignments of two fragments are computed, a trace graph of aligned characters in each column is constructed. A trace gives us a sequence of characters. Thus the multiple alignment phase determines a trace first and then generates a sequence by topologically sorting connected components in the trace. The problem is how to find the best trace from a graph of characters. The basic idea to solve the problem is that any pairwise alignment that is already determined forms a trace. Given the pairwise alignments, compute a supergraph whose supernodes are the sequences to be aligned and whose superedges are defined by pairwise alignments. A superedge is weighted by a similarity metric which is the sum of similarities between characters in two sequences. Now the goal is to find a tree of pairwise traces with the maximum weight for a supergraph. The heuristic proposed by Kececioglu and Myers is to compute a closure between pairwise traces and add them to the superedges and then to compute the maximum weight-spanning tree. 7.4.2

The Design Principle of the Celera Whole-Genome Assembler

The following is the design principle of the Celera Whole-Genome Assembler. 1. The clone-mate information is used to generate scaffold of the genome as well as contigs. One important feature is that assembly of contigs

108

Genome Sequencing Technology and Algorithms

were tried to be as accurate as possible by detecting repeat boundaries while assembling contigs. Contigs without repeats are named U-unitigs. Note that use of clone-length information is reliable as long as assembly of contigs (unitigs) are correct; see Figure 6.2 for issues using mate-pair information. 2. The assembler can utilize external information such as physical/genetic maps. 3. The assembler places fragment reads in a series of stages, starting with the safest moves to more aggressive ones. The stage and evidence for a read’s placement are open to inspection, providing an audit trail of the assembler’s decision.

7.4.3

Overlapper

Fragment overlaps are detected with a criteria requiring fewer than 6% difference and at least 40 bp of unmasked sequence, which is similar to the well-known seed-and-extend method developed for BLAST [8]. A simple statistic can show the existence of extensive repeats, for example: the average number of overlaps per fragment read was 33.7, which is a lot higher than the coverage of the shotgun data (14.6). Like many other assemblers, unusually high number of overlaps for a fragment can be used for detecting repeats.

7.4.4

Unitigger

A collection of fragments whose arrangement is not contested by overlaps from other fragments were assembled into unitigs, contigs with more stringent criteria. Each unitig is examined to see whether it contains repeats or not as follows. Unitigs, that represent nonrepetitive unique DNA, are called U-unitigs. Potential repeat boundaries in U-unitigs are computed and then U-unitigs are extended up to these repeat boundaries. To detect misassembled unitigs, WGA used A-statistic, a log-odds ratio of the probability from the distribution of fragment start points. U-unitigs are unitigs with A-statistic of 10 or higher. A-statistic is calculated as follows. Let F be the number of fragments in the database and G be the (estimated) size of a genome. Assume that fragments are not oversampled. For a unitig with k fragments and distance ρ between the start of its first fragment and the start of the last fragment, the probability of seeing k − 1 start points in the interval of length ρ is [( pF /G )k /k !]exp( − pF /G ). If the unitig was the result of collapsing two repeats, the probability is [( 2 pF /G )k /k !]exp( −2 pF /G ). A-statistic is the log of the ratio of the two probabilities.

Fragment Assembly Algorithms

109

By detecting repeat boundaries, unitigs are extended on the order of a fragment length (i.e., extension of an unitig up to a fragment length). Repeat boundaries are detected as follows. Whenever a unitig A overlaps with two different unitigs B and C, but B and C overlapping regions fail to overlap, a repeat boundary can be detected by aligning B and C using a dynamic programming technique.

7.4.5

Scaffolder

The Celera Scaffolder [9] uses a greedy algorithm that uses a graph structure constructed using mate-pair information or BAC ends, to orient and order unitigs. We briefly sketch the algorithm in two steps. 7.4.5.1 Bactig Graph Construction

Bactigs are contiguous sequences from a common source region obtained by shotgun strategy and assembly process. In the Drosophila shotgun data, there were four different types of library clones of lengths, 2k, 10k, 50k, and 100k. The graph is constructed in three steps. 1. The initial graph: A bactig graph G is a weighted, undirected multigraph, without self loops. Each bactig Bi gives rises to precisely two nodes v, w in the graph and an edge e connecting v and w. Then, an edge e is associated with a standard deviation σ(e) of length variation. 2. Edge bundling: Let M denote the set of mate edges between nodes, v and w. Edges are bundled by first greedily choosing a median-length mate edge e ′ ∈ M , whose length l (e ′) is within 3σ(e) of l(e) and then bundling all of these with e. The weight of an edge e, w(e), is 1 if e is a simple mate edge and ∑ ki =1 w (e i ) if e is obtained by bundling k mate edges {e1, ..., ek}. 3. Transitive reduction: This step is to transitively reduce long mate edges. Recall that there are four types of library clones of lengths, 2k, 10k, 50k, and 100k. By transitively reducing long edges from a larger clone library, jumping edges that can cause ambiguity in ordering are removed in the ordering. A mate edge e from v to w can be transitively reduced on to the path p if e and P have approximately the same length (three times of variation). If this is the case, the edge e is removed from the graph and the weight for every mate edge mi in P is increased by w(e). In this way, the final bactig graph is constructed.

110

Genome Sequencing Technology and Algorithms

7.4.5.2 Path-Merging Algorithm

The bactig-ordering problem is defined and solved by a heuristic algorithm called a greedy path-merging algorithm. Edges are classified into happy edges and unhappy edges. A happy edge is an edge whose length is within three times of length variations. Otherwise, an edge is unhappy. Given a bactig graph G, find an ordering φ of G that maximizes the sum of weights of happy edges.

7.5 Arachne Arachne [2] has also been used for assembling large genomes and has a number of well-designed features for large-scale genome projects: an efficient and sensitive procedure for finding real overlaps, a procedure for scoring overlaps that achieves high accuracy by correcting errors before assembly, read merger based on forward-reverse links, and detection of repeat contigs by mate-pair consistency. The Arachne assembly algorithm operates as follows: 1. Overlap detection and alignment: Pairwise overlaps are computed by a three-step process: (a) identify all k-mers (k = 24) and merge overlapping shared k-mers, (b) extend the shared k-mers to alignments, and (c) refine the alignment by dynamic programming. 2. Error correction: Sequencing errors are detected and corrected by generating multiple alignments among overlapping reads using a majority rule based on quality base score (by Phred). 3. Evaluation of alignments: Each alignment is evaluated by an overall penalty score for an alignment which combines individual discrepancy in base calls (considering the quality score). Overlaps incurring a high-penalty score are discarded. This step detects repeats and chimerical reads. 4. Identification of mate pairs: Complexes of mate pairs are built by starting with combining two overlapping paired pairs at both ends and iteratively adding paired reads matching both ends. 5. Contig assembly: Potential repeat boundaries are identified by aligning fragments that extend the same fragment. All fragment reads are merged and extended until a repeat boundary is reached. 6. Detection of repeat contigs: Once contigs are assembled, contigs that are potentially misassembled due to repeats are identified by two criteria, the depth of coverage and the consistency of linking to other contigs. 7. Creation of supercontigs or scaffold generation: Once contigs are assembled and repeat contigs are marked, supercontigs are built incrementally

Fragment Assembly Algorithms

111

using unmarked contigs. To make this procedure recursive, every contig becomes a supercontig and then merged incrementally using a priority queue. 8. Filling gaps in supercontigs: Once the scaffolds of contigs are generated, gaps between contigs are attempted to be filled using repeat contigs by computing shortest paths between contigs and then performing breadth first search for contigs that can be placed within a supercontig. 7.5.1

Contig Assembly

Reads are merged into contigs up to potential repeat boundaries. Repeat boundaries are detected based on inconsistency in extending overlaps. For example, suppose that a read r can be extended by both x and y, but x and y do not overlap each other. Then a repeat boundary can be detected by aligning x and y. However, this could result in detecting too many potential repeat boundaries. Arachne uses two techniques, dominated reads and subreads, to eliminate spurious repeat boundaries. 7.5.2

Detecting Repeat Contigs and Repeat Supercontigs

Two heuristics are used to detect contigs and supercontigs that are misassembled due to repeats. 1. Density of reads: Arachne computes the log-odds ratio,  a given density of reads representing a unique region of genome  log    a density representing at least two copies of a repeated region  Any contigs with a log-odds ratio less than one are marked as repeated. This technique is also used in the Celera assembler (see Section 7.4). 2. Consistency of forward-reverse links: For two contigs A and B, the distance between two is estimated whenever there are at least two links. The mean and standard deviation, denoted by d(A, B) and ERR(A, B), respectively, are computed and used for detecting repeat (super)contigs with two rules explained here. (a) Rule 1: If d(A, B) < −2,000 − 4 × ERR(A, B) then the (super)contigs are marked as repeated; (b) Rule 2: if contig A is linked to both B and C, then from d(A, B) and d(A, C) and the lengths of B and C, we can compute d(B, C), the estimated gap or overlap length between B and C. If d(B, C) < −2,000 − 4 × ERR(B, C) then A is marked as repeated.

112

Genome Sequencing Technology and Algorithms

7.6 EULER We review a graph-based fragment assembly approach, called EULER [11]. There are previous approaches related to EULER, sequencing by hybridization [12, 13] and the Idury-Waterman algorithm [14]. 7.6.1

Idury-Waterman Algorithm

Idury and Waterman [14] proposed a new way to assemble shotgun reads by computational methods in sequencing by hybridization (SBH) [12, 13]. To reconstruct a sequence ATGCAGGTCC, we can adopt a Hamiltonian-path approach as follows: ATG, TGC, GCA, CAG, AGG, GGT, GTC, TCC. There is a directed edge from ATG to TGC since the suffix TG of ATG is equal to the prefix TG of TGC. After constructing a graph, the sequence is determined by finding a Hamiltonian path through the graph. Since constructing a Hamiltonian path is NP-complete, Pevzner came up with an alternative formulation of the graph. In the new graph G, each k − 1 tuple becomes a node, and an edge is set from node N1 to N2 when there is an instance of a probe whose prefix of size k − 1 is N1 and whose suffix of size k − 1 is N2. With the new graph, the sequence can be determined by a finding Eulerian tour, for which a simple linear time algorithm is known (Figure 7.1 shows a Hamiltonian graph H and an Eulerian graph G for the sequence ATGTGCCGCA). Another advantage of the Eulerian graph approach is that there might be cases where the Eulerian tour determines one sequence, but the Hamiltonian tour cannot determine a sequence due to the existence of the multiple Hamiltonian tours for the same sequence. The basic idea of Idury and Waterman algorithm is that the error rate in the shotgun reads is not that high so there is a sufficient number of occurrences of the true k-tuples (as the probes in SBH). For example, if there are 2% errors, ATG

AT

TGC

TG GC

GCA CAG

CA G

H

AG AGG GG GGT GT TCC

GTC

Figure 7.1 Graphs H and G for ATGCAGGTCC.

CC

TC

Fragment Assembly Algorithms

113

there will be 50 − k correct tuples in every 50 nucleotide sequence on average. Idury and Waterman rely on the fact that there is multiple coverage to sidestep these errors. From a shotgun data, an Euler graph on (k − 1)-tuples for the k-tuples is constructed and the consensus sequence is generated by searching for the Eulerian tours. 7.6.2

An Overview of EULER

The proposed approach uses the Eulerian-superpath approach and does not employ the traditional “overlap-layout-consensus” approach. As in the Idury and Waterman approach, a de Bruijin graph, denoted by G, is constructed using tuples of length l and the Eulerian path of the graph is computed for fragment assembly (see Section 7.6.1). However, there are two major differences in two approaches, of EULER and of Idury and Waterman. 1. Errors in fragments makes a tangle of erroneous edges, which becomes a major hurdle in computing the Eulerian path. EULER tries to correct as many errors in fragments as possible before computing the Eulerian path by solving the error-correction problem. 2. EULER solves the Eulerian-superpath problem where a series of transformations of an Eulerian graph and collection of paths to a new graph and collection of paths are made to compute a solution to the fragment assembly problem. 7.6.3

Error Correction and Data Corruption

To correct errors in the input shotgun data, we need to know the target sequence. However, the target sequence is supposed to be assembled using the (erroneous) input shotgun reads. This seemingly impossible task is formulated as computing Gl, the set of all tuples in G. Of course, Gl is unknown but their strategy is to approximate Gl without knowing the sequence of G. The error correction problem was proposed to eliminate errors in the shotgun data. The basic idea comes from an observation that a single error in a read s affects at most l l-tuples in s and l l-tuples in the reverse complement of s that point to the same sequencing error. EULER employs the greedy approach that looks for error corrections in the reads that reduce the size of Sl by 2l. This greedy procedure sometimes introduces errors rather than correcting errors. For example, this procedure corrected 234,410 errors and introduced 1,452 errors in the N. meningitidis genome project. Pevzner et al. argue that introducing errors is not bad as long as consistent overlaps can be detected so that false edges in the de Bruijin graph can be reduced (the introduced errors can be corrected later).

114

7.6.4

Genome Sequencing Technology and Algorithms

Eulerian Superpath

Once |Sl |, the set of all l-tuples from S = {s1, ..., sn} is computed by solving the error correction problem, the de Bruijin graph G(Sl ) with vertex set Sl−1 (the set of all l − 1-tuples from S = {s1, ..., sn}) is computed as follows. A directed edge from an l − 1-tuple v ∈Sl−1 to another w ∈Sl−1 is added if Sl contains an l-tuple whose the first l − 1 prefix coincides with v and whose the last l − 1 suffix coincides with w. 7.6.4.1 Eulerian Superpath Problem

Given an Eulerian graph and a collection of paths in this graph, the Eulerian-superpath problem is formulated as one to find an Eulerian graph path in this graph that contains all these paths as subpaths. The Eulerian-superpath problem is solved by transforming both the graph P and the system of paths G1 into a new graph G1 and a new system of paths P1. Such transformation is called equivalent if there exists a one-to-one correspondence between Eulerian superpaths in (G, P) and (G1, P1). The goal is to make a series of transformations

(G , P ) → (G 1 , P1 ) → K → (G k , Pk )

Vin

x

Vmid

y

Vout

Vout

Vin z Vmid

Figure 7.2 The x,y-detachment: edges x and y are substituted with a new edge z. As a result, all three paths have direct paths from x to y.

Fragment Assembly Algorithms

115

where the final system of paths Pk are single edge. Due to the equivalent transformations, every solution of the Eulerian-superpath problem in (Gk, Pk) provides a solution of the Eulerian-superpath problem in (G, P) (i.e., contigs). The transformations are implemented using two techniques, detachment and cut (Figure 7.1). The x, y-detachment is a transformation that adds a new edge z = (vin, vout) and deletes the edges x and y from G (see Figure 7.2 for more details). The x-cut is a transformation that removes the edge x from all paths that contain the edge (see Figure 7.3 for more details). The authors claim that detachments and cuts are powerful techniques to untangle the graph and to reduce the fragment assembly to the Eulerian path problem for bacterial genomes that the authors used. 7.6.5

Use of Mate-Pair Information

There are two EULER programs that utilize mate-pair information, EULER-DB and EULER-SF. EULER-DB tries to untangle repeats using mate-pair information by treating each clone-mate pair as artificial paths in the graph with their expected lengths. EULER-SF generates scaffolds using matepair information.

y1

y3 y4

x

y2

y3 y4

y1 x

y2

Figure 7.3 The x-cut is used for a tangle (i.e., a repeat that does not have a read-path). The edge x is deleted from all paths that contains x.

116

Genome Sequencing Technology and Algorithms

7.7 Other Approaches to Fragment Assembly In this section, we survey a couple of interesting approaches although they are not widely used in the genome-sequencing community as of now. The first approach by Parsons et al. [15] employed a genetic algorithm approach. Kim and Segre [16], the second approach, proposed a structured pattern-matching approach using exact patterns sampled from the shotgun reads. 7.7.1

A Genetic Algorithm Approach

Parsons et al. [15] proposed an approach based on genetic algorithms. Since the fragment assembly problem is NP-complete, the application of genetic algorithms makes sense. Furthermore, as Parsons, Forrest, and Burks argue, the basic idea underlying all genetic algorithms (to generate a better composite solution from good partial solutions) is well suited to the fragment assembly. 7.7.1.1 Representing the Sequence Assembly Problem

A permutation of integers represents a sequence of fragment numbers where successive fragments overlap if there is evidence for overlapping. They used the sorted-order representation for the permutation ordering [17]. There are two requisite properties for a legal ordering: (1) all fragments should be present in the ordering, and (2) there should be no duplication in the ordering. Consider the example in Figure 7.4. In the example, each 3-successive bits represent a fragment number in the bit string. The bit string represents a sequence, 2 7 1 5 3. Each number is associated with another number (associate) representing its position in the ordering. The position of the number 2 is first, so 2 is associated with 1. Then the sequence is sorted in ascending order, becoming a new

0 1

0 1 1

1 0 0

2

7

1

5

3

1

2

3

4

5

1

5

4

2

3

1 1

0

1 0

1

0 1

Bit string

0

2 (Start position)

Start position

Figure 7.4 How to get a permutation ordering, 1 5 4 2 3.

Key values

Intermediate layout

Fragment Assembly Algorithms

117

sequence 1 2 3 5 7. The permutation is a sequence of associates, that is, 3 1 5 4 2. The final step is to read the sequence starting at the position 2 designated as a starting position by the last number in the original bit string. This reading gives us the permutation, 1 5 4 2 3. 7.7.1.2 Fitness Function

The choice of a fitness function is crucial to the success of a genetic algorithm. Parsons, Forrest, and Burks used two fitness functions: F1 and F2. Let I = f [0], f [1], ..., f [n − 1] be an ordering of fragments, where f [i] = j denotes that fragment j appears in position i of the ordering. The first fitness function F1 is defined by F1(I ) =

n −2

∑w i=0

f [i ] , f [i +1]

where wi, j is the pairwise overlap strength of fragment i and j. Note that this function examines only the overlap strength of directly adjacent fragments in the ordering. The optimization process finds a layout that maximizes this function. The second fitness function F2 considers the overlap strength among all possible pairs, in which fragments with strong overlap are far apart. n −1 n −1

F2(I ) = ∑ ∑ i − j × w f [i ] , f [i +1] i=0 j =0

Then four genetic operators, such as order crossover, edge recombination, inversion, and transposition, have been applied while the resulting ordering is evaluated using the fitness function. An important property in the operators is that they try to preserve or extend existing contigs. Especially, the two mutationlike operators do not contract existing contigs but only try to extend them. The basic idea is to allow for the evolution of larger contigs by treating smaller contigs as basic building blocks. 7.7.2

A Structured Pattern-Matching Approach

Kim and Segre [16] proposed a fragment assembly algorithm that uses exact matches of short patterns randomly selected from fragment data to identify fragment overlaps, construct an overlap map, and finally deliver a consensus sequence. 7.7.2.1 Overview of AMASS

AMASS starts with sampling patterns of fixed length from shotgun data, and then finds all occurrences of sampled patterns in shotgun data. The occurrences

118

Genome Sequencing Technology and Algorithms

of patterns are used to detect overlapping fragments and to handle repeats which is a major hurdle to fragment assembly. Two distributions, the pattern distribution and the fragment distribution, are used to identify fragments that come from repeats. Then, assembly of fragments from nonrepeat regions are assembled first, and later contigs are extended to include repeat regions. Figure 7.5 illustrates how AMASS works. 7.7.2.2 Building Contigs

AMASS represents a fragment as an ordered set of pattern occurrences (probes) and their interprobe spacings and fragment overlaps are determined by this representation. Contig is built in a greedy fashion, adding the fragment with the best score at a time as summarized next. 1. Build a pairwise overlap table. 2. Take best pair of unassigned fragments and build an initial contig. 3. Compute overlap score between the current contig and remaining fragments. Pattern dist.

Shotgun data

Sample patterns from each fragment and collect statistics

Remove those patterns

Fragment dist.

Sample more patterns and find matches

Pattern represented shotgun data

Contigs

Assemble fragments

Figure 7.5 An overview of the AMASS fragment assembler.

Fragment Assembly Algorithms

119

4. Selecting a fragment with the highest score, grow the current contig by adding the fragment. 5. Repeat steps 3 and 4 while overlap scores are still significant. 6. Go to step 2 while unassigned fragments remain. 7.7.2.3 Handling Repeats

Repeats are handled in three different ways: (1) use of satellite matching, (2) the pattern distribution, and (3) the fragment distribution. To handle very short repeats, AMASS employs a technique called satellite matching which tests for coexistence of very short patterns around a pair of common pattern (probe) occurrences. If probes are from two different repeat copies, their surrounding regions will be different for short repeats and the satellite-matching test can identify probes from different repeat copies quickly and effectively. Any probe occurrences that fail for the satellite-matching test will be discarded so that they are not used for contig construction. Probe occurrences are further screened using the pattern distribution that shows the distribution of the number of probe occurrences versus the number of probes. Any probes that occur an unusual number of times are obviously from repeats and discarded. This technique is used by nearly all fragment assemblers to quickly filter out probes from repeats. AMASS also attempts to identify fragments that come from repeats using the fragment distribution that the distribution of the number of fragments versus the number of probe occurrences in a fragment. The basic idea is that fragments from long repeats will exhibit a significant increase in the number of probe occurrences since probes from a repeat copy will occur in other repeat copies of the same repeat. AMASS marks fragments in the tail region as repeat fragments. Contigs are constructed for nonrepeat regions using the set of fragments that are not marked as repeat fragments and repeat regions are expanded later whenever ambiguity in overlapping regions are acceptable.

7.8 Incompleteness of the Survey Genome sequencing is a very important problem in biology but it is a very difficult computational problem. As a result, there has been a lot of research on computational techniques. There are many fragment assembly algorithms and this survey is not complete by any means. Among the assemblers we could not include in this chapter due the page limitation are Atlas [18], Phusion [19], STROLL [20], the Staden package [21], JAZZ [22], AMOS [23], and Minimus [24]. The genome sequencing is still unresolved and more sophisticated

120

Genome Sequencing Technology and Algorithms

techniques will be continuously developed, especially for short fragment assembly, some of which are described in this volume.

Acknowledgments Sun Kim was supported in part by a Career DBI-0237901 from the National Science Foundation (United States) and a grant from the Korea Institute of Science and Technology Information. We thank the anonymous reviewer for valuable comments.

References [1] Myers, E. W., et al., “A Whole-Genome Assembly of Drosophila,” Science, Vol. 24, 2000, pp. 2196–2204. [2] Batzoglou, S., et al., “Arachne: A Whole-Genome Shotgun Assembler,” Genome Res., Vol. 12, No. 1, 2002, pp. 177–189. [3] Sutton, G., et al., “TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects,” Genome Science and Technology, Vol. 1, 1995, pp. 9–19. [4] Huang, X., “An Improved Sequence Assembly Program,” Genomics, Vol. 33, 1996. [5] Huang, X., “A Contig Assembly Program Based on Sensitive Detection of Fragment Overlaps,” Genomics, Vol. 14, 1992. [6] Huang, X., et al., “PCAP: A Whole-Genome Assembly Program,” Genome Res., Vol. 13, 2003, pp. 2164–2170. [7] Kececioglu, J. D., and E. W. Myers, “Combinatorial Algorithms for DNA Sequence Assembly,” Algorithmica, Vol. 13, 1995. [8] Altschul, S. F., et al., “Basic Local Alignment Search Tool,” Journal of Molecular Biology, Vol. 215, 1990, pp. 403–410. [9] Huson, D. H., “The Greedy Path-Merging Algorithm for Sequence Assembly,” RECOMB 2001, 2001, pp. 157–163. [10] Kim, S., et al., “Enumerating Repetitive Sequences from Pairwise Sequence Matches,” manuscript, DuPont Central Research, 2000. [11] Pevzner, P. A., et al., “An Eulerian Path Approach to DNA Fragment Assembly,” PNAS, Vol. 98, 2001, pp. 9748–9753. [12] Cantor, C., et al., “SBH: An Idea Whose Time Has Come,” Genomics, Vol. 11, 1992. [13] Pevzner, P. A., and R. J. Lipshutz, “Towards DNA Sequencing Chips,” 19th Symposium on Mathematical Foundation in Computer Science, Vol. 841, 1994. [14] Idury, R., and M. S. Waterman, “A New Algorithm for DNA Sequence Assembly,” Journal of Computational Biology, Vol. 2, No. 2, 1995, pp. 291–306.

Fragment Assembly Algorithms

121

[15] Parsons, R. J., et al., “Genetic Algorithms, Operators, and DNA Fragment Assembly,” Machine Learning, Vol. 21, 1995. [16] Kim, S., and A. M. Segre, “AMASS: A Structured Pattern Matching Approach to Shotgun Sequence Assembly,” Journal of Computational Biology, Vol. 6, No. 4, 1999. [17] Syswerda, G., “Uniform Crossover in Genetic Algorithm,” Proc. of the Third International Conference on Genetic Algorithms, 1989. [18] Havlak, P., et al., “The Atlas Genome Assembly System,” Genome Res., Vol. 14, 2004, pp. 721–732. [19] Mullikin, J. C., and Z. Ning, “The Phusion Assembler,” Genome Res., Vol. 13, 2003, pp. 81–90. [20] Chen, T., and S. Skiena, “Trie-Based Data Structures for Fragment Assembly,” The Eighth Symposium on Combinatorial Pattern Matching, Aarhus, Denmark, June 30–July 2, 1997. [21] Staden, R., et al., “Sequence Assembly and Finishing Methods,” in Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, 2nd ed., A. D. Baxevanis and B. F. Ouellette, (eds.), New York: John Wiley & Sons, 2001. [22] Shaprio, H., “Outline of the Assembly Process: JAZZ, the JGI In-House Assembler,” Lawrence Berkeley National Laboratory, Paper LBNL-58236, July 8, 2005. [23] “AMOS: A Modular, Open Source Assembler,” http://amos.sourceforge.net, 2007. [24] Sommer, D. D., et al., “Minimus: A Fast, Lightweight Genome Assembler,” BMC Bioinformatics, Vol. 8, February 26, 2007, p. 64.

8 Assembly for Double-Ended Short-Read Sequencing Technologies Jiacheng Chen and Steven Skiena

Recently developed pyrosequencing-like technologies are extremely promising, however, the lengths of the resulting reads are drastically shorter than those produced by current sequencing machines. We study the space of read-length, sequencing error rate, and coverage that lies well outside conventional assumptions to determine the technological/economic parameters where de novo sequencing will be achievable with these new technologies. We demonstrate that genome assembly on bacterial and human sequences is possible with astonishingly short reads, given sufficiently high (but still very economical) coverage. In particular, we show that paired reads of length 15 to 20 bases suffice to accurately sequence bacterial genomes. Further, the required level of coverage can be obtained at a projected cost of under $1,000-permegabase genome.

8.1 Introduction Roughly a thousand bacterial genomes and a few dozen higher organisms have been sequenced to date, but full-genome sequencing remains extremely expensive. The cost of adding the (n + 1)st genome to the n previously sequenced will make it economically impossible to justify new projects long before we have sequenced a reasonable fraction of the diversity of life.

123

124

Genome Sequencing Technology and Algorithms

Significantly cheaper sequencing technologies must be developed to read the complete book of life. Some surveys [1, 2] identify many promising new technologies. However, all current efforts focus on the problem of resequencing to study human variation. The economic and medical importance of resequencing ensures that it will drive technology development, at least until the goal of universal diagnostic sequencing is achieved. Resequencing humans is a much less demanding problem than de novo sequencing, because the ability to align sequence reads to a reference genome enables us to get away with much noisier reads and vastly lower sequence coverage. We are convinced that coupling these new technologies with significant advances in computational sequence assembly has the potential to dramatically reduce the cost of de novo sequencing as well. Contemporary sequencing projects generate reads of 500 to 1,000 bases, with base-error rates of roughly 2% and project coverages on the order of 10 to 20. In this chapter, we study the space of read-length, sequencing error rate, and coverage that lies well outside conventional assumptions to determine the technological/economic parameters where de novo sequencing will be achievable with these new technologies. We demonstrate that genome assembly is possible with very short double-ended reads, given sufficiently high (but still economical) coverage. In particular, we describe a double-ended, short-read sequencing protocol, and demonstrate that read-lengths of 15 to 20 bases suffice to sequence bacterial genomes under reasonable experimental conditions. Although our protocol appears similar to conventional double-barreled shotgun sequencing, the reason it works with such short reads is quite different than conventional protocols. We use double-ended reads not to close gaps, but for read subset selection. With sufficiently high coverage, double-ended reads increase the potential information content of the experiment to the point where assembly becomes feasible. We present new algorithms for assembling double-ended short-read sequencing data, along with analytical results demonstrating that reads of length k = (2/3)lg4n + o(lg n) suffice to sequence random n-base strings with high probability. That this works is quite surprising; note that this read-length is so short that most k-mers will repeat frequently in the target. Our analysis implies and our simulation results confirm that read-lengths currently produced by pyrosequencing-based technologies suffice to assemble bacterial sequences. We present encouraging experimental results on assembling bacterial and human genome sequences. Our chapter is organized as follows. In Section 8.2, we briefly review the current state of the art in short-read sequencing technologies. In Section 8.3, we present an analysis that extremely short reads are sufficient for assembly both in theory and in simulation on bacterial and mammalian genomes—given high

Assembly for Double-Ended Short-Read Sequencing Technologies

125

enough coverage. Finally, in Section 8.4 we describe our design and implementation of a complete short-paired read assembler, with experimental results. Our conclusion is that microbial genomes can be reliably sequenced with paired short reads (of length 20 and perhaps less) with a coverage of 500 under realistic error rates. New technologies (in particular, polony sequencing [3, 4]) can be used to obtain such a level coverage very economically—at a projected cost of under $1,000-per-megabase genome. Thus double-ended short-read assembly promises to dramatically reduce the cost of de novo genome sequencing. The most significant previous work we are aware of on short-read sequence assembly concerns experiments with the EULER assembler [5–7] on single reads as short at 70 bases with as much as 30-times coverage [8]. They conclude that the resulting contigs will require significant (if not prohibitive) finishing efforts. Our primary interest will be even shorter reads (15 to 25 bases), but at much higher coverage (100 to 1,000 times), using a to-be-determined mix of double-ended reads. Whiteford et al. [9] analyze the feasibility of short-read sequencing, by computing (1) the number of unique k-mers, and (2) the size of contigs between repeated k-mers on reference genomes for small values of k. They conclude that de novo assembly of large portions of the E. coli genome is possible with reads from length 20 ≤ k ≤ 50. However, this analysis does not consider the impact of either sequencing errors or nonuniform sampling of reads, both of which significantly complicate the task of assembly. Chaisson et al. [8] observe that short-read sequencing at sufficiently high coverage essentially reduces to classical sequencing by hybridization [10] but they do not explore the potential of short double-ended read assembly in any detail. As we will show, the combinatorics of such reads at high coverage implies that two very short reads can be much better than one, even if the latter reads are more than twice as long.

8.2 Short-Read Sequencing Technologies The basic chemistry used in conventional sequencing machines is largely unchanged from the Gilbert-Sanger procedures developed in the 1970s. The great progress in sequencing came not so much from significantly longer or higher-quality reads as much as a dramatic cost reduction from high levels of automation. Today, a large-scale sequencing project costs roughly $1 per read for fixed costs and $1 per read for variable costs [11]. Several companies and laboratories have developed what we will classify as short-read sequencing technologies, capable of producing reads ranging in length from 10 to 50 bases. Only at the high end of this range might

126

Genome Sequencing Technology and Algorithms

conventional assembly strategies have a chance of sequencing simple genomes (e.g., viruses) which do not have significant repeats, and only if the raw sequence data is very accurate. Most of the short-read sequencing technologies are based on pyrosequencing [12–14], which is a “sequencing-by-synthesis” technology. It begins when a sequencing primer is hybridized to a single-stranded DNA target. Sequencing proceeds in rounds of base extensions, alternating A, C, G, and T. Each base-incorporation event releases pyrophosphate, which drives the release of visible light. This light is detected by a CCD camera, with the peak of the light signal proportional to the number of nucleotides incorporated. Sequencing errors particularly occur in runs of repeated bases, where it becomes difficult to distinguish the flash of k bases from k + 1 bases. Reads of up to perhaps 200 bases are possible using pyrosequencing. Indeed, reads of over 100 bases are currently being obtained by the 454 Corporation (http://www.454.com), using technology based on pyrosequencing [15]. We note that reads of length of over 100 bases are sufficiently long that assembly techniques based on merging reads with long pairwise overlaps work roughly as well as on traditional 500-base Sanger-sequencing reads. Several technologies under now development produce substantially shorter but more inexpensive reads, however. These require different approaches for assembly. The new technologies differ in the details of localizing molecules, amplification, and sequencing approach. Representative examples include: • Polony sequencing, developed by Mitra and Church [16], is a new tech-

nology for achieving massive parallelism. A polony (or PCR colony) is a tiny localized spot on a surface. Single molecules can be dispersed over the gel and amplified locally, without diffusing to neighboring polonies. Thus a single surface can have millions of distinguishable molecules, as opposed to a fixed number of mechanical wells. All of these molecules can be simultaneously sequenced using a pyrosequencing-like technology. Mitra and Church project that their method can yield raw sequence at $0.10 per megabase, which radically alters notions of achievable sequence coverage. Sequencing a (typical) 1-megabase bacterial genome with thousandfold coverage would cost only $100. Even 100,000-fold coverage would be easily affordable by a single investigator (only $10,000 per bacterial genome). They have demonstrated the ability to produce 15-base reads with 99.3% accuracy, as measured over 2 million reads on a small collection of templates and, more importantly, 13-base-paired reads [17]. As will be shown in Section 8.3, we have demonstrated that such read-lengths

Assembly for Double-Ended Short-Read Sequencing Technologies

127

are already in the neighborhood for assembling bacterial genomes. Agencourt Bioscience Corp. (http://www.agencourt.com) has licensed this technology, and uses it to sequence over 200 million bases per day [2]. • Solexa (http://www.solexa.co.uk) seeks to determine human sequence

variation for research and diagnostic purposes, by obtaining large numbers of 25-base reads and aligning them to a reference sequence. Their single-molecule array technology disperses single-molecule fragments along a surface, creating an unaddressed array of extremely high density (about 108 sites per square centimeter). • Solexa has a proprietary sequencing chemistry using base-extension

where one new base is extended per round and the color of fluorescence determines which nucleotide it is. Recently they have sequenced a 162-kilobase contig of human DNA with more than 99.99% accuracy [2]. • Helicos (http://www.helicosbio.com) is pioneering a single-molecule

approach to sequencing based on technology from Quake [18]. Individual molecules are immobilized on a glass plate, and then imaged as synthesis occurs. Because single molecules are sequenced without amplification, there is no PCR-induced bias or amplification errors. The initial paper reported 5-base reads, but the obtainable read-length has been substantially increased since then. • Single-molecule molecular-motion sequencing [19] obtains informa-

tion about base composition by observing the relative movements of two beads, attached to the RNA polymerase enzyme and DNA template, respectively. Transcriptional motion of the polymerase indicates the incorporation of a new base. When such a reaction is done in an environment where one of the four nucleotides is present in low concentration, a pause is observed in the incorporation of each rare base. Thus aggregating data from four such reactions give the resulting sequence. The proof-of-principle experiment reported in [19] produced a 32-base read with two errors.

Other companies working on short-read sequencing include Genovoxx (http://www.genovoxx.de), Sequenom (http://sequenom.com), and VisiGen Biotechnologies (http://www.visigenbio.com).

128

Genome Sequencing Technology and Algorithms

8.3 Assembly for Short-Read Sequencing Sequence repeats limit the usefulness of short reads, as any sequence repetition exceeding the read-length defines an unresolvable ambiguity. In particular, the shortest common superstring of a collection of short reads is likely to be a highly over-compressed representation of target. To solve the problem of repeats, we propose the following variable-insert length, double-ended read protocol. Fragment multiple-target clones and use gel electrophoresis to separate out all fragments of length a ± b%, or (equivalently stated) of length d to d + w for given integers d and w. Although the details of realizable insert lengths depend upon the gel/substrate, such a protocol is quite standard. For polony sequencing [R. Mitra, personal communication, June 2004], insert sizes of 2,000 ± 25% base-pairs can be easily realized, with a variation of ± 10% achievable with more effort. We show that there is enough information in the spectrum of very short read pairs to determine the sequence for random strings. DNA sequences are clearly not random strings. However, our analysis captures why very short reads with sufficiently high coverage can suffice for sequencing. Our experimental results reported below demonstrate that short reads do in fact suffice for human and bacterial sequences. The ideas within are based on de Bruijn graphs and Eulerian paths, which are discussed in this chapter. Theorem 1 The variable insert-length, double-ended read protocol suffices to determine a random n-base sequence S with high-probability, even for k 2/3log4n (log n). + o(lg n) and w Note that this read-length is so short that k-mers will frequently repeat in S. However, if the window length w is short enough, the set of all mate-pairs for a given k-mer will be unlikely to contain a repeated (k − 1)-mer among themselves. Thus the mating k-mers will define paths in appropriate subgraph of the de Bruijn graph [10], and hence be uniquely defined. However, the appropriate window size must be sufficiently large that these length-w paths/sequences will be long relative to the expected repeat length of the target sequence, and occur frequently enough to define high enough coverage for assembly. Any given k-mer will occur an expected n/4k times in S. Thus the expected size of the set of mate-pair k-mers of any given k-mer is (n/4k)w. These are drawn from a universe of 4k possible k-mers. By the analysis associated with the birthday-paradox, in sampling with replacement we are unlikely to see a duplicate until we have sampled on the order of the square root of the universe. Thus to ensure unambiguous reconstruction within windows, c 4k ≥

n 2 w → k ≥ lg 4 ((nw c ) ) k 3 4

Assembly for Double-Ended Short-Read Sequencing Technologies

129

Any given sequence of length c0 lg4 n will appear within S with probability 1/(4 c 0 −1 ). Hence the probability of a duplicate of length w = c0 lg4 n decreases exponentially with increasing c0, and so a window length of this size suffices to exceed all repeats in S, giving the result. 8.3.1

Algorithmic Methods

The proof of theorem implies a reconstruction algorithm for the double-barreled, short-reads: • Group the mate pairs of each given k-mer. • Build the de Bruijn graph on the set of pairing k-mers. • Identify any long unbranching paths in this graph as substrings (frag-

ments) of the target sequence. • Repeatedly merge any pairs of fragments that overlap by longer than the

expected repeat length of the target, as in conventional shotgunsequence assembly. This basic algorithm can readily be extended by heuristics and statistical analysis of read frequencies to handle sequencing errors such as substitutions/indels and missing reads; details appear in Section 8.4. An additional reconstruction criteria can be employed under the less-robust assumption the complete set of pairing (n − d − w + 1)w reads is available. If so, consider the end of a sufficiently long (> d + w + k) fragment. The terminating window of size w represents all of the mate k-mers of the k-mer upstream by d + w bases. The window of this k-mer (si) and the one just to its right (si+1) overlap by w − 1 bases. If all but one of the windows with left k-mer si+1 are completely resolved, then the last base of this window must be the only right k-mer which does not yet appear in a resolved window. Given a sufficiently large fragment, we can in principle repeat this extension algorithm to determine the entire sequence. In Tables 8.1 and 8.2, we report the results of assembly using both the basic and extension algorithms for each sequence. 8.3.2

Simulation Results

We performed an extensive series of simulation experiments on both random strings and actual genomic sequences to demonstrate the efficacy of the doublebarreled short-read protocol. In our experiments, we assumed the sequencing experiment provided the frequency of each mate k-mer pair, and that it distinguished the left from right k-mer experimentally prior to assembly. The first

130

Genome Sequencing Technology and Algorithms

Table 8.1 Large Contig Coverage Fraction for Bacterial Sequences Via Double-Barreled, Short-Read Sequencing, for Various Read-Lengths (Results Provided for Basic and Extension Assembly Algorithms) k=13

k=14

k=15 ext

bas

k=16 ext

bas

k=17

Species

Length

bas

ext

bas

ext

bas ext

Borrelia burgdorferi

910,681

0.75

0.75

0.97 0.98 0.99

1.00 0.99 1.00 0.99 1.00

Haemophilus influenzae

1,830,023 0.93

0.94

0.97 0.98 0.98

0.99 0.98 0.99 0.98 1.00

Helicobacter pylori

1,667,825 0.85

0.86

0.95 0.96 0.96

0.99 0.97 0.99 0.97 1.00

Mycoplasma genitalium

580,074 0.95

0.96

0.97 1.00 0.97

1.00 0.97 1.00 0.97 1.00

Pseudomonas aeruginosa

4,164,955 0.86

0.86

0.98 0.98 0.99

0.99 0.99 1.00 0.99 1.00

Staphylococcus 2,814,816 0.89 aureus

0.90

0.94 0.95 0.95

0.97 0.96 0.99 0.96 0.99

Streptococcus pneumoniae

1,326,684 0.91

0.92

0.94 0.97 0.96

0.98 0.96 0.99 0.96 0.99

Thermoplasma acidophilum

1,564,906 0.99

1.00

0.99 1.00 0.99

1.00 0.99 1.00 0.99 1.00

assumption is reasonable given the low cost of sufficiently high coverage for statistical analysis, and validated by our results in the next section. The latter was done for programming convenience in the simulation, but does not fundamentally contribute to the performance of the method. Simulation results on full-genome bacterial sequences appear in Table 8.1. For each set of conditions, we report the fraction of the target assembled in large contigs, defined as pieces larger than twice the window length. The read-length required to reconstruct these targets grows very slowly with n, as predicted by theorem. For all the species we tried, paired 17-base reads with a window size of 2,000 suffice to completely determine or almost completely determine the entire bacterial genome. The less-robust extension algorithm proves somewhat more helpful to achieve this high degree of reconstruction. However, we are confident that more careful analysis of the de Bruijn subgraphs can achieve most, if not all, of this performance gain in a more robust way. Enlarging the read-length to k = 30 did not necessarily remove all assembly artifacts from reconstruction, due to large repeats in these bacterial genomes. The significantly smaller read-lengths reported here performed essentially as well.

NT_022459.13 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 60

NT_016606.16 0.85 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 205

NT_029289.10 0.31 0.59 0.54 0.77 0.85 0.99 0.99 1.00 1.00 1.00 1.00 1.00 85

NT_007592.13 0.93 1.00 0.99 1.00 0.99 1.00 0.99 1.00 1.00 1.00 1.00 1.00 115

NT_007819.14 0.12 0.21 0.45 0.60 0.66 0.85 0.69 0.89 0.80 1.00 0.96 1.00 255

NT_008183.17 0.75 0.94 0.97 1.00 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 105

NT_008413.16 0.82 0.98 0.93 0.98 0.97 1.00 0.99 1.00 1.00 1.00 1.00 1.00 80

NT_011903.10 0.83 0.92 0.97 1.00 0.98 1.00 0.98 1.00 0.99 1.00 0.99 1.00 1665

3

4

5

6

7

8

9

Y

repeat

NT_005058.14 0.75 0.93 0.91 0.98 0.93 1.00 0.95 1.00 0.99 1.00 1.00 1.00 520

bas ext

max

2

bas ext

k=100

0.36 0.57 0.76 0.94 0.89 0.95 0.98 1.00 1.00 1.00 1.00 1.00 102

bas ext

k=50

NT_032977.6

bas ext

k=30

1

bas ext

k=25

bas ext

k=20

Chrm Genbank-ID

k=15

Table 8.2 Large Contig Coverage Fraction for 100-kb Human Sequences Via Double-Barreled, Short-Read Sequencing, for Various Read-Lengths (Results Provided for Basic and Extension Assembly Algorithms)

Assembly for Double-Ended Short-Read Sequencing Technologies 131

132

Genome Sequencing Technology and Algorithms

Simulation results on 100-kb human sequence fragments drawn from 10 different chromosomes appear in Table 8.2. As expected, human sequence proves significantly more difficult to assemble. Still, in 9 of 10 cases read-lengths of k = 30 suffice for complete assembly, even though these fragments contained repeats of length from 60 to 1,665.

8.4 Developing a Short-Read-Pair Assembler The simulation results presented in previous sections clearly demonstrate that it is possible to use doubled-ended short reads to assemble genome sequences, given arbitrarily high coverage of error-free reads. Determining the actual coverage needed for reads of a particular length subject to a given sequencing error rate requires more sophisticated experiments. Towards that end, we have developed a full assembler for the short paired-read technologies discussed in this chapter. In this section, we report on the design of our assembler and our results on simulated bacterial genomesequencing projects. Our results demonstrate that far lower levels of coverage are needed for successful assembly than might be inferred from our theoretical analysis. We first describe our experimental methodology. All experiments simulate the de novo sequencing of Mycoplasma genitalium, the shortest known bacterial genome. This was selected to maximize the variety of read-length, coverage, and error conditions we could evaluate with our computational resources. Selected experiments with larger bacterial genomes (in particular, a strain of E. coli) produced similar results. Substitution base errors were generated independently and uniformly at random. This error model does not completely reflect the sequence artifacts of pyrosequencing and related technologies, although our experiments with actual polony sequencing reads makes clear they can be effectively corrected with similar techniques to those described next. The phases of our assembly process are as follows, and illustrated in Figure 8.1: • Cleaning input read-pairs—Our initial phase of processing takes the set

of input read-pairs and attempts to discard or correct reads containing base-sequencing errors. Similar ideas in various contexts appear in [8, 9, 20]. • Read frequency analysis—Let t be the total number of sampled reads (i.e., t = c · n/k where c is the coverage), n is the genome size, and k is the read-length. Let A(x) denote the probability that there be x samples of a particular k-mer r given that that r occurs on the genome G. Then

Assembly for Double-Ended Short-Read Sequencing Technologies

133

acggttacatg acggttccatg acggttccatg Read frequency analysis

Postassembly contig extension

Group left reads using de Bruijn graph

Assemble associated right reads

Figure 8.1 Phases of the assembly process. x t −x   1   n − 1   A( X ) = tx      n  n   

Thus the probability that a given k-mer of G is sampled at most k times is B (s ) = ∑ si = 0 A(i ). From this, a frequency threshold can easily be determined next which reads can be identified as sequencing errors instead of correct reads. These equations can be generalized to correct for the impact of sequencing errors (given a technology-specific base-error rate) and multiple occurrences of a given k-mer on G. • Read correction—To glean some information from reads with sequenc-

ing errors, we map any read r ′ occurring less frequently than the above threshold to a k-mer r if there is a unique r of greater frequency than r ′ such that the Hamming distance from r ′ to r is 1. Such matches occur frequently because our read-lengths are so short that even base-error rates of 3% or so leave a substantial probability that reads contain at most a single base error.

Table 8.3 demonstrates the effectiveness of our read-correction efforts as reflected by both the remaining sequence coverage (after discarding/correcting erroneous reads) and percentage of undiscarded erroneous reads. Even with base-error rates of 3%, a healthy fraction of the original coverage always remains even as a substantial percentage of false reads are eliminated. • Construct the de Bruijn subgraph on “left” reads so as to group the associ-

ated “right” reads. This phase performs the selection process to identify

134

Genome Sequencing Technology and Algorithms

Table 8.3 Remaining Effective Coverage and Percentage of Surviving Erroneous Reads After Read-Error Correction (as a Function of Input Coverage, Read-Length, and Base-Error Rate) Effective Coverage

k

Raw Coverage 0%

25

100

20

15

Surviving False Reads %

1%

2%

100

82

59

36 0.0% 0.1% 1.4%

2.2%

150

150

137

115

77 0.0% 1.0% 2.5%

3.7%

200

200

188

158 121 0.0% 1.7% 4.1%

5.8%

250

250

235

165 161 0.0% 2.6% 6.2%

8.4%

350

350

331

289 234 0.0% 0.1% 0.6%

1.3%

500

500

351

409 323 0.0% 0.0% 0.9%

2.0%

100

100

91

150

150

200

200

250 350

76

3% 0%

1%

2%

3%

58 0.0% 0.6% 1.9%

3.1%

143

127 106 0.0% 1.4% 3.6%

5.6%

192

175 150 0.0% 2.3% 6.0%

8.9%

250

240

219 191 0.0% 3.6% 9.0%

13.0%

350

337

307 267 0.0% 0.1% 0.6%

1.6%

500

500

481

434 364 0.0% 0.2% 1.7%

2.0%

100

100

94

150

150

141

130 118 0.0% 2.3% 6.8%

85

75 0.0% 1.1% 3.3%

5.7% 11.0%

200

200

188

174 158 0.0% 4.1% 11.3%

17.7%

250

250

235

215 193 0.0% 0.1% 1.3%

3.8%

350

350

328

298 263 0.0% 0.3% 3.0%

8.1%

500

500

467

419 360 0.0% 0.1% 7.0%

3.8%

the paired reads likely to occur within the variance of the fragment size. To reduce the coverage necessary to construct contigs of the “right” reads, we partition the “left” reads into groups consisting of all k-mers on a particular maximal subpath between nodes of outdegree >1. • Construct the de Bruijn graph on the “right” reads of each left group. All

maximal subpaths in this between nodes of outdegree >1 are identified as sequence contigs. To reduce the required coverage, the amount of overlap required generating an edge in this graph is governed by a parameter than can in general be safely set substantially below k − 1.

• Select contigs of sufficient size to pass through a shotgun assembler. The

sequence contigs produced by the previous process can be interpreted as

Assembly for Double-Ended Short-Read Sequencing Technologies

135

reads produced by conventional sequence technologies and sent to a conventional sequence assembler for further processing. The primary decision to be made is identifying the contig length threshold to govern what gets passed through for further analysis. Clearly longer contigs are easier to assemble, but we would needlessly reduce coverage by demanding long contigs. • Postassembly contig extension. The extension assembly algorithm of Sec-

tion 8.3.2 provides a natural way to extend the large contigs resulting from assembly to the point where they may be joined together. By mapping each read pair to its associated large contig, we can select the read pairs with exactly one end accounted for in the large contig. The remaining end reads must lie close to the ends of the large contig, and will be densely enough sampled so as to permit substantial extension of the long contigs. 8.4.1

Analysis

Figure 8.2 presents the distribution of sequence contigs produced by our assembler as a function of contig length, read-length, coverage, and base-error rate. Under all conditions evaluated, a coverage of 500 sufficed to produce contigs of length greater than 200 covering over 98% of the genome. In general, the percentage of the genome covered decreased very slowly even for contigs greater than 500 in length. Except the most strenuous conditions of error/read-length we tested, coverages of 250 produced similar results. In our experiments, we used the TIGR Assembler [17] to assemble these contigs into genomic sequence. Figure 8.3 presents the distribution of genomic sequence produced by our assembler as a function of contig length, read-length, coverage, and base-error rate. Under all conditions, over 96% of each genome was correctly assembled into fragments of length 1,000 or greater, and a substantial percentage of the genome assembled in fragments of length 20,000 or greater. While these results are very good, they fall short of complete assembly. Still, we consider them extremely encouraging because there are several directions remaining which clearly lead to even better assemblies: • Although the contigs fed to TIGR Assembler were treated as single

reads, these contigs can largely be paired through proper interpretation of the paired reads. We have not yet fully exploited the potential of the length constraints between paired reads. We are investigating techniques based on constrained path lengths in de Bruijn graphs similar to those developed for EULER [12].

1000

100 200 260 300 400 500 750

1000

0.4

0.6

0.8

1

0

100

100

250

250

500 Coverage on raw read level

k=20 e=2% pre TIGR

500 Coverage on raw read level

k=15 e=2% pre TIGR

e = 2%

1000

100 200 260 300 400 500 750

1000

100 200 260 300 400 500 750

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

100

100

250

250

500 Coverage on raw read level

k=20 e=3% pre TIGR

500 Coverage on raw read level

k=15 e=3% pre TIGR

e = 3%

1000

100 200 260 300 400 500 750

1000

100 200 260 300 400 500 750

Figure 8.2 Percentage of the genome covered by raw contigs of lengths 100 through 750 as a function of read-length, error rate, and coverage (from 100 to 500).

0

500 Coverage on raw read level

k=20 e=1% pre TIGR

500 Coverage on raw read level

0.2

0.4

0.6

0.8

1

0

250

250

100 200 260 300 400 500 750

0.2

100

100

k=15 e=1% pre TIGR

0.2

0.4

0.6

0.8

1

k = 20

0

0.2

0.4

0.6

0.8

1

k = 15

e = 1%

% of sequence covered

% of sequence covered

% of sequence covered

% of sequence covered

% of sequence covered

% of sequence covered

136 Genome Sequencing Technology and Algorithms

1000

Figure 8.2 (continued)

0

500 Coverage on raw read level

0.4

0.6

0.8

1

0

250

100 200 260 300 400 500 750

0.2

100

k=25 e=1% pre TIGR

% of sequence covered

0.2

0.4

0.6

0.8

1

k = 25

% of sequence covered

e = 1%

100

250

500 Coverage on raw read level

k=25 e=2% pre TIGR

e = 2%

1000

100 200 260 300 400 500 750

0

0.2

0.4

0.6

0.8

1

100

250

500 Coverage on raw read level

k=25 e=3% pre TIGR

e = 3%

1000

100 200 260 300 400 500 750

Assembly for Double-Ended Short-Read Sequencing Technologies 137

% of sequence covered

1000

1000 5000 10000 20000 30000 40000 50000 100000

1000

0.4

0.6

0.8

1

0

100

100

250

250

500 Coverage on raw read level

k=20 e=2% post TIGR

500 Coverage on raw read level

k=15 e=2% post TIGR

e = 2%

1000

1000 5000 10000 20000 30000 40000 50000 100000

1000

1000 5000 10000 20000 30000 40000 50000 100000

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

100

100

Figure 8.3 Postassembly contig sizes as a function of read-length, error rate, and coverage (from 100 to 500).

0

500 Coverage on raw read level

k=20 e=1% post TIGR

500 Coverage on raw read level

0.2

0.4

0.6

0.8

1

0

250

250

1000 5000 10000 20000 30000 40000 50000 100000

0.2

100

100

k=15 e=1% post TIGR

0.2

0.4

0.6

0.8

1

k = 20

0

0.2

0.4

0.6

0.8

1

k = 15

e = 1%

% of sequence covered % of sequence covered

% of sequence covered

% of sequence covered

% of sequence covered % of sequence covered

250

250

500 Coverage on raw read level

k=20 e=3% post TIGR

500 Coverage on raw read level

k=15 e=3% post TIGR

e = 3%

1000

1000 5000 10000 20000 30000 40000 50000 100000

1000

1000 5000 10000 20000 30000 40000 50000 100000

138 Genome Sequencing Technology and Algorithms

1000

Figure 8.3 (continued)

0

500 Coverage on raw read level

0.4

0.6

0.8

1

0

250

1000 5000 10000 20000 30000 40000 50000 100000

0.2

100

k=25 e=1% post TIGR

% of sequence covered

0.2

0.4

0.6

0.8

1

k = 25

% of sequence covered

e = 1%

100

250

500 Coverage on raw read level

k=25 e=2% post TIGR

e = 2%

1000

1000 5000 10000 20000 30000 40000 50000 100000

0

0.2

0.4

0.6

0.8

1

100

250

500 Coverage on raw read level

k=25 e=3% post TIGR

e = 3%

1000

1000 5000 10000 20000 30000 40000 50000 100000

Assembly for Double-Ended Short-Read Sequencing Technologies 139

% of sequence covered

140

Genome Sequencing Technology and Algorithms

• We have not attempted to use multiple window lengths between double-

ended reads, as is standard in shotgun-sequencing protocols. In particular, we could easily extend our protocol by producing a second set of paired reads with larger read separations, and use these to chain together and order our assembled large contigs. • Furthermore, the results of Figure 8.3 do not reflect that of

postassembly contig extension. By coupling this with paired reads for gap closure we are confident that we can fully assemble bacterial genomes (modulo a small number of gaps) with read-lengths of 20 (and quite possibly less) for error rates of up to 3%. Finally, nonmonotonicities in Table 8.3 demonstrate that we can do a substantially better job of cleaning erroneous reads.

To put these results into perspective, the 3% base-error rate we have demonstrated we can work with is currently being achieved with polony sequencing on paired reads. The cost of the amount of sequencing necessary to achieve a coverage level of 500 is projected to be only a few hundred dollars for a 1-megabase bacterial genome. Thus only modest and projected increases in the read-length of their technology coupled with an assembler on the lines we have developed should lead to a substantial cost breakthrough in de novo sequencing.

References [1]

Shendure, J., et al., “Accurate Multiplex Polony Sequencing of an Evolved Bacterial Genome,” Science, Vol. 309, 2005, pp. 1728–1732.

[2]

Shendure, J., et al., “Advanced Sequencing Technologies: Methods and Goals,” Nature Reviews: Genetics, Vol. 5, 2004, pp. 335–344.

[3]

Mitra, R., et al., “Fluorescent In Situ Sequencing on Polymerase Colonies,” Analyt. Biochem., 2003.

[4]

Nyren, P., et al., “Solid Phase DNA Minisequencing by Enzymatic Luminometric Inorganic Pyrophosphate Detection Assay,” Anal. Biochem., Vol. 208, 1993, pp. 171–175.

[5]

Pevzner, P., H. Tang, and M. Waterman, “An Eulerian Path Approach to DNA Fragment Assembly,” Proc. Natl. Acad. Sci., Vol. 98, 2001, pp. 9748–9753.

[6]

Pevzner, P., H. Tang, and M. Waterman, “A New Approach to Fragment Assembly in DNA Sequencing,” Proc. Fifth ACM Conf. on Computational Molecular Biology (RECOMB 2001), 2001, pp. 256–267.

[7]

Pevzner, P. A., “l-Tuple DNA Sequencing: Computer Analysis,” J. Biomolecular Structure and Dynamics, Vol. 7, 1989, pp. 63–73.

Assembly for Double-Ended Short-Read Sequencing Technologies

141

[8] Chaisson, M., et al., “Fragment Assembly with Short Reads,” Bioinformatics, Vol. 20, 2004, pp. 2067–2074. [9] Whiteford, N., et al., “An Analysis of the Feasibility of Short Read Sequencing,” Nucleic Acids Research, Vol. 33, No. 19, 2005, p. e171. [10] Ronaghi, M., et al., “Real-Time DNA Sequencing Using Detection of Pyrophosphate Release,” Anal. Biochemistry, Vol. 244, 1996, pp. 367–373. [11] Lander, E., and B. Austin, “Workshop Summary: Sequencing and Re-Sequencing the Biome!” http://www.genome.gov/10005564, July 2002. [12] Pevzner, P., and H. Tang, “Fragment Assembly with Double-Barreled Data,” Bioinformatics, 2001, pp. S225–S233. [13] Ronaghi, M., et al., “A Sequencing Method Based on Real-Time Pyrophosphate: A Review,” Science, Vol. 281, 1998, pp. 363–367. [14] Service, R., “The Race for the $1000 Genome,” Science, Vol. 311, 2006, pp. 1544–1548. [15] Margulies, M., et al., “Genome Sequencing in Microfabricated High-Density Picolitre Reactor,” Nature, Vol. 437, 2005, pp. 376–380. [16] Mitra, R., and G. Church, “In Situ Localized Amplification and Contact Replication of Many Individual DNA Molecules,” Nucleic Acids Research, Vol. 27, 1999, pp. 1–6. [17] Sutton, G. G., et al., “TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects,” Genome Science and Technology, Vol. 1, 1995, pp. 9–19. [18] Braslavsky, I., et al., “Sequence Information Can Be Obtained from Single DNA Molecules,” PNAS, Vol. 100, 2003, pp. 3960–3964. [19] Greenleaf, W., and S. Block, “Single-Molecule, Motion-Based DNA Sequencing Using RNA Polymerase,” Science, Vol. 313, 2006, p. 801. [20] Bailey, J., et al., “Recent Segmental Duplications in the Human Genome,” Science, Vol. 297, 2002, pp. 1003–1007.

Part III Beyond Conventional Genome Sequencing

9 Genome Characterization in the Post–Human Genome Project Era Haixu Tang and Sun Kim

The completion of the Human Genome Project (HGP) in 2003 [1], 2 years earlier than the initial expectation, received great help from the rapid advancement (with increasing throughput and reducing cost) of DNA sequencing technology. The cost of sequencing a mammalian-sized genome has fallen from billions of dollars as HGP spent to less than $50 million as of today. It is, however, still far too expensive for DNA sequencing to become a promising technology towards personalized medicine (see Chapter 5). How much should it be? The National Institutes of Health (NIH) thought it should be cut to another four orders of magnitude (i.e., $1,000 for sequencing a human individual’s genome). To achieve this ultimate goal, NIH set a target period of 10 years, and meanwhile planned a near-term target of reducing the cost by two orders of magnitude (i.e., being a $100,000 genome), in 5 years [2]. It seems not the most optimistic anticipation for the advancement of genome-sequencing technologies. The X Prize foundation launched a challenge with an offer of a $10 million dollar prize to any private team who “can build a device and use it to sequence 100 human genomes within 10 days or less, … and at a demonstrated cost of no more than $10,000 per genome” [3]. Several new generation technologies have shown potential capability (see Part I of this book for details). But a major circumstantial difference between this sequencing race and the previous one, in which DNA sequencing cost was reduced more than fifty-fold over a decade, is the completion of the HGP and the availability of a reference human genome.

145

146

Genome Sequencing Technology and Algorithms

The genome characterization in the post-HGP era appears both easier and more difficult than sequencing the reference human genome. On the easy side, now we have the reference genome in hand, which can provide valuable information for genome characterization. However, on the difficult side, the needs of fast genome sequencing, and the genome characterization beyond sequencing may pose new challenges for both experimental and computational biologists. In this section of the book, we aim at providing a perspective on the new challenges that we may encounter.

9.1 Genome Resequencing and Comparative Assembly Since the genome of one person is approximately 99.9% identical to another, to sequence the genome of an individual is indeed to determine the ~0.1% variations between it and the reference human genome that is available. This task, sometimes referred to as genome resequencing, can be accomplished by a comparative assembly instead of a de novo assembly procedure, which compares the reference genome with the sequencing information obtained from the target genome and determines the variations. For example, one can map the sequencing reads to the reference genome using rapid sequence alignment algorithms, because the reads should be nearly identical to their corresponding sequences on the reference genome. As a result, a low-coverage (e.g., 2–3) sequencing project may generate sufficient number of reads to characterize almost all the positional variations on the entire target genome. This method can also be used for the comparative assembly of the genome from a closely related species to human, for example, the chimpanzee genome [4], in which the de novo assembled contigs were oriented and ordered with the help of the reference human genome. The new generation DNA sequencing techniques (e.g. the pyrosequencing) are particularly useful in resequencing, because (1) they can be applied at an extremely high throughput; and (2) they generate short reads that are hard to be de novo assembled. Sequencing by hybridization (SBH) was initially proposed as an alternative approach to the conventional gel-based sequencing [5]. Despite its considerably low cost, it is hard to de novo reconstruct a unique long DNA sequence from the SBH experimental data (i.e., the spectrum of short subsequences). Nevertheless, by combining information of a reference genome with the experimental data, a spectrum alignment algorithm can be used to efficiently resequence a long target DNA [6]. This demonstrates that those sequencing technologies that are not suitable for de novo genome sequencing may be adapted for the purpose of genome resequencing. However, specific algorithms need to be designed for the comparative assembly using this data (see Chapter 13).

Genome Characterization in the Post–Human Genome Project Era

147

9.2 Genotyping Versus Haplotyping Even though different persons may carry different sequence variations, most important (e.g., disease associated) variations are usually common variations (i.e., those ones carried by a large fraction of individuals), within the human populations. Therefore, it is possible to first characterize all frequent genetic variations by resequencing the genomes from a limited number of randomly selected individuals. Afterwards, genome resequencing becomes equivalent to the genomewide genotyping, in which the allelic types are determined for each previously characterized variation sites in the target genome. One remarkable effort along this direction is made by the International HapMap consortium, which attempts to map all common single nucleotide polymorphisms (SNPs) in the human genome [7]. The HapMap project has two phases: in phase I, over a million SNPs were mapped across the genome in each of 269 individuals from various populations; and an additional 4.6 million SNPs will be added in phase II [8]. It is known as linkage disequilibrium (LD) that the allele of one polymorphic site often correlates to the alleles of the flanking sites. Therefore, a segment of SNP sites often show a particular combination of alleles, called haplotypes. The utilization of this property can significantly reduce the efforts for genome resequencing: it is only needed to determine the allelic types of a small subset of all SNP sites for predicting the allele of all the other sites. In the HapMap project, since the SNPs were obtained from diploid human individuals through high-throughput DNA sequencing, they do not contain any haplotype information. In order to accurately determine the long-range haplotypes in human genome, computational methods have been developed [9] (see Chapter 10).

9.3 Large-Scale Genome Variations The HapMap project attempts to identify common SNPs in the human genome. Apparently not all sequence variations in the human genome are single nucleotide mutations. Large-scale genome rearrangements, including chromosomal inversions, translocations, segmental deletions, segmental duplications, and changes in chromosome copy number (i.e., aneuploidy and polyploidy), are also observed among different individuals [10], and sometimes among different tissues and cells of the same individual, in particular those malfunctional cells, such as cancer cells (see Chapter 11). These structural variations, however, are hard to detect using the conventional shotgun genome-sequencing strategy, and in the past the majority of information related to large-scale genome variation was discovered through molecular cytogenetic techniques. The availability of the reference human genome provides a new opportunity for the high-throughput identification of large-scale genome variation. There are currently two

148

Genome Sequencing Technology and Algorithms

sequence-based techniques used to detect these genomic variations: array-based hybridization approaches that identify copy number variations of genomic-sequence segments, and double-end sequencing that can identify both copy number variations as well as genome rearrangements [10]. Array-based comparative genome hybridization (array-CGH) approaches competitively hybridize two set of differentially labeled genomic fragments from two different individuals to the arrays of spotted DNA fragments, and determine the copy number differences of the spotted DNA fragments in the two genomes. Depending on the DNA fragments used in the array (e.g., genomic clones, cDNAs or synthetic oligonucleotides that covers the whole genome), the resulting map of copy number variation can be at different resolutions. In the second approach, pairs of reads are obtained from ends of long clones (e.g., fosmids) in a genomic library constructed from a target genome. These reads are subsequently mapped to the reference human genome, and a putative breakpoint in the target genome can be inferred if the distance between the mapping locations of a read pair is significantly deviated from the expected clone size [11]. This paired-end sequencing approach can identify not only the copy number variations, but also large-scale genome rearrangements, such as inversions, although computational analysis of the mapping of read pairs is required to differentiate the types of variations (see Chapter 11).

9.4 Epigenomics: Genetic Variations Beyond Genome Sequences There exist certain genetic variations that cannot be represented on the genomic sequence. Epigenetics studies the genetic variations on the modifications of chromosomal structures other than their primary sequences. Two commonly studied chromosomal modifications are the DNA methylation, a covalent addition of a methyl group to the base of cytosine (C), and the posttranslational modification (PTM) of the histone proteins that help to stabilize the compact chromatin structure, such as methylation, acetylation, phosphorylation, and sumoylation [12]. Epigenetic variations may result in the changes of important biological functions (e.g., the differential regulation of gene expression) or the silence of active transposable elements. The completion of the reference human genome pushes forward the development of high-throughput technologies for genome-wide study of epigenetic variations and their biological consequences, known as the epigenomics [13]. There are three experimental approaches to detect DNA methylations: (1) DNA digestion by a methylation-sensitive DNA restriction enzyme; (2) the chemical modification of DNA by sodium bisulfite or metabisulfite; and (3) immunoprecipitation of methylated cytosines. Various applications of the microarray platform have been coupled with each of these approaches, introducing several high-throughput techniques for both qualitative

Genome Characterization in the Post–Human Genome Project Era

149

and quantitative genome-scale scanning of DNA methylations. The epigenetic modifications of histones are mainly studied by the chromatin immunoprecipitation (ChIP). The antibody specific to one histone modification of interest can be developed and used to immunoprecipitate the chromatin-histone complex. The DNA fragments that interacted with the modified histones can then be detected by custom-designed microarrays. Based on these new technologies, multiple human epigenome consortia have been formed across the world, which may lead to new findings of the epigenetic variations and their relationship with human diseases.

9.5 Conclusion The completion of the reference human genome is just a starting point toward the ultimate goal of the future of personalized medicine. In addition to the development of new technologies for low-cost DNA sequencing, genome scientists are also expanding their research into new areas of genome characterization, and even beyond genome sequencing. These developments, however, cannot be done without the help of new high-throughput techniques, most of which are established upon the complete reference human genome sequence.

References [1]

The International Human Genome Sequencing Consortium, “Finishing the Euchromatic Sequence of the Human Genome,” Nature, Vol. 431, No. 7011, pp. 931–945.

[2]

Anderson, M., “NIH Offers $1000 Genome Grant,” The Scientist, Vol. 5, 2004.

[3]

The X Prize Foundation, “$10 Million Archon Xprize for Genomics,” http://genomics. xprize.org/, 2006.

[4]

Chimpanzee Sequencing and Analysis Consortium, “Initial Sequence of the Chimpanzee Genome and Comparison with the Human Genome,” Nature, Vol. 437, No. 7055, December 5, 2005, pp. 69–87.

[5]

Southern, E. M., “DNA Chips: Analyzing Sequence by Hybridization to Oligonucleotides on a Large Scale,” Trends in Genetics, Vol. 12, 1996, pp. 110–115.

[6]

Pe’er, I., N. Arbili, and R. Shamir, “A Computational Method for Resequencing Long DNA Targets by Universal Oligonucleotide Arrays,” Proc. Natl. Acad. Sci. USA, Vol. 99, No. 24, 2002, pp. 15492–15496.

[7]

The International HapMap Consortium, “The International HapMap Project,” Nature, Vol. 426, 2003, pp. 789–796.

[8]

The International HapMap Consortium, “A Haplotype Map of the Human Genome,” Nature, Vol. 437, 2005, pp. 1299–1320.

150

Genome Sequencing Technology and Algorithms

[9] Stephens, M., and P. Donnelly, “A Comparison of Bayesian Methods for Haplotype Reconstruction from Population Genotype Data,” Am. J. Hum. Genet., Vol. 73, 2003, pp. 1162–1169. [10] Feuk, L., A. R. Carson, and S. W. Scherer, “Structural Variation in the Human Genome,” Nat. Rev. Genet., Vol. 7, No. 2, 2006, pp. 85–97. [11] Tuzun, E., et al., “Fine-Scale Structural Variation of the Human Genome,” Nat. Genet., Vol. 37, 2005, pp. 727–732. [12] Strahl, B. D., and C. D. Allis, “The Language of Covalent Histone Modifications,” Nature, Vol. 403, 2000, pp. 41–45. [13] Callinan, P. A., and A. P. Feinberg, “The Emerging Science of Epigenomics,” Hum. Mol. Genet., Vol. 15, 2006, pp. R95–R101.

10 The Haplotyping Problem: An Overview of Computational Models and Solutions Paola Bonizzoni, Gianluca Della Vedova, Riccardo Dondi, and Jing Li

The investigation of genetic differences among humans has given evidence that mutations in DNA sequences are responsible for some genetic diseases. The most common mutation is the one that involves only a single nucleotide of the DNA sequence, called a single-nucleotide polymorphism (SNP). As a consequence, computing a complete map of all SNPs occurring in the human populations is one of the primary goals of recent studies in human genomics. The construction of such a map requires the determination of the DNA sequences that form all chromosomes. In diploid organisms like humans, each chromosome consists of two sequences called haplotypes. Distinguishing the information contained in both haplotypes when analyzing chromosome sequences poses several new computational issues that collectively form a new emerging topic of computational biology known as haplotyping. This chapter is a comprehensive study of some new combinatorial approaches proposed in this research area, and it mainly focuses on the formulations and algorithmic solutions of some basic biological problems. Three statistical approaches are briefly discussed at the end of the chapter.

10.1 Introduction The completion of the Human Genome Project [1, 2] has resulted in a draft map of the DNA sequence (which may be thought of as a string over the

151

152

Genome Sequencing Technology and Algorithms

alphabet {A, C, G, T}) present in each human being. At this point one of the main topics of research in genomics is determining the relevance of all mutations as causes of some genetic diseases. Mutation in DNA is the principal factor that is responsible for the phenotypic differences among human beings, and SNPs are the most common mutations, hence it is fundamental to complete a map of all SNPs in the human population. For this purpose an SNP is defined as a position in a chromosome where each one of two (or more) specific nucleotides are observed in at least 10% of the population [3]. The nucleotides involved in an SNP are called alleles. It has been observed that for almost all SNPs only two different alleles are present, in such case the SNP is said to be biallelic, otherwise the SNP is said to be multiallelic. In this survey we will consider exclusively biallelic SNPs. In diploid organisms, such as humans, each chromosome is made of two distinct copies and each copy is called a haplotype. It is known that exactly one haplotype is inherited from the father and the other is from the mother. More precisely, in the absence of recombinations events, each haplotype in a child is identical to one of the two haplotypes of each parent. Whenever recombinations occur, a haplotype of the child may consist of portions of both haplotypes of a parent. Studies [4, 5] show the block structure in human chromosomes that implies that it is possible to partition a chromosome into blocks where no (or only a few) recombinations have occurred within each block. This observation justifies the fact that different formulations, with or without recombinations, of the biological problem of completing SNP haplotype maps make sense. Furthermore, results in [4, 5] also show that the SNPs within each block induce only a few distinct common haplotypes in the majority of the population, even though the theoretical number of different haplotypes for a block containing n SNPs is exponential in n. The above facts make it interesting to build an SNP haplotype map that consists of the information of haplotype block structure, common haplotypes, and their frequencies which, hopefully, could be used to correlate the common haplotypes with common diseases in gene mapping. Computing a haplotype map requires the determination of the possible SNPs combinations that are common in a population, hence it is necessary to analyze data derived by a large-scale SNP screening of single haplotypes in a population. Unfortunately, even the most recent technologies are too expensive for large-scale analysis or cannot provide good haplotype data from diploid organisms of a large population. Indeed, experimental data only provide the genotype for each individual at each SNP site on the chromosome, which is the combined information of the two alleles. For example, we know that an SNP occurs at a certain site, and that the two alleles occurring are A and T, but we are not able to determine to which haplotype the A belongs. This chapter focuses on presenting, in a computational framework, some combinatorial problems arising from the following two basic biological issues:

The Haplotyping Problem: An Overview of Computational Models and Solutions 153

1. To infer haplotypes from genotypes; 2. To infer haplotypes from DNA sequence fragments. The first problem basically consists of examining the genotypes from an entire population in order to derive the correct haplotypes. Different computational problems may be defined depending on the fact that recombinations are allowed or forbidden and on the fact that some parental relations among the individuals are known or not. The second biological problem arises in DNA sequencing, where some fragments of two haplotypes are known, and it is desired to compute both haplotypes in their complete form (i.e., the whole sequences). This chapter is organized as follows. In Section 10.2, some preliminary definitions and notations used in haplotype inference are given. In Section 10.3 we introduce the problem of inferring haplotypes in a population, firstly by discussing such problems in a general framework in Sections 10.3.1 and 10.3.2, then exploiting a natural biological property in Sections 10.3.3 and 10.3.4, while in Section 10.3.5 we introduce a basic issue in the inference problem. The problem of haplotype inference given a pedigree is treated in Section 10.4. In Section 10.5 some of the recent results regarding the reconstruction of haplotypes from DNA fragments are presented. Finally, Section 10.6 aims to give the reader some references regarding the use of statistical methods for the haplotyping problem. We conclude this chapter with some discussions.

10.2 Preliminary Definitions We have already pointed out in the introduction that we will restrict ourselves to biallelic SNPs. Without loss of generality we can assume that the values of the two involved alleles of each SNP are always 0 or 1. Since the SNPs are located sequentially on a chromosome, a haplotype of length m is a vector over {0, 1}m, where each position i is also called a site or a locus. A genotype vector, or simply genotype, represents two haplotypes as a sequence of unordered pairs over the set {0, 1}. Each pair represents the nucleotides in a given site, and since the pairs are unordered we are not able to determine the two haplotypes from the genotype alone. For example two haplotypes of length 3 are <0, 1, 1> and <1, 0, 1> which are combined into the genotype <(0, 1), (0, 1), (1, 1)>. Whenever a pair is made of two identical values, then the SNP site is homozygous; otherwise it is heterozygous. Clearly, by the assumption on the values of the alleles, the pair for a homozygous site is (0, 0) or (1, 1), while the pair for an heterozygous site is (0, 1). Hence a compact representation of the genotype consists of a vector over the alphabet {1, 0, ?}, where the first two symbols are used if the site is homozygous, and a ? encodes a heterozygous site. For example,

154

Genome Sequencing Technology and Algorithms

the compact representation of the genotype<(0, 1), (1, 0), (1, 1)> is therefore . Given a genotype g = , then a resolution of g is a pair of haplotypes, where h = and k = , such that hi = ki = gi if gi ≠ ? and hi · ki ∈{0, 1}, hi ≠ ki if gi = ?. When the above conditions hold we also say that resolves g. Given a genotype g and a haplotype h, h is said compatible with g if and only if there exists a haplotype h ′ such that is a resolution of g; in such case the haplotype h ′ is called the realization of g by h, and is denoted by R(g, h). Given a genotype g (a haplotype h, respectively), let us denote by g[i] (h[i]) the element of g at site i. Please notice that, given a genotype g and a haplotype h, there exists exactly one haplotype R(g, h) such that resolves g. Computing R(g, h) is straightforward; in fact for each position i, where g[i] ≠ ?, R(g, h)[i] = h[i], otherwise R(g, h)[i] = 1 − h[i]. The general problem of inferring haplotypes from genotypes can be stated as follows. Problem 1: Haplotype Inference problem (HI) Input: a set G {g1, ..., gn} of genotypes. Output: for each genotype g ∈G, a pair of haplotypes resolving g. The HI problem stated before is actually a metaproblem, in the next sections we will analyze some of the formulations of the general problem that have been proposed in the literature.

10.3 Inferring Haplotypes in a Population Recombination events make a number of problems harder, but experimental data shows that human chromosomes can be partitioned into large regions (usually called blocks), where no or few recombinations could occur within each block. By definition of a block as a portion of the chromosome where there are few SNPs that account for the differences in the individuals, when we restrict ourselves to the analysis of a specific block in the population, only a relatively small number of distinct haplotypes can be found. The above observations justify some simplifying assumptions that are present in a number of models for inferring haplotypes from genotypes: • Haplotypes consist only of certain portions of the chromosomes or

SNP sites [6]. • Only a block is considered, consequently no recombinations are allowed [7, 8].

The Haplotyping Problem: An Overview of Computational Models and Solutions 155

The approaches described in Sections 10.3.1 and 10.3.3 make use of these two assumptions. Clearly, exploiting those assumptions depends on a feasible solution to the computational problem of partitioning chromosomes into blocks. The approach proposed in [8, 9] faces the problem by computing a partition of the chromosome minimizing the total number of regions or blocks, where each block induces at most a fixed number of distinct haplotypes in the population. This computational problem and other related issues are discussed in [9, 10], where efficient algorithms have been proposed. When we consider the haplotype inference problem, the alleles at each site i of an individual consist of exactly one allele from each of his parents in the corresponding sites: this behavior is known as Mendelian law (see Figure 10.1). When no recombination occurs, each of the two haplotype copies is equal to one of the haplotype copies of its parents, that is all alleles of a haplotype are derived from the same haplotype copy of a parent. If, on the contrary, recombinations occur, then a haplotype can consist of alleles coming from two haplotypes of the same parent. By Mendelian law, the consequence is that the given haplotype derives from two grandparents of the individual. Thus, the parental and grandparental sources of an allele are the basic information to be used to determine and measure recombinations. Figure 10.1 shows a recombination event. A recombination occurs between the second locus and the third locus in the left (paternal) haplotype (the haplotype 1A, 2A, 3B, 4B) of the child, since under Mendelian law, the alleles at the first two loci are inherited from the left (paternal) haplotype of the father while the alleles at the last two loci are inherited from the right (maternal) Father

Mother

1A

1B

1C

1D

2A

2B

2C

2D

3A

3B

3C

3D

4A

4B

4C

4D

1A

1D

2A

2D

3B

3D

4B

4D Child

Figure 10.1 An example of recombination.

156

Genome Sequencing Technology and Algorithms

haplotype of the father. Note that no recombination occurs in the right (maternal) haplotype (the haplotype 1D, 2D, 3D, 4D) of the child, since it is equal to the right (maternal) haplotype of the mother. The number of recombination events in an individual is the total number of such switches of the grandparent source occurring in its haplotypes. In the example of Figure 10.1, the number of recombinations is one. 10.3.1 The Inference Problem: A General Rule

Various methods to infer haplotypes from genotype data have been proposed. Among them the inference method proposed in [11] and later largely discussed in [12] deserves our interest since it is the first approach that points out some basic computational issues related to the haplotype inference problem under a general inference rule. The input data of the inference method consists of n individuals and an m-site genotype for each individual. The expected output for each individual genotype is a haplotype pair, among 2k−1 different possibilities (where k is the number of heterozygous sites), that resolves the original genotype of the individual. In [11] a parsimonious principle has been used to formalize the HI problem. This principle has been suggested by empirically observing that a valid solution is usually the one that resolves the largest number of genotypes. Moreover, in the same paper a number of experiments suggest the conjecture that there are not two distinct solutions resolving correctly all input genotypes. Note that, if a genotype g contains only one heterozygous site, then we can infer without ambiguities the two haplotypes that resolve g (the two haplotypes are obtained by computing the only two possible resolutions of the single heterozygous site). Thus a genotype is considered ambiguous when it contains at least two unresolved sites, that is sites of value ?. Definition 10.1 (Inference rule): Let G {g1, ..., gm} be a set of genotypes and let H be a nonempty set of haplotypes. The application of the inference rule to a haplotype h H compatible with a genotype g G consists of adding R(g, h) to H and removing g from G. The computational problem is therefore, given (G, H), finding a sequence of applications of the inference rule that leads to a set H ′ of haplotypes that resolve all genotypes G (that is all genotypes are removed from G), whenever this is possible. In this case we say that H ′, with H ′ ⊇ H , resolves G. Example 10.1 shows an instance of the problem. Example 10.1 (Inference rule): Let G {g1 0?0?01, g2 00?00? } and H {h1 010101, h2 000101}. The optimal sequence of applications consists of applying the inference rule first to (g1, h1), and then to g2; this application allows

The Haplotyping Problem: An Overview of Computational Models and Solutions 157

to resolve all genotypes in G. In fact, R(g1, h1) h3 000001, after having applied the rule G {g2} and H = {h1, h2, h3}. Now it is possible to apply the rule to (g2, h3), since R(g2, h3) the set G becomes empty. If we had decided to apply the inference rule first to (g1, h2), the obtained haplotype would not allow to resolve g2. In fact, R(g1, h2), but none of the vectors now in H can be used to resolve g2. In [12] a formal framework to analyze and investigate the computational complexity of such problem has been proposed, by stating an optimization problem, whose corresponding decision version is NP-hard. The optimization problem follows: Problem 2: Maximum Resolution problem (MR) Input: a set G of genotypes and a set H of haplotypes. Output: a maximum cardinality subset G ′ of G of genotypes that are removed from G by a sequence of applications of the inference rule starting from G and H. The computational complexity of some restricted versions of the MR problem is investigated in [12]. Given a set of genotypes G and a set H of haplotypes, H has the unique expression property with respect to G, in short UE property, if for every g ∈G, there exists at most a pair h1, h2 in H such that R(g, h1) = h2. The consequent problem follows: Problem 3: Unique Expression Maximum Resolution problem (UEMR) Input: a set G of genotypes and a set H of haplotypes with the UE property. Output: a maximum-cardinality subset G ′ of G of genotypes that are removed from G by a sequence of applications of the inference rule starting from G and H that leaves a set H ′ of haplotypes having the UE property with respect to G ′. The proof in [12] that the MR problem is NP-hard makes use of a set H of haplotypes with the UE property, thus proving that also the restricted UEMR problem is NP-hard. In [12] a heuristic for the MR problem is proposed. In particular, the MR problem is reduced through a worst-case exponential time reduction to a new graph problem, which consists of finding some induced subtrees in a graph [12]. A heuristic for this last problem by using integer linear programming is presented. More recently, the complexity of the MR problem has been further investigated. Mainly, the computational complexity of solving a single genotype from a set containing both genotypes and haplotypes, by iteratively using the inference rule has been investigated. The problem is the formalization of the above question. It has been shown [13] that also the problem is NP-hard.

158

Genome Sequencing Technology and Algorithms

Problem 4: Single Genotype Resolution problem (SGR) Input: a nonempty set H of haplotypes and a distinguished genotype g ∈ G in a set G of genotypes. Output: a sequence of applications of the inference rule that resolves a subset of G including g. 10.3.2 The Pure Parsimony Haplotyping Problem

The Pure Parsimony Haplotyping problem faces the inference problem similarly to the Clark’s rule approach presented in the previous section. In this approach we are given a set of genotypes G and we want to compute a set H of haplotypes containing a resolution for each genotype g ∈G. The problem is based on the parsimony criterion and thus assumes that the correct solution consists of a minimum cardinality set H of haplotypes that resolves G. Problem 5: Pure Parsimony Haplotyping problem (PurePH) Input: a set G of genotypes. Output: a set H of haplotypes that resolves G and has minimum cardinality. The PurePH problem has been introduced in [14], where a solution based on an integer-programming formulation has been proposed. The problem has been proved to be APX-hard in [15], even for the restriction where each genotype contains at most three ambiguous positions. Conversely, when each genotype has at most two ambiguous positions the problem admits a polynomial time algorithm [16, 17]. The problem can be approximated within factor 2 k −1 [15] when each genotype contains at most k heterozygous positions, and it admits a log n approximation factor [18], when the set H of haplotypes that resolves G is taken from a given set H ′ of haplotypes that is of polynomial size. Different restrictions of the PurePH problem have been studied in [19]. 10.3.3 The Inference Problem by the Coalescent Model

One of the main drawbacks of the approach presented previously is that no biological assumption has been made, and sometimes biological assumptions allow researchers to restrict the problem so that efficient and more realistic solutions are obtained. Consequently, some specific biological models have been introduced recently in the framework of haplotyping. An interesting model has been proposed in [7]: the coalescent model, which assumes that the evolutionary history is represented by a rooted tree, where each given sequence labels one of the leaves of the tree. The infinite site model is also assumed, that is at most one mutation can occur in a given site in the whole tree. This last assumption, which forbids recurrent mutations, is suitable to represent the evolutionary history in

The Haplotyping Problem: An Overview of Computational Models and Solutions 159

absence of recombinations and when the basic evolutionary event is changing the value of an SNP site, from 0 to 1. Consequently mutations are directed, that is descendants of individuals in which a mutation has occurred still own the given mutation [20]. The following definition introduces the main combinatorial tool for describing some computational problems related to the coalescent model. Definition 10.2: Let B be an n × m {0, 1}-matrix, where each row in B is a binary haplotype and each column i is the n vector of the SNP sites i for the m haplotypes. A haplotype perfect phylogeny for B, in short, hpp, is a rooted tree T with n leaves such that the following properties hold: 1. Each leaf of the tree is labeled by a distinct haplotype from B, that is a distinct row of B. 2. Each internal edge of T is labeled by at least an SNP site j changing from 0 to 1, while each site labels at most one edge. 3. For each haplotype leaf h, the unique path from the root of T to h specifies exactly all SNP sites that are 1 in h. Without loss of generality, the root of the phylogeny is assumed to be labeled by (0, 0, ..., 0). Consider now a matrix A, where each row is a genotype, that is A is a {0, 1, ?}-matrix. Analogously to the case of genotypes vectors, it is possible to give the definition of realization of matrix A by a matrix B, that is a matrix B such that each row of A is resolved by a pair of rows of B. Definition 10.3: A {0, 1}-matrix B is a realization of a {0, 1, ?}-matrix A if each row Aj of A is resolved by a pair of rows of B. The third point of Definition 10.2 implies that each path in the perfect phylogeny T from the root to a haplotype leaf h is a compact representation of the row of matrix B corresponding to h, since it represents all the sites of that row with value 1. Moreover let v be an internal vertex of T, let u be the parent of v and let Hv be the set of all haplotype leaves of the subtree of T that has root in v; then Hv consists of exactly all haplotypes that have value 1 in the SNP sites labeling (u, v). Hence Hv provides a compact representation of column j of matrix B. In [7, 8] the haplotype inference problem is then stated using the notion of haplotype perfect phylogeny. The basic idea of the common approach in [7, 8] is that n genotypes must be resolved by haplotypes that can be related by a haplotype perfect phylogeny as in Definition 10.2. Formally, the approach described earlier leads to the following problem as stated in [8]. Problem 6: Perfect Phylogeny Haplotyping problem (PPH) Input: an n × m matrix A over alphabet {0, 1, ?}.

160

Genome Sequencing Technology and Algorithms

Output: a matrix B which is a realization of matrix A and a haplotype perfect phylogeny for B, or decide that such a matrix does not exist. In [7] the PPH problem is stated by requiring that a realization B of matrix A must be obtained by doubling each row rj of matrix A, in such a way that rows r2j and r2j+1 of B solve row rj in A. We will call such a realization a full realization of matrix A. The definition given above is more general and allows us to define an optimization version of the problem, that derives by applying a parsimonious criterion in inferring the haplotypes: the MPPH problem stated below. Indeed, it seems reasonable to require that the inference process from genotypes should produce a minimum number of distinct haplotypes, as pointed out by the empirical results in [11]. Problem 7: Minimum Perfect Phylogeny Haplotyping problem (MPPH) Input: an n × m matrix A over alphabet {0, 1, ?}. Output: a matrix B which is a realization of matrix A with the smallest number of rows and a haplotype perfect phylogeny for B or decide that such a matrix does not exist. In [21] it is shown that the problem is NP-hard, via a reduction from minimum vertex cover. Given an instance of the PPH problem, a first algorithmic issue concerns the existence of a solution for that instance. Indeed, a haplotype perfect phylogeny induces two relations between pairs of SNP sites labeling edges in the tree, one between siblings and one between an ancestor and a descendant. Those relations do not always allow researchers to find a solution to every instance of the PPH problem. Two sites i, j of an individual haplotype h are related by the ancestor-descendant relation whenever changes 0 to 1 hold in both sites i and j of h: indeed, in the hpp, the path from the root to the leaf labeled by h contains two edges labeled i and j. Moreover, i and j are 1-0 siblings (0-1 siblings) in an individual haplotype h, whenever the change 0 to 1 occurs in position i of h and not in j (or vice versa, respectively). It is easy to verify that in a hpp the two sites i and j that are related in a haplotype by the parenthood relation cannot be 0-1 siblings in a haplotype and 1-0 siblings in another haplotype (see Figure 10.3). Formally, this situation is described by the existence of three rows h1, h2, h3 and two columns i, j of a matrix B such that B[h1, i] = B[h1, j] = 1, while B[h2, i] = 1 and B[h2, j] = 0, B[h3, i] = 0 and B[h3, j] = 1. The submatrix of B induced by i, j and h1, h2, h3 is used [20] to characterize matrices that cannot be represented by a hpp: we call such submatrix the forbidden matrix; see Figure 10.2. Lemma 1: Let A be an n × m matrix over alphabet {0, 1}. Then, A admits an hpp if every submatrix of A induced by three rows and a pair of columns is not the forbidden matrix of Figure 10.2.

The Haplotyping Problem: An Overview of Computational Models and Solutions 161 i

j

h1

1

1

h2

1

0

h3

0

1

Figure 10.2 Example of a forbidden matrix M.

00

00 j

i

j

i

h1

h2 (a)

h1

h3 (b)

Figure 10.3 The forbidden matrix M cannot be represented using a perfect phylogeny. In the tree (a) is represented by the relation between sites i and j due to the haplotypes h1 and h2 of matrix M. In the tree (b) is represented by the relation between sites i and j due to the haplotypes h1 and h3 of matrix M. Note that in tree (a) i must be an ancestor of j, while in tree (b) j must be an ancestor of i.

An analogous of Lemma holds for matrices over alphabet {0, 1, ?}. In a first paper on the PPH problem, a polynomial solution to the problem of computing a full realization of a matrix A based on a reduction to the Graph realization problem is proposed [7], while direct algorithms of O(nm2) time complexity are proposed in [8, 22]. More recently, a linear time algorithm for the PPH problem has been presented in [23, 24]. An experimental study of the biological validity and relevance of the Coalescent model in haplotype inference is largely discussed in [9]. 10.3.4 Xor-Genotyping

There are some sequencing methods that can determine whether an individual is homozygous or heterozygous for each SNP, but cannot distinguish between the two homozygous sites. The list of heterozygous sites of the individual is referred in [25] as xor-genotype. Xor-genotypes are less informative than the full genotypes, but they can be generated by less costly experimental methods. Therefore, the inference of haplotypes from xor-genotypes is of a certain interest. Methods

162

Genome Sequencing Technology and Algorithms

extending the perfect phylogeny model to infer haplotypes from xor-genotypes have been proposed in [25]. Formally, a xor-genotype of a pair of haplotypes (h1, h2) consists of the set of heterozygous sites of h1 and h2. Then, a set G of haplotypes resolves a set H of genotypes if each member of G is a xor-genotype of haplotypes in H. Problem 8: Xor Perfect Phylogeny Haplotyping (XPPH) Input: a set G of xor-genotypes. Output: a haplotype matrix H which resolves set G and a haplotype perfect phylogeny for H, or decide that such a matrix does not exist. The XPPH problem has been proved to be polynomial time solvable in [25]. 10.3.5 Incomplete Data

The current DNA sequencing technology often produces haplotypes with missing bases at some positions. Hence, one of the main issue in haplotyping is the inference of complete haplotypes from input data containing missing values. Observe that input data consists of either haplotypes or genotypes with missing values. We call the first problem partial haplotype completion, while we refer to the second one as haplotyping with missing data. Both problems have been faced by investigating combinatorial methods based on inference rules [26] that deal with the case of missing data. These methods usually use inference rules based on genetic models for human haplotypes or statistical analysis of haplotypes frequency. The first type of methods uses the well-known coalescent model, while successful statistical methods are based on the minimum entropy model. The minimum entropy model, proposed in [27], assumes that the frequency distribution of the haplotypes within the population has a small entropy. The haplotyping with missing data in presence of the coalescent model has been shown to be NP-hard for both rooted and unrooted phylogenies [28] and even in the restricted case of path phylogenies [29]. In [26], it has been shown that the problem is tractable on input data satisfying the “rich-data hypothesis.” Moreover, the haplotyping with missing data problem under the coalescent model is fixed-parameter tractable [30], when restricted to path phylogenies and when the chosen parameter is the maximum number of missing entries at a particular SNP site. The solution of the partial haplotype completion under the minimum entropy model is obtained by finding the completion of the partial haplotypes that minimizes the entropy value of the input haplotype matrix; finding such a completion is APX-hard [27].

The Haplotyping Problem: An Overview of Computational Models and Solutions 163

On the other hand, the problem of finding a completion of partial haplotypes so that the haplotypes in the complete matrix fit the coalescent model is tractable when at least one complete haplotype is known [31] and more precisely it has a near optimal algorithm. In this case it is possible to change the 0,1-representation of the bases at each SNP site in such a way that the known complete haplotype is represented as the all-zero vector. Indeed, in terms of the evolutionary tree, this corresponds to fix the known all-zero haplotype as the root of the perfect phylogeny, which becomes the directed haplotype perfect phylogeny model given in Definition 10.2. A method combining the coalescent and the minimum entropy model for solving the partial haplotype completion has been proposed in [32], where a heuristic for the problem requiring one complete haplotype to be known is given. A different approach to the partial haplotype completion is based on pure parsimony. In [33] it has been shown that the PurePH problem with input of a set of incomplete haplotypes is NP-hard; a probabilistic algorithm for the problem is proposed. In [34] it has been proven that the problem is APX-hard, even in the restricted case where each haplotype in the input contains at most two missing data. Observe that this result has been obtained for the equivalent problem of clustering fingerprint vectors with missing values.

10.4 Inferring Haplotypes in Pedigrees In this section, we investigate the HI problem in a pedigree. The difference between population data and pedigree data lies in the relations among the individuals which are regarded independent if taken from a population, while the individuals are related by a parenthood relation if a pedigree is specified. The dependency relationships in pedigree data have implications in two different aspects: (1) a structure (called pedigree graph) is imposed over pedigree data, whereas for population data to build such a structure (corresponding to recovering the evolutionary history of the involved haplotypes) might be one of the goals; and (2) Mendelian law is assumed, that is each child receives one allele from the father and one from the mother at each site, thus no mutations occur in the pedigree. Consequently, Mendelian law can be used to partially resolve some genotypes. Although the parental information gives some constraints on the reconstruction of haplotypes, there are still too many solutions that are consistent with the genotype data and Mendelian law, especially for biallelic data such as SNPs, where in general the probability that more individuals have the same heterozygous genotypes is higher than that on multiallelic data. Based on the fact that genetic recombinations are rare in human data [4, 5, 35], people believe

164

Genome Sequencing Technology and Algorithms

that haplotypes with fewer recombinations should be preferred in a haplotype reconstruction [36–38]. As already pointed out in this chapter, the parsimonious principle naturally leads to optimization problems. In this case the computational problem is finding a haplotype configuration with minimum number of recombinants. Again, recombinations make the problem hard; in fact a first formal proof of the NP-hardness of this problem is given in [39] for the general case of pedigree graphs. A heuristic algorithm for this problem and a polynomial exact algorithm for the 0-recombination situation are also presented in that paper. We will discuss those results later, but first we need to formally define the notion of a pedigree graph and several models for the HI problem on pedigree data. Let us first define the general notion of a pedigree graph, without genotype information that has become widespread in biology. Definition 10.4: A pedigree graph is a weakly connected directed acyclic graph G = V, E , where V M F N, M stands for the male nodes, F stands for the female nodes, N stands for the mating nodes, and E {e (u, v): u M F and v N or u N and v M F}. M F are called the individual nodes. The indegree of each individual node is at most 1. The indegree of a mating node must be 2, with one edge starting from a male node (called father) and the other edge from a female node (called mother), and the outdegree of a mating node must be larger than zero. In a pedigree, the individual nodes adjacent to a mating node (i.e., they have edges from the mating node) are called the children of the two individual nodes adjacent from the mating node (i.e., the father and mother nodes, which have edges to the mating node). The individual nodes that have no parents (indegree is zero) are called founders. For each mating node, the induced subgraph containing the father, mother, mating, and children nodes is called a nuclear family. A parents-offspring trio (or simply trio) consists of two parents and one of their children. A mating loop is a cycle in the graph obtained from G when the directions of edges are ignored. An equivalent definition of pedigree graph that points out the combinatorial nature of the representation is the following: Definition 10.5: A pedigree graph G is a weakly connected directed acyclic graph V, E, where each vertex has indegree 2 or 0. In Definition 10.5 only individual nodes are represented and their gender information are outfitted. The founders are the vertices without incoming edges and a trio is any subgraph of G with three vertices u, v, w where (u, v) (here (u, v) stands for an arc from u to v since we are talking about the direct graph) and (w, v) are the only arcs of the pedigree graph; a trio is denoted by the triple . In the following we will mean Definition 10.5 when we refer to the notion

The Haplotyping Problem: An Overview of Computational Models and Solutions 165

of pedigree graph. The only substantial difference with Definition 10.5 regards the definition of mating loops; when Definition 10.5 is considered, a mating loop consists of two distinct paths from a vertex x to a vertex y. Figure 10.4 shows side by side an example pedigree according to the two definitions of pedigree graph, in particular in Figure 10.4(a) the common representation of pedigree graphs is reported. This pedigree graph definition is very general and there are several restricted versions defined as follows. Definition 10.6: A pedigree tree T is a pedigree graph without mating loops. We further distinguish pedigree trees by restricting the number of mating partners each individual node of the tree can have. Given a pedigree tree T, for any vertex v, let P(v) be the set of vertices x such that (x, v) is an arc of T. Then T is a single-mating pedigree tree, if for any two vertices v, w, the two sets P(v) and P(w) are disjoint or the same set; otherwise the pedigree tree is called multimating. The definition of pedigree graph introduced so far allows us to describe the structure of the parental relations. We still need a notion that allows us to relate the actual genotypes to the structure of a pedigree graph. Definition 10.7: A genotyped pedigree graph is a pedigree graph G where each individual vertex is labeled by an m-site genotype vector. With a slight abuse of language by pedigree graph we will denote both a labeled (genotyped) and an unlabeled pedigree graph and in the case of labeled pedigree, we use the node itself to denote the genotype vector associated to the node itself. Recall that for any node u, given the genotype vector
3-1

3-3

3-4

3-2

3-5

3-6

3-11 3-12 3-13 3-14 3-15 (a)

3-7

3-9

3-8

3-10 (b)

Figure 10.4 A pedigree with 15 members. (a) A square represents a male node and a circle represents a female node, and a solid (round) node represents a mating node. The children (e.g., 3-3, 3-5, and 3-7) are placed under their parents (e.g., 3-1 and 3-2). (b) The representation of the same pedigree according to Definition 10.5.

166

Genome Sequencing Technology and Algorithms

1-1

1-3

1-2

1-4

1-9

1-13

1-14

1-5

1-6

1-7

1-8

1-10

1-11

1-12

1-16

1-17

1-15

Figure 10.5 A pedigree with 17 members and a mating loop without showing the mating nodes.

u[m]>, then u[i] ∈{0, 1, ?} . If u[i] ≠ ?, then we say that u[i] is defined. Similarly, a haplotyped pedigree graph is a pedigree graph where each individual vertex is labeled with two haplotypes. Definition 10.8: A genotyped pedigree graph G is g-valid if the following consistency rules hold. Given a trio u, v, w , then for each i, 1 i m: • If u[i] ≠ w[i] are both defined, then v[i] = ?, • If u[i] ≠ w[i] and only one of u[i] or w[i] is defined, then v[i] = w[i] or

v[i] = u[i],

• If u[i] = w[i] = ?, then v[i] can be 0, 1 or ?, • u[i] = v[i] = w[i], otherwise.

The Haplotyping Problem: An Overview of Computational Models and Solutions 167

Then, given a g-valid genotyped pedigree graph G, we are interested in a haplotyped pedigree graph with same sets of vertices and edges, such that the haplotypes labeling a vertex v resolve the genotype of v in G and each haplotype (paternal/maternal) in a child is inherited from each parent (father/mother) with/without recombinations. In such case we will say that the haplotyped graph is a realization of G. Problem 9: Pedigree Graph Haplotype Inference problem (PHI) Input: a g-valid genotyped pedigree graph G. Output: a haplotyped pedigree graph that is a realization of G. Even though a realization of the genotyped graph explicitly associates the two haplotypes of each child to the ones of its parents, a realization might not unambiguously determine for each allele on a given haplotype what is the haplotype of its corresponding parent from which it derives. For instance let us consider Figure 10.6, we know that the first allele of a haplotype of the child comes from her mother, but does it come originally from a grandmother or a grandfather? In order to disambiguate those situations we introduce the notion of a GS value for each allele that states if it is inherited from parent’s paternal haplotype or maternal haplotype. Female

?

Mother

Male

Female

?

?

Male

?

10 1 1 Father 10

00 11 10

Child 01 11 10

Figure 10.6 The child in a pedigree tree: it is known that the first allele of a haplotype of the child comes from her mother, but it is not known if it is originally from a grandmother or a grandfather.

168

Genome Sequencing Technology and Algorithms

The introduction of a GS value for each allele is necessary due to the presence of recombinations, because in this case each haplotype is not the exact copy of one haplotype of a parent. More formally, given v the child in a trio where u and w are, respectively, the father and the mother, and assuming that h1(x), h2(x) denote the pair of haplotypes resolving the genotype at node x, GS(h1(v[i]) = 0 if h1(v[i]) = h1(u[i]) and GS(h1(v[i])) = 1 if h1(v[i]) = h2(u[i]). Similarly we can define GS(h2(v[i])). We are now able to exploit the GS values to count the number of recombinants as follows: for any two alleles that are at adjacent sites and from the same haplotype, they induce a recombinant (or recombination event) if their GS differ. Formally, given haplotype h, a recombination occurs at site i of h if GS(h[i]) ≠ GS(h[i + 1]). According to the parsimonious principle, we can derive an optimization problem using the above notions. Problem 10: General Minimum Recombination Haplotype Inference problem (GMRHI) Input: a g-valid genotyped pedigree graph G. Output: a realization of G minimizing the number of recombination events. In [39] it has been proven that GMRHI (originally called an MRHC problem in their paper) is in general NP-hard. Actually, the proof in the paper shows that even with two sites (m = 2), the GMRHI problem is still NP-hard in the case of the pedigree graph allowing mating loops. The reduction is from a stronger version of tridimensional matching which is NP-hard [40]. An iterative heuristic is proposed based on the assumption that recombination events are rare. Later, the same group has proposed an integer linear programming formulation of the GMRHI with missing data and adopted a branch-and-bound algorithm that is efficient for practical size problems [41]. In [39] also a polynomial time (O(m3n3)) exact algorithm for the restriction of the problem where no recombination occurs is given, that is: Problem 11: Zero-Recombinant Haplotype Inference problem (ZRHI) Input: a pedigree graph G. Output: A realization of G such that no recombination events occur if such realization exists; otherwise, report that no realization exists. The algorithm first identifies all the necessary constraints based on Mendelian Law and the zero-recombinant assumption, and represents them using a system of linear equations over the cyclic group Z2. By using a simple method based on Gaussian elimination, all possible feasible haplotype configurations could be obtained. The running time for ZRHI has been further improved to O(mn2 + n3lognloglogn) by Xiao et al. [42]. Their algorithm can efficiently eliminate redundant equations in a system of linear equations collected based on

The Haplotyping Problem: An Overview of Computational Models and Solutions 169

some spanning tree of the pedigree graph. Chan et al. [43] claimed a linear algorithm for ZRHI when a pedigree has no mating loops. Since in the case of human pedigrees, mating loops are very rare, it becomes interesting to give the following more restricted formulations of the general problem. Problem 12: Multimating Pedigree Tree Minimum Recombination Haplotype Inference problem (MPT-MRHI) Input: a multimating g-valid genotyped pedigree tree T. Output: a realization of T minimizing the number of recombination events. Problem 13: Single-Mating Pedigree Tree Minimum Recombination Haplotype Inference problem (SPT-MRHI) Input: a single-mating g-valid genotyped pedigree tree T. Output: a realization of T minimizing the number of recombination events. It has been proved in [44] that even SPT-MRHI is NP-hard by a reduction from MAX-CUT [40]. Unlike the NP-hardness proof of GMRHI in [39], where only two sites are needed, the proof of SPT-MRHI does require an unbounded number of sites. More recently, Liu et al. [45] have shown that the SPT-MRHI problem is NP-hard even when an individual node has at most one mate and at most one child (binary-tree pedigrees) by a reduction from ≠3SAT [46]. In the same paper, the authors have also shown that, with missing data, GMRHI on pedigrees with two loci and SPT-MRHI on binary-tree pedigrees cannot be approximated unless P = NP.

10.5 Inferring Haplotypes from Fragments The Human Genome Project has successfully produced a draft version of the DNA present in human beings, through a sequencing process. A different kind of problem arises when the sequencing process aims also to reconstruct haplotypes. Roughly speaking, the sequencing process is made of two phases: in the first phase a number of fragments are obtained, where each fragment is a small piece (a few hundreds of bases long) of the examined DNA. Afterwards all such fragments are merged into a chromosome, for example, via shotgun sequencing [2]. In its original formulation the sequencing problem assumes that all fragments come from only one copy of the chromosomes of a DNA strand. But this assumption is too weak, in fact not only the fragments come from both copies, but it is not possible to associate the fragments to the copy of the chromosome from which they originate. Thus, computing the two sequences that form the haplotypes becomes more challenging. Moreover, the presence of

170

Genome Sequencing Technology and Algorithms

errors in the fragments that need to be assembled makes harder the problem of reconstructing the original sequence from fragments when SNPs are considered. In [6], the problem of reconstructing the pair of haplotype sequences from fragments of a human chromosome is investigated by introducing some formulations. First of all, each location of an SNP over a fragment is assumed to be known, that is the genomic sequence is thought of as a sequence of positions with one (not an SNP site) or two (an SNP biallelic site) symbols associated to each position. The sequence of SNPs sites along a fragment is described by a vector over a binary alphabet {0, 1} that is used to denote the two distinct alleles of SNP sites contained in the fragment. The formal definitions of the problems introduced in [6] share the fact that the instance is always an n × m matrix M where each entry M[i, j] is 0 or 1 or negative, and where the ith row corresponds to the ith fragment, conversely the jth column corresponds to the jth SNP. When M[i, j], then the ith fragment does not cover the jth SNP, which means that the allele of the fragment in position j is unknown: the entry − of matrix M is called a hole. Moreover, we will say that two fragments conflict with each other if they disagree on an SNP covered by both fragments, that is fragments i and j of matrix M are in conflict whenever there exists an SNP site k such that M[i, k] ≠ M[j, k] and both M[i, k] and M[j, k] are not holes. A conflict occurring at a given SNP site denotes the fact that the two fragments come from two distinct haplotypes, or more precisely, the SNP site is heterozygous and has two distinct alleles on the two chromosome copies. If the matrix M represents m fragments obtained from one pair of haplotypes and fragments do not contain errors in the alleles reported in the matrix, conflicts among the fragments can be used to derive a partition of fragments into two sets, each one containing the nonconflicting fragments from the same haplotype. Thus, we define a matrix M error free if there exists a partition of rows of M into two matrices M1 and M2 such that both M1 and M2 do not contain conflicting fragments. In [6], the following biological problem is investigated: the reconstruction of the two different haplotypes of a chromosome from SNP values of fragments represented by an instance matrix that is not necessarily error free. It must be pointed out that, even when the matrix is error free, there may be more than one pair of haplotypes whose fragments give the same instance matrix: in this case the reconstruction of the original haplotypes for the chromosome from an SNP matrix is not solvable, as the instance matrix does not contain enough information to disambiguate among all possible haplotypes. Hence, we restrict ourselves to the problem of finding a pair of haplotypes whose fragments are represented by a given matrix. Then the problem consists of assigning each fragment to a copy of the chromosome.

The Haplotyping Problem: An Overview of Computational Models and Solutions 171

When the matrix is error free, this problem can be solved easily by computing the fragment conflict graph, whose vertices are the fragments and the pair (i, j) is an edge if the two fragments i and j conflict (see Figure 10.7). Indeed, the graph must be bipartite, with each shore representing all the fragments that are in one of the two copies of the chromosome. Please notice that the solution to the problem of inferring haplotypes from fragments obtained from the fragment conflict graph is unique if the graph is connected, in which case the solution consists of a unique pair of haplotypes. In the presence of errors the problem becomes more complex; in fact we look for the minimum number of modifications to the instance matrix to make it error free. Figure 10.8 gives an example of a matrix with errors; observe that the fragment conflict graph of the matrix is not bipartite. Different operations on the matrix may be defined and each of such operations leads to a specific computational problem, as pointed out in [6] where the following problems have been introduced: Problem 14: Minimum Fragment Removal problem (MFR) Compute the minimum set of rows to remove from the matrix so that the matrix is error free.

1 2 3 4 5

1 0 0 1 — 1

2 1 — 0 1 —

1

3

2

5

3 — 1 — 1 —

4 0 — — — 1

5 — — 1 0 —

6 0 0 1 — 1

4

Figure 10.7 An error-free matrix and its associated fragment conflict graph.

1

2 1 — 0 1 1 3

2

5

1 2 3 4 5

1 0 0 1 — 1

3 — 1 — 1 —

4 0 — — — 1

5 — — 1 0 —

4

Figure 10.8 A matrix with errors and its associated fragment conflict graph.

6 0 0 1 — 1

172

Genome Sequencing Technology and Algorithms

Problem 15: Minimum SNP Removal problem (MSR) Compute the minimum set of columns to remove from the matrix so that the matrix is error free. Problem 16: Longest Haplotype Reconstruction problem (LHR) Compute a set of rows to remove from the matrix so that the matrix is error free and the sum of the lengths of the inferred haplotypes is maximized. Problem 17: Minimum Error Correction problem (MEC) Compute the minimum number of corrections on the entries of the matrix so that the resulting matrix is error free. A single fragment may (but not must) cover SNP sites that are consecutive on the fragment: in such case the fragment is gapless. A fragment has k gaps if it covers k + 1 blocks of consecutive SNPs. Of particular interests are the cases of 0 or 1 gaps, as these cases are common when sequencing. The classical shotgun-sequencing procedure deals with probes that are consecutive nucleotides from a DNA strand. Recent advances of the sequencing technology has made the production of mate pairs feasible, where a mate pair is made of two probes from the same copy of a chromosome and with a distance between the two probes that is approximately known. Representing a mate pair with a 1-gap fragment is immediate. In Table 10.1 are summarized some results described in [6, 16, 47]. The presence of gaps in fragments and also the number of holes is strictly related to the computational complexity of the above problems (Table 10.1 Table 10.1 Known Results for Some Problems on Fragments Problems

Gaps

Exact

MFR

0≤1 k holes

APX-hard* NP-hard Solvable in time O(m2n + m3) Solvable in time O(m2knm2 + 23km3)

MSR

0≤2 k holes

NP-hard Solvable in time O(mn2) Solvable in time O(mn2k+2)

APX-hard**

LHR

k holes 0

? Solvable in time O(n2m+n3)

?

MEC

n 0

NP-hard NP-hard

?

**Not known if in APX, see MinNodeBipartizer **Not known if in APX, see MinEdgeBipartizer

Approximation

The Haplotyping Problem: An Overview of Computational Models and Solutions 173

reports all known results about the complexity of such problems). Indeed, whenever fragments are gapless the problem is in general polynomial. The 1-gap case is relevant since the presence of at most one gap in each fragment is a sufficient condition to make hard the MFR problem. It is an interesting question to investigate the 1-gap case for the MSR and the other problems. In the following we describe a graph-based method to solve the MSR problem for gapless instance matrices. This method uses the notion of conflict among SNP sites. Given an SNP matrix M, two SNP sites i and j are in conflict in M if i and j assume both 0, 1 values in M and there exist two fragments x and y, such that the submatrix induced by rows x and y and columns i and j has three symbols of one type and one of the opposite. SNP conflicts of an SNP matrix M are represented by the SNP conflict graph having vertices the SNPs and an edge for each pair (i, j) of conflicting SNPs. An interesting property relates the SNP matrices without gaps to error-free matrices. Lemma 1: A matrix without gaps is error free if it has no SNP conflicts. By using the above property, that can be easily verified, the problem MSR reduces to solve the Min Vertex Cover problem over the SNP conflict graph, as making a matrix error free means to remove from such graph the minimum number of vertices (matrix columns) so that the graph has no edges. In [6] it is proved that whenever a matrix is without gaps then the SNP conflict graph is perfect [48]. Being the Min Vertex Cover problem on perfect graphs polynomialtime solvable, MSR can be solved efficiently via the above reduction. On the other hand, when the fragments contain some gaps the reduction does not work as illustrated in the following example. Example 10.9: Assume that M is the matrix of Figure 10.9 and G the corresponding SNP conflict graph. Then the vertex 4 is a minimum vertex cover, but the matrix M ′ obtained from M by removing the column 4 is not error free. All algorithms for the gapless cases are via dynamic programming; the main idea is that it is possible to infer the optimal solution from the optimal solution on a submatrix obtained by removing an SNP column (MSR) or a

1 2 3

1 0 1 —

2 — 1 0

1

3

2

4

3 0 — 1

4 0 0 1

Figure 10.9 A matrix with gaps and the corresponding SNP conflict graph.

174

Genome Sequencing Technology and Algorithms

fragment row (MFR, LHR). Let M be an SNP matrix, we denote with M[1...k] the submatrix of M formed by the first k rows of M. Let us consider the MFR problem. Since the matrix M is gapless, each row of M, consists of a sequence of defined values encoding the actual fragment, preceded or followed by some (possibly zero) holes. For each row f of M we denote by l(f ) and r(f ), respectively, the leftmost and the rightmost positions (SNP) of f that are not holes. The algorithm will exploit the fact that all positions between l(f ) and r(f ) must be defined. A first preprocessing step of the algorithm is to sort the rows of M in increasing order of l(f ), that is l(i) ≤ l(j), whenever i < j. A dynamic programming algorithm mainly consists of showing that an optimal solution of an instance can be computed from an optimal solution of some induced subinstance, in our case we will compute the optimal solution over the matrix M by exploiting the optimal solution where the “rightmost” fragment is removed from M. Consequently, we will show how to solve the instance M[1...k] from an optimal solution of M[1..k − 1] (we recall here that an optimal solution is a minimum-size set of rows that must be removed to obtain an error-free matrix). The algorithm computes the value D[i, j, k] that is the optimum over instance M[1...k] with the additional restriction that fragment i (respectively, j) has the maximum value of r(i) (respectively, r(j )) among all r(f ) for all fragments placed on the first (respectively, the second) haplotype. See Figure 10.10 for an illustrative example. Moreover, for each row f considered we denote with OK(f ) the set of rows with index g < f that agree with row f, that is rows representing fragments that may be on the same haplotype copy. Now we can explicitly show the recurrence that allows to solve the problem. The following possible cases must be considered: 1. D[i, j, 0] := 0 2. k > i, k > j; D[i, j, k] := D[ i , j , k − 1] if r (k ) ≤ r ( j ) and rows k and j agree  D[ i , j , k − 1] if r (k ) ≤ r (i ) and rows k and j agree D[ i , j , k − 1] + 1  l(i)

r(i) Fragment i

Fragment j

Haplotype 1 Haplotype 2

Figure 10.10 Example of fragments placed on the haplotypes by the algorithm in [47].

The Haplotyping Problem: An Overview of Computational Models and Solutions 175

3. k = i; D[k, j, k] := minh∈OK(k), r(h)≤r(k){D[h, j, k − 1]} 4. k = j; D[i, k, k] := minh∈OK(k), r(h)≤r(k){D[i, h, k − 1]} The optimal solution will be minh, k{D[h, k;m]}. In [47] the dynamic programming algorithm described has also been extended as a fixed-parameter algorithm for the same problem, where the parameter is the number of holes contained in the fragments, that is the number of unresolved symbols between the leftmost and the rightmost resolved symbol of each fragment. The algorithm can also be extended to the class of matrices where the columns can be permuted so that in each row the (0/1) symbols appear consecutively. The last problem is also a version of the consecutive ones problem, for which a polynomial-time algorithm is described in [49].

10.6 A Glimpse over Statistical Methods In this section we present some basic aspects of the application of statistical methods to tackle the haplotype inference problem. These methods are among the most used by biologists. Indeed, a certain number of results on haplotype inference from real data have been produced based on some statistical models. Anyway, here we will only briefly review some of these algorithms, since this survey mainly focuses on combinatorial approaches for haplotype inference. Moreover, some combinatorial problems discussed here aim to address some biological issues that are different from the ones attacked using the algorithms based on statistical models. The introduction of statistical models is mainly due to the presence of some shortcomings of the method introduced in [11]. In fact, the method in [11] requires an initial set of resolved haplotypes and it also highly depends upon the order by which haplotypes are resolved. The approach of [12, 50] lowers the relevance of the dependency on the order, hence making the whole procedure more reliable. The main idea of statistical models is that haplotypes have an unknown distribution in the target population and the observed genotypes of each individual are simply a combination of two haplotypes randomly drawn from the population. The goal of statistical haplotype inference is thus to estimate the haplotype frequencies and the haplotypes of each individual can be easily inferred based on the haplotype frequencies under some biological assumptions (like random mating assumption). Two different formulations of the haplotype inference problem, namely, maximum-likelihood inference [51] and Bayesian inference [52, 53] have been investigated and will be briefly discussed here.

176

Genome Sequencing Technology and Algorithms

Problem 18: Frequencies Haplotype Inference problem (FHI) Input: a sample G of genotypes. Output: the set of haplotype frequencies {h1, h2, ..., hn} (where n is the number of all possible haplotypes) that maximize the likelihood function of observing the genotype sample G. Problem 19: Bayesian Frequencies Haplotype Inference problem (BFHI) Input: a sample G of genotypes and a priori distribution of the frequencies of haplotypes. Output: the posterior distribution of the haplotype frequencies given the sample G. The FHI problem is tackled in [51], where an expectation-maximization algorithm (EM) has been proposed to estimate the haplotype frequencies that maximize the likelihood function of a genotype sample. It is well-known that the general framework of EM algorithm [54] is to find the maximum-likelihood estimator(s) by iteratively executing the E-step and M-step until convergence in the presence of missing data. Let Ht denote the set of haplotype frequencies and Gt denote the set of probabilities of all the genotypes at time t. The EM algorithm works by arbitrarily assigning an initial value of H0 (a possible initial set of frequency values is the one corresponding to the assumption that all possible haplotypes are equiprobable). Based on H0, the expectation of an observing genotype can be easily calculated, which is one of the elements in G1. The expected genotype frequencies in G1 are used in turn to estimate the haplotype frequencies at the M-step resulting in a new H1. Iterate the two steps until convergence (i.e., the difference between Ht+1 and Ht is smaller than a predefined value). At each iteration, the solution Ht is improved in the M-step by maximizing the likelihood function of the genotype sample. Different initial values can be taken in order to increase the possibility to obtain a global optimal solution. The BFHI problem is tackled by an iterative stochastic-sampling strategy, the pseudo Gibbs sampler (PGS) [52] that makes use of the Markov chain Monte Carlo method with the assumption of a coalescent model. A Gibbs sampler iteratively samples a pair of compatible haplotypes for each genotype conditional on the genotypes G and on other haplotypes, and uses these values to update the frequencies of haplotype distribution. This iterative algorithm produces haplotype frequencies h1, ..., hm as they were sampled from the desired posterior distribution of the haplotype frequencies given the sample genotypes G. The two methods described above cannot handle satisfactorily a large number of SNPs, or missing data. These issues are addressed by the method proposed in [53], where a robust Bayesian procedure has been introduced. The method in [53] makes use of the biological model also used in [51], but imposes no assumptions on the population history (comparing to the coalescent model

The Haplotyping Problem: An Overview of Computational Models and Solutions 177

used in [52]). In particular the method introduces a divide-and-conquer technique so that a larger number of haplotypes can be studied. This algorithm partitions the genotypes into units, where each unit has maximum length of 8 loci. The method first constructs a set of most probable partial haplotypes compatible with each unit using the Gibbs sampler. Two adjacent units are then combined to construct a set of the most probable partial haplotypes that are compatible with the genotypes of the two units. The algorithm recursively combines the partial haplotypes until the whole haplotype is created. A detailed comparison of different statistical methods and the combinatorial method in [11] can be found in [53], which is based on an experimental study. We conclude this section by observing that, since there is an exponential number of possible haplotype solutions for a given set of genotypes, the statistical methods may have to analyze an exponential number of haplotypes in order to generate the solution of the FHI and BFHI problems. Even equipped with advanced numerical methods like EM algorithm and Gibbs sampler, such statistical methods are still very time consuming. Thus, the application of these methods is restricted to a small number of individuals in a population; moreover, the maximum number of individuals tractable decreases as the number of SNPs increases. This limitation is not present in other combinatorial approaches based on polynomial algorithmic solutions.

10.7 Discussion Many models and algorithms have been proposed for haplotype reconstruction and haplotype frequency estimation. Another important task is how to take advantage of the inferred haplotype information in biomedical research. Knowledge about haplotype structure learned from the HapMap project has been used in selecting tag SNPs for association studies. Various methods have also been proposed to directly use haplotype information in disease gene association mapping. And it has been shown that haplotype-based methods may provide higher power than single SNP-based methods under some assumptions. However, increasing evidences have shown that gene-gene interactions and gene-environment interactions play an important role in the etiology of complex diseases. We believe that, information from various data sources, such as SNP variations, gene/protein expressions and interactions, has to be integrated to have a better understanding of genetic risks of many complex diseases.

178

Genome Sequencing Technology and Algorithms

Acknowledgments Jing Li is supported in part by NIH/NLM grant R01 LM008991 and a start-up fund from Case Western Reserve University. Paola Bonizzoni is supported in part by an FAR2006 grant, “Algorithm for the Systems Biology and Bioinformatics.”

References [1] The International Human Genome Sequencing Consortium, “Initial Sequencing and Analysis of the Human Genome,” Nature, Vol. 409, No. 6822, 2001, pp. 860–921. [2] Venter, J. C., et al., “The Sequence of the Human Genome,” Science, Vol. 291, No. 5507, 2001, pp. 1304–1351. [3] Patil, N., et al., “Blocks of Limited Haplotype Diversity Revealed by High-Resolution Scanning of Human Chromosome 21,” Science, Vol. 294, No. 5547, 2001, pp. 1669–1670. [4] Daly, M., et al., “Fine-Structure Haplotype Map of 5q31: Implications for Gene-Based Studies and Genomic lD Mapping,” Am. J. Hum. Genet., Vol. 69, No. 4, 2001. [5] Gabriel, S. B., et al. “The Structure of Haplotype Blocks in the Human Genome,” Science, Vol. 296, No. 5576, 2002, pp. 2225–2229. [6] Lancia, G., et al., “SNPs Problems, Complexity and Algorithms,” Proc. 9th European Symp. on Algorithms (ESA), 2001, pp. 182–193. [7] Gusfield, D., “Haplotyping as Perfect Phylogeny: Conceptual Framework and Efficient Solutions,” Proc. 6th Annual Conference on Research in Computational Molecular Biology (RECOMB), 2002, pp. 166–175. [8] Halperin, E., E. Eskin, and R. M. Karp, “Efficient Reconstruction of Haplotype Structure Via Perfect Phylogeny,” Journal of Bioinformatics and Computational Biology, Vol. 1, No. 1, 2003, pp. 1–20. [9] Halperin, E., E. Eskin, and R. M. Karp. “Large Scale Reconstruction of Haplotypes from Genotype Data,” Proc. 7th Annual Conference on Research in Computational Molecular Biology (RECOMB), 2003, pp. 104–113. [10] Zhang, K., et al., “A Dynamic Programming Algorithm for Haplotype Block Partitioning,” Proc. Natl. Acad. Sci. USA, Vol. 99, No. 11, 2002, pp. 7335–7339. [11] Clark, A., “Inference of Haplotypes from PCR-Amplified Samples of Diploid Populations,” Molecular Biology and Evolution, Vol. 7, No. 2, 1990, pp. 111–122. [12] Gusfield, D., “Inference of Haplotypes from Samples of Diploid Populations: Complexity and Algorithms,” Journal of Computational Biology, Vol. 8, No. 3, 2001, pp. 305–323. [13] Lin, H., et al., “A Note on the Single Genotype Resolution Problem,” J. Comput. Sci. Technol., Vol. 19, No. 2, 2004, pp. 254–257. [14] Gusfield, D., “Haplotype Inference by Pure Parsimony,” Proc. of 14th Annual Symposium on Combinatorial Pattern Matching (CPM 2003), 2003, pp. 144–155.

The Haplotyping Problem: An Overview of Computational Models and Solutions 179 [15] Lancia, G., M. Pinotti, and R. Rizzi, “Haplotyping Populations by Pure Parsimony: Complexity of Exact and Approximation Algorithms,” INFORMS Journal on Computing, Vol. 14, No. 4, 2004, pp. 348–359. [16] Cilibrasi, R., et al., “On the Complexity of Several Haplotyping Problems,” Proc. of Algorithms in Bioinformatics, 5th International Workshop (WABI 2005), 2005, pp. 128–139. [17] Lancia, G., and R. Rizzi, “A Polynomial Case of the Parsimony Haplotyping Problem,” Oper. Res. Lett., Vol. 34, No. 3, 2006, pp. 289–295. [18] Huang, Y., K. Chao, and T. Chen, “An Approximation Algorithm for Haplotype Inference by Maximum Parsimony,” Journal of Computational Biology, Vol. 12, No. 10, 2005, pp. 1261–1274. [19] Sharan, R., B. V. Halldorsson, and S. Istrail, “Islands of Tractability for Parsimony Haplotyping,” IEEE/ACM Trans. on Computational Biology and Bioinformatics, Vol. 3, No. 3, 2006, pp. 303–311. [20] Gusfield, D., Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge, U.K.: Cambridge University Press, 1997. [21] Bafna, V., et al., “A Note on Efficient Computation of Haplotypes Via Perfect Phylogeny,” Journal of Computational Biology, Vol. 11, No. 5, 2004, pp. 858–866. [22] Bafna, V., et al., “Haplotyping as Perfect Phylogeny: A Direct Approach,” Journal of Computational Biology, Vol. 11, No. 5, 2004, pp. 858–866. [23] Ding, Z., V. Filkov, and D. Gusfield, “A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem,” Proc. 9th Annual Conference on Research in Computational Molecular Biology (RECOMB), 2005, pp. 585–600. [24] Bonizzoni, P., “A Linear-Time Algorithm for the Perfect Phylogeny Haplotype Problem,” Algorithmica, Vol. 48, No. 3, 2007, pp. 233–248. [25] Barzuza, T., et al., “Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs,” Proc. 15th Annual Symposium on Combinatorial Pattern Matching, (CPM), 2004, pp. 14–31. [26] Halperin, E., and R. M. Karp, “Perfect Phylogeny and Haplotype Assignment,” Proc. of 8th Annual Conference on Research in Computational Molecular Biology (RECOMB), 2004, pp. 10–19. [27] Halperin, E., and R. M. Karp, “The Minimum-Entropy Set Cover Problem,” Proc. of 31st International Colloquium on Automata Languages and Programming (ICALP), 2004, pp. 733–744. [28] Kimmel, G., and R. Shamir, “The Incomplete Perfect Phylogeny Haplotype Problem,” J. Bioinformatics and Computational Biology, Vol. 3, No. 2, 2005, pp. 359–384. [29] Gramm, J., et al., “On the Complexity of Haplotyping Via Perfect Phylogeny,” Proc. of 2nd RECOMB Satellite Workshop on Computational Methods for SNPs and Haplotypes, 2004, pp. 35–46. [30] Gramm, J., T. Nierhoff, and T. Tantau, “Perfect Path Phylogeny Haplotyping with Missing Data Is Fixed-Parameter Tractable,” Parameterized and Exact Computation, First International Workshop, IWPEC 2004, 2004, pp. 174–186.

180

Genome Sequencing Technology and Algorithms

[31] Pe’er, I., et al., “Incomplete Directed Perfect Phylogeny,” SIAM Journal on Computing, Vol. 33, No. 3, 2004, pp. 590–607. [32] Bonizzoni, P., et al., “Experimental Analysis of a New Algorithm for Partial Haplotype Completion,” International Journal of Bioinformatics Research and Applications (IJBRA), Vol. 1, No. 4, 2005, pp. 461–473. [33] Kimmel, G., R. Sharan, and R. Shamir, “Computational Problems in Noisy SNP and Haplotype Analysis: Block Scores, Block Identification, and Population Stratification,” INFORMS Journal on Computing, Vol. 14, No. 4, 2004, pp. 360–370. [34] Bonizzoni, P., et al., “Fingerprint Clustering with Bounded Number of Missing Values,” Proc. 17th Annual Symposium on Combinatorial Pattern Matching, (CPM), 2006, pp. 106–116. [35] Helmuth, L., “Genome Research: Map of Human Genome 3.0,” Science, Vol. 5530, No. 293, 2001, pp. 583–585. [36] O’Connell, J. R., “Zero-Recombinant Haplotyping: Applications to Fine Mapping Using SNPs,” Genet. Epidemiol., Vol. 19, Suppl. 1, 2000, pp. S64–S70. [37] Qian, D., and L. Beckmann, “Minimum-Recombinant Haplotyping in Pedigrees,” Am. J. Hum. Genet., Vol. 70, No. 6, 2002, pp. 1434–1445. [38] Tapadar, P., S. Ghosh, and P. P. Majumder, “Haplotyping in Pedigrees Via a Genetic Algorithm,” Hum. Hered., Vol. 50, No. 1, 2000, pp. 43–56. [39] Li, J., and T. Jiang, “Efficient Inference of Haplotypes from Genotypes on a Pedigree,” J. Bioinfo. Comp. Biol., Vol. 1, No. 1, 2003, pp. 41–69. [40] Garey, M. R., and D. S. Johnson, Computer and Intractability: A Guide to the Theory of NP-Completeness, New York: W. H. Freeman, 1979. [41] Li, J., and T. Jiang, “Computing the Minimum Recombinant Haplotype Configuration from Incomplete Genotype Data on a Pedigree by Integer Linear Programming,” J. Comp. Biol., Vol. 12, 2005, pp. 719–739. [42] Xiao, J., et al., “Fast Elimination of Redundant Linear Equations and Reconstruction of Recombination-Free Mendelian Inheritance on a Pedigree,” Proc. 18th ACM-SIAM Symposium on Discrete Algorithms (SODA), New Orleans, LA, January 2007. [43] Chan, M. Y., et al., “Linear-Time Haplotype Inference on Pedigrees Without Recombinations,” Proc. of 6th International Workshop on Algorithms in Bioinformatics (WABI), 2006, pp. 56–67. [44] Doi, K., J. Li, and T. Jiang, “Minimum Recombinant Haplotype Configuration on Tree Pedigrees,” Proc. of 3rd International Workshop on Algorithms in Bioinformatics (WABI), Hungary, 2003, pp. 339–353. [45] Liu, L., et al., “Complexity and Approximation of the Minimum Recombinant Haplotype Configuration Problem,” Proc. of the 15th International Symposium in Algorithms and Compuatation (ISAAC), 2005, pp. 370–379. [46] Schaefer, T. J., “The Complexity of Satisfiability Problems,” Proc. of the 10th Symposium on the Theory of Computing (STOC), 1978, pp. 216–226.

The Haplotyping Problem: An Overview of Computational Models and Solutions 181 [47] Rizzi, R., et al., “Practical Algorithms and Fixed-Parameter Tractability for the Single Individual SNP Haplotyping Problem,” Theor. Comput. Sci., Vol. 335, No. 1, 2005, pp. 109–125. [48] GrÃtschel, M., L. Lovasz, and A. Schrijver, “A Polynomial Algorithm for Perfect Graphs,” Annals of Discrete Mathematics, Vol. 21, 1984, pp. 325–356. [49] Booth, K. S., and G. S. Lueker, “Testing for the Consecutive Ones Property, Interval Graphs, and Graph Planarity Using PQ-Tree Algorithms,” Journal of Computer and System Sciences, Vol. 13, No. 3, 1976, pp. 335–379. [50] Orzack, S., D. Gusfield, and V. P. Stanton, “The Absolute and Relative Accuracy of Haplotype Inferral Methods and a Consensus Approach to Haplotype Inferral,” 51st Annual Meeting of the American Society of Human Genetics, 2001. [51] Excoffier, L., and M. Slatkin, “Maximum-Likelihood Estimation of Molecular Haplotype Frequencies in a Diploid Population,” Molecular Biology and Evolution, Vol. 12, No. 5, 1995, pp. 921–927. [52] Stephens, M., N. J. Smith, and P. Donnelly, “A New Statistical Method for Haplotype Reconstruction from Population Data,” American Journal of Human Genetics, Vol. 68, 2001, pp. 978–989. [53] Niu, T., et al., “Bayesian Haplotype Inference for Multiple Linked Single-Nucleotide Polymorphisms,” American Journal of Human Genetics, Vol. 710, 2002, pp. 157–169. [54] Mitchell, T. M., Machine Learning, New York: McGraw-Hill, 1987.

11 Analysis of Genomic Alterations in Cancer Benjamin J. Raphael, Stas Volik, and Colin C. Collins

11.1 Introduction Cancer is driven by a selection for mutations that include single nucleotide substitutions, short indels, and large-scale rearrangements of the genome [e.g., chromosomal inversions, translocations, segmental deletions, segmental duplications, and changes in chromosome copy number (aneuploidy and polyploidy)]. The frequency of these events varies greatly among tumors. For example, some tumors exhibit a large number of single nucleotide mutations but relatively normal chromosomal organization, while other tumors exhibit extensive chromosomal aberrations and rearrangement. In some types of cancer, these large-scale rearrangements produce changes in gene structure and regulation that are directly implicated in cancer progression and are targets for cancer therapeutics. A classic example is the Philadelphia chromosome [1], a 9;22 translocation observed in chronic myeloid leukemia. This translocation results in the ABL-BCR fusion protein [2] that is targeted by the drug Gleevec [3]. Another example is the chromosome 8;14 translocation Burkitt’s lymphoma. This translocation activates the c-myc gene by placing it under the control of a strong promoter of an immunoglobulin gene [4]. In contrast to these and other well-characterized translocations in leukemias and lymphomas, solid tumors frequently exhibit many chromosomal aberrations [5]. However, very few aberrations have been found to be recurrent across multiple patients, and thus it was believed that fusion genes like ABL-BCR were nonexistent in solid tumors.

183

184

Genome Sequencing Technology and Algorithms

However, more recent analysis has challenged this view [6] and two gene fusions were recently reported whose combined frequency exceeds 50% of tested prostate cancer patients [7]. These results suggest that additional fusion genes remain to be discovered or delinated. During the past few decades, the vast majority of information about large-scale alterations in tumor genomes (e.g., as gain and loss of whole chromosomes, translocations, inversions, or large regions of duplication) has resulted from the application of cytogenetic and molecular cytogenetic techniques. These techniques are based on direct visualization of chromosomes and include chromosome banding, multiplex fluorescent in situ hybridization (mFISH) [8], and spectral karyotyping (SKY) [9]. Collectively these techniques have revealed numerous chromosomal aberrations, over 50,000 of which are recorded in the Mitelman database [10]. Despite this data, little is known about the detailed organization of tumor genomes or about the role of genome rearrangements in cancer progression. For example, the relative importance or prevalence of duplications in comparison to translocations and inversions in tumors is not known and the extent of variation in frequency of different events across different tumor types is unclear. The reason for this knowledge gap is that molecular cytogenetic techniques for identifying genome rearrangements have limited resolution (on the order of megabases) because they rely on isolation and analysis of metaphase chromosomes. This means that changes on a smaller scale will not be observed. Moreover, cytogenic techniques are relatively low-throughput, meaning that detailed studies of many samples are difficult. Also, most of these techniques require the isolation of metaphase chromosomes, which is challenging for some tissues. In the era of genome sequencing, it is apparent that resequencing tumor genomes would provide the ultimate dataset for cancer mutation and rearrangement studies. However, it is presently unrealistic to sequence more than a few tumor genomes in view of the high cost of mammalian genome sequencing. Moreover, in contrast to the sequencing of the human genome, tumor genomes present unique computational and experimental challenges. First, assembly of a tumor genome by whole-genome shotgun sequencing is challenging because extensive segmental duplications present in many tumors presents a formidable fragment assembly problem. Second, solid tumors are a heterogeneous collection of cells with varying number and type of mutations. Thus, if one shotgun sequences DNA extracted from a tumor sample, one is not sequencing a single genome, but a population of different (albeit related) genomes. Despite these obstacles, the availability of a high-quality reference human genome sequence coupled with rapid advances in sequencing technology affords the opportunity for high-resolution sequence-based analysis of tumor genomes. One approach that mitigates the assembly problem is to restrict attention to protein coding regions and sequence only these regions to discover somatic point mutations

Analysis of Genomic Alterations in Cancer

185

important in cancer. Notable efforts in this direction include [11–14] and the recently initiated Cancer Genome Atlas [15]. Here we focus on larger-scale rearrangements, duplications, and deletions. There are currently two sequence-based techniques used to examine these genomic alterations in tumors: comparative genomic hybridization CGH to arrayed representations of the human genome and paired-end sequencing. CGH is restricted to the detection of changes in copy number in the tumor while paired-end sequencing detects rearrangements, changes in copy number, and point mutations. We note that both of these techniques have also been applied to assess inherited structural polymorphisms in the human genome [16, 17]. 11.1.1 Measurement of Copy Number Changes by Array Hybridization

Array comparative genome hybridization (aCGH) [18] has become a dominant tool for the analysis of copy number changes in cancer. This technique involves the hybridization of differentially fluorescently labeled normal human and tumor DNA fragments to a set of genomic probes derived from normal human DNA. Measurements of tumor to normal fluorescence ratios at each probe identify locations in the human genome (Figure 11.1) that are present in higher or lower copy than in the tumor genome. Comparative genomic hybridization was first developed as a cytogenetic technique for hybridizations to metaphase chromosomes [19], but the use of arrays has steadily improved the resolution of aCGH: earlier spotted clone arrays [20] have resolution at most 0.5–1 MB, but more recent arrays based on overlapping clones [21], PCR products [22], or oligonucleotides [23, 24] offer resolutions approaching 50 kb or less. A major challenge in the interpretation of aCGH data is noise in the hybridization, and a variety of statistical techniques have been developed for this analysis. These techniques rely on the principle that if a duplicated or deleted region is large relative to the genomic spacing between probes on the array, then multiple adjacent probes will record the duplication or deletion. Thus measurements at probes from adjacent locations on the human genome typically will be correlated. Statistical methods exploit these correlations to transform the set of noisy probe measurements into contiguous segments of the genome that have normal or altered copy number in the tumor. Methods include change-point models [25], hidden Markov models [26], clustering [27], and a variety of other techniques, several of which are compared in [28]. Array-CGH has become a widespread tool in genomic analysis of cancer and has been used to: (1) identify candidates oncogenes and tumor-suppressor genes; (2) assay tumors for specific well-characterized aberrations such as amplification of ERBB2; and (3) correlate copy number profiles with prognosis, recurrence, or response to treatment. Pinkel and Albertson [29] review these applications of aCGH in cancer.

186

Genome Sequencing Technology and Algorithms Tumor DNA

Normal DNA

Genomic probes on array

Copy number

Duplication

Genome coordinate

Deletion

Figure 11.1 In array-CGH (aCGH), normal and tumor DNA (ideally from the same patient) are differentially labeled and hybridized to an array of genomic probes spaced across the reference human genome. The relative intensity measured at a probe indicates the copy number of the region in the tumor genome. Statistical methods are employed to filter the resulting copy number profile into duplicated segments, deleted segments, or segments with no change in copy number.

Array-CGH has some limitations as a tool for tumor genome analysis. First, aCGH does not detect rearrangements that have no effect on copy number including inversions and reciprocal (balanced) translocations. Second, although aCGH will reveal regions of the genome that are duplicated in a tumor, aCGH gives little information about the organization and locations of duplicated material within the tumor genome. For example, an aCGH experiment will not reveal whether two duplicated regions are located close together (or even on the same chromosome) in the tumor genome. In some tumor genomes, duplicated material from several disparate regions of the human genome is colocalized [30–32], and this knowledge is potentially useful for understanding altered regulation of genes in tumors. Finally, aCGH is impeded by genomic heterogeneity in the sample, arising either from contamination of tumor samples with normal (unmutated) cells or from heterogeneity in the alterations found within different cells of the tumor.

Analysis of Genomic Alterations in Cancer

187

11.1.2 Measurement of Genome Rearrangements by End Sequence Profiling

Sequencing of tumor genomes overcomes some of the limitations of CGH, but as mentioned earlier, high-coverage shotgun sequencing of a large number of tumors is not yet practical. An approach called end-sequence profiling (ESP) [31] has proven to be effective for genome-wide analysis of rearrangements in tumor cells. ESP provides a balance between imprecise, but inexpensive cytogenetic technologies and very precise, but expensive, full-genome sequencing. ESP involves the sequencing of paired ends of tumor genome fragments and the mapping of these ends to the reference human genome sequence [Figure 11.2(a)]. ESP is able to reveal all types of rearrangements present in a tumor including inversions, translocations, transpositions, duplications, and deletions. ESP gives at least an order of magnitude more accurate representation of the tumor genome than cytogenetic techniques like SKY, and in addition yields detailed information about the organization of the tumor genome that is lacking in CGH. Moreover, ESP is less impeded by heterogeneity in the sample than CGH, since an end-sequenced fragment arises from a distinct piece of DNA from an individual tumor cell. Therefore, it is possible to overcome problems with heterogeneous samples or contamination by normal cell admixture by sequencing additional clones. Ultimately, the resolution of ESP is limited only by the number of clones sequenced and the size of each clone. Tumor DNA

1) Cut tumor genome into overlapping fragments 2) Sequence tags from ends of each fragment

3×10

9

2.5 2 1.5 1

Human DNA x

3) Map end sequences to human genome. y

0.5 0.5 1 1.5 2 2.5 3×10

9

(a) (b)

Figure 11.2 (a) In the end sequence profiling (ESP) technique, short tag sequences from the ends of fragments of the tumor genome are mapped to the human genome. Each mapped fragment is associated with a pair (x, y) of locations in the human genome. (b) The data from an ESP experiment consists of a set of ES pairs (x1, y1), ..., (xn, yn) represented as points in a two-dimensional plot. Typically, the distance between elements of an ES pair will approximately equal the length of a fragment (points near diagonal). However, since the tumor genome is a rearranged version of the human genome, there will also be a number of invalid ES pairs whose ends map far apart (points off diagonal). The goal is to reconstruct the organization of the tumor genome from the ES pairs and to find a plausible sequence of rearrangements that transform the human genome into the tumor genome.

188

Genome Sequencing Technology and Algorithms

ESP was first applied to a comprehensive study of the MCF7 breast cancer cell line [31, 32] and later to additional cell lines and primary tumors [33]. The following methodology was used. First, a bacterial artificial chromosome (BAC) library was constructed from the MCF7 cell line. That is, DNA from the MCF7 cell line was split into small fragments varying in size from 80–250 kb, and these pieces of DNA were cloned into BACs.1 Second, the ends (≈500 bp) of each BAC were sequenced. Third, the resulting end sequences were mapped to the reference human genome. Only BACs with both end sequences mapping uniquely to the human genome were retained for further analysis. Each such BAC corresponds to a pair (x, y) of locations in the human genome where the end sequences map. In addition, since the end sequence may map to either DNA strand, each mapped end has a sign (+ or –) to indicate the strand. We call such a signed pair an end sequence pair (ES pair). Thus, the data from an ESP experiment consists of a set of ES pairs (x1, y1), ..., (xN, yN) [Figure 11.2(b)]. Typically, the distance between elements of a ES pair will equal the length L of a BAC clone (e.g., 80–250 kb), and the ends will have opposite, convergent orientations [i.e., an ES pair of the form (+x, −(x+L))]. We call such ES pairs valid pairs. However, since the tumor genome is a rearranged version of the human genome, there will also be a number of invalid pairs whose ends map far apart, or have the wrong orientation, or both. The valid and invalid pairs reveal information about the organization of the tumor genome. In particular, invalid pairs indicate distant regions of the human genome that are fused in the tumor, possibly revealing novel fusion genes [32]. However, in highly rearranged tumor genomes like MCF7, the complicated patterns of invalid pairs defy simple explanation and require the development of computational methods for analysis.

11.2 Analysis of ESP Data The first step in the analysis of ESP data is to map the end sequences to the reference human genome sequence. For end sequences of 500 bp, this step is easily accomplished using tools like MegaBlast [34] or BLAT [35]. The main challenge results from repeats and duplications in the human genome, which lead to nonunique mappings for end sequences. Clones with nonunique mappings can be removed from consideration, as the genomic region that they contain will likely be covered by other clones, assuming that the clone library is sufficiently large. Completion of end sequence mapping gives a set of ES pairs. The second 1. Note that the ESP methodology is flexible in terms of the cloning vector that is used. Fosmids with insert sizes of ≈40 kb or plasmids with insert sizes of ≈2 kb can give increased resolution of the tumor genome, at the cost of a larger number of clones that is required with BACs.

Analysis of Genomic Alterations in Cancer

189

step is to cluster the ES pairs to overcome experimental errors and identify clones spanning the same rearrangement. The primary source of experimental errors is chimeric BACs in the library. Chimeric BACs are produced by joining of two noncontiguous regions of DNA, and thus chimeric BACs will also correspond to invalid ES pairs. However, chimeric BACs are artifacts, rather than signs of real rearrangements. When an ESP project includes a sufficiently large number of BAC clones, chimeric BACs are easily distinguished from real rearrangements because the breakpoints of a rearrangement will likely be covered by two or more BACs. In contrast, because chimeric BACs typically combine two “random” segments of DNA, different chimeric BACs rarely will have ends mapped in close proximity. Thus, we define an ES cluster as a set of ES pairs whose entries are close enough that all ES pairs in a set could be explained by a single rearrangement event. That is, we say that ES pairs (x1, y1), ..., (xn, yn) form an ES cluster if there exist locations a and b such that: l ≤ sign ( x i )(a − x i ) + sign ( y i )(b − y i ) ≤ L for i = 1, …, n where l and L are the minimum and maximum clone sizes, respectively. ES pairs arising from chimeric BACs are extremely unlikely to be members of ES clusters. Having obtained a set of ES clusters, we can identify putative rearrangements in the tumor including inversions, translocations, and duplications (Figure 11.3). However, some ES clusters do not result from a single rearrangement of the human genome, but from multiple overlapping rearrangements [Figure 11.3(d)]. To analyze these overlapping rearrangements, we apply methods from comparative genomics to derive a putative tumor genome sequence and analyze genome rearrangements that transform the normal human genome into the tumor genome. In [36], in an early attempt to reconstruct tumor genomes from ESP data, we developed a computational approach with a few simplifying assumptions: (1) a sequence of inversions, translocations, chromosomal fissions, and chromosome fusions generates the tumor genome from the normal human genome; (2) no duplications occurred in the tumor genome; and (3) each BAC clone contains at most one rearrangement breakpoint. These assumptions allowed us to use a theoretical framework originally developed to study genome rearrangements that occur during species evolution. In this framework, both the human and tumor genomes are represented by integer permutations and the problem is to find a minimal sequence of rearrangement operations that transform one permutation into another. Under the assumption that only inversions, translocations, fission, and fusions occur, such a minimal sequence can be computed efficiently, specifically in time that is polynomial in the length of the permutation [37]. We applied this computational approach to ESP data from the

190

Genome Sequencing Technology and Algorithms

A

C

B s

A

t

B

A

-B

D

C

s

(a)

C

s

Inversion

Translocation

t

-B

-C

A

t

D t

s (b)

A

B

C

u

A

B u

E

D v

w

v

w

Duplication

B

u

(c)

E

C

D

A

C

D v

E w

???? (d)

Figure 11.3 Locations and orientation of end sequence (ES) pairs suggest rearrangement events in the tumor genome including: (a) inversions on a single chromosome, (b) translocations between two chromosomes, (c) duplications or transpositions, or (d) compound events suggesting multiple rearrangements. Here, arrows indicate the locations and orientation of mapped end sequences, and arcs join end sequences (ES) that form an ES pair. Each of these events transforms the indicated invalid ES pair by rearranging the labeled segments of the genome.

MCF7 breast cancer cell line, and derived a putative reconstruction of the MCF7 genome (Figure 6 in [36]). This study produced the first high-resolution reconstruction of a tumor genome and directed further BAC sequencing experiments. Of course, duplication and deletions are quite common in tumor genomes as shown by numerous CGH studies. The availability of ESP data allows us to study the organization of duplicated regions in tumors. In the MCF7 ESP data, we observed a complex pattern of ES pairs that suggested a process of overlapping rearrangements and duplications (see Figure 3b in [38]). We developed a computational technique to analyze duplications in this data using a model based on the biological process of duplication by amplisome2 [38]. Amplisomes are essentially minimal units of duplication that replicate extrachomosomally and can reintegrate into chromosomes [39]. Our method reconstructs the sequence of amplisome that explains the ESP data by finding the shortest path in a graph derived from the ES pairs. Using our method, we reconstructed a putative amplisome for the MCF7 genome (see Figure 5 in [38]). 2. Also referred to as episomes or double minutes, depending on their size.

Analysis of Genomic Alterations in Cancer

191

While amplisomes have been observed in vitro and in vivo, they are only one mechanism by which tumor genomes evolve their new organization. A process called the breakage/fusion/bridge cycle also yields duplications at the ends of damaged chromosomes [40], and there is evidence that this process may active in human solid tumors [32, 33, 41]. The precise mechanisms that produce duplications in human tumors are not completely understood, and it is not known whether a tumor preferentially uses one mechanism of duplication or combines multiple mechanisms. ESP analysis of tumor genomes can help resolve these mysteries particularly in concert with the development of computational models of additional rearrangement and duplication mechanisms.

11.3 Combination of Techniques It is likely that no single sequenced-based technique is optimal for analyzing all types of alterations in every tumor genome. However, different methods of analyzing tumor genomes should ideally produce concordant results. We recently compared ESP and array aCGH data for MCF7 and discovered that there was significant overlap between the genomic locations where aCGH identifies a change in copy number, and locations where ESP identifies rearrangements [33]. Of course, agreement between the two techniques is not expected to be perfect, both because of experimental noise and because aCGH cannot measure certain types of rearrangements. We note that even in the case of structural polymorphisms, there are discrepancies between different techniques [42]. Robust statistical methods to integrate measurements from different experimental techniques are needed.

11.4 Future Directions An important consideration in the analysis of tumor genomes is the genomic heterogeneity found in a tumor. In principle, sequencing approaches like ESP have an advantage over aCGH in this regard. In the creation of a BAC (or other clone) library, DNA from multiple cells is pooled, but an individual clone does measure a rearrangement in a single cell, in contrast to aCGH, which averages over the cell population. ES pairs from rearrangement variants in different cells will be mixed together in the ESP data, but rearrangements that are common to all or most cells in a tumor population are more likely to be cloned and sequenced than sporadic rearrangements. Thus ESP is biased towards finding these common and potentially early rearrangement events in the development of the tumor. At the same time, deep sequencing via ESP can reveal rare events present in a subpopulation of tumor cells. A key question is to determine the

192

Genome Sequencing Technology and Algorithms

number of clones necessary to analyze populations of tumor cells with different amounts of heterogeneity; mathematical analyses, simulation studies, and pilot sequencing projects are needed to address this question. There is a parallel between sequencing a tumor and “community sequencing” or metagenomics approaches that simultaneously sequence an environmental sample containing a mixture of organisms [43, 44]. Community sequencing identifies rare organisms in a hetrogeneous mixture with deep sequencing just as ESP identified rare rearrangements in tumors with more sequenced clones. There is room for cross-fertilization of ideas between these two approches. DNA sequencing technology continues to reduce in cost and improve in efficiency. Steady improvements in current technologies will make large-scale ESP studies more common. Moreover, other ESP-like strategies have been proposed including a paired-end sequencing technique that improves efficiency by concatenating multiple short paired-end tags into a single read in SAGE-like approach, yielding an order of magnitude more paired ends for the same number of sequenced reads [45]. Presently, the 18-bp end sequences produced by this technique are too short for tumor-genome rearrangement studies because too few of these short tags can be uniquely identified in the human genome. However, with slightly longer end sequences (e.g., 22–25 bp), enough will map uniquely to the human reference sequence to undertake effective ESP studies. The greatest promise in the near term lies in the application of the new generation of short-read sequencers that will be able to produce a significantly larger number of short-end sequence pairs in a cost-effective manner [46]. Such large-scale sequencing efforts and the future development of singlecell sequencing techniques will give an unprecedented catalog of tumor mutations, including both point mutations and large-scale alterations. Eventually, complete mutational analysis of tumors will become feasible. There will be a great demand for bioinformatic techniques to uncover sets of recurrent mutations and define similarity between highly mutated tumor samples. Similarity might mean more than possessing specific mutations or mutated genes in common; for example, different tumors might share similar mutated pathways. Ultimately, the knowledge gained from tumor genome studies will be used not only to discover gene targets for diagnostics and therapeutics, but also to better understand the temporal and population dynamics of the mutational process of tumor development.

References [1]

Rowley, J. D., “Letter: A New Consistent Chromosomal Abnormality in Chronic Myelogenous Leukaemia Identified by Quinacrine Fluorescence and Giemsa Staining,” Nature, Vol. 243, 1973, pp. 290–293.

Analysis of Genomic Alterations in Cancer

193

[2] Heisterkamp, N., et al., “Localization of the c-ab1 Oncogene Adjacent to a Translocation Break Point in Chronic Myelocytic Leukaemia,” Nature, Vol. 306, 1983, pp. 239–242. [3] Druker, B. J., et al., “Efficacy and Safety of a Specific Inhibitor of the BCR-ABL Tyrosine Kinase in Chronic Myeloid Leukemia,” N. Engl. J. Med., Vol. 344, 2001, pp. 1031–1037. [4] Croce, C. M., et al., “Molecular Genetics of Human B- and T-Cell Neoplasia,” Cold Spring Harb. Symp. Quant. Biol., Vol. 51, Pt. 2, 1986, pp. 891–898. [5] Albertson, D. G., et al., “Chromosome Aberrations in Solid Tumors,” Nat. Genet., Vol. 34, 2003, pp. 369–376. [6] Mitelman, F., B. Johansson, and F. Mertens, “Fusion Genes and Rearranged Genes as a Linear Function of Chromosome Aberrations in Cancer,” Nat. Genet., Vol. 36, 2004, pp. 331–334. [7] Tomlins, S. A., et al., “Recurrent Fusion of TMPRSS2 and ETS Transcription Factor Genes in Prostate Cancer,” Science, Vol. 310, 2005, pp. 644–648. [8] Speicher, M. R., S. Gwyn Ballard, and D. C. Ward, “Karyotyping Human Chromosomes by Combinatorial Multi-Fluor FISH,” Nat. Genet., Vol. 12, 1996, pp. 368–375. [9] Schrock, E., et al., “Multicolor Spectral Karyotyping of Human Chromosomes,” Science, Vol. 273, 1996, pp. 494–497. [10] Mitelman, F., B. Johansson, and F. Mertens, Mitelman Database of Chromosome Aberrations in Cancer, 2006, http://cgap.nci.nih.gov/chromosomes/mitelman. [11] Ley, T. J., et al., “A Pilot Study of High-Throughput, Sequence-Based Mutational Profiling of Primary Human Acute Myeloid Leukemia Cell Genomes,” Proc. Natl. Acad. Sci. USA, Vol. 100, 2003, pp. 14275–14280. [12] Stephens, P., et al., “A Screen of the Complete Protein Kinase Gene Family Identifies Diverse Patterns of Somatic Mutations in Human Breast Cancer,” Nat. Genet., Vol. 37, 2005, pp. 590–592. [13] Sjoblom, T., et al., “The Consensus Coding Sequences of Human Breast and Colorectal Cancers,” Science, Vol. 314, 2006, pp. 268–274. [14] Greenman, C. P., “Patterns of Somatic Mutation in Human Cancer Genomes,” Nature, Vol. 446, No. 7132, March 8, 2007, pp. 153–158. [15] Kaiser, J., National Institutes of Health, “NCI Gears Up for Cancer Genome Project,” Science, Vol. 307, 2005, pp. 11–82. [16] Feuk, L., A. R. Carson, and S. W. Scherer, “Structural Variation in the Human Genome,” Nat. Rev. Genet., Vol. 7, 2006, pp. 85–97. [17] Tuzun, E., et al., “Fine-Scale Structural Variation of the Human Genome,” Nat. Genet., Vol. 37, 2005, pp. 727–732. [18] Pinkel, D., et al., “High Resolution Analysis of DNA Copy Number Variation Using Comparative Genomic Hybridization to Microarrays,” Nat. Genet., Vol. 20, 1998, pp. 207–211. [19] Kallioniemi, A., et al., “Comparative Genomic Hybridization for Molecular Cytogenetic Analysis of Solid Tumors,” Science, Vol. 258, 1992, pp. 818–821.

194

Genome Sequencing Technology and Algorithms

[20] Pollack, J. R., et al., “Genome-Wide Analysis of DNA Copy-Number Changes Using cDNA Microarrays,” Nat. Genet., Vol. 23, 1999, pp. 41–46. [21] Ishkonian, A. S., et al., “A Tiling Resolution DNA Microarray with Complete Coverage of the Human Genome,” Nat. Genet., Vol. 36, 2004, pp. 299–303. [22] Dhami, P., et al., “Exon Array CGH: Detection of Copy-Number Changes at the Resolution of Individual Exons in the Human Genome,” Am. J. Hum. Genet., Vol. 76, 2005, pp. 750–762. [23] Lucito, R., et al., “Representational Oligonucleotide Microarray Analysis: A HighResolution Method to Detect Genome Copy Number Variation,” Genome Res., Vol. 13, 2003, pp. 2291–2305. [24] Barrett, M. T., et al., “Comparative Genomic Hybridization Using Oligonucleotide Microarrays and Total Genomic DNA,” Proc. Natl. Acad. Sci. USA, Vol. 101, 2004, pp. 17765–17770. [25] Olshen, A. B., et al., “Circular Binary Segmentation for the Analysis of Array-Based DNA Copy Number Data,” Biostatistics, Vol. 5, October 2004, pp. 557–572. [26] Fridlyand, J., et al., “Hidden Markov Models Approach to the Analysis of Array CGH Data,” Journal of Multivariate Analysis, Vol. 90, 2004, pp. 132–153. [27] Wang, P., et al., “A Method for Calling Gains and Losses in Array CGH Data,” Biostatistics, Vol. 6, 2005, pp. 45–58. [28] Lai, W. R., et al., “Comparative Analysis of Algorithms for Identifying Amplifications and Deletions in Array CGH Data,” Bioinformatics, Vol. 21, 2005, pp. 3763–3770. [29] Pinkel, D., and D. G. Albertson, “Array Comparative Genomic Hybridization and Its Applications in Cancer,” Nat. Genet., Vol. 37, Suppl., 2005, pp. S11–S17. [30] Guan, X. Y., et al., “Identification of Cryptic Sites of DNA Sequence Amplification in Human Breast Cancer by Chromosome Microdissection,” Nat. Genet., Vol. 8, 1994, pp. 155–161. [31] Volik, S., et al., “End-Sequence Profiling: Sequence-Based Analysis of Aberrant Genomes,” Proc. Natl. Acad. Sci. USA, Vol. 100, 2003, pp. 7696–7701. [32] Volik, S., et al., “Decoding the Fine-Scale Structure of a Breast Cancer Genome and Transcriptome,” Genome Res., Vol. 16, 2006, pp. 394–404. [33] Raphael, B. J., et al., “A Sequence Based Survey of the Complex Structural Organization of Tumor Genomes,” unpublished document. [34] Zhang, Z., et al., “A Greedy Algorithm for Aligning DNA Sequences,” J. Comput. Biol., Vol. 7, 2000, pp. 203–214. [35] Kent, W. J., “BLAT—The BLAST-Like Alignment Tool,” Genome Res., Vol. 12, 2002, pp. 656–664. [36] Raphael, B. J., et al., “Reconstructing Tumor Genome Architectures,” Bioinformatics, Vol. 19, Suppl. 2, 2003, pp. II162–II171. [37] Pevzner, P., Computational Molecular Biology: An Algorithmic Approach, Cambridge, MA: MIT Press, 2000.

Analysis of Genomic Alterations in Cancer

195

[38] Raphael, B. J., and P. A. Pevzner, “Reconstructing Tumor Amplisomes,” Bioinformatics, Vol. 20, Suppl. 1, 2004, pp. I265–I273. [39] Windle, B. E., and G. M. Wahl, “Molecular Dissection of Mammalian Gene Amplification: New Mechanistic Insights Revealed by Analyses of Very Early Events,” Mutat. Res., Vol. 276, 1992, pp. 199–224. [40] McClintock, B., “The Stability of Broken Ends of Chromosomes in Zea Mays,” Genetics, Vol. 26, 1941, pp. 234–282. [41] Chin, K., et al., “In Situ Analysis of Genome Instability in Breast Cancer,” Nat. Genet., Vol. 16, 2004, pp. 984–988. [42] Eichler, E. E., “Widening the Spectrum of Human Genetic Variation,” Nat. Genet., Vol. 38, 2006, pp. 9–11. [43] Venter, J. C., et al., “Environmental Genome Shotgun Sequencing of the Sargasso Sea,” Science, Vol. 304, 2004, pp. 66–74. [44] Tyson, G. W., et al., “Community Structure and Metabolism Through Reconstruction of Microbial Genomes from the Environment,” Nature, Vol. 428, 2004, pp. 37–43. [45] Ng, P., et al., “Gene Identification Signature (GIS) Analysis for Transcriptome Characterization and Genome Annotation,” Nat. Methods, Vol. 2, 2005, pp. 105–111. [46] Bently, D. R., “Whole Genome Re-Sequencing,” Curr. Opin. Genet. Dev., Vol. 16, 2006, pp. 545–552.

12 High-Throughput Assessments of Epigenomics in Human Disease Curt Balch, Tim H.-M. Huang, and Kenneth P. Nephew

12.1 Introduction In 1942, Waddington coined the term epigenetics to describe the development of phenotype from genotype [1, 2]. That term evolved to describe all heritable gene regulatory events distinct from primary DNA sequence [3], and the field of epigenetics now encompasses DNA methylation, covalent modifications of histones, nucleosome-DNA interactions, and most recently, small inhibitory RNA molecules [4]. While the Human Genome Project has generated a wealth of information regarding gene discovery, repeat elements, and possible transcription factor-binding sites, the detailed DNA-external events that regulate expression of coding sequences remain largely unstudied. Consequently, a “Human Epigenome Project” has been initiated in Europe to delineate all methylation sites throughout the entire DNA complement of a human cell; that effort has been proposed for expansion into a comprehensive international undertaking that would also include other chromatin modifications [5]. Such an epigenetic “map” would allow for a greater understanding of the role of nongenetic modifications in mediating both normal and abnormal phenotypes, by exerting specific effects on gene expression patterns.

197

198

Genome Sequencing Technology and Algorithms

12.2 Epigenetic Phenomena That Regulate Gene Expression 12.2.1 Methylation of Deoxycytosine

To date, the best-studied epigenetic modification is methylation of the pyrimidine cytosine, within the dinucleotide CpG, resulting in 5-methylcytosine (5-mC), often referred to as the “fifth base” present in DNA. In normal cells, 5-mC is widely believed to act to “silence” expression and/or retrotransposition of parasitic repeat sequences, such as Alu repeats and long-interspersed elements (LINEs) [6]. Although 5-mC comprises approximately 1% of the genome, it is underrepresented due to spontaneous deamination to thymine, which is not recognized by mismatch repair systems [7]. However, distinct regions of 5-mC, located within specific CG-rich sequences known as “CpG islands,” are often found unmethylated and associated with active transcriptional units [8]. DNA methylation is firmly associated with mammalian development, as evidenced by vast alterations in methylation patterns that occur during embryogenesis and lineage commitment [9]. In likely connection with its role in normal development, aberrations in DNA methylation patterns are associated with a number of differentiation- related disorders, including cancer. 12.2.2 Histone Modifications and Nucleosome Remodeling

While cytosine methylation is currently the best-studied epigenetic modification, other well-known chromatin alterations include histone permutations and nucleosome repositioning. Histone modifications include acetylation, phosphorylation, ubiquitination, and methylation, primarily within the N-terminal “tail” regions that extend from the nucleosome core octamer [10]. The sum total of these covalent histone alterations has been referred to as the “histone code,” which can be “written” by the various modifying enzymes and “read” by various binding proteins that act to further modify chromatin and/or alter gene expression [11, 12]. Even more so than DNA methylation, histone methylation has been strongly associated with differentiation/lineage commitment. In particular, methylation of histone H3 lysine 27, a repressive histone code “mark,” is associated with a family of transcriptional repressors known as the Polycomb group (PcG), which are believed to establish and maintain pleurior multipotency in embryonic/tissue stem cells [13–15]. Conversely, Trithorax group (TrxG) proteins are linked to methylation of histone H3 lysine 4, a transcriptionally permissive mark; TrxG members thus act to directly oppose PcG actions and as might be expected, are affiliated with differentiation and lineage commitment [16, 17]. In addition to modification of individual histones, “variants” of whole histone proteins are often found associated with DNA in assorted states of transcriptional activity [18, 19]. For example, histone variant

High-Throughput Assessments of Epigenomics in Human Disease

199

H3.3 is a “mark” of active chromatin [20], while phosphorylated histone H2AX is intricately linked to sites of DNA damage and can actually serve as a marker for that state of cellular injury [21, 22]. Consequently, global or local examination of histone variants, and their associated DNA sequences, could possibly provide insight into epigenetic responses to various states of cell physiology. Similar to variant histone proteins and posttranslational modifications, chromatin remodeling is an additional epigenetic phenomenon associated with gene regulation and possibly, DNA repair [23, 24]. Early structural studies of chromatin revealed that 146 base pairs (bp) of DNA is wrapped around a core nucleosome consisting of an octamer of histones, with 50 bp of “linker” DNA between repeating octamers [25]. This “beads-on-string” model, however, is a basic structure that becomes further compacted into higher order heterochromatic conformations [26]; consequently, nucleosome-DNA interactions must necessarily be disrupted during gene transcription [27]. To allow access of transcriptional machinery to coding sequences, ATP-dependent chromatin-remodeling complexes, originally discovered in yeast, promote repositioning of nucleosomes [23]. These complexes are theorized to effect DNA “looping” to relieve nucleosomal interactions and thus allow promoter access by transcription machinery [28]. Other chromatin-modulating complexes include the nucleosome remodeling and deacetylation (NuRD) multimer and the chromatin accessibility complex (CHRAC) [23, 29]. Thus, histone modifications and nucleosome architecture play critical roles in natural and aberrant physiological processes and consequently, technologies are rapidly being developed to globally monitor these phenomena across the entire human genome.

12.2.3 Small Inhibitory RNA Molecules

In addition to direct chromatin modification, another more recently discovered epigenetic gene regulatory process is that mediated by small inhibitory RNA molecules known as microRNAs (miRNAs). To date, over 220 miRNAs have been discovered in humans, although the actual number is likely greater than 800 [30]. These inhibitory RNAs are generally transcribed as polycistrons and subsequently processed to 18–25 nucleotide “stem-loop” structures that are exported from the nucleus. In the cytoplasm, miRNAs are further processed, allowing them to associate with a multiunit complex linked to the degradation and/or translational inhibition of complementary mRNA transcripts [31]. It is now speculated, however, that in addition to direct effects on mRNAs, microRNA/protein complexes can re-enter the nucleus and, by a poorly understood process, actually direct methylation machinery to specific DNA sequences [32].

200

Genome Sequencing Technology and Algorithms

12.3 Epigenetics and Disease 12.3.1 Epigenetics and Developmental and Neurological Diseases

Aberrant DNA methylation has now been associated with numerous human diseases, likely in association with its intricate role in development. Such a role has particularly been suggested for psychiatric diseases, shown to differentially afflict monozygotic (MZ) twins (whom are genetically identical) [33]. Other developmental diseases associated with DNA methylation include ICF (immunodeficiency, centromeric region instability and facial anomalies) syndrome, which often results in defective lymphogenesis [34], and the neurodevelopmental disorders Prader-Willi and Angelman syndromes [35], In fact, epigenetics are now strongly believed to play some role in the majority of noninfectious diseases, as evidenced by differential DNA methylation and histone acetylation patterns in MZ twins [36] (i.e., “epigenetic drift”) and the delayed (adult) onset of inherited genetic diseases, such as familial amyotrophic lateral sclerosis and Huntington’s chorea [37]. Other disorders, including Alzheimer’s disease, Down’s syndrome, coronary artery disease, and a developmental syndrome known as neural tube defect, have been associated with deficiencies in the metabolism of folate, a dietary precursor of S-adenosylmethionine, which provides the donor methyl group for methylation of 5-deoxycytosine [38]. 12.3.2 Epigenetics and Cancer

Epigenetic alterations are now well known in tumorigenesis. Specifically, tumors exhibit overall hypomethylation (primarily in repeat elements and pericentromeric regions) and hypermethylation of normally unmethylated CpG islands [39, 40]. It is also now established that CpG methylation within promoters of specific tumor suppressor genes is associated with transcriptional silencing of those genes and diminished control of cell proliferation [41, 42]. Consequently, aberrant methylation patterns are now being investigated as potential biomarkers and pathway-specific therapeutic targets [43]. Similar to DNA methylation, deviant histone modifications have also been associated with a number of pathologies, including cancer. Various well-known leukemia fusion proteins, including several partnered with the retinoic acid receptor, aberrantly recruit histone deacetylases, resulting in atypical gene repression [44, 45], and, in general, one characteristic of tumors includes hypoacetylation of histones [46]. Additionally, histone acetyltransferases have been reported as up- or down-regulated in a number of malignancies [47]. Although presently less studied, chromatin-remodeling enzymes and miRNAs have also been associated with numerous diseases. In particular, specific subunits of one remodeling complex, SWI/SNF, have been discovered to act as tumor suppressors for a number of malignancies [48, 49], while conversely, components of the repressive

High-Throughput Assessments of Epigenomics in Human Disease

201

chromatin complex NuRD have been associated with tumor invasion and metastasis [48, 50]. Despite being discovered only within the past twelve years, miRNAs are now also recognized as widely involved in normal development and disease [31]. A specific class of miRNAs associated with malignant disease has been coined oncomirs (acronym for oncogenic miRNAs). Oncomirs appear to be capable of functioning as either tumor suppressors or promoters [51]. Similarly, as numerous miRNAs have been demonstrated to regulate differentiation, miRNAs are likely associated with several developmental disorders [52].

12.4 High-Throughput Analyses of Epigenetic Phenomena As epigenetic chromatin modifications, miRNAs, and regulatory enzymes are now implicated in numerous disease processes, several high-throughput approaches have been developed for their analysis. While many epigenetic studies are still performed at the single-gene level, emerging global approaches now allow comprehensive analyses of the entire “epigenome” and its alterations that contribute to pathological conditions (Figure 12.1). For one such alteration, DNA methylation, numerous assessment methods are currently in use. Many of these rely on the use of methylation-sensitive restriction enzymes, such as SmaI or HpaII (Table 12.1). Consequently, a limiting factor for such assays is bias toward cleavage sites for those nucleases. More recently, mass spectrometry has been used to map methylation of specific gene promoters, although mass limitations prevent genome-wide analyses. Techniques commonly used for the comprehensive analysis of histone modifications are summarized in Table 12.2. Protein complexes or posttranslational modifications can be assessed by mass spectrometry, while DNA sequences associated with such modifications are now analyzed by a number of microarray and cloning methods. Most of these approaches rely on a process known as chromatin immunoprecipitation (ChIP), in which antibodies specific to distinct epigenetic modifications/enzymes are used to “pull down” DNA associated with those particular chromatin components [53]. Thus, such analyses are limited by both the sensitivity and specificity of the antibody of interest. High-throughput, multiplex sequencing has also been employed for genome-wide study of DNA associated with a number of specific epigenetic events [54, 55]. 12.4.1 Gel-Based Approaches

One well-established method for globally examining DNA methylation, known as restriction landmark genome scanning (RLGS), uses combinations of

DNA methylation

CH 3 CH 3

ChIP-on-chip ChIP-cloning ChIP-PET SACO GMAT

DNA associated with histone modifications

CH 3

H1

Ub

H3

H4

Mass spectrometry

Histone modifications/ variants

H3.3

H2A

H2B

CH 3

H1

C

H2A

H3

H2B

H4

RNA Pol

Nucleosome positioning/ remodeling complexes Nuclease protection assays DNase-chip DNase-massive parallel sequencing Protein tagging/mass spectrometry Micrococcalnuclease-array

CR

Figure 12.1 Depiction of various epigenetic processes and associated high throughput approaches for their assessment. Only representative modifications are shown. DNA is represented by thin lines. The multimeric assembly represents a chromain-remodeling complex (“CRC”) that relaxes DNA-histone interactions. For additional details regarding assessment techniques, see Tables 12.1 and 12.2 [56–133].

CH 3 CH 3

H3

H4

H3

H4

CH 3

H2A

CH 3

H2B

CH 3

H2A

CH 3

CH 3CH 3 CH 3

H2B

CH 3

CH 3CH 3 CH 3

Custom microarrays Northern blot SAGE Massive parallel sequencing

MicroRNA expression/discovery

Ac

Numerous microarray approaches (DMH, MIAMI, MIRA, MSO, etc.) MeDIP/McIP RLGS MSDK Mass spectrometry MCA/RDA Pyrosequencing Massive parallel sequencing

Ac

H 3C Ub

202 Genome Sequencing Technology and Algorithms

DNA is precipitated using MBD2b/MBD3L1 complexes and hybridized to arrays.

Methylated DNA is purified using a Semiquantitative. Does not methylbinding-protein (MeCP2) column, am- require microarray equipplified, biotin-labeled, and hybridized to a ment/facilities. nylon microarray of commonly methylated promoters. Detection is by standard streptavidin-alkaline phosphatase.

Methylated CpG island re- Microarray covery assay (MIRA) [72, 73]

Promoter methylation array (Panomics, Fremont, California)

Membrane array

DNA from test and reference samples are differentially labeled (e.g., red versus green fluorophores) and hybridized to tiled microarrays.

Microarray

High-throughput, not biased to restriction enzyme sites.

High-throughput. Direct comparison of normal and test samples.

Microarray-based, no use of 2-D gels.

Differential methylation hybridization (DMH) [60, 61]

DNA is digested with Not1, biotin-labeled, pulled down, and hybridized to a BAC clone microarray.

Microarray

Comprehensive, high-throughput. Gel fragments can be cloned and sequenced.

Advantages

Not1 digestion coupled to BAC array hybridization [109]

DNA is digested with the methylation-sensitive enzyme Not1 and another enzyme, run in first dimension, digested in-gel with third enzyme, and run in second dimension.

Description

Gel-based

Type of Approach

Restriction landscape genome scanning (RLGS) [106, 107]

Method

Table 12.1 High-Throughput Methods for DNA Methylation Analysis

Currently limited to 40 genes.

Diminished sensitivity in regions of high CpG density.

Restriction enzyme-based (see RLGS). Some methylation-sensitive enzymes exhibit poor coverage.

Use of Not1 (see above). Possibility of false positives.

Technically arduous, requires several days. 25% of CpG islands do not possess a Not1 site [103].

Disadvantages

High-Throughput Assessments of Epigenomics in Human Disease 203

Microarray

Microarray

Microarray

“Zip code” microarray

Methyl-binding protein immunoprecipitation (McIP) [74]

Microarray-based Integrated Analysis of Methylation (MIAMI) [62]

Methylation profiling by PCR coupled to ligase detection reaction (LDR) and universal array [68]

Type of Approach

Methylated DNA immunoprecipitation (MeDIP) [71, 110]

Method

Table 12.1 (continued)

Advantages

Bisulfite-treated DNA is amplified and annealed to base-specific fluorescent probes that are then ligated to a separate sequence-specific probe attached to a “zip code” present on a universal microarray.

DNA from test /reference samples is restricted with the methylation-sensitive and methylation-insensitive isoschizomer enzymes. The two samples are then differentially labeled and hybridized to a tiling microarray. A second array uses insensitive digestion alone, to determine restriction site polymorphisms (false positives).

DNA bound to a methyl-binding domain/Fc antibody fusion protein is pulled down by protein A-sepharose, followed by purification and hybridization to an array.

Limited by sensitivity and specificity of antibody.

Disadvantages

Quantitative. Possible clinical Requires bisulfite treatment. applications for interrogating Coverage is limited to zip specific methylated CG sites. code probes present on array.

Corrects for incomplete Require two separate arrays. methylation-sensitive enzyme Assay remains biased to digestion and restriction site HpaII restriction sites. polymorphisms.

No use of restriction enzymes Possible diminished sensitivor PCR. ity in densely methylated sequences (see MIRA).

Antimethylcytosine antibody used to No PCR amplification or repulldown associated DNA, followed by array striction enzyme bias. hybridization.

Description

204 Genome Sequencing Technology and Algorithms

DNA is differentially digested with High-throughput, quantitaRestriction-enzyme-based, methylation-sensitive versus methylation-re- tive, does not require bisulfite proprietary reagents. quiring endonucleases, followed by quanti- treatment. tative fluorescent PCR.

MethylScreen

PCR-based (proprietary- Orion Genomics)

200–1,200 bp restriction fragments of HpaII and its isochizomer MspI are cohybridized to a custom array. This method allows mapping of both methylated and unmethylated CpG sites.

HpaII tiny fragment Enrich- Microarray ment by Ligation-mediated (Nimblegen SysPCR (HELP) [111] tems, Madison, Wisconsin)

High resolution (~200 bp). Restriction site bias. Quantitative. Internal controls allow for correction of copy number and base composition.

High-throughput, does not re- Bias toward McrBC restriction quire reference sample. sites, proprietary technique.

DNA is nebulized, digested with the methylation-requiring enzyme McrBC, gel isolated to fragments >1 kb, and hybridized to a tiling array.

Requires bisulfite treatment. Coverage is limited to probes on beads.

Disadvantages

Microarray (proprietary Orion Genomics, St. Louis, Missouri)

Can be automated. High sensitivity.

Advantages

MethylScope [63]

Similar to the LDR zip code array described above except that the ligation products are PCR amplified, fluorescently labeled, and captured by randomly arrayed oligonucleotides attached to glass beads.

Description

“Zip code” glass bead array

Type of Approach

Universal bead array (BeadArray technology, Illumina, San Diego, California) [67]

Method

Table 12.1 (continued)

High-Throughput Assessments of Epigenomics in Human Disease 205

Microarray

Microarray

Gel, PCR-based

Methylation-specific oligonucleotide array (MSO) [70]

Amplification of intermethylated sites (AIMS) [112]

Type of Approach

Methylation target array (MTA) [66]

Method

Table 12.1 (continued)

Capable of producing fully quantitative, detailed “methylation maps” of specific promoters.

Capable of analyzing multiple tissues, for specific methylated genes, in a high-throughput manner.

Advantages

DNA is digested with the methylation-sensi- Straightforward, relatively tive enzyme Sma1 and its isoschizomer inexpensive. PspA1. Adaptor ligation to sticky ends is followed by PCR primed against adaptors + 3–4 arbitrarily chosen nucleotides (to reduce complexity).

DNA is bisulfite treated, PCR amplified, and hybridized to an array containing oligonucleotides matching specific CG (methylated) or TG (unmethylated) sites within distinct promoters.

Similar to tissue microarrays, MTA relies on the printing of MS enzyme-digested, amplified DNA from distinct tissues onto nylon membranes. Membrane arrays are then hybridized with radiolabeled PCR products representing commonly methylated genes.

Description

Gel-based, possibly difficult resolution due to inadequate band separation. Some restriction enzyme bias.

Requires bisulfite treatment. Coverage is limited to arrayed, promoter-specific oligonucleotides.

Coverage is limited to known methylated genes. Requires radiolabeling. Limited quantitation.

Disadvantages

206 Genome Sequencing Technology and Algorithms

Histone modifications

Epigenetic Phenomenon

ChIPed DNA is blunt-ended, cloned, and amplified, followed by digestion with MmeI to create a single-PET library. Single PETs are then liberated, dimerized, and multiplex sequenced.

ChIPed DNA is amplified by arbitrary Not limited to known CG-rich or CG-poor primers, sequences, as with electrophoresed, and informative microarrays. bands cloned and sequenced

ChIP-paired-end-tag- Cloning/ ging (ChIP-PET) [85, sequencing 124]

ChIP coupled to arbi- PCR-based, trarily primed PCR cloning/ (ChAP) [125] sequencing

Laborious, possibility of false positives.

Contingent on specificity and sensitivity of antibody. Possible artifacts due to crosslinking of DNA not associated with chromatin.

Disadvantages

Limited to regions that successfully anneal primers.

Unbiased. Can compen- Computationally challenging (extracsate for disadvantages of tion of tags from raw sequences and short reads characteristic mapping to known genomic sites). of multiplex sequencing.

Can be used to discover novel protein-associated sequences, as opposed to microarrays.

“ChIPed” DNA is blunt-ended and cloned into vectors, followed by screening and sequencing. One group [122] has used the TOPO cloning system (Invitrogen).

Cloning/ sequencing

ChIP-cloning [122]

Advantages High-throughput. Using bioinformatics tools, can be used to determine protein-to-DNA binding motifs.

Description DNA is crosslinked to chromatin, followed by co-immunoprecipitation using antibodies against specific histone residues or chromatin proteins. DNA is then purified and hybridized to tiled microarrays.

Type of Approach

Chromatin Microarray immunoprecipitation on microarray (ChIP-on-chip) [120, 121]

Method

Table 12.2 High-Throughput Methods for Analysis of Histone Modifications, MicroRNA Expression, and Nucleosome Localization

High-Throughput Assessments of Epigenomics in Human Disease 207

Epigenetic Phenomenon

ChIPed DNA fragments are linkerVery high throughput. ligated and cloned into vectors conExtremely sensitive. taining >10 million 32-mer Unbiased. oligonucleotide tags and immobilized. Sequencing is performed by repeated cycles of annealing of uniquely identifiable adaptors, shortening of template, and annealing of new adaptor. Adaptor “signatures” are decoded by hybridization of fluorescent probes.

ChIP coupled to massively parallel signature sequencing (MPSS) [54]

High-throughput sequencing (proprietary Lynx Therapeutics, Hayward, California)

Histones are isolated and subjected to LC/MS or are digested and subjected to MALDI-TOF.

Mass spectrometry Mass assay of histone spectometry modifications [98, 99]

Tags having repetitive sequences cannot be assigned. Computationally difficult.

Technically arduous, biased to NlaIII restriction sites.

Disadvantages

Short read lengths. Computational difficulty in reassembling sequence fragments. Possible GC bias [128].

Expensive.

Can examine global alter- Cannot be used for examining geneations in specific histone or sequence-specific events. Possimodifications. ble difficulty in fragment resolution.

ChIPed DNA is blunted, ligated to 75% of tags should be adapters, and amplified. The ampli- present in unique loci. fied DNA is then digested with NlaIII, with ~1 site in every 120 bp. A modified SAGE procedure is then used to create concatemerized 21-bp tags for sequencing.

Advantages

Serial analysis of Cloning/ chromatin occupancy sequencing (SACO) [128]

Description Couples ChIP to SAGE technology, Does not rely on known using 21-bp “tags” for each DNA sequences, as do fragment, followed by concatenation, microarray methods. cloning, and sequencing.

Type of Approach

Cloning/ sequencing

Genome-wide mapping technique (GMAT) [126, 127]

Method

Table 12.2 (continued)

208 Genome Sequencing Technology and Algorithms

MicroRNA

Epigenetic Phenomenon

Collection of modified DNA oligonucleotides against rat, mouse, and human miRNAs. Can be arrayed by user or purchased as array. Sraightforward, accurate procedure in kit format.

Total RNA is size fractionated to obtain 18–25-mer fragments, adapterligated or polyadenylated, reverse transcribed, and PCR amplified. PCR products are then sequenced.

Relatively low-throughput (can be increased by concatenation using SAGE technology). Difficulty in discovery of miRNAs expressed at low levels.

Requires laboratory validation.

High sensitivity and spec- Limited coverage. Possible exclusion ificity. of rare miRNAs.

Straightforward. Premanufactured kits available for procedure.

Cloning/ sequencing

Disadvantages

Demonstrated as promis- Dated. ing for clinical application [77].

MicroRNA cloning/ direct sequencing [86]

Northern blot/ microarray

mirVana microRNA analysis system (Ambion, Austin, Texas) [76, 130]

Custom 368-probe (in triplicate) microarray, corresponding to 245 mouse and human miRNA genes.

Sophisticated prediction models have been quite successful.

Microarray

Custom microarray [77]

Advantages

Custom 124-feature array of Possesses relevant Two years old. Additional miRNAs nonredundant, conserved mouse and miRNAs implicated in de- discovered during that time. human miRNAs velopment.

Description

Computational Bioinformatics Many microRNAs have now been approaches to discovered using sequence predicmiRNA discovery [86, tion algorithms based on structural 131] features (e.g., “stem-loops”) and homology to lower organisms.

Microarray

Type of Approach

Custom microarray [78]

Method

Table 12.2 (continued)

High-Throughput Assessments of Epigenomics in Human Disease 209

Nucleosome positioning

Epigenetic Phenomenon

Microarray

Protein tagging/mass Mass spectrometry [97, spectrometry 133]

Micrococcal nuclease array [132]

DNAse chip [79, 80]

Microarray

High-throughput sequencing (proprietary Lynx Therapeutics, Hayward, California)

MicroRNA cloning coupled to massive parallel signature sequencing (MPSS) and/or 454 sequencing [94]

DNAseI array

Type of Approach

Method

Table 12.2 (continued)

Components of chromatin remodeling complexes are tagged, expressed in vivo, and purified. Components are then gel-purified, digested, and subjected to tandem mass spectrometry.

DNA is crosslinked to chromatin, treated with micrococcal nuclease (Mnase), and treated with protease. Mononucleosomal DNA is then green-labeled and cohybridized with total DNA (red) to tiling array.

Digestion of nuclei with or without DNAseI, size fractionation, and hybridization to tiled microarrays.

Identical to previous entry except that final PCR products are cloned and subjected to MPSS and/or 454 sequencing (see descriptions in histone modification and DNA methylation sections, respectively).

Description

Requires gel purification. Possibility of protruding overhangs following Mnase digestion.

Expensive. Resolution limited by size fractionation step (sucrose gradient or sonication).

Expensive. Difficulty in discovery of miRNAs expressed at low levels.

Disadvantages

Can determine global Cannot be linked to DNA sequence changes in histone modi- or structure. Possible difficulty in fications during specific fragment resolution. cellular states (e.g., developent, stress, etc.).

Can be used to map promoter regions and transcription factor-binding sites.

High throughput, straightforward, microarray-based.

Very high throughput.

Advantages

210 Genome Sequencing Technology and Algorithms

Epigenetic Phenomenon

Type of Approach

Mapping of DNase I Highhypersensitive (HS) throughput sites by massive par- sequencing. allel signature sequencing and 454 sequencing [95].

Method

Table 12.2 (continued)

Nuclei are digested with DNase I, deproteinized, ligated and placed into a cloning vector having >107 oliognucleotide tags. See description of MPSS in histone modification section.

Description Very high throughput. Clustering analysis can distinguish true sites from background noise.

Advantages

Complexity of DNA fragments likely limits scope map of DNase sites. Possibility of identified sites being falsely positive.

Disadvantages

High-Throughput Assessments of Epigenomics in Human Disease 211

212

Genome Sequencing Technology and Algorithms

methylation-sensitive and methylation-insensitive enzymes, followed by two-dimensional electrophoresis to establish methylation patterns that differ between diseased and normal tissue. While this method is quite arduous, it provides high resolution, and has successfully led to the discovery of methylation biomarkers for various malignancies [56, 57]. Although methylation studies have largely shifted to microarray approaches, gel-based analyses remain the mainstay for examinations of nucleosome positioning. These well-established techniques include micrococcal nuclease “laddering” and “DNA footprinting” methods that rely on protein-mediated protection of DNA [58, 59]. However, these approaches are laborious and low throughput and thus unsuitable for global studies. 12.4.2 Microarrays

Another approach for high-throughput, genome-wide epigenetic analysis is the use of oligonucleotide microarrays. For the global analysis of DNA methylation, one microarray-based method, pioneered by our group, is known as differential methylation hybridization (DMH). Similar to RLGS, DMH relies on digestion of test (e.g., tumor) and reference (e.g., normal tissue) DNA with a methylation-sensitive enzyme(s). In contrast to RLGS, in which the reference and test DNA digests are analyzed on separate gels and then compared, DMH entails cohybridization of DNA from both tissues, following distinctive labeling of each (e.g., Cy5 for tumor and Cy3 for reference) [60, 61]. To correct for false positive results due to restriction-site polymorphisms or incomplete digestion, a recent method known as MIAMI (microarray-based integrated analysis of methylation), uses a separate array distinctly hybridized to a methylation-insensitive enzyme (isoschizomer of methylation-sensitive enzyme) [62]. Other approaches, such as MethylScope, specifically eliminate methylated DNA prior to array hybridization [63, 64] by sample digestion with the methylation-requiring enzyme McrBC [65] (Table 12.1). Another method developed by our group, methylation target array, is similar to tissue microarrays in that genomic DNA from patient specimens is printed on nylon membranes and then hybridized with various probes to examine methylation of specific genes [66]. To quantitatively examine methylation at specific CG sites, rather than large promoter regions, a number of novel approaches are now in use. Several of these use sequence-dependent ligation methods that result in production of “addressable” or “zip-coded” fluorescent oligonucleotides. Two of these are now in use, employing a universal microarray and a glass-bead array (Illumina, San Diego, California), respectively [67, 68]. In a more direct approach (not ligation-dependent), we and others developed a method known as methylationspecific oligonucleotide array, in which probes to both methylated and unmethylated CG sites of specific genes are printed on slides [69, 70]. That

High-Throughput Assessments of Epigenomics in Human Disease

213

technique was successfully used to distinguish methylation patterns of several gene promoters specific to various malignancies, as compared to the normal counterpart tissue [69]. Some DNA methylation microarray methods that do not rely on restriction enzymes employ an immunoprecipitation step. One such approach, known as me-DIP, utilizes an antimethylcytosine antibody for “pulldown” of methylated DNA, followed by hybridization of the precipitated DNA to a tiled microarray [71]. Another recently described related approach, known as MIRA (methylated-CpG-island recovery), employs a combination of methylcytosinebinding proteins, MBD2 and MBD3L1, for DNA pulldown [72, 73], followed by microarray hybridization. A similar strategy uses MBD2 fused to the Fc fragment of human IgG1 for pullown by protein A-sepharose [74]. In a comparable study, a collaborative effort by our group used a method known as “ChIP-on-chip” to globally examine the distribution of MBD2 (and thus likely sites of methylation-mediated gene repression) throughout the genome of breast cancer cells [75]. For high-throughput analysis of miRNAs, isolated small RNA molecules can also be hybridized to various microarray platforms for expression studies. One biotechnology corporation (Ambion, Austin, Texas) manufactures and markets a set of over 662 miRNA probes for microarray printing or northern blotting, in addition to a small RNA isolation product in a kit format [76]. In addition, custom miRNA arrays have been prepared by various investigators for examining developmental patterns or possible clinical biomarkers [77, 78]. Nucleosome position studies remain primarily low-throughput, relying mostly on gel analysis of assembled histones or micrococcal nuclease-digested chromatin. One group, however, has utilized DNase I digestion, followed by tiled microarray hybridization (an approach designated “DNase-array” or “DNase-chip”), to map DNase-hypersensitive sites, which demark the boundaries of DNA-protein interactions [79, 80]. 12.4.3 Cloning/Sequencing

While microarray methods are highly useful for mapping epigenetic modifications associated with specific known DNA regions, these cannot be used for the discovery of novel chromatin-linked sequences. Consequently, numerous strategies for “tagging” and sequencing distinct chromatin-interacting domains have been recently developed. Most of these are adaptations of SAGE (serial analysis of gene expression) technology, originally developed for the mapping and discovery of RNA transcripts [81]. Typically, these approaches rely on the immobilization of chromatin-coimmunoprecipitated DNA and digestion with the restriction enzyme MmeI, which cuts at an internal recognition site, resulting in 20-bp unique overhangs (“tags”). These tags are eventually released and

214

Genome Sequencing Technology and Algorithms

concatenated for sequencing and identification [82]. A variation of this technology, known as MSDK (methylation-sensitive digital karyotyping), utilizes an initial restriction by the methylation-sensitive enzyme AscI, followed by modified SAGE technology, to map tags (i.e., DNA methylation sites) to specific chromosomal locations [83, 84]. Another recent approach, known as GIS (gene identification signature) analysis, employs tags from both the 5’ and 3’ of DNA fragments [paired-end ditags (PETs)] for concatenation, sequencing, and mapping [82]. GIS technology has also been combined with ChIP (ChIP-PET), to globally map p53 transcription factor binding sites across the genome of colon cancer cells [85]. For microRNA discovery, cloning is also used. In those approaches, total RNA is fractionated to obtain transcripts of size 18–25 nucleotides and subjected to adaptor ligation or polyadenylation, RT-PCR, concatenation (using SAGE technology), and direct sequencing [86]. Alternatively, RT-PCR products can be cloned for high-throughput multiplex parallel sequencing. These methods have been used to successfully identify miRNA profiles specific to various developmental stages of rice, Drosophila, and chickens [87–89]. Similar approaches are now being employed for miRNAs that play specific roles in human cancers [51]. Sequencing of DNA associated with epigenetic modifications can be performed by a number of traditional or recently developed methods. For methylation analysis, the standard approach remains DNA modification by sodium bisulfite, followed by Sanger dideoxynucleotide sequencing [90]. As sodium bisulfite elicits deamination of unmethylated cytosine to uracil, any “C” residues that remain following bisulfite treatment were originally methylated. While Sanger-based bisulfite sequencing remains in wide use, a recent approach relies on pyrophosphate product detection following specific incorporation of distinct adenine, cytosine, guanine, or thymine nucleotides into bisulfatetreated DNA [91]. This method, known as pyrosequencing (Biotage, Uppsala, Sweden), is now used for quantitative methylation analysis and has been utilized to evaluate differential promoter methylation in oral and nonsmall cell lung cancers [92, 93]. A more recent approach, patented by 454 Life Sciences (Branford, Connecticut), uses pyrosequencing technology following DNA-to-bead immobilization and amplification in oil-emulsion microreactors [55]. Another multiplex sequencing approach, known as massively parallel signature sequencing or MPSS (Illumina, San Diego, California), has been used for sequencing miRNAs in Arabidopsis [94] and for mapping DNAse I-hypersensitive sites in yeast [95]. The MPSS method relies on the sequential hybridization of a uniquely labeled adaptor to an immobilized template, followed by template shortening, and hybridization to another adaptor; the released adaptor is then fluorescently detected [54].

High-Throughput Assessments of Epigenomics in Human Disease

215

12.4.4 Mass Spectrometry

Another emerging technique for the analysis of epigenetic proteins and modifications is mass spectrometry (MS). MS relies on the assessment of massto-charge of ions and can thus be used to analyze peptide and polynucleotide structures [96]. MS has now been successfully used to examine protein-protein interactions in nucleosome-remodeling complexes [97] and alterations in histone modifications under specific cellular conditions, such as differentiation or growth [98]. MS has also been used to map modifications of histone H3 associated with differentiation [99]. One specific MS method, known as matrixassisted laser desorption/ionization coupled to time-of-flight (MALDI-TOF) analysis, has been used to map mass changes in DNA due to base-specific alterations (e.g., methylation) [100, 101]. While these assessments are made only at the single- gene level, they are high throughput and quantitative. Such determinations have now been used to examine specific methylation differences between cancerous and normal lung and brain [102, 103].

12.5 Conclusions Comprehensive elucidation of epigenetic modifications, in both normal and diseased tissues, will allow for an extensive understanding of gene regulatory networks that control both normal and disease phenotypes. Such knowledge would be readily translatable to various developmental disorders, numerous adult-onset diseases, psychiatric disease, and cancer. Moreover, as epigenetic therapies are now approved for various hematologic malignancies and are currently being studied in clinical trials for solid tumors, high-throughput epigenetic assessments could be used for evaluation of the bioactivity of patient-prescribed regimens (e.g., DNA hypomethylation induced by methyltransferase inhibitors). In fact, several epigenetic biomarkers have now been identified for a number of diseases [104, 105], and high-throughput analyses could be used for prognosis or ideally, early detection. One immediate application of these technologies is the realization of the Human Epigenome Project, proposed as an exhaustive annotation of all histone and deoxcytosine modifications throughout the human genome. In summary, as almost all human disorders are at least partially associated with the DNA-external dysregulation of gene expression, the impact of comprehensive analyses of epigenetic phenomena cannot be overstated.

Acknowledgments The authors gratefully acknowledge the following agencies for supporting this work: U.S. National Institutes of Health (NIH), National Cancer Institute

216

Genome Sequencing Technology and Algorithms

Grants CA 085289 (to K. P. Nephew), CA 113001 (to T. H-M. Huang); American Cancer Society Research and Alaska Run for Woman Grant TBE-104125; U.S. Army Medical Research Acquisition Activity, Award Numbers DAMD 17-02-1-0418 and DAMD17-02-1-0419; Walther Cancer Institute (Indianapolis, Indiana).

References [1] Jaenisch, R., and A. Bird, “Epigenetic Regulation of Gene Expression: How the Genome Integrates Intrinsic and Environmental Signals,” Nat. Genet., Vol. 33, Suppl., 2003, pp. 245–254. [2] Waddington, C., “The Epigenotype,” Endeavor, Vol. 1, 1942, pp. 18–20. [3] Jablonka, E., and M. J. Lamb, “The Changing Concept of Epigenetics,” Ann. NY Acad. Sci., Vol. 981, 2002, pp. 82–96. [4] Egger, G., et al., “Epigenetics in Human Disease and Prospects for Epigenetic Therapy,” Nature, Vol. 429, 2004, pp. 457–463. [5] Esteller, M., “The Necessity of a Human Epigenome Project,” Carcinogenesis, Vol. 27, 2006, pp. 1121–1125. [6] Wolffe, A. P., and M. A. Matzke, “Epigenetics: Regulation Through Repression,” Science, Vol. 286, 1999, pp. 481–486. [7] Cooper, D. N., and H. Youssoufian, “The CpG Dinucleotide and Human Genetic Disease,” Hum. Genet., Vol. 78, 1988, pp. 151–155. [8] Robertson, K. D., “DNA Methylation and Chromatin—Unraveling the Tangled Web,” Oncogene, Vol. 21, 2002, pp. 5361–5379. [9] Li, E., “Chromatin Modification and Epigenetic Reprogramming in Mammalian Development,” Nat. Rev. Genet., Vol. 3, 2002, pp. 662–673. [10] Berger, S. L., “Histone Modifications in Transcriptional Regulation,” Curr. Opin. Genet. Dev., Vol. 12, 2002, pp. 142–148. [11] Jenuwein, T., and C. D. Allis, “Translating the Histone Code,” Science, Vol. 293, 2001, pp. 1074–1080. [12] Strahl, B. D., and C. D. Allis, “The Language of Covalent Histone Modifications,” Nature, Vol. 403, 2000, pp. 41–45. [13] Boyer, L. A., et al., “Polycomb Complexes Repress Developmental Regulators in Murine Embryonic Stem Cells,” Nature, 2006. [14] Bracken, A. P., et al., “Genome-Wide Mapping of Polycomb Target Genes Unravels Their Roles in Cell Fate Transitions,” Genes Dev., Vol. 20, 2006, pp. 1123–1136. [15] Kamminga, L. M., et al., “The Polycomb Group Gene Ezh2 Prevents Hematopoietic Stem Cell Exhaustion,” Blood, Vol. 107, 2006, pp. 2170–2179.

High-Throughput Assessments of Epigenomics in Human Disease

217

[16] Hanson, R. D., et al., “Mammalian Trithorax and Polycomb-Group Homologues Are Antagonistic Regulators of Homeotic Development.” Proc. Natl. Acad. Sci. USA, Vol. 96, 1999, pp. 14372–14377. [17] Cernilogar, F. M., and V. Orlando, “Epigenome Programming by Polycomb and Trithorax Proteins,” Biochem. Cell Biol., Vol. 83, 2005, pp. 322–331. [18] Jin, J., et al., “In and Out: Histone Variant Exchange in Chromatin,” Trends Biochem. Sci., Vol. 30, 2005, pp. 680–687. [19] Polo, S. E., and G. Almouzni, “Chromatin Assembly: A Basic Recipe with Various Flavours,” Curr. Opin. Genet. Dev., Vol. 16, 2006, pp. 104–111. [20] Ahmad, K., and S. Henikoff, “The Histone Variant H3.3 Marks Active Chromatin by Replication-Independent Nucleosome Assembly,” Mol. Cell, Vol. 9, 2002, pp. 1191–1200. [21] Fernandez-Capetillo, O., A. Celeste, and A. Nussenzweig, “Focusing on Foci: H2AX and the Recruitment of DNA-Damage Response Factors,” Cell Cycle, Vol. 2, 2003, pp. 426–427. [22] Lowndes, N. F., and G. W. Toh, “DNA Repair: The Importance of Phosphorylating Histone H2AX,” Curr. Biol., Vol. 15, 2005, pp. R99–R102. [23] Becker, P. B., and W. Horz, “ATP-Dependent Nucleosome Remodeling,” Ann. Rev. Biochem., Vol. 71, 2002, pp. 247–273. [24] Varga-Weisz, P., “ATP-Dependent Chromatin Remodeling Factors: Nucleosome Shufflers with Many Missions,” Oncogene, Vol. 20, 2001, pp. 3076–3085. [25] McGhee, J. D., and G. Felsenfeld, “Nucleosome Structure,” Ann. Rev. Biochem., Vol. 49, 1980, pp. 1115–1156. [26] McGhee, J. D., et al., “Orientation of the Nucleosome Within the Higher Order Structure of Chromatin,” Cell, Vol. 22, 1980, pp. 87–96. [27] Kornberg, R. D., and Y. Lorch, “Interplay Between Chromatin Structure and Transcription,” Curr. Opin. Cell Biol., Vol. 7, 1995, pp. 371–375. [28] Kulic, I. M., and H. Schiessel, “Nucleosome Repositioning Via Loop Formation,” Biophys. J., Vol. 84, 2003, pp. 3197–3211. [29] Langst, G., et al., “Nucleosome Movement by CHRAC and ISWI Without Disruption or Trans-Displacement of the Histone Octamer,” Cell, Vol. 97, 1999, pp. 843–852. [30] Bentwich, I., et al., “Identification of Hundreds of Conserved and Nonconserved Human MicroRNAs,” Nat. Genet., Vol. 37, 2005, pp. 766–770. [31] Mendell, J. T., “MicroRNAs: Critical Regulators of Development, Cellular Physiology and Malignancy,” Cell Cycle, Vol. 4, 2005, pp. 1179–1184. [32] Ronemus, M., and R. Martienssen, “RNA Interference: Methylation Mystery,” Nature, Vol. 433, 2005, pp. 472–473. [33] Singh, S. M., B. Murphy, and R. O’Reilly, “Epigenetic Contributors to the Discordance of Monozygotic Twins,” Clin. Genet., Vol. 62, 2002, pp. 97–103.

218

Genome Sequencing Technology and Algorithms

[34] Ehrlich, M., et al., “DNA Methyltransferase 3B Mutations Linked to the ICF Syndrome Cause Dysregulation of Lymphogenesis Genes,” Hum. Mol. Genet., Vol. 10, 2001, pp. 2917–2931. [35] Paulsen, M., and A. C. Ferguson-Smith, “DNA Methylation in Genomic Imprinting, Development, and Disease,” J. Pathol., Vol. 195, 2001, pp. 97–110. [36] Fraga, M. F., et al., “Epigenetic Differences Arise During the Lifetime of Monozygotic Twins,” Proc. Natl. Acad. Sci. USA, Vol. 102, 2005, pp. 10604–10609. [37] Bjornsson, H. T., M. D. Fallin, and A. P. Feinberg, “An Integrated Epigenetic and Genetic Approach to Common Human Disease,” Trends Genet., Vol. 20, 2004, pp. 350–358. [38] Loenen, W. A., “S-Adenosylmethionine: Jack of All Trades and Master of Everything?” Biochem. Soc. Trans., Vol. 34, 2006, pp. 330–333. [39] Das, P. M., and R. Singal, “DNA Methylation and Cancer,” J. Clin. Oncol., Vol. 22, 2004, pp. 4632–4642. [40] Feinberg, A. P., and B. Tycko, “The History of Cancer Epigenetics,” Nat. Rev. Cancer, Vol. 4, 2004, pp. 143–153. [41] Jones, P. A., “Overview of Cancer Epigenetics,” Semin. Hematol., Vol. 42, 2005, pp. S3–S8. [42] Karpf, A. R., and D. A. Jones, “Reactivating the Expression of Methylation Silenced Genes in Human Cancer,” Oncogene, Vol. 21, 2002, pp. 5496–5503. [43] Esteller, M. “Relevance of DNA Methylation in the Management of Cancer,” Lancet Oncol., Vol. 4, 2003, pp. 351–358. [44] Gelmetti, V., et al., “Aberrant Recruitment of the Nuclear Receptor Corepressor-Histone Deacetylase Complex by the Acute Myeloid Leukemia Fusion Partner ETO,” Mol. Cell Biol., Vol. 18, 1998, pp. 7185–7191. [45] Grignani, F., et al., “Fusion Proteins of the Retinoic Acid Receptor-Alpha Recruit Histone Deacetylase in Promyelocytic Leukaemia,” Nature, Vol. 391, 1998, pp. 815–818. [46] Archer, S. Y., and R. A. Hodin, “Histone Acetylation and Cancer,” Curr. Opin. Genet. Dev., Vol. 9, 1999, pp. 171–174. [47] Roth, S. Y., J. M. Denu, and C. D. Allis, “Histone Acetyltransferases,” Ann. Rev. Biochem., Vol. 70, 2001, pp. 81–120. [48] Gregory, R. I., and R. Shiekhattar, “Chromatin Modifiers and Carcinogenesis,” Trends Cell Biol., Vol. 14, 2004, pp. 695–702. [49] Roberts, C. W., and S. H. Orkin, “The SWI/SNF Complex—Chromatin and Cancer,” Nat. Rev. Cancer, Vol. 4, 2004, pp. 133–142. [50] Kumar, R., R. A. Wang, and R. Bagheri-Yarmand, “Emerging Roles of MTA Family Members in Human Cancers,” Semin. Oncol., Vol. 30, 2003, pp. 30–37. [51] Esquela-Kerscher, A., and F. J. Slack, “Oncomirs—MicroRNAs with a Role in Cancer,” Nat. Rev. Cancer, Vol. 6, 2006, pp. 259–269.

High-Throughput Assessments of Epigenomics in Human Disease

219

[52] Shivdasani, R. A., “MicroRNAs: Regulators of Gene Expression and Cell Differentiation,” Blood, 2006. [53] Das, P. M., et al., “Chromatin Immunoprecipitation Assay,” Biotechniques, Vol. 37, 2004, pp. 961–969. [54] Brenner, S., et al., “Gene Expression Analysis by Massively Parallel Signature Sequencing (MPSS) on Microbead Arrays,” Nat. Biotechnol., Vol. 18, 2000, pp. 630–634. [55] Margulies, M., et al., “Genome Sequencing in Microfabricated High-Density Picolitre Reactors,” Nature, Vol. 437, 2005, pp. 376–380. [56] Smiraglia, D. J., et al., “Differential Targets of CpG Island Hypermethylation in Primary and Metastatic Head and Neck Squamous Cell Carcinoma (HNSCC),” J. Med. Genet., Vol. 40, 2003, pp. 25–33. [57] Yu, L., et al., “A NotI-EcoRV Promoter Library for Studies of Genetic and Epigenetic Alterations in Mouse Models of Human Malignancies,” Genomics, Vol. 84, 2004, pp. 647–660. [58] Ulyanova, N. P., and G. R. Schnitzler, “Human SWI/SNF Generates Abundant, Structurally Altered Dinucleosomes on Polynucleosomal Templates,” Mol. Cell Biol., Vol. 25, 2005, pp. 11156–11170. [59] Li, Q., and O. Wrange, “Assays for Transcription Factors Access to Nucleosomal DNA,” Methods, Vol. 12, 1997, pp. 96–104. [60] Yan, P. S., et al., “Applications of CpG Island Microarrays for High-Throughput Analysis of DNA Methylation,” J. Nutr., Vol. 132, 2002, pp. 2430S–2434S. [61] Yan, P. S., S. H. Wei, and T. H. Huang, “Differential Methylation Hybridization Using CpG Island Arrays,” Methods Mol. Biol., Vol. 200, 2002, pp. 87–100. [62] Hatada, I., et al., “Genome-Wide Profiling of Promoter Methylation in Human,” Oncogene, Vol. 25, 2006, pp. 3059–3064. [63] Lippman, Z., et al., “Role of Transposable Elements in Heterochromatin and Epigenetic Control,” Nature, Vol. 430, 2004, pp. 471–476. [64] Nouzova, M., et al., “Epigenomic Changes During Leukemia Cell Differentiation: Analysis of Histone Acetylation and Cytosine Methylation Using CpG Island Microarrays,” J. Pharmacol. Exp. Ther., Vol. 311, 2004, pp. 968–981. [65] Stewart, F. J., et al., “Methyl-Specific DNA Binding by McrBC, a Modification-Dependent Restriction Enzyme,” J. Mol. Biol., Vol. 298, 2000, pp. 611–622. [66] Chen, C. M., et al., “Methylation Target Array for Rapid Analysis of CpG Island Hypermethylation in Multiple Tissue Genomes,” Am. J. Pathol., Vol. 163, 2003, pp. 37–45. [67] Bibikova, M., et al., “High-Throughput DNA Methylation Profiling Using Universal Bead Arrays,” Genome Res., Vol. 16, 2006, pp. 383–393. [68] Cheng, Y. W., et al., “Multiplexed Profiling of Candidate Genes for CpG Island Methylation Status Using a Flexible PCR/LDR/Universal Array Assay,” Genome Res., Vol. 16, 2006, pp. 282–289.

220

Genome Sequencing Technology and Algorithms

[69] Adorjan, P., et al., “Tumour Class Prediction and Discovery by Microarray-Based DNA Methylation Analysis,” Nucleic Acids Res., Vol. 30, 2002, p. e21. [70] Gitan, R. S., et al., “Methylation-Specific Oligonucleotide Microarray: A New Potential for High-Throughput Methylation Analysis,” Genome Res., Vol. 12, 2002, pp. 158–164. [71] Weber, M., et al., “Chromosome-Wide and Promoter-Specific Analyses Identify Sites of Differential DNA Methylation in Normal and Transformed Human Cells,” Nat. Genet., Vol. 37, 2005, pp. 853–862. [72] Rauch, T., et al., “MIRA-Assisted Microarray Analysis, a New Technology for the Determination of DNA Methylation Patterns, Identifies Frequent Methylation of Homeodomain-Containing Genes in Lung Cancer Cells,” Cancer Res., Vol. 66, 2006, pp. 7939–7947. [73] Rauch, T., and G. P. Pfeifer, “Methylated-CpG Island Recovery Assay: A New Technique for the Rapid Detection of Methylated-CpG Islands in Cancer,” Lab Invest., Vol. 85, 2005, pp. 1172–1180. [74] Gebhard, C., et al., “Genome-Wide Profiling of CpG Methylation Identifies Novel Targets of Aberrant Hypermethylation in Myeloid Leukemia,” Cancer Res., Vol. 66, 2006, pp. 6118–6128. [75] Ballestar, E., et al., “Methyl-CpG Binding Proteins Identify Novel Sites of Epigenetic Inactivation in Human Cancer,” Embo. J., Vol. 22, 2003, pp. 6335–6345. [76] Shingara, J., et al., “An Optimized Isolation and Labeling Platform for Accurate MicroRNA Expression Profiling,” Rna, Vol. 11, 2005, pp. 1461–1470. [77] Calin, G. A., et al., “MicroRNA Profiling Reveals Distinct Signatures in B Cell Chronic Lymphocytic Leukemias,” Proc. Natl. Acad. Sci. USA, Vol. 101, 2004, pp. 11755–11760. [78] Thomson, J. M., et al., “A Custom Microarray Platform for Analysis of MicroRNA Gene Expression,” Nat. Methods, Vol. 1, 2004, pp. 47–53. [79] Crawford, G. E., et al., “DNase-Chip: A High-Resolution Method to Identify DNase I Hypersensitive Sites Using Tiled Microarrays,” Nat. Methods, Vol. 3, 2006, pp. 503–509. [80] Sabo, P. J., et al., “Genome-Scale Mapping of DNase I Sensitivity In Vivo Using Tiling DNA Microarrays,” Nat. Methods, Vol. 3, 2006, pp. 511–518. [81] Velculescu, V. E., et al., “Serial Analysis of Gene Expression,” Science, Vol. 270, 1995, pp. 484–487. [82] Wei, C. L., et al., “5’ Long Serial Analysis of Gene Expression (LongSAGE) and 3’ LongSAGE for Transcriptome Characterization and Genome Annotation,” Proc. Natl. Acad. Sci. USA, Vol. 101, 2004, pp. 11701–11706. [83] Hu, M., et al., “Distinct Epigenetic Changes in the Stromal Cells of Breast Cancers,” Nat. Genet., Vol. 37, 2005, pp. 899–905. [84] Polyak, K., “Profiling the Epigenome Using MSDK (Methylation-Sensitive Digital Karyotyping),” AACR Educational Book 2006, 2006, pp. 199–201. [85] Wei, C. L., et al., “A Global Map of p53 Transcription-Factor Binding Sites in the Human Genome,” Cell, Vol. 124, 2006, pp. 207–219.

High-Throughput Assessments of Epigenomics in Human Disease

221

[86] Berezikov, E., E. Cuppen, and R. H. Plasterk, “Approaches to MicroRNA Discovery,” Nat. Genet., Vol. 38, Suppl., 2006, pp. S2–S7. [87] Aravin, A. A., et al., “The Small RNA Profile During Drosophila Melanogaster Development,” Dev. Cell, Vol. 5, 2003, pp. 337–350. [88] Xu, H., et al., “Identification of MicroRNAs from Different Tissues of Chicken Embryo and Adult Chicken,” FEBS Lett., Vol. 580, 2006, pp. 3610–3616. [89] Luo, Y. C., et al., “Rice Embryogenic Calli Express a Unique Set of MicroRNAs, Suggesting Regulatory Roles of MicroRNAs in Plant Post-Embryogenic Development,” FEBS Lett., Vol. 580, 2006, pp. 5111–5116. [90] Fraga, M. F., and M. Esteller, “DNA Methylation: A Profile of Methods and Applications,” Biotechniques, Vol. 33, 2002, pp. 632, 634, 636–649. [91] Ronaghi, M., M. Uhlen, and P. Nyren, “A Sequencing Method Based on Real-Time Pyrophosphate,” Science, Vol. 281, 1998, pp. 363, 365. [92] Shaw, R. J., et al., “Promoter Methylation of P16, RARbeta, E-Cadherin, Cyclin A1 and Cytoglobin in Oral Cancer: Quantitative Evaluation Using Pyrosequencing,” Br. J. Cancer, Vol. 94, 2006, pp. 561–568. [93] Xinarianos, G., et al., “Frequent Genetic and Epigenetic Abnormalities Contribute to the Deregulation of Cytoglobin in Non-Small Cell Lung Cancer,” Hum. Mol. Genet., Vol. 15, 2006, pp. 2038–2044. [94] Lu, C., et al., “MicroRNAs and Other Small RNAs Enriched in the Arabidopsis RNA-Dependent RNA Polymerase-2 Mutant,” Genome Res., 2006. [95] Crawford, G. E., et al., “Genome-Wide Mapping of DNase Hypersensitive Sites Using Massively Parallel Signature Sequencing (MPSS),” Genome Res., Vol. 16, 2006, pp. 123–131. [96] Good, D. M., and J. J. Coon, “Advancing Proteomics with Ion/Ion Chemistry,” Biotechniques, Vol. 40, 2006, pp. 783–789. [97] Le Guezennec, X., et al., “MBD2/NuRD and MBD3/NuRD, Two Distinct Complexes with Different Biochemical and Functional Properties,” Mol. Cell Biol., Vol. 26, 2006, pp. 843–851. [98] Beck, H. C., et al., “Quantitative Proteomic Analysis of Post-Translational Modifications of Human Histones,” Mol. Cell Proteomics, Vol. 5, 2006, pp. 1314–1325. [99] Hake, S. B., et al., “Expression Patterns and Post-Translational Modifications Associated with Mammalian Histone H3 Variants,” J. Biol. Chem., Vol. 281, 2006, pp. 559–568. [100] Schatz, P., et al., “Novel Method for High Throughput DNA Methylation Marker Evaluation Using PNA-Probe Library Hybridization and MALDI-TOF Detection,” Nucleic Acids Res., Vol. 34, 2006, p. e59. [101] Schatz, P., D. Dietrich, and M. Schuster, “Rapid Analysis of CpG Methylation Patterns Using RNase T1 Cleavage and MALDI-TOF,” Nucleic Acids Res., Vol. 32, 2004, p. e167. [102] Muller, S., et al., “Retention of Imprinting of the Human Apoptosis-Related Gene TSSC3 in Human Brain Tumors,” Hum. Mol. Genet., Vol. 9, 2000, pp. 757–763.

222

Genome Sequencing Technology and Algorithms

[103] Ehrich, M., et al., “Quantitative High-Throughput Analysis of DNA Methylation Patterns by Base-Specific Cleavage and Mass Spectrometry,” Proc. Natl. Acad. Sci. USA, Vol. 102, 2005, pp. 15785–15790. [104] Balch, C., et al., “New Anti-Cancer Strategies: Epigenetic Therapies and Biomarkers,” Front Biosci., Vol. 10, 2005, pp. 1897–1931. [105] Gius, D., et al., “The Epigenome as a Molecular Marker and Target,” Cancer, Vol. 104, 2005, pp. 1789–1793. [106] Rush, L. J., and C. Plass, “Restriction Landmark Genomic Scanning for DNA Methylation in Cancer: Past, Present, and Future Applications,” Anal. Biochem., Vol. 307, 2002, pp. 191–201. [107] Smiraglia, D. J., and C. Plass, “The Development of CpG Island Methylation Biomarkers Using Restriction Landmark Genomic Scanning,” Ann. NY Acad. Sci., Vol. 983, 2003, pp. 110–119. [108] Callinan, P. A., and A. P. Feinberg, “The Emerging Science of Epigenomics,” Hum. Mol. Genet., Vol. 15, Spec. No. 1, 2006, pp. R95–R101. [109] Ching, T. T., et al., “Epigenome Analyses Using BAC Microarrays Identify Evolutionary Conservation of Tissue-Specific Methylation of SHANK3,” Nat. Genet., Vol. 37, 2005, pp. 645–651. [110] Wilson, I. M., et al., “Epigenomics: Mapping the Methylome,” Cell Cycle, Vol. 5, 2006, pp. 155–158, 2006. [111] Khulan, B., et al., “Comparative Isoschizomer Profiling of Cytosine Methylation: The HELP Assay,” Genome Res., Vol. 16, 2006, pp. 1046–1055. [112] Frigola, J., et al., “Methylome Profiling of Cancer Cells by Amplification of Inter-Methylated Sites (AIMS),” Nucleic Acids Res., Vol. 30, 2002, p. e28. [113] Gonzalgo, M. L., et al., “Identification and Characterization of Differentially Methylated Regions of Genomic DNA by Methylation-Sensitive Arbitrarily Primed PCR,” Cancer Res, Vol. 57, 1997, pp. 594–599. [114] Tryndyak, V., O. Kovalchuk, and I. P. Pogribny, “Identification of Differentially Methylated Sites Within Unmethylated DNA Domains in Normal and Cancer Cells,” Anal. Biochem., Vol. 356, 2006, pp. 202–207. [115] Clark, S. J., et al., “High Sensitivity Mapping of Methylated Cytosines,” Nucleic Acids Res., Vol. 22, 1994, pp. 2990–2997. [116] Zhou, Y., et al., “Use of a Single Sequencing Termination Reaction to Distinguish Between Cytosine and 5-Methylcytosine in Bisulfite-Modified DNA,” Biotechniques, Vol. 22, 1997, pp. 850–854. [117] Azhikina, T., et al., “Non-Methylated Genomic Sites Coincidence Cloning (NGSCC): An Approach to Large Scale Analysis of Hypomethylated CpG Patterns at Predetermined Genomic Loci,” Mol. Genet. Genomics, Vol. 271, 2004, pp. 22–32. [118] Toyota, M., et al., “Identification of Differentially Methylated Sequences in Colorectal Cancer by Methylated CpG Island Amplification,” Cancer Res., Vol. 59, 1999, pp. 2307–2312.

High-Throughput Assessments of Epigenomics in Human Disease

223

[119] Tost, J., H. El Abdalaoui, and I. G. Gut, “Serial Pyrosequencing for Quantitative DNA Methylation Analysis,” Biotechniques, Vol. 40, 2006, pp. 721–722, 724, 726. [120] Horak, C. E., and M. Snyder, “ChIP-Chip: A Genomic Approach for Identifying Transcription Factor Binding Sites,” Methods Enzymol., Vol. 350, 2002, pp. 469–483. [121] Wu, J., et al., “ChIP-Chip Comes of Age for Genome-Wide Functional Analysis,” Cancer Res., Vol. 66, 2006, pp. 6899–6902. [122] Weinmann, A. S., and P. J. Farnham, “Identification of Unknown Target Genes of Human Transcription Factors Using Chromatin Immunoprecipitation,” Methods, Vol. 26, 2002, pp. 37–47. [123] Lee, H. R., et al., “Chromatin Immunoprecipitation Cloning Reveals Rapid Evolutionary Patterns of Centromeric DNA in Oryza Species,” Proc. Natl. Acad. Sci. USA, Vol. 102, 2005, pp. 11793–11798. [124] Ng, P., et al., “Multiplex Sequencing of Paired-End Ditags (MS-PET): A Strategy for the Ultra-High-Throughput Analysis of Transcriptomes and Genomes,” Nucleic Acids Res., Vol. 34, 2006, p. e84. [125] Liang, G., et al., “Distinct Localization of Histone H3 Acetylation and H3-K4 Methylation to the Transcription Start Sites in the Human Genome,” Proc. Natl. Acad. Sci. USA, Vol. 101, 2004, pp. 7357–7362. [126] Roh, T. Y., S. Cuddapah, and K. Zhao, “Active Chromatin Domains Are Defined by Acetylation Islands Revealed by Genome-Wide Mapping,” Genes Dev., Vol. 19, 2005, pp. 542–552. [127] Roh, T. Y., et al., “High-Resolution Genome-Wide Mapping of Histone Modifications,” Nat. Biotechnol., Vol. 22, 2004, pp. 1013–1016. [128] Impey, S., et al., “Defining the CREB Regulon: A Genome-Wide Analysis of Transcription Factor Regulatory Regions,” Cell, Vol. 119, 2004, pp. 1041–1054. [129] Siddiqui, A. S., et al., “Sequence Biases in Large Scale Gene Expression Profiling Data,” Nucleic Acids Res., Vol. 34, 2006, p. e83. [130] Gu, J., and V. R. Iyer, “PI3K Signaling and miRNA Expression During the Response of Quiescent Human Fibroblasts to Distinct Proliferative Stimuli,” Genome Biol., Vol. 7, 2006, p. R42. [131] Bentwich, I., “Prediction and Validation of MicroRNAs and Their Targets,” FEBS Lett., Vol. 579, 2005, pp. 5904–5910. [132] Yuan, G. C., et al., “Genome-Scale Identification of Nucleosome Positions in S. Cerevisiae,” Science, Vol. 309, 2005, pp. 626–630. [133] Krogan, N. J., et al., “Regulation of Chromosome Stability by the Histone H2A Variant Htz1, the Swr1 Chromatin Remodeling Complex, and the Histone Acetyltransferase NuA4,” Proc. Natl. Acad. Sci. USA, Vol. 101, 2004, pp. 13513–13518.

13 Comparative Sequencing, Assembly, and Anchoring Aleksandar Milosavljevic and Cristian Coarfa

An increasing variety of new applications of comparative genomics are being enabled by increasing throughputs of next generation sequencing technologies, and by the availability of sequenced genomes. The applications typically employ genome-scale comparisons involving assembled genomes and unassembled genome fragments or anchorable tags in various combinations. Comparison algorithms have evolved over the past decades to serve changing sequence analysis needs. The increasing volume of genome-sequence information requires algorithms for extremely high-volume anchoring. More specifically, the following four new circumstances will shape the requirements for the next generation of similarity search algorithms. First, the next generation sequencing technologies will provide unprecedented sequencing throughputs while the sequence data will likely have different error characteristics. The dramatic increase in sequence volume will put a premium on speed and scalability while requiring robustness to sequencing error and ability to compare short reads and mapable tags. Second, genome assemblies of many closely related organisms, and even different individuals from the same species, will become available, allowing for variation analysis across individuals and comparative mapping and assembly. The number of meaningful comparisons will grow proportionately to the square of the amount of available sequence. Parallelism and cluster computing are likely

225

226

Genome Sequencing Technology and Algorithms

to play an important role in the design and optimization of comparison algorithms. Third, a multitude of next generation sequencing technologies will likely coexist, each occupying a niche. The increasing availability of sequence information and decreasing cost of sequencing will gradually but predictably enable new applications within specific niches. The similarity search algorithms will have to meet individual needs such as specificity, sensitivity, and speed, of specific sequencing technologies and specific applications. Fourth, the applications will increasingly depend on accurate sequence-based inference of orthology relations between pairs of genome assemblies or between genome assemblies and large numbers of sequence reads or sequence tags. We use the term “anchoring” to broadly refer to the inference of orthology relations between DNA sequence within or across species. In the following sections, we first survey a sampling of new applications enabled by high-throughput sequencing. We argue that almost all the applications rely on the key sequence-anchoring step. We then examine the requirements for anchoring and how they are addressed by the currently dominant “seed-and-extend” paradigm of similarity search embodied in algorithms such as BLAST [1, 2] and BLAT [3]. We review positional hashing [4], a novel, inherently parallelizable anchoring method developed in our laboratory designed to address the specific requirements of high-volume anchoring. Finally, we propose a simulated parametrized benchmark for comparing performance of various anchoring algorithms.

13.1 Comparing an Assembled Genome with Another Assembled Genome Understanding the function and the natural history of the human genome are two key genome-era challenges. Comparison of assembled mammalian genomes with the human reveals signals of purifying selection (where detrimental random mutations are sieved out by natural selection) thereby identifying evolutionarily conserved functional elements. The first application of this method on the scale of a whole mammalian genome involved comparison of the genomes of human and mouse [5], and more recent applications extend comparison to numerous mammalian and other genomes. Comparisons of primate genomes provides the closest view of the dynamic history of the human genome, revealing the evolutionary forces shaping it and shedding light on the genetic basis of uniquely human characteristics. One recent example includes reconstruction of the recent history of the human genome using the genomes of chimpanzee and rhesus macaque [6] using the genomic triangulation method [7].

Comparative Sequencing, Assembly, and Anchoring

227

As illustrated in Figure 13.1, comparison starts with an anchoring step where homologous loci are identified. In order to filter out nonorthologous loci, a reciprocal best-match heuristic is usually applied. In other words, locus a in genome A and locus b in genome B are matched if locus b appears among top k matches when a is used as a query in a similarity search against genome B, and vice versa. Typically, k is set to 1 but sometimes a larger k may be applied to improve sensitivity. The heuristic is based on the assumption that the orthologous loci will share the largest amount of similarity. Anchors are merged into colinear blocks [8, 9] that generally correspond to orthologous DNA sequences. The colinearity is sometimes relaxed to simple collocation, as shown in Figure 13.2. A set of anchors are said to be collocated if the anchored loci fall within a predefined distance of each other, even if they do not occur in exact order in the two compared sequences. The more relaxed criterion enables study of rearrangements at lower levels of resolution, ignoring small local rearrangements such as small inversions of genomic sequence. Inconsistency may occur if two blocks connect one locus within one genome to two different loci within the second genome. This is typically resolved by assuming that the largest block points to the true orthology, as illustrated in Figure 13.3. For example, the UCSC browser displays comparative sequence information in the form of Nets, which are obtained by selecting largest blocks, referred to as Chains, and then recursively picking blocks that fit within the gaps of the already selected Chains [10].

Genome 1 assembled

Genome 2 assembled

Anchoring

Merging into blocks

Rearrangement detection (e.g., GRIMM)

Chromosomal rearrangements

Alignment (e.g., LAGAN)

Base-pair-level conservation

Figure 13.1 Comparisons of assembled genomes to detect evolutionary rearrangements or base-pair-level evolutionary conservation indicative of function.

228

Genome Sequencing Technology and Algorithms Sequence 1

Sequence 2

(a) Sequence 1

Sequence 2

(b)

Figure 13.2 Orthologous sequences are obtained by merging either (a) colinear anchors or (b) colocated anchors.

Block 1a

Block 1b

Block 2a

Block 2b

Figure 13.3 The similarity between blocks 1b and 2b is considered to point to the true orthology, rather than the one between blocks 1a and 2a.

The blocks are used for two types of purposes. First, they are used as an input to the methods for reconstructing chromosome-level rearrangements such as GRIMM [9], CAR [11], and genomic triangulation [7]. Second, the blocks are used as a starting point for base-pair-level alignments using programs such as LAGAN [12] and numerous others [13–16]. The model described here does not perfectly describe all situations and algorithms. For example, deviating somewhat from the simplified model presented here, certain algorithms [17] combine anchoring and alignment more intimately on a local level but still require anchoring on the genome level.

Comparative Sequencing, Assembly, and Anchoring

229

13.2 Mutual Comparison of Genome Fragments Construction of an overlap graph is the first step in the classical overlap-layout consensus strategy for genome assembly (reviewed in [18]). Read overlaps are detected using essentially an anchoring algorithm. As illustrated in Figure 13.4, anchoring is a key step in the process of genome assembly. One should note that, as opposed to cross-species anchoring, read overlap is an intraspecies anchoring step. For cross-species anchoring, the objective is to identify sequence fragments that are derived from the same ancestral fragment through species divergence. In contrast, the objective of read overlap is to identify read fragments that originate from the same genomic locus. Despite this semantic difference, the algorithmic problem is similar. Issues confounding assembly, such as intragenome duplications, pose similar obstacle for the two tasks. In contrast to the genomes of humans and mouse, which are both to be finished at a high level of accuracy and completeness, numerous other mammalian genomes are slated in the near term for whole-genome assembly at relatively low coverage. Consequently, in the near future the genomes of most mammals will likely consist of hundreds of thousands of relatively short contigs.

Genome 2 fragmentary

Genome 1 fragmentary

Contig assembly

Anchoring

Overlap graph

Genome assembly

Contig assembly

Anchoring

Cross-species contig overlap detection

Joint assembly

Figure 13.4 Comparisons of genome fragments. The path on the left indicates assembly of an individual genome. The path on the right indicates “bootstrapping” of two assemblies by taking advantage of the fact that the genomes share genomic sequence due to shared ancestry.

230

Genome Sequencing Technology and Algorithms

A method has been proposed to “bootstrap” assemblies of individual genomes by taking advantage of the fact that mammalian genomes share common ancestry. The key idea is that shared ancestry implies common structure (i.e., shared order and orientation of contigs). An implementation of this idea is illustrated in the right part of Figure 13.4: contig-level assemblies of related genomes are compared to identify overlaps between contigs across two related genomes. The contig overlaps are then used to infer order and orientation of contigs within one species using “bridges” consisting of contigs from the other species. This high-level description of the approach hides somewhat the complex structure of the underlying combinatorial problem, as described in more detail elsewhere [19]. For the purpose of our discussion, it suffices to say that in order for the method to work properly, contig overlaps should correspond to orthologous sequence fragments. In this context, contig overlap detection is therefore an anchoring problem.

13.3 Comparing an Assembled Genome with Genome Fragments Anchoring of genome fragments onto orthologous loci in already assembled genomic sequences will become increasingly ubiquitous for two reasons. First, the number of assembled genomes is increasing thus creating a substrate for anchoring. Second, the next generation sequencing technologies will be producing extremely large numbers of relatively short but anchorable reads. The interpretation of such reads will in many cases require anchoring onto already assembled genomes. For the purpose of organizing this section, we classify applications of anchoring into three categories, based on the anchoring approach: (1) anchoring of individual reads; (2) anchoring of paired-end reads, also referred to as mate-pairs; and (3) anchoring of reads sampled from a clone such as a bacterial artificial chromosome (BAC). Specific applications employing each of the three types of anchoring are illustrated in Figure 13.5. In the following, we briefly discuss each of the applications. 13.3.1 Applications Using Read Anchoring

Digital karyotyping [20] measures the copy number of loci within a genome. A large number of reads or short mapable tags are obtained using Sanger sequencing or next generation sequencing technology. The reads are anchored onto the reference human genome. The copy number is measured by the density of mapped reads. The genome is computationally segmented into loci, each locus having a specific copy number estimate.

Comparative Sequencing, Assembly, and Anchoring

231

Genome 2 fragmentary

Genome 1 assembled

Anchoring

Anchored reads

Anchored mate pairs

Mapped reads from a clone

Mapped BACs Digital Karyotyping Polymorphism detection Cross-species conservation Comparative assembly

ESP, PET

Pooled genomic indexing

FES polymorphisms FES cross-primate rearrangements BES crossmammalian conservation Mate-pair chains

Figure 13.5 Comparisons of assembled genomes and genome fragments.

One method for detecting polymorphisms within a species whose genome is sequenced is to obtain whole-genome shotgun reads of different individuals and map them onto the reference genome for the species. This method has been employed to detect many SNPs in the dbSNP database [21, 22] and, more recently, to detect indel polymorphisms in humans [23, 24]. Correct anchoring of reads onto orthologous loci is particularly challenging in humans due to the large number of duplicated regions and paralogous genes. When anchoring reads sequenced from a certain individual to a reference genome, incorrect anchoring onto such regions may be interpreted as apparent

232

Genome Sequencing Technology and Algorithms

polymorphisms, while the apparent allele differences are in fact differences between duplicated or paralogous loci. Anchoring is a key step in phylogenetic footprinting [25, 26] and phylogenetic shadowing [27] methods. In phylogenetic footprinting, short DNA sequences, highly conserved among multiple-related species, are used to identify evolutionarily conserved functional regions. As part of a major large-scale application of the phylogenetic footprinting method, an increasing number of mammalian species are being sequenced to a draft level with the goal to delineate functional regions in the human genome by observing patterns of base-pair-level evolutionary conservation of functional regions due to purifying selection. Due to the relatively low coverage of sequenced genomes, the draft assemblies of many sequenced mammals will consist of a large numbers of relatively short contigs and even unassembled reads. Correct anchoring of such reads or short contigs is obviously a key step in the assembly [28, 29]. This will be particularly relevant for the planned low-coverage sequencing of numerous eutherian mammals [30], which will produce millions of reads and short contigs. While phylogenetic footprinting considers multiple, including distantly related, species, the phylogenetic shadowing method [27] is specifically designed to detect evolutionary conservation across primates. It is looking for highly conserved sequences between human and apes, Old World monkeys, and New World monkeys, to identify human functional regions. While the evolutionary distances may be closer, the anchoring problem may in fact be harder due to the abundance of relatively recent segmental duplications and retroposon insertions in the human and other primate lineages. Yet another application of read anchoring is comparative genome assembly [31, 32], where one assembled genome is used as a reference to guide the assembly of the genome of another related species. In one embodiment of comparative assembly, the reads from one species are anchored onto the genome of a related species in order to partition the reads by corresponding loci and then assemble loci independently and locally. The information in the reference genome provides more extensive and accurate assembly at lower read coverage for the regions that are similar across the two species. In addition to anchoring of genomic reads, anchoring of ESTs, SAGE tag sequences, and transcript tag sequences obtained using next-generation sequencing technologies also has obvious value. 13.3.2 Applications Employing Anchoring of Paired Ends

Sequenced ends of a small-insert or a long-insert clone such as a BAC may be anchored onto a related reference genomic sequence. If the distance and orientation of anchoring is consistent with the size of an orientation of the clone insert, one may infer absence of large-scale breakpoints within the clone relative to the

Comparative Sequencing, Assembly, and Anchoring

233

reference genome. One example is the detection of cross-mammalian conservation of chromosome structure using mapping of sequenced BAC-end sequences [33–36]. On the other hand, if the distance and orientation of anchoring is not consistent, one may infer that the clone spans a breakpoint induced by a genomic insertion, deletion, or another rearrangement. Typically, two independent clone mappings that are mutually consistent are required to reduce the false positive rate of breakpoint detection. This general methodology has been employed by the so-called end-sequence profiling (ESP) method to infer chromosomal aberrations in breast cancer using BAC clones [37, 38]. Anchoring of sequenced fosmid ends has been employed to detect human polymorphisms [39] and differences between chimpanzee and humans [40]. End-sequencing of cDNA clones using 454 sequencing technology has been employed to detect chimeric transcripts in cancer [41]. If two pairs of clone ends anchor onto the reference genome consistently and in an overlapping fashion, conservation of the whole so-called mate-pair chain may be inferred [7], as illustrated in Figure 13.6. Next generation sequencing technologies hold the promise of providing a particularly economical and fast method of delineating conserved and rearranged regions within compared genomes using this method. 13.3.3 Applications Utilizing Mapping of Clone Reads

Each genome project confronts the choice between the more economical whole-genome shotgun and more accurate clone-by-clone strategies. A hybrid genome-sequencing strategy, first demonstrated on the rat genome [42], combines the benefit of both by combining shotgun sequencing of BAC clones and whole-genome shotgun sequencing. Whole-genome reads are localized to specific BACs using a “BAC fishing” procedure where the reads obtained directly from individual BACs are compared to whole-genome reads in order to anchor the whole-genome reads to specific BACs. Once the whole-genome reads are Mate-pair chain

Reference sequence

Figure 13.6 Overlapping mate pairs with consistent mappings are merged into mate-pair chains.

234

Genome Sequencing Technology and Algorithms

localized, the BACs are independently assembled thus alleviating assembly problems due to genome-wide repetitive structures that abound in mammalian genomes. While BAC fishing employs BAC reads to localize whole-genome reads to specific BACs, the BAC reads can also be employed to anchor the BAC onto an existing genome assembly. The mapped BACs can then be selected for targeted sequencing of orthologous regions of biomedical interest. The direct BAC sequencing method is more sensitive than the BAC-end sequencing method but the cost is prohibitive for mapping purposes due to the large cost of preparing small-insert shotgun libraries from individual BACs. The pooled genomic indexing method overcomes this problem by pooling BACs prior to library preparation, thus enabling comparative mapping of BACs of one species onto orthologous loci in the genomes of related species [43, 44]. The method has enabled recent targeted sequencing of BACs containing fragments of the genome of rhesus macaque [6]. Anchoring of reads or mapable tags obtained from the BAC pools is a key step in the process.

13.4 Anchoring by Seed-and-Extend Versus Positional Hashing Methods The seed-and-extend paradigm currently dominates in the field of sequence similarity search [2, 3, 16, 45–49], and for this reason it was the first tool for anchoring, despite the fact that the paradigm emerged to address a different problem decades ago. The seed-and-extend paradigm originally emerged to address the key problem of searching a large database using a relatively short query to detect remote homologies. A homology match to a gene of known function was typically used to derive a hypothesis about the function of the query sequence. One key requirement for this application is sensitivity when comparing sequences across large evolutionary distances. A second key requirement is speed when searching a large database using a short query. The first generation seed-and-extend algorithms such as BLAST [1, 2] and FASTA [49] employed preprocessing of the query to speed up the database search while second generation seed-and-extend algorithms such as BLAT [3] and SSAHA [48] employed in-memory indexing of genome-sized databases for another order of magnitude of speed increase required for interactive lookup of genome loci in human genome browsers using genomic DNA sequence queries. It is important to note that the anchoring problem poses a new and unique set of requirements. First, the detection of remote homologies is less relevant for anchoring than discrimination of true orthology relations when comparing closely related genomes. Second, with the growth of the genome databases and the emergence of next generation sequencing technologies the query itself may

Comparative Sequencing, Assembly, and Anchoring

235

now contain tens of millions of fragments or several gigabases of assembled sequence. To address the requirements specific to the anchoring problem, we recently developed the positional hashing method and implemented it in the software package Pash [4]. The method avoids costly base-pair-level matching, and instead employs faster and scaleable k-mer matching. The k-mer matching is performed using distributed position-specific hash tables that are constructed from both compared sequences. To better formulate the difference between positional hashing and the classical seed-and-extend paradigms, we first introduce a few definitions. A “seed” pattern P is defined by offsets {x1, ..., xw}. We say that a “seed” match is detected between sequences S and T in respective positions i and j if S[i + x1] = T[j + x1] …, and S[i + xw] = T[j + xw]. To further simplify notation, we define pattern function fP at position i in sequence S as fP(S, i) = S[i + x1]... S[i + xw]. Using this definition, we say that a “seed” match is detected between sequences S and T in respective positions i and j if fP(S, i) = fP(T, j). A seed-and-extend method extends each seed match by local base-pair alignment. The alignments that do not produce scores above a threshold of significance are discarded. The tradeoff between the sensitivity and the speed of a seed-and-extend method is to a large degree determined by the weight of the employed pattern, defined as the number of offsets in it. To a lesser degree, sensitivity is also a function of the pattern span, defined as the distance xw − x1 + 1. Sensitivity of a pattern is the probability of detecting a seed match between orthologous positions. Specificity is the fraction of seed matches that reach significance upon local alignment. Patterns of higher weight are more specific but less sensitive. Patterns of lower weight are more sensitive but less specific, potentially causing excessive wasting of computing time on local alignments that are either spurious or irrelevant for anchoring. In contrast to the seed-and-extend paradigm, positional hashing groups all collinear matches (i.e., those falling along the same diagonal in the comparison matrix) to produce a score. The score calculated by grouping the matches suffices for a wide range of anchoring applications, while providing significant speedup by eliminating the time-consuming local alignment at the base-pair level. In further contrast to the seed-and-extend paradigm, which creates an index from one of the two compared sequences and then “streams” the other sequence against the index to look up matches in consecutive positions, the positional hashing method simultaneously hashes both compared sequences. The hashing involves numerous position-specific hash tables, thus allowing extreme scalability through parallel computing. The positional hashing scheme breaks the anchoring problem along its natural diagonal structure, as illustrated in the Figure 13.7. Each cluster node

236

Genome Sequencing Technology and Algorithms

Genome 2

Genome 1

(a)

(b)

L

Genome 2

Genome 1

(c)

Genome 2

Genome 1

L

Figure 13.7 Positional hashing. The positional hashing scheme breaks the anchoring problem along the diagonals of the comparison matrix (a). Each cluster node detects and groups matches along a subset of diagonals (b). Short bold lines in the left panel of (c) indicate positions used to calculate hash keys for positional hash table H(0,0) while the short bold lines in the right panel of (c) indicate positions used to calculate hash keys for table H(0, L−1). While the figure illustrates comparison of assembled genomes, the method is also applicable with minor modifications to the comparison of genomic fragments.

detects and groups matches along a subset of diagonals, as illustrated in Figure 13.7(b). More precisely, matches along diagonal d = 0,1,…L−1, of the form fP(S, i) = fP(T, j), where i = j + d (mod L) are detected and grouped in parallel on individual nodes of a computer cluster. Position-specific hash tables are defined by conceptually dividing each alignment diagonal into nonoverlapping segments of length L, as indicated by dashed lines in Figure 13.7(c). A total of L2 positional hash tables H(d, k) are constructed for each diagonal d = 0,1,…L−1 and diagonal position k = 0, 1,… L−1. Matches are detected by using the values of fP(S, i) and fP(T, j) as keys for storing indices [i/L] and [j/L] into specific hash table bins. A match of the form fP(S, i) = fP(T, j) where i = j + d (mod L) and j = k (mod L) is detected whenever [i/L] and [j/L] occur in the same bin of hash table H(d, k). Further implementation details are described in [4].

Comparative Sequencing, Assembly, and Anchoring

237

13.5 The UD-CSD Benchmark for Anchoring The choice of a program for an anchoring application depends on a number of data parameters, data volume, and computational resources available for the task. To facilitate selection of the most suitable program it would therefore be useful to test candidates on a benchmark. Toward this end, we developed a benchmark that includes segmental duplications, an important feature of mammalian genomes, and particularly of the genome of humans and other primates. The duplications are particularly challenging because they limit the sequence uniqueness necessary for anchoring. The UD-CSD benchmark is named after five key aspects: unique fraction of the genome; duplicated fraction; coevolution of duplicated fraction during which uniqueness is gradually developed; speciation; and divergence of orthologous reads. As illustrated in Figure 13.8, the UD-CSD benchmark is parametrized by the following four parameters: number of unique reads k; number of duplicated reads n; coevolution parameter x; and divergence parameter y. Using the UD-CSD benchmark, we evaluated the sensitivity and specificity of Pash compared to an established seed-and-extend comparison algorithms such as BLAT. We first generated k + 1 random reads of size m base pairs, then we duplicated the last read n – 1 times, as illustrated in Figure 13.8(a), and obtained seed reads sj, i = 1, n + k. This corresponds to a genome where the k reads represent unique regions, and the n-duplicated reads represent duplicated regions. Next, we evolved each read si, such that each base has a mutation probability of x, and each base was mutated at most once, and obtained reads ri, i = 1, n + k. Out of the mutations, 1% were indels, with half insertions and half deletions; the indel lengths were chosen using a geometric probability distribution with the parameter p = 0.9, and imposing a maximum length of 10. The remaining mutations were substitutions. This process approximates a period of coevolution of two related species during which duplicated regions acquire uniqueness (parametrized by x) necessary for anchoring. Next, two copies of each read were generated, and one assigned to each of two simulated genomes of descendant species, as shown in Figure 13.8(c); this corresponds to a speciation event. Subsequently, each read evolved independently such that each base had a mutation probability of y, as illustrated in Figure 13.8(d); this corresponds to a period of divergence between the two related species. Finally, we obtained the set of reads {r1,1, ..., rn+k,1} and ri,2, with i = 1, n + k. We then employed Pash and BLAT to anchor the read set {r1,1, ..., rn+k,1} onto {r1,2, ..., rn+k,2}, by running each program and then filtering its output such that only the top ten best matches for each read are kept. Any time a read ri,1 is matched onto ri,2, we consider this a true positive; we count how many true positives are found to evaluate the accuracy of the anchoring program.

y 1,1

1

1

1,2

1

k,1

k

k

k

k,2

k

x

k+1,1

k+1

k+1

k+1

k+1,2

k+1

k+n,1

k+n

k+n

k+n

Duplicated reads (10%)

k+n,2

k+n

(d)

(c)

(b)

(a)

Figure 13.8 The UD-CDS (unique, duplicated-coevolution, speciation, divergence) anchoring benchmark. (a) Randomly generate k unique reads and n duplicated reads. (b) Coevolution: each base mutates with probability x. (c) Speciation: each read duplicates. (d) Divergence: each base mutates with probability y.

Divergence

Speciation

Coevolution ∞

1

Unique reads (90%)

238 Genome Sequencing Technology and Algorithms

Comparative Sequencing, Assembly, and Anchoring

239

One may raise objection to our considering the top 10 best matches and may instead insist that only the top match counts. Our more relaxed criterion is justified by the fact that anchoring typically involves a reciprocal-best-match step. For example, a 10-reciprocal-best-match step would sieve out false matches and achieve specific anchoring as long as the correct match is among the top-10-scoring reads. Assuming random error, one may show that the expected number of false matches would remain constant (10 in our case) irrespective of the total number of reads matched. For our experiment, we chose a read length of 200 bases, and varied the total number of reads from 5,000 to 2 million. k and n were always chosen such that 90% of the start reads were unique, and 10% were repetitive sequences; for example, for the 2 million reads case, we started with k = 1.8 million unique reads and n = 200,000 duplicated reads. In Figure 13.9, we present execution time as a function of the number of reads for Pash, using a gapped pattern of weight 13 and length 21, and a word offset of 8, and for BLAT, using medium sensitivity settings (minScore=20, minIdentity=0, tileSize=11, minMatch=2). For coevolution of 25% and divergences of 1% and 5%, Pash and BLAT achieve comparable sensitivity (the numbers of true positives found are within 1% of each other). This result is significant because it indicates that timeconsuming base-pair-level alignments performed by BLAT are not necessary for accurate anchoring—word matching performed by Pash suffices. Consequently, when comparing closely related species, such as human and chimp, Pash is as sensitive as BLAT and has an execution time advantage. However, for a divergence of 5%, such as when comparing human and rhesus, the speed advantage of Pash diminishes.

13.6 Conclusions In this chapter we discussed the importance of effective, scalable, and parallelizable anchoring techniques to enable high-throughput applications of comparative genomics. We surveyed a plethora of applications where anchoring plays a key role. Applications include comparisons of assembled genomes to determine rearrangements and conservation, and mutual comparison of fragmentary genomes to bootstrap genome assembly. A rich area of interest relies on comparisons of assembled genomes to fragmentary genomes: anchoring of reads to genomes to determine copy-number variation and polymorphism, to perform phylogenetic footprinting and phylogenetic shadowing, and to drive comparative assembly. Anchoring of paired ends is employed to detect cross-mammalian conservation of chromosome structure and to infer chromosomal aberrations and chimeric transcripts in cancer. Anchoring is a key step in all these applications. We reviewed existing methods that rely on the seed-and-extend strategy,

240

Genome Sequencing Technology and Algorithms Anchoring time at 25% coevolution and 1% divergence 1600 1400

Time (s)

1200 1000 BLAT

800

Pash

600 400 200 0 0,

00

0 2,

00

0,

00

0 00 1,

20

0,

00

0 00 10

0,

,0 50

5,

00

0

00

0

Number of reads

Anchoring time at 25% coevolution and 5% divergence 1600 1400

Time (s)

1200 1000 BLAT

800

Pash

600 400 200 00 00 0, 0

2,

,0 00 00

1, 0

0, 00 0 20

0 10 0, 00

0 50 ,0 0

5, 00

0

0

Number of reads

Figure 13.9 Anchoring times for BLAT and Pash for a coevolution of 25% and for divergences of 1% and 5%.

and discussed a novel anchoring approach, Positional hashing (implemented in the Pash program), that is inherently parallelizable, enables a user to specify gapped or contiguous seeds, and offers practical tradeoff between sensitivity and speed. It is unlikely that a single program will be optimal for all anchoring applications. It is therefore important to develop benchmarks and evaluation procedures so that the best available program is selected for the parameters of the specific anchoring task. Toward this end, we proposed UD-CSD, a simulated benchmark with tunable parameters and evaluated anchoring performance of Pash and BLAT.

Comparative Sequencing, Assembly, and Anchoring

241

References [1] Altschul, S. F., et al., “Basic Local Alignment Search Tool,” J. Mol. Biol., Vol. 215, No. 3, 1990, pp. 403–410. [2] Altschul, S. F., et al., “Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs,” Nucleic Acids Research, Vol. 25, No. 17, 1997, pp. 3389–3402. [3] Kent, W. J., “BLAT—The BLAST-Like Alignment Tool,” Genome Res., 2002, pp. 656–664. [4] Kalafus, K. J., A. R. Jackson, and A. Milosavljevic, “Pash: Efficient Genome-Scale Sequence Anchoring by Positional Hashing,” Genome Res., Vol. 14, 2004, pp. 672–678. [5] Waterston, R. H., et al., “Initial Sequencing and Comparative Analysis of the Mouse Genome,” Nature, Vol. 420, No. 6915, 2002, pp. 520–562. [6] The Rhesus Macaque Genome Sequencing and Analysis Consortium, “Evolutionary and Biomedical Insights from the Rhesus Macaque Genome,” Science, Vol. 316, No. 5822, 2007, pp. 222–234. [7] Harris, R. A., J. Rogers, and A. Milosavljevic, “Human-Specific Changes of Genome Structure Detected by Genomic Triangulation,” Science, Vol. 316, No. 5822, 2007, pp. 235–237. [8] Mural, R. J., et al., “A Comparison of Whole-Genome Shotgun-Derived Mouse Chromosome 16 and the Human Genome,” Science, Vol. 296, No. 5573, 2002, pp. 1661–1671. [9] Pevzner, P., and G. Tesler, “Genome Rearrangements in Mammalian Evolution: Lessons from Human and Mouse Genomes,” Genome Res., Vol. 13, No. 1, 2003, pp. 37–45. [10] Kent, W. J., et al., “Evolution’s Cauldron: Duplication, Deletion, and Rearrangement in the Mouse and Human Genomes,” PNAS, 2003, pp. 19320–72100. [11] Ma, J., et al., “Reconstructing Contiguous Regions of an Ancestral Genome,” Genome Res., Vol. 16, No. 12, 2006, pp. 1557–1565. [12] Brudno, M., et al., “LAGAN and Multi-LAGAN: Efficient Tools for Large-Scale Multiple Alignment of Genomic DNA,” Genome Res., Vol. 13, No. 4, 2003, pp. 721–731. [13] Bray, N., and L. Pachter, “MAVID: Constrained Ancestral Alignment of Multiple Sequences,” Genome Res., Vol. 14, No. 4, 2004, pp. 693–699. [14] Morgenstern, B., “DIALIGN 2: Improvement of the Segment-to-Segment Approach to Multiple Sequence Alignment,” Bioinformatics, Vol. 15, No. 3, 1999, pp. 211–218. [15] Raphael, B., et al., “A Novel Method for Multiple Alignment of Sequences with Repeated and Shuffled Elements,” Genome Res., Vol. 14, No. 11, 2004, pp. 2336–2346. [16] Schwartz, S., et al., “Human-Mouse Alignments with BLASTZ,” Genome Res., Vol. 13, No. 1, 2003, pp. 103–107. [17] Brudno, M., et al., “Glocal Alignment: Finding Rearrangements During Alignment,” Bioinformatics, Vol. 19, No. 90001, 2003, pp. 54i–62.

242

Genome Sequencing Technology and Algorithms

[18] Batzoglou, S., “Algorithmic Challenges in Mammalian Genome Sequence Assembly,” in Encyclopedia of Genomics, Proteomics, and Bioinformatics, M. Dunn, et al., (eds.), New York: John Wiley & Sons, 2005. [19] Veeramachaneni, V., P. Berman, and W. Miller, “Aligning Two Fragmented Sequences,” Discrete Applied Mathematics, Vol. 127, 2003, pp. 119–143. [20] Wang, T. L., et al., “Digital Karyotyping,” Proc. Natl. Acad. Sci. USA, Vol. 99, No. 25, 2002, pp. 16156–16161. [21] Smigielski, E. M., et al., “dbSNP: A Database of Single Nucleotide Polymorphisms,” Nucl. Acids Res., Vol. 28, No. 1, 2000, pp. 352–355. [22] The International HapMap Consortium, “The International HapMap Project,” Nature, Vol. 426, No. 6968, 2003, pp. 789–796. [23] Ning, Z., M. Caccamo, and J. C. Mullikin, “ssahaSNP—A Polymorphism Detection Tool on a Whole Genome Scale,” IEEE Computational Systems Bioinformatics Conference, 2005. [24] Mills, R. E., et al., “An Initial Map of Insertion and Deletion (INDEL) Variation in the Human Genome,” Genome Res., Vol. 16, No. 9, 2006, pp. 1182–1190. [25] Gumucio, D. L., et al., “Phylogenetic Footprinting Reveals a Nuclear Protein Which Binds to Silencer Sequences in the Human Gamma and Epsilon Globin Genes,” Mol. Cell Biol., Vol. 12, No. 11, 1992, pp. 4919–4929. [26] Blanchette, M., B. Schwikowski, and M. Tompa, “Algorithms for Phylogenetic Footprinting,” J. Comput. Biol., Vol. 9, No. 2, 2002, pp. 211–223. [27] Boffelli, D., et al., “Phylogenetic Shadowing of Primate Sequences to Find Functional Regions of the Human Genome,” Science, Vol. 299, No. 5611, 2003, pp. 1391–1394. [28] Bouck, J. B., M. L. Metzker, and R. A. Gibbs, “Shotgun Sample Sequence Comparisons Between Mouse and Human Genomes,” Nature Genetics, Vol. 25, No. 1, 2000, pp. 31–33. [29] Chen, R., et al., “Comparing Vertebrate Whole-Genome Shotgun Reads to the Human Genome,” Genome Res., Vol. 11, No. 11, 2001, pp. 1807–1816. [30] Margulies, E. H., et al., “An Initial Strategy for the Systematic Identification of Functional Elements in the Human Genome by Low-Redundancy Comparative Sequencing,” Proc. Natl. Acad. Sci. USA, Vol. 102, No. 13, 2005, pp. 4795–4800. [31] Pop, M., et al., “Comparative Genome Assembly,” Brief Bioinform., Vol. 5, No. 3, 2004, pp. 237–248. [32] Milosavljevic, A., “DNA Sequence Similarity Recognition by Hybridization to Short Oligomers,” U.S. Patent No. 6001562, 1999. [33] Fujiyama, A., et al., “Construction and Analysis of a Human-Chimpanzee Comparative Clone Map,” Science, Vol. 295, No. 5552, 2002, pp. 131–134. [34] Larkin, D. M., et al., “A Cattle-Human Comparative Map Built with Cattle BAC-Ends and Human Genome Sequence,” Genome Res., Vol. 13, No. 8, 2003, pp. 1966–1972. [35] Poulsen, T. S., and H. E. Johnsen, “BAC End Sequencing,” Methods Mol. Biol., Vol. 255, 2004, pp. 157–161.

Comparative Sequencing, Assembly, and Anchoring

243

[36] Zhao, S., et al., “Human, Mouse, and Rat Genome Large-Scale Rearrangements: Stability Versus Speciation,” Genome Res., Vol. 14, No. 10, 2004, pp. 1851–1860. [37] Volik, S., et al., “End-Sequence Profiling: Sequence-Based Analysis of Aberrant Genomes,” Proc. Natl. Acad. Sci. USA, Vol. 100, No. 13, 2003, pp. 7696–7701. [38] Volik, S., et al., “Decoding the Fine-Scale Structure of a Breast Cancer Genome and Transcriptome,” Genome Res., 2006. [39] Tuzun, E., et al., “Fine-Scale Structural Variation of the Human Genome,” Nat. Genet., Vol. 37, No. 7, 2005, pp. 727–732. [40] Newman, T. L., et al., “A Genome-Wide Survey of Structural Variation Between Human and Chimpanzee,” Genome Res., Vol. 15, No. 10, 2005, pp. 1344–1356. [41] Ng, P., et al., “Multiplex Sequencing of Paired-End Ditags (MS-PET): A Strategy for the Ultra-High-Throughput Analysis of Transcriptomes and Genomes,” Nucl. Acids Res., Vol. 34, No. 12, 2006, p. e84. [42] Consortium, “Genome Sequence of the Brown Norway Rat Yields Insights into Mammalian Evolution,” Nature, Vol. 428, 2004, pp. 493–521. [43] Csuros, M., and A. Milosavljevic, “Pooled Genomic Indexing (PGI): Analysis and Design of Experiments,” Journal of Computational Biology, Vol. 11, No. 2, 2004, pp. 1001–1021. [44] Milosavljevic, A., et al., “Pooled Genomic Indexing of Rhesus Macaque,” Genome Res., Vol. 15, No. 2, 2005, pp. 292–301. [45] WU-BLAST, 2007, http://blast.wustl.edu. [46] Li, M., et al., “PatternHunter II: Highly Sensitive and Fast Homology Search,” Journal of Bioinformatics and Computational Biology, Vol. 2, No. 3, 2004, pp. 417–440. [47] Ma, B., J. Tromp, and M. Li, “PatternHunter: Faster and More Sensitive Homology Search,” Bioinformatics, Vol. 18, No. 3, 2002, pp. 440–445. [48] Ning, Z., A. J. Cox, and J. C. Mullikin, “SSAHA: A Fast Search Method for Large DNA Databases,” Genome Res., Vol. 11, No. 10, 2001, pp. 1725–1729. [49] Pearson, W. R., and D. J. Lipman, “Improved Tools for Biological Sequence Comparison,” Proc. Natl. Acad. Sci. USA, Vol. 85, No. 8, 1988, pp. 2444–2448.

About the Authors Sun Kim is an associate professor in the School of Informatics, the associate director and a founding faculty member of the bioinformatics program, an associated faculty at the Center for Genomics and Bioinformatics, and an affiliated faculty at the Biocomplexity Institute at Indiana University–Bloomington. He also worked at DuPont Central Research from 1998 to 2001 and at the University of Illinois at Urbanafrom 1997 to 1998. Professor Kim received a B.S, an M.S., and a Ph.D. in computer science from Seoul National University, Korea Advanced Institute of Science and Technology (KAIST), and the University of Iowa, respectively. His research interests lie in bioinformatics and its related areas such as string-pattern matching, data mining, and combinatorial search techniques. He was a recipient of the Outstanding Junior Faculty Award at Indiana University 2004, CAREER Award from 2003 to 2008 from National Science Foundation USA, and the Achievement Award at DuPont Central Research in 2000. His e-mail address is [email protected]. Haixu Tang is an assistant professor in the School of Informatics and an affiliated faculty of Center for Genomics and Bioinformatics (CGB) at Indiana University, Bloomington, where he has worked since 2004. Professor Tang received a Ph.D. in molecular biology from the Shanghai Institute of Biochemistry in 1998; between 1999 and 2001, he was a postdoctoral associate in the Department of Mathematics at the University of Southern California; between 2001 and 2004, he was an assistant project scientist in the Department of Computer Science and Engineering at the University of California at San Diego. Professor Tang was a recipient of the NSF career award in 2007. His e-mail address is [email protected]. 245

246

Genome Sequencing Technology and Algorithms

Elaine R. Mardis is an associate professor of genetics and molecular microbiology and the codirector of the Genome Sequencing Center at the Washington University School of Medicine in St. Louis, Missouri. She has a Ph.D. in biochemistry and chemistry and a B.S. in zoology from the University of Oklahoma. Dr. Mardis has been at the Washington University School of Medicine’s Genome Sequencing Center (GSC) since 1993, playing a pivotal role in the evaluation, optimization, and application of novel sequencing instrumentation, chemistry, and molecular biology. As the director of technology development, she helped develop automation and pipelines for sequencing the human genome. She currently orchestrates the GSC’s efforts to explore next generation sequencing technologies and to transition them into the GSC’s production sequencing efforts. Dr. Mardis also serves on several NIH study sections and private scientific advisory boards. Her e-mail address is [email protected]. Curt Balch is a research assistant professor in the Department of Cellular and Integrative Physiology at the Indiana University School of Medicine. He received a Ph.D. from the University of Cincinnati and was employed in the biotechnology industry for 5 years prior to his current position. His research interests include cancer epigenetics, system biological approaches to cancer, and epigenetic therapies. His e-mail address is [email protected]. Paola Bonizzoni is a professor of computer science at the University di Milano-Bicocca in Milan. She received an M.S. in computer science from the University di Milano in 1988 and a Ph.D. in computer science from the University di Milano-Torino in 1993. She has been a visiting research associate at the University of Colorado at Boulder. Her e-mail address is [email protected]. Jiacheng Chen received an M.S. in computer science from Stony Brook University in 2006. His current interests include the design and application of computational algorithms and software engineering. His e-mail address is [email protected]. George M. Church is a professor of genetics at Harvard Medical School and the director of the Center for Computational Genetics. With degrees from Duke University in chemistry and zoology, he coauthored research on 3D software and RNA structure with Sung-Hou Kim. His Ph.D. from Harvard in biochemistry and molecular biology with Wally Gilbert included the first direct genomic sequencing method in 1984. He initiated the Human Genome Project, first as a research scientist at the newly formed Biogen Inc. and then as a Monsanto Life Sciences Research Fellow at UCSF. Professor Church invented the broadly applied concepts of molecular multiplexing and tags, homologous recombination methods, and array DNA synthesizers. The technology transfer of automated sequencing and software to the Genome Therapeutics Corporation resulted in the first commercial genome sequence (the human pathogen, H. pylori, 1994). He has served in advisory roles for 12 journals, 5 granting agencies, and 22 biotech companies. His current research focuses on integrating

About the Authors

247

biosystems modeling with personal genomics and synthetic biology. His e-mail address is http://arep.med.harvard.edu/gmc/email.html. Cristian Coarfa is a postdoctoral research associate in the Bioinformatics Research Laboratory at the Human Genome Sequencing Center within the Department of Molecular and Human Genetics, at the Baylor College of Medicine. He received a B.Sc. in computer science from the Politehnica University of Bucharest and an M.S. and a Ph.D. in computer science from Rice University. His current research interests include high-throughput genomic discovery pipelines, parallel computing, and software optimization. His e-mail address is [email protected]. Colin C. Collins is an associate professor in the Department of Urology at the University of California at San Francisco and a visiting professor at the Beijing Genome Institute. Dr. Collins’ research is best described as translational genomics where mathematics, genomics, computer science, and clinical science ultimately converge in diagnostics and therapeutics. His current research is aimed at sequencing tumor genomes and development of broad spectrum DNA-based biomarkers for cancer. His e-mail address is [email protected]. Gianluca Della Vedova is an associate professor in the Department of Statistics, University di Milano-Bicocca. He holds a Ph.D. and an M.Sc. in computer science from the University di Milano. His research interests focus on the design of combinatorial algorithms in bioinformatics and graph theory. His e-mail address is [email protected]. Riccardo Dondi is an assistant professor at the University of Bergamo. His research area is computational biology, in particular, the design of algorithms and the study of the computational complexity of some biological problems, mainly the construction and comparison of evolutionary trees. His more recent focus is on haplotype inference problems. His e-mail address is [email protected]. Lewis J. Frey is an assistant professor in the Department of Biomedical Informatics at the University of Utah. His research interests include work on the cancer biomedical informatics grid (caBIG™) for the National Cancer Institute and how it can be applied to personalized medicine. He conducts research in computational biomedical informatics, particularly in methods of combining multiple data sets of different modalities for the purpose of knowledge discovery. His e-mail address is [email protected]. Baback Gharizadeh is a scientist at Stanford Genome Technology Center at Stanford University. He is one of the pioneers of the pyrosequencing technology and has developed and improved this technology in chemistry and read-length since its advent and has used the improvements for different relevant applications. His e-mail address is [email protected]. Tim H.-M. Huang is a professor in the Division of Human Cancer Genetics at The Ohio State University. He has pioneered the development of epigenetic microarray for global analysis of epigenetic alterations in cancer. Dr.

248

Genome Sequencing Technology and Algorithms

Huang completed his Ph.D. study in genetics at the University of California, Davis in 1989. His postdoctoral training was in the area of clinical cytogenetics at Baylor College of Medicine, and subsequently he was certified by the American Board of Medical Genetics in 1993. Dr. Huang began to develop an independent research program in cancer epigenetics at the University of Missouri (1991–2003). His epigenetic laboratory was relocated to The Ohio State University in 2003. Currently, Dr. Huang is a member of the Cancer Genetics Study Section and has served in numerous ad hoc reviews for NIH’s intramural programs and investigator-initiated, cancer center, program project, and SPORE grants. He is an associate editor of Cancer Research, Cancer Informatics, and Cancer Genomics and Proteomics and is a fellow of the American Association for the Advancement of Sciences. His e-mail address is [email protected]. Roxana Jalili is a research associate at Stanford Genome Technology Center in Stanford University. She received an M.S. in chemical engineering with a concentration in biotechnology at San Jose State University. Her recent research endeavors include developing a novel method for protein detection using pyrosequencing as well as high-throughput DNA sequencing employing 454 pyrosequencing technology. Her e-mail address is [email protected]. Jing Li is an assistant professor in the Department of Electrical Engineering and Computer Science at Case Western Reserve University. He obtained a Ph.D. in computer science from University of California–Riverside in 2004 and a B.S. in statistics from Peking University, China, in 1995. His research interest is in the area of computational biology. More specifically, he mainly focuses on the design and implementation of efficient computational and statistical algorithms for the characterization of DNA variation in human populations and for the identification of the correlations of DNA variation and phenotypic variation. His e-mail address is [email protected]. Victor Maojo is a professor of computer science in the Department of Artificial Intelligence, Universidad Politecnica de Madrid, where he is also the director of the Biomedical Informatics Group. He is currently on the editorial board of the journals Methods of Information in Medicine and Journal of Intelligent of Fuzzy Systems. He works on different projects, funded by national agencies and the European Commission, in the biomedical informatics area. His research interests include methods for database integration, ontologies, text and data mining, clinico-genomic trials, and, more recently, nanomedicine. His e-mail address is [email protected]. Aleksandar Milosavljevic is an associate professor of molecular and human genetics at Baylor College of Medicine (BCM). He also directs the Bioinformatics Research Laboratory and is affiliated with the Human Genome Sequencing Center at BCM. He received a Dipl.Ing. in electrical engineering from the University of Belgrade and a Ph.D. in computer and information sciences from the University of California at Santa Cruz. His current research

About the Authors

249

interests include development of genomic and informatic methods and software systems for genomic biomedicine. His e-mail address is [email protected]. Joyce A. Mitchell is a professor and the chair of the Department of Biomedical Informatics at the University of Utah. She developed the Genetics Home Reference (http://ghr.nlm.nih.gov) to bridge the genomics research results with consumer health interests in genetic diseases. Her current research interests are focused on clinical bioinformatics and translational informatics. She is particularly interested in the form in which pieces of genomic information will be stored in electronic health systems and used in clinical decision support. She is currently working on a several projects on how to use data from gene tests of drug metabolizing genes in EMR decision support and on how the cyberinfrastructure of the health systems needs to evolve to accommodate personalized medicine. She continues to work as the senior advisor on the Genetics Home Reference at the NLM. Her e-mail address is [email protected]. Kenneth P. Nephew is a professor of cellular and integrative physiology and an adjunct professor of obstetrics and gynecology in the Medical Sciences Program at the Indiana University School of Medicine. He is the assistant director for basic science research at the Indiana University Cancer Center and the program leader for the Walther Cancer Institute. He received a Ph.D. from The Ohio State University in 1991. He serves as a regular member for various grant review committees, including the Cancer Biomarkers review group at the National Institutes of Health (NIH) and the Tumor Biochemistry and Endocrinology panel of the American Cancer Society (ACS). His current research interests, supported by awards from the NIH, the ACS, and the Department of Defense Breast and Ovarian Cancer Research Programs, include cancer epigenetics (DNA methylation and histone modifications), cancer biomarkers, high-throughput technology for analyzing the epigenome, drug resistance, ovarian and breast cancers, and steroid hormone action (estrogen receptor signaling). His e-mail address is [email protected]. David Okou holds an M.S. in biochemistry from the University of Abidjan and a Ph.D. in biochemistry from Clark Atlanta University. He is a senior postdoctoral fellow in the Department of Human Genetics at the Emory University School of Medicine. His research interest focuses on the use of next generation resequencing technologies to rapidly and inexpensively detect genetic variation that contribute to human diseases in general, and autism in particular. His e-mail address is [email protected]. Gregory J. Porreca is a researcher in the Department of Genetics at Harvard Medical School. His research interests focus on development of technology for nucleic acid sequence analysis. He holds a Ph.D. in genetics from Harvard University and a B.S. in biology and computer science from the College of New Jersey. His e-mail address is [email protected].

250

Genome Sequencing Technology and Algorithms

Benjamin J. Raphael is an assistant professor in the Department of Computer Science and the Center for Computational Molecular Biology at Brown University. He received an S.B. from the Massachusetts Institute of Technology and a Ph.D. from the University of California at San Diego, both in mathematics. His research interests include computational cancer genomics and the design of algorithms for biological sequence analysis. His e-mail address is [email protected]. Mostafa Ronaghi is a principal investigator at Stanford University focused on developing novel tools for molecular diagnostics. He has a Ph.D. from The Royal Institute of Technology in Sweden. Dr. Ronaghi has written more than 50 peer-reviewed publications in journals and books. Among the technologies developed by him are pyrosequencing and the molecular inversion probe assay, which have been widely used. His e-mail address is [email protected]. Jay Shendure is an assistant professor in the Department of Genome Sciences at the University of Washington. He received a Ph.D. in 2005 and an M.D. in 2007 from Harvard Medical School. His e-mail address is [email protected]. Steven Skiena is a professor of computer science at Stony Brook University. His research interests include the design of graph, string, and geometric algorithms, and their applications (particularly to biology). He is the author of four books, including The Algorithm Design Manual and Calculated Bets: Computers, Gambling, and Mathematical Modeling to Win. He was recipient of the ONR Young Investigator Award and the IEEE Computer Science and Engineering Undergraduate Teaching Award. His e-mail address is [email protected]. Stas Volik has been working in the field of genomics for 15 years, specializing in oncogenomics for the last 10 years. Together with Dr. Collins, he has invented, published, and patented end sequence profiling technology for the comprehensive analysis of the structure of tumor genomes. His e-mail address is [email protected]. Michael E. Zwick is an assistant professor in the Department of Human Genetics at the Emory University School of Medicine in Atlanta, Georgia. His laboratory’s research is centered on the application of cutting-edge technologies that rapidly identify genomic variation. The ultimate goals of his research aim to characterize the effects of genomic variation in systems ranging broadly from genetic mechanisms of human disease, including autism and mental retardation, Drosophila population genomics, and resequencing biodefense pathogens, such as anthrax. His e-mail address is [email protected].

Index mRNA, 92 multiple sequence, 107 overlap, 110 Alleles, 152 AMASS, 87, 117–19 defined, 87, 117 fragment identification, 119 fragment representation, 118 overview, 117–18 satellite matching, 119 Amplification of intermethylated sites (AIMS), 206 Ampligase, 49 Amplisomes, 191 Anchoring in colinear blocks, 227 of paired ends, 232–33 in phylogenetic footprinting, 232 read, 230–32 by seed-and-extend, 234–36 times for BLAT, 240 UD-CSD benchmark, 237–39 Anthrax resequencing, 35–37 Arachne, 110–11 contig assembly, 111 defined, 110 operation, 110–11 repeat contig/supercontig detection, 111 See also Fragment assembly algorithms

454 GS-20 sequencer, 7 454 Life Sciences, 214 454 sequencing, 17–20 advantages/challenges, 20 chemistry, 18–19 defined, 17 ultrabroad, 20 ultradeep, 20 whole-genome, 19 ABACUS algorithmic improvements and, 39 base calling, 32 calls, 30 defined, 30 grid alignment, 32 quality score, 30, 31 quality score distribution, 31 resequencing array data analysis with, 29–33 ABI capillary sequencing (ACS), 37 ACGT (Advancing Clinico-Genomics Trials on Cancer), 60 Adaptive Background genotype calling scheme. See ABACUS Affymetrix GeneChip DNA Analysis software, 33–34 Affymetrix RAs, 27 Alignment ABACUS grid, 32

251

252

Genome Sequencing Technology and Algorithms

Array-based pyrosequencing, 15–21 454 sequencing chemistry, 18–19 advantages/challenges, 20 defined, 17 technology applications, 19 ultrabroad sequencing, 20 ultradeep amplicon sequencing, 20 whole-genome sequencing, 19 See also Pyrosequencing Array comparative genome hybridization (aCGH), 148, 185–86 defined, 185 limitations, 186 normal/tumor DNA, 186 use in genomic analysis of cancer, 185 Assembly validation, 85–87 clone coverage analysis, 87 compression/expansion statistics, 86 information theoretic probabilistic approach, 87 input, 85 output, 85–86 TAMPA, 86 ATP, 16–17 dATP, 17 dATP- -S, 17 BACs, 95, 188, 230 chimeric, 189 clones, 189 fishing, 233, 234 mapped, 234 Bacterial artificial chromosomes. See BACs Bactig-ordering problem, 110 Bambus, 87, 89–92 contig ordering, 90–91 contig orientation, 90 edge bundling, 89–90 hierarchical scaffolding, 91 mRNA, ESTs, BAC ends, paired reads alignment, 92 untangling, 91 See also Scaffold generation Bayesian Frequencies Haplotype Inference (BFHI) problem, 176 Bayesian inference, 175 BEAMing, 44 Biomedical database information, 56–57 BioMOBY, 62 BLAST, 108, 226

BLAT, 188, 226, 239 Blocks anchors submerged in, 227 defined, 154 inconsistency, 227 partitioning into, 155 purposes, 228 similarities between, 228 Bridge graphs, 92, 93 Bridge PCR, 44 CaIntegrator, 66 Cancer aCGH and, 185 epigenetics and, 200–201 genomic alterations, 183–92 Cancer Biomedical Informatics Grid (caBIG), 62 Cancer Data Standards Repository (caDSR), 63–64 Cancer Genome Atlas, 185 Cancer Text Information Extraction System (caTIES), 64 Cancer Translational Research Informatics Platform (caTRIP), 64–66 defined, 64 Federated Query Builder, 67 CAP, 104 CAP2, 104 CAP3, 104–5 automatics clipping, 104 defined, 104 mate-pair constraints, 105 overlays, 104 See also Fragment assembly algorithms CAR, 228 CaTISSUE Clinical Annotation Engine (CAE), 64 CaTISSUE Core, 64 Celera Scaffolder, 109–10 bactig graph construction, 109 defined, 109 path-merging algorithm, 110 Celera Whole Genome Assembler (WGA), 105–10 A-statistics, 108 defined, 105 design principle, 107–8 Kececioglu and Myers approach, 105–7 overlapper, 108

Index unitigger, 108–9 U-unitigs, 108 See also Fragment assembly algorithms ChAP, 207 Chromatin accessibility complex (CHRAC), 199 Chromatin immunoprecipitation (ChIP), 149 ChIP-cloning, 207 ChIP-on-chip, 207 ChIP-paired-end-tagging, 207 coupled to MPSS, 208 Clone coverage analysis, 87 Clone reads, mapping, 233–34 Cluster computing, 225–26 CMOS image sensor, 21 Coalescent model, 158–61, 162 Cohybridization, 212 Comparative assembly, 146 Comparisons, 226–28 algorithms, 225 assembled genome and genome fragments, 230–34 assembled genomes, 227 genome fragments, 229–30 primate genome, 226 Compression/expansion statistics, 86 Consensus sequences, 102 Contigs assembly, 110, 111 building, 118–19 ordering, 90–91 orientation, 90 repeat, detecting, 111 Copy number variation (CNV), 39 Coulter counter, 10 Cyclic-array methods, 48 Data grids, 59 De Bruijn subgraphs, 133–34 Differential methylation hybridization (DMH), 203, 212 Digital karyotyping, 230 DiseaseCard, 62, 63 DNA footprinting, 212 DNAse chip, 210 DNAsel array, 210 DNA sequencing in nucleic acid sequence determination, 15

253

resequencing arrays in, 25–40 DNA sequencing technology background, 3–4 dideoxynucleotide termination, 3 massively parallel approach goals, 6 massively parallel scale development rationale, 4–5 overview, 3–6 at RNA level, 4 Dynamic programming algorithm, 174 Emulsion 454, 18 Emulsion PCR, 44 concentration, lowering, 47 protocols, 47, 48 template amplification with, 46–48 See also Polymerase chain reactions (PCR) End sequence profiling (ESP), 187–88, 233 application, 188 data analysis, 188–91 defined, 187 end sequence pair (ES pair), 188 illustrated, 187 MCF7, 190 tumor genomes analysis, 191 Enterprise Vocabulary Services (EVS), 58 Epigenetics, 148 cancer and, 200–201 cloning/sequencing, 213–14 defined, 197 developmental/neurological diseases and, 200 disease and, 200–201 gel-based approaches, 201–12 high-throughput analysis, 201–15 histone modifications, 198–99 map, 197 mass spectrometry, 215 methylation of deoxycytosine, 198 microarrays, 212–13 miRNAs, 199 nucleosome remodeling, 198–99 phenomena regulating gene expression, 198–99 processes, 202 Epigenomics, 148–49 defined, 148 high-throughput assessments, 197–216 modifications of histones, 149 Error correction problem, 113

254

Genome Sequencing Technology and Algorithms

EULER, 95, 112–15 defined, 112 error correction and data corruption, 113 EULER-DB, 115 Eulerian-superpath problem, 113, 114–15 Idury-Waterman algorithm, 112–13 mate-pair information, 115 overview, 113 See also Fragment assembly algorithms Eulerian-superpath problem, 113, 114–15 Extensible Markup Language (XML), 61 Finite site model, 158 Fitness function, 117 Foundation Model of Anatomy (FMA), 59 Founders, 164 Fragment assembly, 84–85 Fragment assembly algorithms, 101–20 Arachne, 110–11 CAP3, 104–5 Celera Whole Genome Assembler (WGA), 105–10 EULER, 112–15 generic approach, 116–17 input, 101 Phrap, 103–4 structured pattern-matching approach, 117–19 TIGR, 102–3 Fragment conflict graphs, 171 Fragments errors in, 113 gapless, 172 identification, 119 inferring haplotypes from, 169–75 layout, 107 merging with assemblies, 102 mutual comparison, 229–30 orientation, 106 Frequencies Haplotype Inference (FHI) problem, 176 Gapless fragments, 172 Gel-based approaches, 201–12 General Minimum Recombination Haplotype Inference (GMRHI), 168 Generic fragment assembly algorithm, 116–17 fitness function, 117

operations, 117 sequence assembly representation, 116–17 See also Fragment assembly algorithms Genetics Home Reference, 55 Genome assembly techniques, 79–97 Genome characterization, 145–49 epigenomics, 148–49 genotyping versus haplotyping, 147 large-scale variations, 147–48 in post-HGP, 145–49 resequencing and comparative assembly, 146 Genome rearrangement measurement, 187–88 Genome repeat masking, 28 Genome sequencing, 53–72 assembly validation, 85–87 finishing, 94 fragment assembly, 84–85 framework, 96–97 heterogeneous data sources, 56–57 information modeling, 57–58 ontologies/terminologies, 58–59 personalized medicine, 53, 55–56 scaffold generation, 87–94 by shotgun-sequencing strategy, 79–82 strategies, 94–95 trimming vector and low-quality sequences, 83–84 very large scale, 95 Genome sequencing applications, 59–71 ACGT, 60 caBIG, 62 caDSR, 63–64 caIntegrator, 66 caTRIP, 64–66 DiseaseCard, 62, 63 LOINC Committee, 69–70, 71 NoE, 68 OntoFusion, 60–62 Genome-wide mapping technique (GMAT), 208 Genomic alterations (cancer), 183–92 aCGH, 185–86 combination of techniques, 191 ESP, 187–88 ESP data analysis, 188–91 future directions, 191–92 Genomic DNA (gDNA), 44, 45

Index Genomics and Personalized Medicine Act of 2006, 54 Genomic triangulation method, 226 Genotypes defined, 153 resolving, 156–57 xor, 161–62 Genotyping haplotyping versus, 147 SNP, 16 Gibbs sampler, 177 GigAssembler, 87 bridge graphs, 92, 93 defined, 91 raft-ordering graphs, 92 rafts, 92, 93 scaffold-generation process, 91–92 GO, 59 Greedy approximation algorithm, 106 GRIMM, 228 G-valid genotyped pedigree graph, 167 Haplotyped pedigree graphs, 166 Haplotypes common, 152 compatible, 154 defined, 151, 152 inferring, 153 inferring, in population, 154–63 inferring from fragments, 169–75 inferring in pedigrees, 163–69 maps, 152 partial completion, 162–63 statistical inference, 175 Haplotyping defined, 151 genotyping versus, 147 with missing data, 162–63 problem, 151–78 Pure Parsimony problem, 158 HapMap project, 4, 147 Helicos, 127 HELP, 205 Heterozygous, 153 Hierarchical scaffolding, 91 High-throughput methods, 203–11 AIMS, 206 cloning/sequencing, 213–14 DMH, 203

255

HELP, 205 histone modifications, 207–11 mass spectrometry, 215 McIP, 204 MeDIP, 204 methylation profiling coupled to LDR, 204 MethylScope, 205 MethylScreen, 205 MIAMI, 204 microarrays, 212–13 MIRA, 203 MSO, 206 MTA, 206 Not1 digestion coupled to BAC array hybridization, 203 promoter methylation array, 203 RLGS, 203 universal bead array, 205 Histone modifications, 198–99 ChAP, 207 ChIP-cloning, 207 ChIP-on-chip, 207 ChIP-paired-end-tagging, 207 GMAT, 208 high-throughput methods, 207–11 MPSS, 208 SACO, 208 Human Genome Project (HGP), 25, 43, 53, 145 completion, 145 information generated by, 197 Human Genome Variation Society, 69 Human resequencing, 33 Hybridization, sequencing by, 9 computational methods, 112 as gel-based sequencing alternative, 146 with resequencing arrays, 26–28 Idury-Waterman algorithm, 112–13 Infectious agent detection, 37–38 Inference problem by coalescent model, 158–61 general rule, 156–58 Information modeling, 57–58 Information theoretic probabilistic approach, 87 Isolated target DNA, 28–29

256

Genome Sequencing Technology and Algorithms

Kececioglu and Myers approach, 105–7 fragment layout, 107 fragment orientation, 106 greedy approximation algorithm, 106 multiple sequence alignment, 107 overlap graph construction, 106 steps, 105–6 See also Celera Whole Genome Assembler (WGA) LAGAN, 228 Large-scale genome variations, 147–48 Linkage disequilibrium (LD), 147 Litigation-based sequencing, 8–9 Log-likelihood ratio (LLR), 103 LOINC Committee (Logical Observation Identifiers Names and Codes), 69–70, 71 Longest Haplotype Reconstruction (LHR) problem, 172 Long-interspersed elements (LINEs), 198 Long PCR (LPCR), 28–29 Massively parallel sequencing development rationale, 4–5 future methods survey, 9–11 goals, 6 by hybridization, 9 litigation-based, 8–9 nanopore approaches, 10–11 by synthesis pyrosequencing, 6–8 by synthesis with reversible terminators, 8 within zero-mode waveguide, 9–100 Massively parallel signature sequencing (MPSS), 44, 214 Mass spectrometry, 215 Mate-pair constraints, 105 identification, 110 information, 103, 115 Mating loop, 164 Matrix-assisted laser desorption/ionization coupled to time-of-flight (MALDI-TOF), 215 Maximum-likelihood inference, 175 MegaBlast, 188 Mendelian law, 155, 163, 168 Methylated CpG island recovery assay (MIRA), 203

Methylated DNA immunoprecipitation (MeDIP), 204 Methylation of deoxycytosine, 198 Methylation-specific oligonucleotide array (MSO), 206 Methylation target array (MTA), 206 Methyl-binding protein immunoprecipitation (McIP), 204 MethylScope, 205 MethylScreen, 205 MFISH, 184 Microarray-based integrated analysis of methylation (MIAMI), 204 Microarrays, 209, 212–13 Microbial pathogen resequencing, 35–38 anthrax resequencing, 35–37 infectious agent detection, 37–38 SARS resequencing, 37 Micrococcal nuclease laddering, 212 MicroRNAs (miRNAs), 199 cloning, 209–10 custom, 209 custom arrays, 213 discovery, 209, 214 high-throughput analysis, 213 mirVana, 209 probes, 213 Minimum entropy model, 162 Minimum Error Correction (MEC) problem, 172 Minimum Fragment Removal (MFR) problem, 171 Minimum Perfect Phylogeny Haplotyping (MPPH) problem, 160 Minimum SNP Removal (MSR) problem, 172 Min Vertex Cover problem, 173 MIRA (methylated-CpG-island recovery), 213 MirVana, 209 MitoChip, 33, 34 Mitochondrial DNA resequencing, 33–34 MRNA alignment, 92 transcripts, 199 Multimating Pedigree Tree Minimum Recombination Haplotype Inference (MPT-MRHI) problem, 169

Index Mutation, 152 Nanopore sequencing, 10–11 National Institute of Health (NIH), 145 Network of Excellence (NoE), 68 Next generation sequencing technologies, 225, 226 NimbleGen Systems RAs, 28 Nuclear family, 164 Nucleosome positioning, 210, 213 Nucleosome remodeling, 198–99 Oncomirs, 201 OntoFusion, 60–62 defined, 60 domain independence, 60 screen shot, 61 Orthologous sequences, 228 Overlap-layout-consensus approach, 84–85 Paired ends anchoring, 232–33 sequencing, 185 Parallelism, 225–26 Parents-offspring trio, 164 Parsimonious principle, 156 Path-merging algorithm, 110 PCAP, 104 Pedigree Graph Haplotype Interface (PHI) problem, 167 Pedigree graphs, 164 defined, 163 definitions, 164–65 g-valid genotyped, 167 haplotyped, 166 Perfect Phylogeny Haplotyping (PPH) problem input, 159 linear time algorithm, 161 Minimum (MPPH), 160 output, 160 Xor (XPPH), 162 Personalized medicine, 53, 55–56 defined, 54 pharmacogenetics, 55–56 realization, 54 Pharmacogenetics, 55–56 Phrap, 103–4 defined, 103 online documentation, 103

257

steps, 103–4 See also Fragment assembly algorithms Phylogenetic footprinting, 232 PicoTiterPlate (PTP) device, 7 Polony sequencing, 43–50, 126–27 cyclic-array methods, 48–49 future, 49–50 methodologies, 44 overview, 44–45 sequencing libraries construction, 45–46 template amplification, 46–48 Polyacrylamide gel electrophoresis (PAGE), 46 Polymerase chain reactions (PCR), 34 bridge, 44 clonal reaction, 46 emulsion, 44, 46–48 long (LPCR), 28–29 multiplex amplification, 46 in polyacrylamide gels, 44 Polymorphism Markup Language (PML), 66 Polymorphisms, 231 Pooled genomic indexing, 95 Positional hashing, 234–36 colinear matches, 235 illustrated, 236 seed-and-extend versus, 234–36 Promoter methylation array, 203 Pseudo Gibbs sampler (PGS), 176 Pure Parsimony Haplotyping problem, 158, 163 Pyrosequencing, 6–8 454 GS-20 sequencer, 7 array-based, 15–21 chemistry, 16–17 defined, 214 future, 21 method principle, 6 in microtiter plate format, 7 novel applications enabled by, 8 in short-read sequencing, 126 Raft-ordering graphs, 92 Rafts, 92, 93 Read anchoring, 230–32 Recombination events, 152 example, 155 occurrence, 155–56 Repetitive sequences, TIGR handling of, 103

258

Genome Sequencing Technology and Algorithms

Resequencing anthrax, 35–37 genome, 146 human, 33 microbial pathogen, 35–38 mitochondrial DNA, 33–34 SARS, 37 Resequencing arrays (RAs), 25–40 Affymetrix, 27 applications, 33–38 challenges, 38–40 concept and design, 27 data analysis with ABACUS, 29–33 data reproduction, 30 DNA sequencing by hybridization, 26–28 efficient use of space, 28 experimental protocols, 28–29 hybridization, 29 Nimblecen Systems, 28 protocol overview, 27 queries, 26 role in DNA sequencing, 25–40 target DNA isolation, 39 Restriction landmark genome scanning (RLGS), 201–12 Reversible terminators, 8 Rolling-circle amplification (RCA), 44, 48 SAGE (serial analysis of gene expression), 4 SARS resequencing, 37 accuracy, 38 cell rate, 38 SARS-CoV, 37 See also Resequencing; Resequencing arrays (RAs) Satellite matching, 119 Scaffold generation, 87–94 Bambus, 87, 89–92 GigAssembler, 87, 92–94 hierarchical, 91 input, 89 output, 89 packages, 87 problem, 89 process, 91–92 Seed-and-extend, 234–36 index, 235 positional hashing versus, 234–36 speed, 235

Sequencing by hybridization (SBH), 9, 26–28, 112, 146 Sequencing libraries, 45–46 Serial analysis of chromatin occupancy (SACO), 208 Short-read sequencing, 123–40 algorithmic methods, 129 analysis, 135–40 assembler development, 132–40 assembly, 128–32 assembly process phases, 132–33 cleaning input read-pairs, 132 contigs selection, 134–35 de Bruijn subgraphs, 133–34 double-ended, 123–40 multiple window lengths and, 140 postassembly contig sizes, 138–39 post contig extension, 135 pyrosequencing basis, 126 read correction, 133 read frequency analysis, 132–33 simulation results, 129–32 technologies, 125–28 Shotgun-sequencing strategy, 79–82 shotgun reads, 80 WGS, 80–82 Simple Object Access Protocol (SOAP), 61 Single-Mating Pedigree Tree Minimum Recombination Haplotype Inference (SPT-MRHI) problem, 169 Single-molecule molecular-motion sequencing, 127 Single-nucleotide polymorphism. See SNPs Smith-Waterman algorithm, 103 SNP conflict graphs, 173 SNPs, 4, 147 alleles, 152 biallelic, 152 defined, 151 discovery, 29 genome-wide profiling, 5 genotyping, 16 identification, 29, 147 large-scale screening, 152 multiallelic, 152 sites, 153 subset, 31 Solexa, 127

Index Spectral karyotyping (SKY), 184 Statistical haplotype inference, 175 Statistical methods, 175–77 concept, 175 introduction, 175 Structured pattern-matching approach, 117–19 AMASS, 117–18 contigs, building, 118–19 defined, 117 repeats, handling, 119 See also Fragment assembly algorithms Supercontigs, 110–11 filling gaps in, 111 repeat, detecting, 111 SWAT, 103 TAMPA, 86 Taq ligase, 49 Taverna, 62 TIGR assembler, 102–3, 135 consensus sequence building, 102 defined, 102 handling repetitive sequences, 103 mate-pair information, 103 merging fragments with assemblies, 102 See also Fragment assembly algorithms Trimming vector and low-quality sequences, 83–84 contaminant detection, 84 quality region determination, 83 vector splice site trimming, 83 UCSC Genome Bioinformatics site, 28, 39 UD-CSD benchmark, 237–39

259

defined, 237 illustrated, 238 key aspects, 237 using, 237 Ultrabroad sequencing, 20 Ultradeep sequencing, 20 Unified Modeling Language (UML), 57 binding models, 64 caBIG models, 62 domain models, 62 UMLS, 58, 59 Universal bead array, 205 Universal Description, Discovery, and Integration (UDDI), 61 U-unitigs, 108 Web Services Description Language (WSDL), 61 Whole-genome amplification, 28 Whole-genome sequencing, 19 hierarchical approach, 94–95 hybrid approach, 95 shotgun strategy, 94 strategies, 94–95 Whole-genome shotgun (WGS) strategy, 80–82 Xor-genotype, 161–62 Xor Perfect Phylogeny Haplotyping (XPPH), 162 Zero-mode waveguide, sequencing within, 9–10 Zero-Recombinant Haplotype Inference Problem (ZRHI), 168

Related Artech House Titles Advanced Materials and Techniques for Radiation Dosimetry, Khalil Arshak and Olga Korostynska, editors Advanced Methods and Tools in ECG Data Analysis, Gari D. Clifford, Francisco Azuaje, and Patrick E. McSharry, editors Biomolecular Computation for Bionanotechnology, Jian-Qin Liu and Katsunori Shimohara Database Modeling in Computational Biology, Jake Chen and Amandeep S. Sidhu, editors Electrotherapeutic Devices: Principles, Design, and Applications, George D. O'Clock Intelligent Systems Modeling and Decision Support in Bioengineering, Mahdi Mahfouf Life Science Automation Fundamentals and Applications, Mingjun Zhang, Bradley Nelson, and Robin Felder, editors Matching Pursuit and Unification in EEG Analysis, Piotr Durka Microfluidics for Biotechnology, Jean Berthier and Pascal Silberzan Systems Bioinformatics: An Engineering Case-Based Approach, Gil Alterovitz and Marco F. Ramoni, editors Text Mining for Biology and Biomedicine, Sophia Ananiadou and John McNaught, editors

For further information on these and other Artech House titles, including previously considered out-of-print books now available through our In-Print-Forever® (IPF®) program, contact: Artech House Publishers 685 Canton Street Norwood, MA 02062 Phone: 781-769-9750 Fax: 781-769-6334 e-mail: [email protected]

Artech House Books 46 Gillingham Street London SW1V 1AH UK Phone: +44 (0)20 7596 8750 Fax: +44 (0)20 7630 0166 e-mail: [email protected]

Find us on the World Wide Web at: www.artechhouse.com