Next Generation Sequencing and Whole Genome Selection in Aquaculture

Next Generation Sequencing and Whole Genome Selection in Aquaculture Next Generation Sequencing and Whole Genome Select...

Author: Zhanjiang (John) Liu

65 downloads 1194 Views 16MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Next Generation Sequencing and Whole Genome Selection in Aquaculture © 2011 Blackwell Publishing Ltd. ISBN: 978-0-813-80637-2

Edited by Zhanjiang (John) Liu

Next Generation Sequencing and Whole Genome Selection in Aquaculture Edited by

Zhanjiang (John) Liu Auburn University

A John Wiley & Sons, Ltd., Publication

Edition first published 2011 © 2011 Blackwell Publishing Ltd. Blackwell Publishing was acquired by John Wiley & Sons in February 2007. Blackwell’s publishing program has been merged with Wiley’s global Scientific, Technical, and Medical business to form Wiley-Blackwell. Editorial Office 2121 State Avenue, Ames, Iowa 50014-8300, USA For details of our global editorial offices, for customer services, and for information about how to apply for permission to reuse the copyright material in this book, please see our Website at www.wiley.com/wiley-blackwell. Authorization to photocopy items for internal or personal use, or the internal or personal use of specific clients, is granted by Blackwell Publishing, provided that the base fee is paid directly to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. For those organizations that have been granted a photocopy license by CCC, a separate system of payments has been arranged. The fee code for users of the Transactional Reporting Service is ISBN-13: 978-0-8138-0637-2/2011. Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought. Library of Congress Cataloging-in-Publication Data Next generation sequencing and whole genome selection in aquaculture / [edited by] Zhanjiang (John) Liu. p. cm. Includes bibliographical references and index. ISBN 978-0-8138-0637-2 (hardcover : alk. paper) 1. Gene mapping. 2. Fishes–Breeding. 3. Shellfish–Breeding. I. Liu, Zhanjiang. QH445.2.N49 2011 639.8–dc22 2010030977 A catalog record for this book is available from the U.S. Library of Congress. Set in 10 on 12 pt Dutch 801 BT by Toppan Best-set Premedia Limited Printed in •• Disclaimer The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose. No warranty may be created or extended by sales or promotional materials. The advice and strategies contained herein may not be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services. If professional assistance is required, the services of a competent professional person should be sought. Neither the publisher nor the author shall be liable for damages arising herefrom. The fact that an organization or Website is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or Website may provide or recommendations it may make. Further, readers should be aware that Internet Websites listed in this work may have changed or disappeared between when this work was written and when it is read. 1

2011

Contents Preface List of Contributors Chapter 1. Genomic Variations and Marker Technologies for Genome-based Selection Zhanjiang (John) Liu Chapter 2. Copy Number Variations Jianguo Lu and Zhanjiang (John) Liu Chapter 3. Next Generation DNA Sequencing Technologies and Applications Qingshu Meng and Jun Yu Chapter 4. Library Construction for Next Generation Sequencing Huseyin Kucuktas and Zhanjiang (John) Liu Chapter 5. SNP Discovery through De Novo Deep Sequencing Using the Next Generation of DNA Sequencers Geoffrey C. Waldbieser Chapter 6. SNP Discovery through EST Data Mining Shaolin Wang and Zhanjiang (John) Liu

vii ix

3 21

35 57

69 91

Chapter 7. SNP Quality Assessment Shaolin Wang, Hong Liu, and Zhanjiang (John) Liu

109

Chapter 8. SNP Genotyping Platforms Eric Peatman

123

Chapter 9. SNP Analysis with Duplicated Fish Genomes: Differentiation of SNPs, Paralogous Sequence Variants, and Multisite Variants Cecilia Castaño Sánchez, Yniv Palti, and Caird Rexroad Chapter 10. Genomic Selection for Aquaculture: Principles and Procedures Anna K. Sonesson

133 151

Chapter 11. Genomic Selection in Aquaculture: Methods and Practical Considerations Ashok Ragavendran and William M. Muir

165

Chapter 12. Comparison of Index Selection, BLUP, MAS, and Whole Genome Selection Zhenmin Bao

185

Index

219

Color plates appear between pages 108 and 109. v

Preface Over the last 25 years of genomics development, molecular markers have been a major limiting factor. That was true for human genomics, animal genomics, as well as for aquaculture genomics. As a result, the goals of genomic research have been a moving target based on the availability of molecular markers. Scientists celebrated at each stage of marker development, from the classical restriction fragment length polymorphism (RFLP), microsatellites, random amplified polymorphic DNA (RAPD), amplified fragment length polymorphism (AFLP), to the most recent marker type of single-nucleotide polymorphisms (SNPs). The demands for molecular markers keep increasing from thousands to tens of thousands, to the current level of hundreds of thousands or millions of polymorphic markers per species to fully mark and map the genomes. Such limitations were imposed mostly because of the lack of the whole genome sequences in many species, especially in aquaculture species. Finally, in the last few years, this bottleneck is to be released due to advances in next generation sequencing technologies. Now, with the powerful second generation and third generation sequencing technologies, many gigabases of nucleotide sequences can be generated in just a few hours, and thousands of thousands of SNPs, among other types of polymorphisms, can be discovered. Since the start of this book project, sequencing technologies have evolved and matured to such a level that they are now widely used, even with aquaculture species. Huge numbers of SNPs are being discovered, validated, and applied to aquaculture genome research. This brings aquaculture genome research to the same level as terrestrial livestock genomics where whole genome-based selection can be conducted. As a result, this book is focused on providing a basic description of next generation sequencing technologies, genomic copy number variations, SNP discovery, validation, and applications to whole genome-based selection. It can be said that whole genome selection is a direct result of genome research, and it perhaps represents the most powerful genome-based technologies. Since its proposal in 2001 by Meuwissen et al. (Genetics 157:1819–1829), whole genome-based selection has become the center and future direction for animal breeding. It will certainly find its way for application in aquaculture. This book has 12 chapters: genome variations and traits; copy number variations; next generation sequencing technologies; methods and protocols for library construction for the next generation sequencing; SNP discovery through sequencing reduced representation libraries; SNP mining from expressed sequence tag (EST) databases; SNP quality assessment; SNP genotyping platforms; complexities of SNP analysis in duplicated teleost fish genomes; whole genome-based selection: principles and procedures; whole genome-based selection: methods and practical considerations; and comparative analysis of conventional index selection, best linear unbiased prediction (BLUP) selection, marker-assisted selection, and whole genome-based selection. The last three chapters each address the theory and principles of whole genomebased selection, but from different perspectives. These chapters were intentionally included from authors with different experiences. As genome selection is still in its vii

viii

Preface

infancy, its theories are still evolving, and yet the practical effectiveness still needs to be validated by future experimentation. The inclusion of chapters written by experts of different perspectives should provide readers some comfort as to where genome selection is going in aquaculture. Chapter 10 was written by Anna Sonesson, who is a member of the group that proposed the theory of whole genome selection in Norway; Chapter 11 was written by Ashok Ragavendran and Bill Muir, the latter of whom has worked with a whole genome selection project in poultry in the United States, but with a good knowledge of aquaculture; and Chapter 12 was written by Zhenmin Bao, who is an expert in aquaculture and aquaculture breeding programs in China. This book was written to bridge genome-based technologies with aquaculture breeding programs. It should be useful to academic professionals, research scientists, graduate students and college students in agriculture, as well as for students of aquaculture and fisheries. I am grateful to all the contributors of this book. It is their great experience and efforts that made this book possible. I am grateful to postdoctoral fellows and graduate students in my laboratory and in the Aquatic Genomics Unit at Auburn University for their proofreading and technical assistance. I have had a year of pleasant experience interacting with Susan Engelken, Editorial Program Coordinator, and with Justin Jeffryes, Commissioning Editor for Plant Science, Agriculture, and Aquaculture with Wiley-Blackwell of John Wiley & Sons. During the course of writing and editing this book, I have worked extremely hard as the Associate Dean for Research while also fulfilling my duty and passion as a professor and graduate adviser. As a consequence, I could not possibly work as hard as I wished to fulfill my responsibility as a father of my three lovely daughters: Elise, Lisa, and Lena Liu. I wish to express my appreciation for their independence and great progress. Finally, this book is a product of the encouragement of my lovely wife, Dongya Gao. As I always say, my mother always expects a lot of me, and my wife always makes sure that I deliver the high expectations. This book, therefore, is dedicated to my extremely supportive wife. Zhanjiang (John) Liu

List of Contributors

Zhenmin Bao Key Lab of Marine Genetics and Breeding Ministry of Education College of Marine Life Science Ocean University of China Qingdao, China Cecilia Castaño Sánchez United States Department of Agriculture/Agricultural Research Service National Center for Cool and Cold Water Aquaculture Kearneysville, WV 25430 USA Huseyin Kucuktas The Fish Molecular Genetics and Biotechnology Laboratory Department of Fisheries and Allied Aquacultures and Program of Cell and Molecular Biosciences Aquatic Genomics Unit Auburn University Auburn, AL 36849 USA Hong Liu The Fish Molecular Genetics and Biotechnology Laboratory Department of Fisheries and Allied Aquacultures and Program of Cell and Molecular Biosciences Aquatic Genomics Unit Auburn University Auburn, AL 36849 USA

Zhanjiang (John) Liu The Fish Molecular Genetics and Biotechnology Laboratory Department of Fisheries and Allied Aquacultures and Program of Cell and Molecular Biosciences Aquatic Genomics Unit Auburn University Auburn, AL 36849 USA Jianguo Lu The Fish Molecular Genetics and Biotechnology Laboratory Department of Fisheries and Allied Aquacultures and Program of Cell and Molecular Biosciences Aquatic Genomics Unit Auburn University Auburn, AL 36849 USA Qingshu Meng CAS Key Laboratory of Genome Science and Information Beijing Institute of Genomics Chinese Academy of Sciences Beijing 100029, China William M. Muir Pulse Molecular Evolutionary Genetics Program and Department of Animal Sciences Room G406 Lilly Hall 915 West State Street Purdue University West Lafayette, IN 47907 USA

ix

x

List of Contributors

Yniv Palti USDA/ARS National Center for Cool and Cold Water Aquaculture Kearneysville, WV 25430 USA Eric Peatman The Fish Molecular Genetics and Biotechnology Laboratory Department of Fisheries and Allied Aquacultures and Program of Cell and Molecular Biosciences Aquatic Genomics Unit Auburn University Auburn, AL 36849 USA Ashok Ragavendran Pulse Molecular Evolutionary Genetics Program and Department of Animal Sciences Room G406 Lilly Hall 915 West State Street Purdue University West Lafayette, IN 47907 USA Caird Rexroad III United States Department of Agriculture/Agricultural Research Service National Center for Cool and Cold Water Aquaculture Kearneysville, WV 25430 USA

Anna K. Sonesson Nofima Marine AS PO Box 5010, 1432 Ås Norway Geoffrey C. Waldbieser USDA, Agricultural Research Service Catfish Genetics Research Unit 141 Experiment Station Road Stoneville, MS 38776 USA Shaolin Wang The Fish Molecular Genetics and Biotechnology Laboratory Department of Fisheries and Allied Aquacultures and Program of Cell and Molecular Biosciences Aquatic Genomics Unit Auburn University Auburn, AL 36849 USA Jun Yu CAS Key Laboratory of Genome Science and Information Beijing Institute of Genomics Chinese Academy of Sciences Beijing 100029, China

Genomic DNA Evenly spaced features

Cy3 label

Reference DNA

Array with features designed from genome sequences

Cy5 label Test DNA

Hybridization

Detection of CNV by Cy3 & Cy5 ratio Figure 2.1 Principles of array comparative genome hybridization (array CGH). A large number of evenly spaced features are designed from the reference genome sequence and placed to an array. Equal amount of reference genome (normal genome) and test genome DNA are labeled by differential fluorescence, for example, Cy3 and Cy5, and hybridized to the array. The ratios of Cy3 and Cy5 define CNV. If red fluorescence is observed, the feature on the array has more copy numbers in the test genome than in the normal genome.

Reference Cancer DNA DNA

+ Hybridization Array CGH

Figure 2.2 An example of using array CGH for the detection of chromosomal segment duplications in cancer.

Next Generation Sequencing and Whole Genome Selection in Aquaculture © 2011 Blackwell Publishing Ltd. ISBN: 978-0-813-80637-2

Edited by Zhanjiang (John) Liu

Biotinylated Hairpin adaptor Ligation Sheared Genome DNA

Circularized DNA fragments Bio Randomly sheared

Data analysis Paired ends span

0

SVs mapping

4000

454 sequencing Isolation Linker (+) library DNA fragments Paired ends

Count 2000

>Pair 1, End A TGTGATCACCCGCCAATATCTC AGATGACACAATGGACCAAAGT TTACGAGCGGCTGACATAGGCT >Pair1, End B TGTGATCACCCGCCAATATCTC AGATGACACAATGGACCAAAGT

0

2000 4000 6000 8000 Span of paired ends

Figure 2.3 Principles of paired-end mapping-based CNV detection. Genomic DNA is sheared into approximately 3-kb fragments. The genomic fragments are then ligated to biotinylated adaptors to mark the orientation. The segments are circularized, followed by linearization at random sites. Next generation sequencing is used to massively sequence the segments. Bioinformatic mapping by in silico positioning of the sequences to the reference genome would detect any size difference or orientation difference, which suggest genome structural variations including CNVs.

A

A

A B

B

B

Signal image

C

polymerase G A A T CG GC A T GC T A A A G T CA Anneal primer APS PPI Sulturylase ATP Luciferase

Light + αxy luciferin

Key sequence

Flowgram

D Figure 3.1 Outline of the Roche/454 sequencer workflow. (A) Single-strand template DNA library preparation; (B) emulsion-based clonal amplification; (C) depositing DNA beads into the PicoTiterPlate device; (D) sequencing by synthesis. (Figure was adapted from www.454.com.)

1

2

3

4

5

6

7

8 C A

G C

C

T

A

T

G T

G

G

C

A

T

9

10

11

G

C

A

A C

T

Reference sequence G

A C

A

T

G

T

T

C

G G

A

G

T …GCTGATGTGCCGCCTCACTCCGGTGG CACTCCTGTGG CTCACTCCTGTGG GCTGATGTGCCACCTCA GATGTGCCACCTCACTC GTGCCGCCTCACTCCTG CTCCTGTGG

C

Unknown variant contfied and cated

T A

12

Known SNP Cated

G

Figure 3.2 The Illumina sequencing-by-synthesis approach. (1) Prepare genome DNA sample; (2) attach DNA to surface; (3) bridge amplification; (4) fragments become double-stranded; (5) denaturation of the double-stranded molecules; (6) complete amplification; (7) determine first base; (8) image first base; (9) determine second base; (10) image second chemistry cycle; (11) sequencing over multiple chemistry cycles; (12) align data. (From the Genome Analyzer brochure, with permission from Illumina Inc.)

LIBRARY PREPARATION

Fragment Library

Polymerase

OR

P1 Coupled Beads

Enhancement

C.

Bead

deposition

BEAD DEPOSITION

B.

Bead

Emulsion

PCR

EMULSION PCR BEAD ENHANCEMENT

A.

Mate-Paired Library

D.

3’

LIGATE

3’ Bead

5’

3’

CLEAVE

Bead

5’

3’ Bead

5’

Primer p5’

C-T-n-n-n-z-z-z

3’

3’

G-G-n-n-n-z-z-z 3’

C-A-n-n-n-z-z-z

Adapter Sequence

G-C-n-n-n-z-z-z

Template Sequence

Primer C-A-n-n-n-z-z-z GT Adapter Sequence

Primer

Template Sequence

Cleavage

z-z-z

C-A-n-n-n p5’ GT Adapter Sequence

Template Sequence

SEQUENCING BY LIGATION/DATA ANALYSIS

Ligase

E.

PRIMER ROUND

3 Universal seq primer (n-2) 4

3’ Universal seq primer (n-3) 3’

seq primer (n-4) 5 Universal 3’ Indicates positions of interrogation

Ligation cycle

DUAL INTEROGATION OF EACH BASE

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2122 2324 25 26 2728 29 30 3132 33 34 35 Universal seq primer (n) 3’ Universal seq primer (n-1) 2 3’

1

Figure 3.3 The ligase-mediated sequencing approach of the Applied Biosystems SOLiD sequencer. (A) Library preparation; (B) emulsion PCR/bead enrichment; (C) bead deposition; (D) sequencing by ligation; (E) primer reset and two-base encoding. (Adapted from www.appliedbiosystems.com.)

Figure 4.3 Schematic presentation of paired-end read library preparation. 1

2

3

n

600 bp 520 480

400

330 260 240 220 200

Figure 5.1 Diagram of an electrophoretic gel used to isolate reduced representation libraries. Genomic DNA from multiple individuals (green tubes at top) is pooled and digested with a restriction enzyme. The DNA fragments are separated by electrophoresis (shaded green box) alongside size standards (green lines, sizes in bp are denoted at left). White boxes represent size fractions that are excised for deep sequencing.

* …GGTTGTTTCATTTCCAGAGCCTTTCATGTCTCCATGAAG… *

…GGTTGTTTCATTTCCAGAGCCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAGCCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAGCCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAACCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAGCCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAGCCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAGCCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAACCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAGCCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAACCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAGCCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAACCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAGCCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAGCCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAACCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAGCCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAACCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAGCCTTTCATGTCTCCATGAAG…

A/A

A/G

G/G

Figure 5.2 Multiple sequence alignment and chromatograms from a single SNP locus. The asterisk denotes the SNP site. On the left, the consensus sequence is at the top of the alignment and alternate alleles are denoted in green. On the right are chromatograms from an A/A homozygote, an A/G heterozygote, and a G/G homozygote.

T/C

Figure 6.3 SNP visualization from POLYBAYES. The SNP identified at position 364 is a C/T SNP, which was generated from SNP screening based on multiple ESTs using POLYBAYES.

Figure 6.4 SNP visualization from POLYPHRED. The SNP identified at position 192 is a C/G SNP, which was generated from SNP screening based on individual fish. The sample E09 and A12 has homozygous allele C, the sample B11 has homozygous allele G, and the sample A11 is heterozygous with both allele C and allele G.

Figure 6.5 SNP visualization from AUTOSNP. The left upper panel displays sequence information, for example, GenBank accession numbers and putative gene identities. The left lower panel displays SNP summary information, for example, at position 310, the SNP is a T/G SNP, at position 377, the SNP is a T/A SNP with a sequence ratio of 3:2 each. The right panel displays the sequences alignment information, highlighting the SNP at position 435, an A/G SNP with a sequence ratio of 3 : 2.

Figure 6.6 SNP visualization from the CLC Genomics Workbench. The left panel is a navigation area including all the files and results information. The right upper panel displays SNP summary information, for example, contig number, consensus sequence length, consensus base at the SNP site (majority rule), SNP allele bases, sequence count (count) and ratio (frequency), and total number of sequences (coverage). The right lower panel displays the sequences alignment information and the SNP sites of the selected contig. In this example, sequence alignments of contig 38 are given, with forward sequence being shown in red and reverse sequence being shown in green.

Figure 6.7 SNP visualization from NextGENe. The upper panel displays a global view of the project, and the lower panel displays the sequence alignment and sequence variation with SNPs highlighted in blue.

Figure 7.1 Importance of the minor sequence frequency and the number of sequences in the contig. Note that the number of sequences at the SNP site is the most important. For instance, more sequences are available at the 5′ and 3′ of a transcript, providing a greater level of reliability of sequences. In contrast, fewer sequences are available in the middle of transcripts.

Minor Number of sequence sequences frequency

Major sequence frequency

Sequence heterozygosity

10 seq 9 seq 8 seq 7 seq 6 seq 5 seq 4 seq 3 seq 2 seq

1 1 1 1 1 1 1 1 1

9 8 7 6 5 4 3 2 1

0.18 0.20 0.22 0.24 0.28 0.32 0.38 0.44 0.50

10 seq 9 seq 8 seq 7 seq 6 seq 5 seq 4 seq

2 2 2 2 2 2 2

8 7 6 5 4 3 2

0.32 0.35 0.38 0.41 0.44 0.48 0.50

10 seq 9 seq 8 seq 7 seq 6 seq

3 3 3 3 3

7 6 5 4 3

0.42 0.44 0.47 0.49 0.50

10 seq 9 seq 8 seq

4 4 4

6 5 4

0.48 0.49 0.50

10 seq

5

5

0.50

SNP quality trend

Figure 7.2 SNP quality assessment based on EST contig size and sequence frequency of the alleles. Arrows indicate the trend of increases of heterozygosity and the trend of increases in SNP quality.

Genotyping success

SNP location P1 P2

SNP

P3 +

cDNA P1 P2

Genomic DNA P1 P2

SNP

SNP

P3 –

cDNA Genomic DNA

P1 P2

SNP

P1 P2

SNP

P3

P3

–

cDNA Genomic DNA

P1 P2

P3

SNP

P3

Figure 7.4 Schematic illustration of the effect of introns involved in SNP genotyping. See text for full legend.

Genomic DNA

G A

Central SNP Quartet PM-A PM A

ATCAATAGCCATCATGAGTTAGTAG

MM-A

ATCAATAGCCATGATGAGTTAGTAG

PM-B

ATCAATAGCCATTATGAGTTAGTAG

MM-B MM B

ATCAATAGCCATAATGAGTTAGTAG

Idealized Array Image Sample 1 AA

Sample 2 AB

Sample 3 BB PM-A

–4 Offset Quartet

MM-A

PM-A

TGCCATCAATAGCCATCATGAGTTA

MM-A

TGCCATCAATAGGCATCATGAGTTA

PM-B

TGCCATCAATAGCCATTATGAGTTA

MM-B

TGCCATCAATAGGCATTATGAGTTA

PM-B MM-B PM-A–4 MM-A–4 PM-B–4

+4 Offset Quartet PM-A

ATAGCCATCATGAGTTAGTAGTTCA

MM-A

ATAGCCATCATGTGTTAGTAGTTCA

MM-B–4 PM-A+4

PM-B

ATAGCCATTATGAGTTAGTAGTTCA

MM-A+4

MM-B

ATAGCCATTATGTGTTAGTAGTTCA

PM-B+4 MM-B+4

+ Opposite Strand Probes

Figure 8.1 Differential hybridization utilizing multiple perfect match (PM) and mismatch (MM) probes per SNP allele and shifting the nucleotide context of the SNP provides the ability to differentiate homozygous and heterozygous signals as well as screening out signal resulting from nonspecific hybridization, as shown in idealized, simplified array image. A/G P1 P2 P1 P2

T C

Address P3

P3

Allele-specific extension and ligation PCR with P1, P2, and P3

Homozygous A/A Homozygous G/G Homozygous A/G

Figure 8.2 Principles of the Illumina SNP genotyping platform. In the Illumina’s SNP assays, the allele discrimination at each SNP locus is achieved by using three oligos—P1, P2, and P3—of which P1 and P2 are allele-specific and are Cy3- and Cy5-labeled as indicated by red and green colors. P3 is locus-specific designed several bases downstream from the SNP site. Upon allele-specific extension and ligation, the artificial, allele-specific template is created for PCR using universal primers. If the template DNA is homozygous, either P1 or P2 will be extended to meet P3; if the template is heterozygous, both P1 and P2 will be extended to meet P3, allowing ligation to happen. P3 contains a unique address sequence that targets a particular bead type with complementary sequence to the address sequence. After downstreamprocessing, the single-stranded, dye-labeled DNAs are hybridized to their complement bead type through their unique address sequences. After hybridization, the BeadArray Reader is used to analyze fluorescence signal on the Sentrix Array Matrix or BeadChip, which is in turn analyzed using software for automated genotype clustering and calling. (Figure adapted from Illumina [www.illumina.com/]).

Fragmented, genomic DNA

Fragmented, genomic DNA Sample 2

Sample 1

Hybridization with locus-specific oligo (LSO)

Hybridization with locus-specific oligo (LSO)

Sample 2

Sample p 1

Single-base extension with labeled ddNTP

Single-base extension with labeled ddNTP

Figure 8.3 Common features of single-base extension genotyping. Sample genomic DNA is amplified, fragmented, and allowed to hybridize to locus-specific oligos (LSOs) in solution or bound to beads. Enzymatic incorporation of a single, fluorescently labeled dideoxynucleotide (ddNTP) allows specific base calling for each sample and locus.

PCR forward

SNaPshot primer

S N P

Figure 9.1 Incorrect assembly of paralogous sequences. Genomic DNA sequencing of SNP flanking regions revealed incorrect assemblies of paralogous sequences during the SNP discovery process. The SNaPshot primer had been designed in a region with distinct paralogous differentiation.

S N P

SNapShot primer (part a)

SNapShot primer (part b) Intron

Figure 9.2 Presence of introns in the amplicons in rainbow trout. Genomic DNA sequencing of SNP flanking regions revealed presence of introns in the amplicon sequences. The SNaPshot primer had been designed in an intron–exon boundary region. The first sequence in the figure is the EST sequence used to design the SNaPshot primer.

Figure 9.3 ABI’s GeneMapper SNP graphs. Columns represent homozygote and heterozygote genotypes for six SNPs. The genotypes in the first first SNPs were G/A (blue and green peaks), while the last one (V5666) was G/C (blue/black).

Figure 9.4 Illumina BeadStudio Atlantic salmon SNP graphs (figures modified from Kent et al. 2008 with permission). Genotypes in these graphs are represented in clusters. Each dot represents one genotyped individual: red clusters symbolize homozygote individuals for allele A (A/A) and blues for allele B (B/B); purple dots represent heterozygotes (A/B); and black dots unscored individuals. See text for full legend.

Figure 9.5 Illumina BeadStudio sample graphs. (A) Rainbow trout sample graph. Sample graphs represent the genotypes of all analyzed SNP for one sample. Red and blue dots represent all the SNP for which this particular individual (USDA04_M) was homozygous for allele A (A/A) and B (B/B), respectively; purple dots symbolize the heterozygous SNPs. (B) Double haploid rainbow trout sample graph. Validated SNPs should be all homozygous in double haploid organisms as reflected by the absence of purple dots.

Chapter 1

Genomic Variations and Marker Technologies for Genome-based Selection Zhanjiang (John) Liu Genetic Variations at the Genomic Level The genome compositions of each individual of the same species are similar but different at the level of DNA sequences and its encoding capacity (sometimes in terms of what genes are transcribed, but perhaps more often in terms of how much the gene products are made), and thereby have different transcriptional activities, encoding similar but different proteins, or encoding same or similar proteins in different quantities, leading to different biological characteristics and performance. Upon comparison of the genomes of individuals within a population with their reference genome sequence of the species, several general types of genetic variations can be found (Figure 1.1): (1) a deletion due to the loss of one or more of bases; (2) insertion due to gain of one or more bases; (3) base substitution at various positions; (4) inversion of a DNA segment in its orientation; (5) rearrangements of multiple DNA segments within a both small and larger scope of the genome; and (6) copy number variation (CNV) due to insertions, deletions, and duplication or multiplication of a DNA segment(s). A deletion mutation and an insertion mutation can be viewed as the same phenomenon depending on what is used as the reference. Deletions/insertions in random genomic locations probably do not have much impact to its biological activities except when deletion/insertion happens within a gene or within its regulatory elements. Insertion/deletion of single base or two bases within a protein coding sequence would cause frameshift of the protein being encoded, thus leading to the completely different amino acid sequences downstream of the mutation. However, deletion/insertion of three bases or multiple of three bases (e.g., 6 base pair [bp], 9 bp) within a protein coding sequence would cause a deletion of one amino acid or multiple amino acids depending on the extent of the deletion/ insertion. In the first case of deletion/insertion of one or two bases into a protein coding sequences, the biological impact could be highly significant. Such mutations could cause total loss of functions of the protein. In the later case, deletion/insertion of three or multiple of three bases would lead to a protein missing one or a few amino acids but the upstream and downstream amino acid sequences should still be the same. In this case, the protein function may or may not be altered depending

Next Generation Sequencing and Whole Genome Selection in Aquaculture © 2011 Blackwell Publishing Ltd. ISBN: 978-0-813-80637-2

Edited by Zhanjiang (John) Liu

3

1. Indels (a) Insertions

(b) Deletions

2. Base substitutions/single nucleotide polymorphisms ACTGCAGTTTGCTCCAGTCTTTGAGAATTTACAGCTCACCACCAAAAAGACGAAAGAGCT |||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||| ACTGCAGTTTGCTCCAGTCTTTGAGAATCTACAGCTCACCACCAAAAAGACGAAAGAGCT

3. Rearrangements

4. Segmental inversions

5. Copy number variations

Figure 1.1 Types of genome variations. In principle, five types of genomic variations exist: 1. Indels that involve insertions or deletions of a segment, as indicated by the shaded boxes. 2. Base substitutions or single-nucleotide polymorphisms (SNPs) are simply differences of bases at a given DNA location. In the example, a T/C SNP is highlighted by the oval. 3. Rearrangements are genomic difference that resulted from the relocation of certain genomic segments of various sizes. Shown are three DNA segments that are present in both genomes, but they are located in different genome locations. Practically, such rearrangements can be intrachromosomal or extrachromosomal. 4. Segment inversions are changes of DNA segments in their orientation in the genome, as indicated by the change of the arrow direction. 5. Copy number variations are differences in copies of DNA segments (genes or just genomic segments) within genomes. In the example, one open box segment is in the first genome, but two open boxes are in the second genome; similarly, different numbers of segments exist between the two genomes as indicated by different sketched boxes.

4

Genomic Variations and Marker Technologies for Genome-based Selection

5

on the position of the mutation and the amino acids involved. Serious biological impact can still result. For instance, in the case cystic fibrosis (CF), a 3-bp deletion at the amino acid position of 506 of the cystic fibrosis transmembrane regulator (CFTR) protein would lead to the most serious form of CF, even though the resulting protein losses just one amino acid. Genome variations involve a wide range of segmental inversions or rearrangements. Very similar to the situation of deletions and insertions, such sequence changes could have huge biological impact depending on the location of the mutation and the genes or gene regulatory sequences involved in such mutations. The most widespread genomic variation among individuals within a population is base substitution. Such base substitution along the DNA chain is defined as singlenucleotide polymorphisms (SNPs). Inversion of a DNA segment in its orientation can be quite widespread in the genome, but this type of variation have not been well studied and probably will not be very useful for large-scale genomic studies. CNV due to insertions, deletions, and duplication or multiplication of a DNA segment is widespread, and this type of genomic variation caught the attention of genome researchers just recently. CNV can involve large or small genome segments that are duplicated or multiplied in one genome while not in another. Such CNVs can involve genes or just genomic segments that do not harbor genes. Obviously, when genes are involved, the duplicated or multiplied genes can affect genome expression activities. The significance of CNV has caught much attention recently, and CNV could potentially be used for whole genome selection programs upon identification of correlation or causation of certain genome segments with performance traits. The importance of CNV in teleost fish is further signified by the fact that teleost fish had an additional round of genome duplication followed with random gene loss, thereby resulting in various CNV situations involving various genes. Because of this significance, CNV is included as an independent chapter in this book (Chapter 2).

A Review of DNA Marker Technologies The entire task of DNA marker technologies is to provide the means to reveal DNAlevel differences of genomes among individuals of the same species, as well as among various related taxa. Historically, these measurements relied on phenotypic or qualitative markers. Morphological differences such as body dimensions, size, and pigmentation are some examples of phenotypic markers. Genetic diversity measurements based on phenotypic markers are often indirect, and are inferential through controlled breeding and performance studies (Parker et al., 1998; Okumuş and Çiftci, 2003). Because these markers are polygenically inherited and have low heritability, they may not represent the true genetic differences (Smith and Chesser, 1981). Only when the genetic basis for these phenotypic markers is known can some of them be used to measure genetic diversity. Molecular markers including protein markers and DNA markers were developed to overcome problems associated with phenotypic markers.

6

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Allozyme Markers Much before the discovery of DNA markers, allozyme markers were used to identify broodstocks in fish and other aquaculture species (Kucuktas and Liu, 2007). Allozymes are different allelic forms of the same enzymes encoded at the same locus (Hunter and Markert, 1957; Parker et al., 1998; May, 2003). Genetic variations detected in allozymes may be the result of point mutations, insertions, or deletions (indels). Allozymes have had a wide range of applications in fisheries and aquaculture including population analysis, mixed stock analysis, and hybrid identification (May, 2003). However, they are becoming a marker type of the past due to the limited number of loci that in turn prohibits genome-wide coverage for the analysis of complex traits (Kucuktas and Liu, 2007). In addition, mutation at the DNA level that causes a replacement of a similarly charged amino acid may not be detected by allozyme electrophoresis. Another drawback is that the most commonly used tissues in allozyme electrophoresis are the muscle, liver, eye, and heart, the collection of which is lethal.

Restriction Fragment Length Polymorphism (RFLP) Two specific technological advances, the discovery and application of restriction enzymes in 1973 and the development of DNA hybridization techniques in 1975, set the foundation for the development of the first type of DNA markers, RFLP (for a recent review, see Liu, 2007a). Restriction endonucleases cut DNA wherever their recognition sequences are encountered. Therefore, changes in the DNA sequence due to insertions/deletions (indels), base substitutions, or rearrangements involving the restriction sites can result in the gain, loss, or relocation of a restriction site. Digestion of DNA with restriction enzymes results in fragments whose number and size can vary among individuals, populations, and species. Two approaches are widely used for RFLP analysis. The first involves the use of Southern blot hybridization (Southern, 1975), while the second involves the use of PCR. Traditionally, fragments were separated using Southern blot analysis, in which genomic DNA is digested, subjected to electrophoresis through an agarose gel, transferred to a solid support such as a piece of nylon membrane, and visualized by hybridization to specific probes. Most recent analysis replaces the tedious Southern blot analysis with techniques based on polymerase chain reaction (PCR). If flanking sequences are known for a locus, the segment containing the RFLP region is amplified via PCR. If the length polymorphism is caused by a deletion or insertion, gel electrophoresis of the PCR products should reveal the size difference. However, if the length polymorphism is caused by base substitution at a restriction site, PCR products must be digested with a restriction enzyme to reveal the RFLP. The major strength of RFLP markers is that they are codominant markers; that is, both alleles in an individual are observed in the analysis. The major disadvantage of RFLP is the relatively low level of polymorphism. In addition, either sequence information (for PCR analysis) or a molecular probe (for Southern blot analysis) is required, making it difficult and time-consuming to develop markers in species

Genomic Variations and Marker Technologies for Genome-based Selection

7

lacking known molecular information. Due to these disadvantages, the application of RFLP markers in aquaculture and fisheries has been, and will be, limited.

Mitochondrial Markers Mitochondrial genome evolves more rapidly than the nuclear genome. The rapid evolution of the mitochondrial DNA (mtDNA) makes it highly polymorphic within a given species. The polymorphism is especially high in the control region (D -loop region), making the D -loop region highly useful in population genetic analysis. The analysis of mitochondrial markers is mostly RFLP analysis, or direct sequence analysis (Liu and Cordes, 2004). Due to the high levels of polymorphism and the ease of mtDNA analysis, mtDNA has been widely used as markers in aquaculture and fisheries settings. However, mtDNA is maternally inherited in most cases, and this nonMendelian inheritance greatly limits the applications of mtDNA for genome research. In addition, most aquaculture-related traits are controlled by nuclear genes. For most aquaculture finfish species, their nuclear genome is at the level of a billion base pairs, while their mitochondrial genomes are usually tens of thousands of times smaller than the nuclear genome. Clearly, in spite of their usefulness for the identification of aquaculture stocks, mtDNA markers will not be tremendously useful for aquaculture genome research and genetic improvement programs in aquaculture. However, some recent studies suggested that mtDNA could influence performance traits such as growth (Steele et al., 2008).

Microsatellite Markers When the Human Genome Project was launched in the mid-1980s, the capacity and capabilities of available DNA marker technologies seriously limited genome research. Such severe limits put pressure to develop more efficient marker systems for analysis of complex traits and genome organizations. At the end of 1980s, the simple sequence repeats (SSRs) or microsatellites were discovered; and they have since been used as one of the most preferred marker types because of their high levels of polymorphism, abundance, roughly even genome distribution, codominant inheritance, and small locus size that facilitate PCR-based genotyping (Tautz, 1989). Microsatellites can be viewed as special cases of insertions or deletions. An addition of a dinucleotide microsatellite repeat can be viewed as an insertion of 2 bp into the genome. They are perhaps the most abundant type of insertions and deletions. Microsatellites are SSRs of 1–6 bp. They are highly abundant in various eukaryotic genomes including all aquaculture species studied to date. In most of the vertebrate genomes, microsatellites make up a few percent of the genome in terms of the involved base pairs, depending on the compactness of the genomes. Generally speaking, more compact genomes tend to contain smaller proportion of repeats including SSRs, but this generality is not always true. For example, the highly compact genome of Japanese pufferfish contains 1.29% of microsatellites, but its closely related Tetraodon nigroviridis genome contains 3.21% of microsatellites (Crollius et al., 2000).

8

Next Generation Sequencing and Whole Genome Selection in Aquaculture

During a genomic sequencing survey of channel catfish, microsatellites were found to represent 2.58% of the catfish genome (Xu et al., 2006; Liu et al., 2009). In fugu, one microsatellite was found for every 1.87 kb of DNA. For comparison, in the human genome, one microsatellite was found for every 6 kb of DNA (Beckmann and Weber, 1992). It is reasonable to predict that in most aquaculture fish species, one microsatellite should exist every 10 kb or less of the genomic sequences, on average. Dinucleotide repeats are the most abundant forms of microsatellites. For instance, in channel catfish, 67.9% of all microsatellites are present in the form of dinucleotide repeats; 18.5% are present as trinucleotide repeats; and 13.5% as tetranucleotide repeats. Of the dinucleotide repeat types, (CA)n is the most common dinucleotide repeat type, followed by (AT)n, and then (CT)n (Toth et al., 2000; Xu et al., 2006). (CG)n type of repeats is relatively rare in the vertebrate genomes. Partially, this is because the vertebrate genomes are often A/T-rich. Of the trinucleotide repeats and tetranucleotide repeats, relatively A/T-rich repeat types are generally more abundant than G/C-rich repeat types. Microsatellites longer than tetranucleotide repeats (penta- and hexanucluotides) are much less abundant, and are therefore are less important as molecular markers (Toth et al., 2000). It is important to point out that the definition of microsatellites limiting to repeats of six bases long are quite arbitrary. Technically speaking, repeats with seven bases or longer sequences are also microsatellites, but because they become rarer as the repeats are longer, they are less relevant as molecular markers. Microsatellites are distributed in the genome on all chromosomes and all regions of the chromosome. They have been found inside gene coding regions (e.g., Liu et al., 2001), introns, and in the nongene sequences (Toth et al., 2000). The best known examples of microsatellites within coding regions are those causing genetic diseases in humans, such as the CAG repeats that encode polyglutamine tract, resulting in mental retardation. In spite of their wide distribution in genes, microsatellites are predominantly located in noncoding regions (Metzgar et al., 2000). Only about 10%–15% of microsatellites reside within coding regions (Moran, 1993; Van Lith and Van Zutphen, 1996; Edwards et al., 1998; Serapion et al., 2004). This distribution should be explained by negative selection against frameshift mutations in the translated sequences (Metzgar et al., 2000; Li and Guo, 2004). Because the majority of microsatellites exist in the form of dinucleotide repeats, any mutation by expansion or shrinking would cause frameshift of the protein encoding open frames if they reside within the coding region. This also explains why the majority of microsatellites residing within coding regions have been found to be trinucleotide repeats, although the presence of dinucleotide repeats and their mutations within the coding regions do occur. Most microsatellite loci are relatively small, ranging from a few to a few hundred repeats. The relatively small size of microsatellite loci is important for PCR-facilitated genotyping. Generally speaking, within a certain range, microsatellites containing a larger number of repeats tend to be more polymorphic, although polymorphism has been observed in microsatellites with as few as five repeats (Karsi et al., 2002). For practical applications, microsatellite loci must be amplified using PCR. For best separations of related alleles that often differ one another by as little as one repeat unit, it is desirable to have small PCR amplicons, most often within 200 bp. However, due to the repetitive nature of microsatellites, their flanking sequences can be quite a

Genomic Variations and Marker Technologies for Genome-based Selection

9

simple sequence as well, prohibiting the design of PCR primers for the amplification of microsatellite loci within a small size limit. Microsatellites are highly polymorphic as a result of their hypermutability, and thereby the accumulation of various forms in the population of a given species. Microsatellite polymorphism is based on size differences due to varying numbers of repeat units contained by alleles at a given locus. Microsatellite mutation rates have been reported as high as 10−2 per generation (Weber and Wong, 1993; Crawford and Cuthbertson, 1996; Ellegren, 2000), which is several orders of magnitude greater than that of nonrepetitive DNA (10−9; Li, 1997). In several fish species, the mutation rates of microsatellites were reported to be at the level of 10−3 per locus per generation: 1.3 × 10−3 in common carp (Zhang et al., 2008), 2 × 10−3 in pipefish (Jones et al., 1999), 3.9–8.5 × 10−3 in salmon (Steinberg et al., 2002), and 2 × 10−3 in dollar sunfish (MacKiewicz et al., 2002). Microsatellites are inherited in a Mendelian fashion as codominant markers. This is one of the strengths of microsatellite markers in addition to their abundance, even genomic distribution, small locus size, and high polymorphism. Genotyping of microsatellite markers are usually straightforward. However, due to the presence of null alleles (alleles that cannot be amplified using the primers designed), complications do exist. As a result, caution should be exercised to assure that the patterns of microsatellite genotypes fit the genetic model under application. The disadvantage of microsatellites as markers include the requirement for existing molecular genetic information, a large amount of up-front work for microsatellite development, and tedious and labor-intensive nature of microsatellite primer design, testing, and optimization of PCR conditions. Each microsatellite locus has to be identified and its flanking region sequenced for the design of PCR primers. Technically, the simplest way to identify and characterize a large number of microsatellites is through the construction of microsatellite-enriched small-insert genomic libraries (Ostrander et al., 1992; Lyall et al., 1993; Kijas et al., 1994; Zane et al., 2002). In spite of the variation in techniques for the construction of microsatellite-enriched libraries, the enrichment techniques usually include selective hybridization of fragmented genomic DNA with a tandem repeat-containing oligonucleotide probe and further PCR amplification of the hybridization products. In spite of the simplicity in the construction of microsatellite-enriched libraries, and thereby the identification and characterization of microsatellite markers, for a large genome project, the real need of direct microsatellite marker development may not be the wisest approach. Recent progress in sequencing technologies with the next generation of sequencers will allow large numbers of genomic sequence tags to be generated that would include numerous microsatellites. Microsatellites can be identified and sequenced directly from genome sequence surveys such as bacterial artificial chromosome (BAC)-end sequencing (Xu et al., 2006; Somridhivej et al., 2008; Liu et al., 2009), and from expressed sequence tag (EST) analysis from which many microsatellites can be developed into type I markers (Liu et al., 1999; Serapion et al., 2004). Caution has to be exercised, however, on microsatellites developed from ESTs. First, due to the presence of introns, one has to be careful not to design primers at the exon–intron boundaries. Second, the presence of introns would make allele sizes unpredictable. Finally, many microsatellites exist at the 5′- or 3′-UTR, making flanking sequences insufficient for the design of PCR primers. While introns are not a problem for

10

Next Generation Sequencing and Whole Genome Selection in Aquaculture

microsatellites derived from BAC-end sequencing, sequencing reactions often terminate immediately after the microsatellite repeats, which also makes flanking sequences insufficient for the design of PCR primers. Microsatellites have been an extremely popular marker type in a wide variety of genetic investigations. Over the past decade, microsatellite markers have been used extensively in fisheries research including studies of genome mapping, parentage, kinships, and stock structure. The major application of microsatellite markers is for the construction of genetic linkage and quantitative trait locus (QTL) maps. This is because of the high polymorphic rate of microsatellite markers. When a resource family is produced, the male and female fish parents are likely to be heterozygous in most microsatellite loci. The high polymorphism of microsatellites makes it possible to map many markers using a minimal number of resource families. There are other reasons for the popularity of microsatellites. One of these is because microsatellites are sequence-tagged markers that allow them to be used as probes for the integration of different maps including genetic linkage and physical maps. Communication using microsatellite markers across laboratories is easy, and the use of microsatellite across species borders is sometimes possible if the flanking sequences are conserved (Fitzsimmons et al., 1995; Rico et al., 1996; Cairney et al., 2000; Leclerc et al., 2000). As a result, microsatellites can be also used for comparative genome analysis. If microsatellites can be tagged to gene sequences, their potential for use in comparative mapping is greatly enhanced. In spite of the popularity and great utilization of microsatellites, several major limitations of microsatellites restrict them to rise to the top of all marker systems: 1. In spite of being very abundant, development of hundreds of thousands or millions of microsatellite markers is practically almost impossible. 2. Automation has not been possible for microsatellite genotyping. Multiplexing has been limited to about a dozen of loci, at the most. 3. For the most part, microsatellites can be just associated with traits, but are not usually the causes of the phenotypic variations. On top of these limitations of microsatellites, recent advances in molecular markers will have a major impact on the choice of DNA markers. In particular, the rapid progress in SNP including its rapid identification and automation in genotyping make SNP the far more preferred marker system for genome studies.

Random Amplified Polymorphic DNA (RAPD) Markers At the beginning of the 1990s, efforts were also devoted to develop multiloci, PCRbased fingerprinting techniques. Such efforts resulted in the development of two marker types that were highly popular for a while: RAPD (Welsh and McClelland, 1990; Williams et al., 1990) and amplified fragment length polymorphism (AFLP; Vos et al., 1995). RAPD is a multilocus DNA fingerprinting technique using PCR to randomly amplify anonymous segments of nuclear DNA with a single short PCR primer (8– 10 bp in length) (for a recent review, see Liu, 2007b). Because the primers are short, relatively low annealing temperatures (often 36–40°C) must be used. Once different

Genomic Variations and Marker Technologies for Genome-based Selection

11

bands are amplified from related species, population, or individuals, RAPD markers are produced. RAPD markers thus are differentially amplified bands using a short PCR primer from random genome sites. Genetic variation and divergence within and between the taxa of interest are assessed by the presence or absence of each product, which is dictated by changes in the DNA sequence at each locus. RAPD polymorphisms can occur due to base substitutions at the primer binding sites or to insertions or deletions (indels) in the regions between the two close primer binding sites. The potential power for detection of polymorphism is reasonably high as compared with RFLP, but much lower than microsatellites; typically, 5–20 bands can be produced using a given primer, and multiple sets of random primers can be used to scan the entire genome for differential RAPD bands. Because each band is considered a biallelic locus (presence or absence of an amplified product), polymorphic information content (PIC) values for RAPDs fall below those for microsatellites. The major advantages of RAPD markers are their applicability to all species regardless of known genetic, molecular, or sequence information, relatively high level of polymorphic rates, simple procedure, and a minimal requirement for both equipment and technical skills. RAPD has been widely used in genetic analysis of aquaculture species, but its further application in genome studies is limited by its lack of high reproducibility and reliability. In addition, RAPD is inherited as dominant markers, and transfer of information with dominant markers among laboratories and across species is difficult.

AFLP Markers Alternatives of RAPD that overcome the major problems such as its low reproducibility were actively sought in the early part of the 1990s. AFLP (Vos et al., 1995) was the outcome of such efforts. AFLP is based on the selective amplification of a subset of genomic restriction fragments using PCR (for a recent review, see Liu, 2007c). Genomic DNA is digested with restriction enzymes, and double-stranded DNA adaptors with known sequences are ligated to the ends of the DNA fragments to generate primer binding sites for amplification. The sequence of the adaptors and the adjacent restriction site serve as primer binding sites for subsequent amplification of the restriction fragments by PCR. Selective nucleotides extending into the restriction sites are added to the 3′ ends of the PCR primers such that only a subset of the restriction fragments is recognized. Only restriction fragments in which the nucleotides flanking the restriction site match the selective nucleotides will be amplified. The subset of amplified fragments is then analyzed by denaturing polyacrylamide gel electrophoresis to generate the fingerprints. AFLP analysis is an advanced form of RFLP. Therefore, the molecular basis for RFLP and AFLP are similar. First, any deletions and/or insertions between the two restriction enzymes, for example, between EcoRI and Mse I that are most often used in AFLP analysis, will cause shifts of fragment sizes. Second, base substitution at the restriction sites will lead to loss of restriction sites, and thus a size change. However, only base substitutions in all EcoRI sites and roughly 1 of 8 of Mse I sites are detected

12

Next Generation Sequencing and Whole Genome Selection in Aquaculture

by AFLP since only the EcoRI primer is labeled and AFLP is designed to analyze only the EcoRI-Mse I fragments. Third, base substitutions leading to new restriction sites may also produce AFLP. Once again, gaining EcoRI sites always leads to production of AFLP, gaining Mse I sites must be within the EcoRI-Mse I fragments to produce new AFLP. In addition to the common mechanisms involved in the polymorphism of RFLP and AFLP, AFLP also scans for any base substitutions at the first three bases immediately after the two restriction sites. Considering large numbers of restriction sites for the two enzymes (250,000 EcoRI sites and 500,000 Mse I sites immediately next to EcoRI sites for a typical fish genome with 1 billion bp), a complete AFLP scan would also examine over 2 million bases immediately adjacent to the restriction sites. The potential power of AFLP in the study of genetic variation is enormous. In principle, any combination of a 6-bp cutter with a 4-bp cutter in the first step can be used to determine potential fragment length polymorphism. For each pair of restriction enzyme used in the analysis, for example, EcoRI and Mse I, a total of approximately 500,000 EcoRI-Mse I fragments would exist for a genome with a size of 1 × 109 bp. Theoretically, 4096 primer combinations compose a complete genomewide scan of the fragment length polymorphism using the two restriction enzymes if three bases are used for selective amplification. As hundreds of restriction endonucleases are commercially available, the total power of AFLP for analysis of genetic variation can not be exhausted. However, it is probably never necessary to perform such exhaustive analysis. Since over 100 loci can be analyzed by a single primer combination, a few primer combinations should display thousands of fingerprints. For genetic resource analysis, the number of primer combinations required for construction of phylogenetic trees/dendrograms depends on the level of polymorphism in the populations, but probably takes no more than 5–10 primer combinations. AFLP combines the strengths of RFLP and RAPD. It is a PCR-based approach requiring only a small amount of starting DNA; it does not require any prior genetic information or probes; and it overcomes the problem of low reproducibility inherent to RAPD. AFLP is capable of producing far greater numbers of polymorphic bands than RAPD in a single analysis, significantly reducing costs and making possible the genetic analysis of closely related populations. It is particularly well adapted for stock identification because of the robust nature of its analysis. The other advantage of AFLP is its ability to reveal genetic conservation as well as genetic variation. In this regard, it is superior to microsatellites for applications in stock identification. Microsatellites often possess large numbers of alleles, too many to obtain a clear picture with small numbers of samples. Identification of stocks using microsatellites, therefore, would require large sample sizes. For instance, if 10 fish are analyzed, each of the 10 fish may exhibit distinct genotypes at a few microsatellite loci, making it difficult to determine relatedness without any commonly conserved genotypes. In closely related populations, AFLP can readily reveal commonly shared bands that define the common roots in a phylogenetic tree, and polymorphic bands that define branches in the phylogenetic tree. The major weakness of AFLP markers is their dominant nature of inheritance. Genetic information is limited with dominant markers because, essentially, only one allele is scored; and at the same time, since the true alternative allele is scored as a different locus, AFLP also inflates the number of loci under study. As dominant

Genomic Variations and Marker Technologies for Genome-based Selection

13

markers, information transfer across laboratories is difficult. In addition, AFLP is more technically demanding, requiring special equipment such as automated DNA sequencers for optimal operations. AFLP has been widely used in aquaculture such as analysis of population structures, migration, hybrid identification, strain identification, parentage identification, genetic resource analysis, genetic diversity, reproduction contribution, and endangered species protection (Jorde et al., 1999; Seki et al., 1999; Sun et al., 1999; Cardoso et al., 2000; Chong et al., 2000; Kai et al., 2002; Mickett et al., 2003; Whitehead et al., 2003; Campbell and Bernatchez, 2004; Mock et al., 2004; Simmons et al., 2006). AFLP has also been widely used in genetic linkage analysis (Kocher et al., 1998; Liu et al., 1998, 1999; Griffiths and Orr, 1999; Agresti et al., 2000; Robison et al., 2001; Rogers et al., 2001; Li et al., 2003; Liu et al., 2003; Felip et al., 2005), and analysis of parental genetic contribution involving interspecific hybridization (Young et al., 2001) and meiogynogenesis (Felip et al., 2000). In a study of the black rockfish (Sebastes inermis), Kai et al. (2002) used AFLP to distinguish three color morphotypes, in which diagnostic AFLP loci were identified as well as loci with significant frequency differences. In such reproductive isolated populations, it is likely that “fixed markers” of AFLP can be identified to serve as diagnostic markers. Fixed markers are associated most often with relatively less migratory, reproductive isolated populations (Kucuktas et al., 2002). With highly migratory fish species, fixed markers may not be available. However, distinct populations are readily differentiated by difference in allele frequencies. For instance, Chong et al. (2000) used AFLP for the analysis of five geographical populations of the Malaysian river catfish (Mystus nemurus) and found that AFLP was more efficient for the differentiation of subpopulations and for the identification of genotypes within the populations than RAPD, although similar clusters of the populations were concluded with either analysis. In spite of its popularity, AFLP has two fundamental flaws that prohibit its wider applications in the future: the dominance inheritance and lack of information to link it to genome sequence information. In some cases, AFLP can be used as a rapid screening tool, and useful markers can then be converted to sequence-characterized amplified region (SCAR) markers. However, genome-scale applications of SCAR markers are unlikely.

SNP SNP describes polymorphisms caused by point mutations that give rise to different alleles containing alternative bases at a given nucleotide position within a locus (for a recent review, see Liu, 2007d). Such sequence differences due to base substitutions have been well characterized since the beginning of DNA sequencing in 1977, but genotyping SNPs for large numbers of samples was not possible until several major technological advances in the late 1990s. SNPs are again becoming a focal point of molecular markers since they are the most abundant polymorphism in any organism, adaptable to automation, and reveal hidden polymorphism not detected with other markers and methods. SNP markers have been regarded by many as the markers of choice in the future.

14

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Theoretically, a SNP within a locus can produce as many as four alleles, each containing one of four bases at the SNP site: A, T, C, and G. Practically, however, most SNPs are usually restricted to one of two alleles (quite often either the two pyrimidines C/T or the two purines A/G) and have been regarded as biallelic. They are inherited as codominant markers in a Mendelian fashion.

Trend of DNA Marker Technologies DNA marker technologies become essential for aquaculture genetics research and the genetic improvement of aquaculture species. As a matter of fact, DNA markers, both the quality and quantity, have always been a limiting factor for in-depth genome research. Throughout the years, aquaculture geneticists have used various markers including allozyme markers, mitochondrial markers, RFLP markers, RAPD, AFLP, microsatellites, and SNPs. The overall trend, however, has been driven by (1) the need for large numbers of markers for high density coverage of the genomes and (2) the need for sequence-tagged markers for comparative genome analysis. Such demands have driven aquaculture genetic research away from using systems that do not offer a great number of markers such as RFLP and allozyme markers, and away from anonymous dominant markers such as RAPD and AFLP. Microsatellites, being codominant and sequence-tagged, have recently become very popular. However, with the draft genome sequence very soon becoming available for major aquaculture species, microsatellites are not without limitations. Their genotyping can be multiplexed, but the extent of multiplexing is limited. Automation of microsatellite genotyping is limited, thus prohibiting large-scale genome-wide applications. Mapping of thousands of microsatellites to the genome is a lot of work, and analysis using tens or hundreds of thousands of microsatellites would be a daunting task, if not technically impossible, for repeated analysis. This only leaves the SNP marker system to be viable. SNPs are the most abundant in genomes when compared with any other types of markers; SNPs are sequencetagged and therefore would allow comparative mapping analysis; SNP genotyping is highly automated and therefore is adaptable to large-scale genome-wide analysis. Therefore, it is clear that SNP markers are the choice marker of the future. In spite of the current lack of draft whole genome sequences for many aquaculture species, it is anticipated that they will soon become available for major aquaculture species. In addition, the availability of next generation sequencing technologies makes it unnecessary to have the whole genome draft sequences in order to develop a large number of SNP markers.

Assessment of the Usefulness of Various Markers for Genome-based Selection The following are the characteristics of the markers suitable for genome-wide applications and genome-based selection:

Genomic Variations and Marker Technologies for Genome-based Selection

15

1. The markers should provide the genome coverage as desired for the traits, whether that is a robust use of huge number of markers across the entire genome, or a subset of the markers previously identified to be relevant for the traits. 2. The markers should provide a uniform coverage of the genome in terms of intermarker distances. 3. The markers can be genotyped with automation, and whole genome analysis is possible with just one or a limited number of genotyping analysis. SNPs are the only marker type that are most suitable for genome-based selection as they meet the marker number test: large numbers of SNPs should be available for almost any species; they meet the genome distribution and spacing test as SNPs are very abundant and appropriate SNPs can be selected for use in genome-based selection; they meet the test of automation as many genotyping platforms are available for SNPs.

Acknowledgments Research in my laboratory is supported by grants from the United States Department of Agriculture (USDA)’s Agriculture and Food Research Initiative Animal Genome and Genetic Mechanisms Program, USDA National Research Initiative (NRI) Basic Genome Reagents and Tools Program, Mississippi–Alabama Sea Grant Consortium, Alabama Department of Conservation, United States Agency for International Development, National Science Foundation, and US-Israel Binational Agricultural Research and Development Fund (BARD). The author would like to thank Dr. Huseyin Kucuktas for helping with drawings of the figures, and Dr. Hong Liu, Dr. Donghong Liu, Ms. Tingting Feng, and Ms. Hao Zhang for their assistance with the references.

References Agresti JJ, Seki S, Cnaani A, Poompuang S, Hallerman EM, Umiel N, Hulata G, Gall GAE, and May B. 2000. Breeding new strains of tilapia: Development of an artificial center of origin and linkage map based on AFLP and microsatellite loci. Aquaculture, 185:43–56. Beckmann JS and Weber JL. 1992. Survey of human and rat microsatellites. Genomics, 12:627–631. Cairney M, Taggart JB, and Hoyheim B. 2000. Characterization of microsatellite and minisatellite loci in Atlantic salmon (Salmo salar L.) and cross-species amplification in other salmonids. Mol Ecol, 9:2175–2178. Campbell D and Bernatchez L. 2004. Generic scan using AFLP markers as a means to assess the role of directional selection in the divergence of sympatric whitefish ecotypes. Mol Biol Evol, 21:945–956. Cardoso SRS, Eloy NB, Provan J, Cardoso MA, and Ferreira PCG. 2000. Genetic differentiation of Euterpe edulis Mart. populations estimated by AFLP analysis. Mol Ecol, 9:1753–1760. Chong LK, Tan SG, Yusoff K, and Siraj SS. 2000. Identification and characterization of Malaysian river catfish, Mystus nemurus (C&V): RAPD and AFLP analysis. Biochem Genet, 38:63–76.

16

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Crawford AM and Cuthbertson RP. 1996. Mutations in sheep microsatellites. Genome Res, 6:876–879. Crollius HR, Jaillon O, Dasilva C, Ozouf-Costaz C, Fizames C, Fischer C, Bouneau L, Billault A, Quetier F, Saurin W, et al. 2000. Characterization and repeat analysis of the compact genome of the freshwater pufferfish Tetraodon nigroviridis. Genome Res, 10:939–949. Edwards YJK, Elgar G, Clark MS, and Bishop MJ. 1998. The identification and characterization of microsatellites in the compact genome of the Japanese pufferfish, Fugu rubripes: Perspectives in functional and comparative genomic analyses. J Mol Biol, 278:843–854. Ellegren H. 2000. Microsatellite mutations in the germline: Implications for evolutionary inference. Trends Genet, 16:551–558. Felip A, Martinez-Rodriguez G, Piferrer F, Carrillo M, and Zanuy S. 2000. AFLP analysis confirms exclusive maternal genomic contribution of meiogynogenetic sea bass (Dicentrarchus labrax L.). Mar Biotechnol, 2:301–306. Felip A, Young WP, Wheeler PA, and Thorgaard GH. 2005. An AFLP-based approach for the identification of sex-linked markers in rainbow trout (Oncorhynchus mykiss). Aquaculture, 247:35–43. Fitzsimmons NN, Moritz C, and Moore SS. 1995. Conservation and dynamics of microsatellite loci over 300-million years of marine turtle evolution. Mol Biol Evol, 12:432–440. Griffiths R and Orr K. 1999. The use of amplified fragment length polymorphism (AFLP) in the isolation of sex-specific markers. Mol Ecol, 8:671–674. Hunter RL and Markert CL. 1957. Histochemical demonstration of enzymes separated by zone electrophoresis in starch gels. Science, 124:1294–1295. Jones AG, Rosenqvist E, Berglund A, and Avise JC. 1999. Clustered microsatellite mutations in the pipefish Syngnathus typhle. Genetics, 152:1057–1063. Jorde PE, Palm S, and Ryman N. 1999. Estimating genetic drift and effective population size from temporal shifts in dominant gene marker frequencies. Mol Ecol, 8:1171–1178. Kai Y, Nakayama K, and Nakabo T. 2002. Genetic differences among three colour morphotypes of the black rockfish, Sebastes inermis, inferred from mtDNA and AFLP analyses. Mol Ecol, 11:2591–2598. Karsi A, Cao D, Li P, Patterson A, Kocabas A, Feng J, Ju Z, Mickett KD, and Liu Z. 2002. Transcriptome analysis of channel catfish (Ictalurus punctatus): Initial analysis of gene expression and micro satellite-containing cDNAs in the skin. Gene, 285:157–168. Kijas JMH, Fowler JCS, Garbett CA, and Thomas MR. 1994. Enrichment of microsatellites from the citrus genome using biotinylated oligonucleotide sequences bound to streptavidincoated magnetic particles. Biotechniques, 16:656–662. Kocher TD, Lee WJ, Sobolewska H, Penman D, and McAndrew B. 1998. A genetic linkage map of a cichlid fish, the tilapia (Oreochromis niloticus). Genetics, 148:1225–1232. Kucuktas H and Liu Z. 2007. Allozyme and mitochondrial markers. In: Aquaculture Genome Technologies, edited by Z Liu. Blackwell Publishing, Ames, IA, pp. 73–85. Kucuktas H, Wagner BK, Shopen R, Gibson M, Dunham RA, and Liu ZJ. 2002. Genetic analysis of Ozark hellbenders (Cryptobranchus alleganiensis bishopi) utilizing RAPD markers. Proc Ann Conf SEAFWA, 55:126–137. Leclerc D, Wirth T, and Bernatchez L. 2000. Isolation and characterization of microsatellite loci in the yellow perch (Perca flavescens), and cross-species amplification within the family Percidae. Mol Ecol, 9:995–997. Li WH. 1997. Genome organization and evolution. In: Molecular Evolution, edited by WH Li. Sinauer Associates, Inc, Sunderland, MA. Li L and Guo XM. 2004. AFLP-based genetic linkage maps of the Pacific oyster Crassostrea gigas Thunberg. Mar Biotechnol, 6:26–36.

Genomic Variations and Marker Technologies for Genome-based Selection

17

Li YT, Byrne K, Miggiano E, Whan V, Moore S, Keys S, Crocos P, Preston N, and Lehnert S. 2003. Genetic mapping of the kuruma prawn Penaeus japonicus using AFLP markers. Aquaculture, 219:143–156. Liu H, Jiang Y, Wang S, Ninwichian P, Somridhivej B, Xu P, Abernathy J, Kucuktas H, and Liu ZJ. 2009. Comparative analysis of catfish BAC end sequences with the zebrafish genome. BMC Genomics, 10:592. Liu Z, Nichols A, Li P, and Dunham RA. 1998. Inheritance and usefulness of AFLP markers in channel catfish (Ictalurus punctatus), blue catfish (I. furcatus), and their F1, F2, and backcross hybrids. Mol Gen Genet, 258:260–268. Liu ZJ. 2007a. Marking the genome: Restriction fragment length polymorphism (RFLP). In: Aquaculture Genome Technologies, edited by ZJ Liu. Blackwell Publishing, Ames, IA, pp. 11–20. Liu ZJ. 2007b. Random amplified polymorphic DNA (RAPD). In: Aquaculture Genome Technologies, edited by ZJ Liu. Blackwell Publishing, Ames, IA, pp. 21–28. Liu ZJ. 2007c. Amplified fragment length polymorphism (AFLP). In: Aquaculture Genome Technologies, edited by ZJ Liu. Blackwell Publishing, Ames, IA, pp. 29–42. Liu ZJ. 2007d. Single nucleotide polymorphism (SNP). In: Aquaculture Genome Technologies, edited by ZJ Liu. Blackwell Publishing, Ames, IA, pp. 59–72. Liu ZJ and Cordes JF. 2004. DNA marker technologies and their applications in aquaculture genetics (vol 238, pg 1, 2004). Aquaculture, 242:735–736. Liu ZJ, Li P, Kucuktas H, Nichols A, Tan G, Zheng XM, Argue BJ, and Dunham RA. 1999. Development of amplified fragment length polymorphism (AFLP) markers suitable for genetic linkage mapping of catfish. Trans Am Fish Soc, 128:317–327. Liu ZJ, Li P, Kocabas A, Karsi A, and Ju ZL. 2001. Microsatellite-containing genes from the channel catfish brain: Evidence of trinucleotide repeat expansion in the coding region of nucleotide excision repair gene RAD23B. Biochem Biophys Res Commun, 289:317–324. Liu ZJ, Karsi A, Li P, Cao DF, and Dunham R. 2003. An AFLP-based genetic linkage map of channel catfish (Ictalurus punctatus) constructed by using an interspecific hybrid resource family. Genetics, 165:687–694. Lyall JEW, Brown GM, Furlong RA, Fergusonsmith MA, and Affara NA. 1993. A method for creating chromosome-specific plasmid libraries enriched in clones containing [Ca]N microsatellite repeat sequences directly from flow-sorted chromosomes. Nucleic Acids Res, 21:4641–4642. MacKiewicz M, Fletcher DE, Wilkins SD, DeWoody JA, and Avise JC. 2002. A genetic assessment of parentage in a natural population of dollar sunfish (Lepomis marginatus) based on microsatellite markers. Mol Ecol, 11:1877–1883. May B. 2003. Allozyme variation. In: Population Genetics: Principles and Applications for Fisheries Scientists, edited by EM Hallerman. American Fisheries Society, Bethesda, MD, pp. 23–36. Metzgar D, Bytof J, and Wills C. 2000. Selection against frameshift mutations limits microsatellite expansion in coding DNA. Genome Res, 10:72–80. Mickett K, Morton C, Feng J, Li P, Simmons M, Cao D, Dunham RA, and Liu Z. 2003. Assessing genetic diversity of domestic populations of channel catfish (Ictalurus punctatus) in Alabama using AFLP markers. Aquaculture, 228:91–105. Mock KE, Brim-Box JC, Miller MP, Downing ME, and Hoeh WR. 2004. Genetic diversity and divergence among freshwater mussel (Anodonta) populations in the Bonneville Basin of Utah. Mol Ecol, 13:1085–1098. Moran C. 1993. Microsatellite repeats in pig (Sus domestica) and chicken (Gallus domesticus) genomes. J Hered, 84:274–280.

18

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Okumuş Ý and Çiftci Y. 2003. Fish population genetics and molecular markers: II. Molecular markers and their applications in fisheries and aquaculture. Turk J Fish Aquat Sci, 3:51–79. Ostrander EA, Jong PM, Rine J, and Duyk G. 1992. Construction of small insert genomic DNA libraries highly enriched for microsatellite repeat sequences. Proc Natl Acad Sci U S A, 89:3419–3423. Parker PG, Snow AA, Schug MD, Booton GC, and Fuerst PA. 1998. What molecules can tell us about populations: Choosing and using a molecular marker. Ecology, 79:361–382. Rico C, Rico I, and Hewitt G. 1996. 470 million years of conservation of microsatellite loci among fish species. Proc Biol Sci, 263:549–557. Robison BD, Wheeler PA, Sundin K, Sikka P, and Thorgaard GH. 2001. Composite interval mapping reveals a major locus influencing embryonic development rate in rainbow trout (Oncorhynchus mykiss). J Hered, 92:16–22. Rogers SM, Campbell D, Baird SJ, Danzmann RG, and Bernatchez L. 2001. Combining the analyses of introgressive hybridisation and linkage mapping to investigate the genetic architecture of population divergence in the lake whitefish (Coregonus clupeaformis Mitchill). Genetica, 111:25–41. Seki S, Agresti JJ, Gall GAE, Taniguchi N, and May B. 1999. AFLP analysis of genetic diversity in three populations of ayu Plecoglossus altivelis. Fish Sci, 65:888–892. Serapion J, Kucuktas H, Feng J, and Liu Z. 2004. Bioinformatic mining of type I microsatellites from expressed sequence tags of channel catfish (Ictalurus punctatus). Mar Biotechnol (NY), 6:364–377. Simmons M, Mickett K, Kucuktas H, Li P, Dunham R, and Liu ZJ. 2006. Comparison of domestic and wild channel catfish (Ictalurus punctatus) populations provides no evidence for genetic impact. Aquaculture, 252:133–146. Smith MH and Chesser RK. 1981. Rationale for conserving genetic-variation of fish gene pools. Ecol Bull, 13–20. Somridhivej B, Wang SL, Sha ZX, Liu H, Quilang J, Xu P, Li P, Hue ZL, and Liu ZJ. 2008. Characterization, polymorphism assessment, and database construction for microsatellites from BAC end sequences of channel catfish (Ictalurus punctatus): A resource for integration of linkage and physical maps. Aquaculture, 275:76–80. Southern EM. 1975. Detection of specific sequences among DNA fragments separated by gel electrophoresis. J Mol Biol, 98:503–517. Steele CA, Wheeler PA, and Thorgaard GH. 2008. Mitochondrial and maternal effects on growth in clonal rainbow. Plant and Animal Genome Conference XVI, San Diego, CA. Steinberg EK, Lindner KR, Gallea J, Maxwell A, Meng J, and Allendorf FW. 2002. Rates and patterns of microsatellite mutations in pink salmon. Mol Biol Evol, 19:1198–1202. Sun Y, Song W-Q ZY-C, Zhang R-S, Abatzopoulos TJ, and Chen R-Y. 1999. Diversity and genetic differentiation in Artemia species and populations detected by AFLP markers. Int J Salt Lake Res, 8:341–350. Tautz D. 1989. Hypervariability of simple sequences as a general source for polymorphic DNA markers. Nucleic Acids Res, 17:6463–6471. Toth G, Gaspari Z, and Jurka J. 2000. Microsatellites in different eukaryotic genomes: Survey and analysis. Genome Res, 10:967–981. Van Lith HA and Van Zutphen LF. 1996. Characterization of rabbit DNA microsatellites extracted from the EMBL nucleotide sequence database. Anim Genet, 27:387–395. Vos P, Hogers R, Bleeker M, Reijans M, van de Lee T, Hornes M, Frijters A, Pot J, Peleman J, Kuiper M, et al. 1995. AFLP: A new technique for DNA fingerprinting. Nucleic Acids Res, 23:4407–4414. Weber JL and Wong C. 1993. Mutation of human short tandem repeats. Hum Mol Genet, 2:1123–1128.

Genomic Variations and Marker Technologies for Genome-based Selection

19

Welsh J and McClelland M. 1990. Fingerprinting genomes using PCR with arbitrary primers. Nucleic Acids Res, 18:7213–7218. Whitehead A, Anderson SL, Kuivila KM, Roach JL, and May B. 2003. Genetic variation among interconnected populations of Catostomus occidentalis: Implications for distinguishing impacts of contaminants from biogeographical structuring. Mol Ecol, 12:2817–2833. Williams JG, Kubelik AR, Livak KJ, Rafalski JA, and Tingey SV. 1990. DNA polymorphisms amplified by arbitrary primers are useful as genetic markers. Nucleic Acids Res, 18:6531–6535. Xu P, Wang S, Liu L, Peatman E, Somridhivej B, Thimmapuram J, Gong G, and Liu Z. 2006. Channel catfish BAC-end sequences for marker development and assessment of syntenic conservation with other fish species. Anim Genet, 37:321–326. Young WP, Ostberg CO, Keim P, and Thorgaard GH. 2001. Genetic characterization of hybridization and introgression between anadromous rainbow trout (Oncorhynchus mykiss irideus) and coastal cutthroat trout (O. clarki clarki). Mol Ecol, 10:921–930. Zane L, Bargelloni L, and Patarnello T. 2002. Strategies for microsatellite isolation: A review. Mol Ecol, 11:1–16. Zhang Y, Liang L, Jiang P, Li D, Lu C, and Sun X. 2008. Genome evolution trend of common carp (Cyprinus carpio L.) as revealed by the analysis of microsatellite loci in a gynogentic family. J Genet Genomics, 35:97–103.

Chapter 2

Copy Number Variations Jianguo Lu and Zhanjiang (John) Liu

Copy number variation (CNV) is a segment of DNA with copy number differences by comparison of two or more genomes. The segment may vary in size, ranging from one kilobase (kb) to several megabases (Cook and Scherer, 2008). Although copy number differences involving segments smaller than 1 kb can also be technically viewed as CNVs, research methods and applications involved in the study of smaller tandem segments are quite different. Therefore, they are not included in the CNV discussions here (see Table 2.1). CNVs can be caused by changes in genomic architecture including deletions, insertions, and duplications. Low copy repeats are regionally specific repeat sequences, which are susceptible to genomic rearrangements that result in CNVs. The size, sequence similarity, orientation, and distance between the copies of repeated sequences are important factors for causing CNVs (Lee and Lupski, 2006). In spite of being known for a long time, serious research on CNVs and their impact on genomes and genome expression has been a recent event. Structure variations in the human genome have been intensely studied recently. In some cases, the human structural variations in copy number and translocations and rearrangements were found to be associated with disease (Iafrate et al., 2004; Sebat et al., 2004; Tuzun et al., 2005; Redon et al., 2006; de Smith et al., 2007). As the information on CNV accumulates, it is clear that CNVs are important in terms of genome expression and function. In aquaculture species, research on CNVs is essentially lacking, but it is important to understand the impact of CNVs and its significance in aquaculture. In particular, this is because many aquaculture species are teleost fish. In teleost fish, another major mechanism could also account for many instances of CNV: Teleost fish went through an additional round of whole genome duplication followed by gene loss during evolution. This would mean that teleost fish contain genomes with various levels of CNVs involving coding genes, ranging from almost a complete tetraploid fish to almost a diploid fish with duplicated genes in small proportions of the genomes. Although CNVs can influence genome expression and function, it is almost certain that CNV polymorphism can influence performance traits, and therefore are highly relevant when discussing whole genome selection. However, as CNV research in aquaculture species is in its infancy, we will not be able to provide information concerning applications of CNV in whole genome-based selection. Rather, in this

Next Generation Sequencing and Whole Genome Selection in Aquaculture © 2011 Blackwell Publishing Ltd. ISBN: 978-0-813-80637-2

Edited by Zhanjiang (John) Liu

21

Table 2.1

Methods summary for the detection of structural variations in the human genome.

Types

Definitions

References

Single-nucleotide polymorphism (SNP) Structural variant

Base substitution involving only a single nucleotide

Gibbs et al. (2003)

A genomic alternation (e.g., a CNV, an insertion) that involves segments of DNA > 1 kb A duplicated genomic segment >1 kb in length with >90% similarity between copies Variation from insertion or deletion event involving <1 kb of DNA A structural variant that is ∼8–40 kb in size; this can refer to a CNV or a balanced structural rearrangement Similar to segmental duplication

Feuk et al. (2006)

Duplication

Indels Intermediate-sized structural variant (ISV) Low copy repeat (LCR) Multisite variant (MSV) Paralogous sequence variant (PSV) Segmental duplication Interchromosomal duplication Intrachromosomal duplication Copy number variant (CNV)

Copy number polymorphism (CNP) Inversion

Translocation and rearrangement

22

Complex polymorphic variation that is neither a PSV nor a SNP Sequence difference between duplicated copies (paralogs) Duplicated region ranging from 1 kb upward with a sequence identity of >90% Duplications distributed among nonhomologous chromosomes Duplications restricted to a single chromosome A segment of DNA that is 1 kb or larger and is present at a variable copy number in comparison with a reference genome A CNV that occurs in more than 1% of the population; originally, this definition was used to refer to all CNVs A segment of DNA that is reversed in orientation with respect to the rest of the chromosome; pericentirc inversions include the centromere, whereas paracentric inversion do not A change in position of a chromosomal segment within a genome that involves no change to the total DNA content Translocations can be intra- or inter-chromosomal

Feuk et al. (2006)

Feuk et al. (2006) Tuzun et al. (2005)

Lupski (1998) Fredman et al. (2004) Eichler (2001) Eichler (2001); Sharp et al. (2005) Eichler (2001) Eichler (2001) Feuk et al. (2006)

Sebat et al. (2004)

Feuk et al. (2006)

Feuk et al. (2006)

Copy Number Variations

23

chapter, we will provide an introduction of CNV research, summarize methods for CNV discovery, review the different approaches for CNV detection, and discuss the potential application of CNV for aquaculture genome research.

Characteristics of CNVs As the term CNV itself suggests, it refers to any changes in chromosome structure resulting in the change of copy number—including insertion or deletion of segments in some genomes in the population of a species, but not in all genomes of the population—translocations or rearrangements joining two formerly separated DNA sequences, leading to the net difference in copy numbers among genomes. However, for scientific communications, the definition of CNVs has been quite dynamic. The original definition is that CNVs are intra- or interchromosomal duplications or deletions of segments larger than 1 kb, but not including high copy number repetitive sequences such as long interspersed nucleotide elements (LINEs) or pericentrometric tandemly repeated DNA sequences (Feuk et al., 2006; Freeman et al., 2006). However, CNVs smaller than 1 kb and complex structures within these CNVs among humans have been reported using high-resolution genome maps (Korbel et al., 2007; Kidd et al., 2008). Thousands of insertion and deletion polymorphisms, less than 1 kb in length, have been detected and also referred to as CNVs (Mills et al., 2006). Hence, the broad-sense definition of CNVs is often expanded to include gains and losses of DNA segments of a few hundred bases and larger (Gokcumen and Lee, 2009). It was suggested in that CNVs should not cover insertion/deletion of transposable elements in order to reduce the complexity of CNV analysis (“The Effects of Genomic Structural Variation on Gene Expression and Human Disease Workshop,” The Wellcome Trust Sanger Institute, Hinxton, UK, November 27–28, 2005). Therefore, CNVs include copy number polymorphisms (CNPs; Sebat et al., 2004), large-scale copy number variants (LCVs; Iafrate et al., 2004), and intermediate-sized variants (ISVs; Tuzun et al., 2005), but does not encompass retroposon insertions (Table 2.1).

Impact of CNVs on Gene Expression and Phenotypes CNVs are a vital source in evolution, and have been found to be involved in many human diseases such as developmental disorders, mental diseases, and cancer. In human population, CNV was shown to represent a major type of polymorphism; approximately 12% of the human genome is subjected to CNV (Redon et al., 2006). Although the extent of CNV is unknown at present from many other species, it is reasonable to assume that CNVs are a huge source of genome variation in most, if not all, species. In teleost fish species, due to the additional round of genome duplication followed by gene loss, CNVs could be one of the largest genome variations, and their impact on phenotypes could be tremendous. CNVs can come from meiotic division processes and somatic division processes. While the meiotic origin of CNVs is well documented, good examples of CNVs derived from somatic processes exist. For example, monozygotic twins (identical

24

Next Generation Sequencing and Whole Genome Selection in Aquaculture

twins) display different DNA CNV profiles (Bruder et al., 2008); CNVs even vary in differentiated human tissues and organs from the same individual (Piotrowski et al., 2008), both demonstrating the mitotic origin of CNVs. CNV can have great phenotypic impact by adjusting gene dosage, disturbing coding sequence, or regulating long-range gene expression (Kleinjan and van Heyningen, 2005). Gene expression levels can be positively correlated with copy number increment (Somerville et al., 2005; McCarroll et al., 2006) or negatively correlated with copy number increment (Lee et al., 2006). For example, the deletion of a transcriptional repressor can increase gene expression. There is at least 17.7% heritable variation in gene expression caused by CNVs in human (Stranger et al., 2007). Gain and loss of gene functions can have both beneficiary and detrimental impact to the organism. This is particularly true for dosage-sensitive genes. Most CNV research has been, to date, conducted in humans. However, due to the importance CNV polymorphism, CNV research has recently been conducted in other species, including agriculturally important species such as cattle (Liu et al., 2008, 2010), and chicken (Völker et al., 2010). As CNV research in agricultural species is still at its early stages, phenotypic impact of CNVs awaits further elucidations.

Methods for CNV Detection and Analysis Microscopic Level Analysis of Structural Variation CNVs can be detected at the microscopic level through karyotyping. At the molecular level, CNV can be caused by translocations, inversions, deletions, and duplications. If the involved chromosomal segments are large in size, such chromosomal alterations can be detected by cytogenetic techniques such as karyotype analysis. With the improved chromosome banding techniques, many structural variations and structural abnormalities have been identified, especially in disease samples (Jacobs et al., 1978, 1992; Coco and Penchaszadeh, 1982; Warburton, 1991; Barber et al., 1998; Kim et al., 1999). Moreover, with fluorescence in situ hybridization (FISH), structural variations can be discerned even when a small chromosomal segment is involved. Chromosome banding also allows detection of a great variety of heteromorphisms. The most commonly detected heteromorphisms involved increases in length or inversions in human chromosome 9 (Verma et al., 1978). The structure variations of this region may involve unequal exchanges and repetitive sequences at recombination positions near the centromere (Starke et al., 2002). It should be noted that cytogenetic techniques such as chromosome banding or FISH has the ability to detect CNVs, but nonetheless, they are insensitive techniques, and they lack the ability to detect genome-wide CNVs of various sizes.

Array Comparative Genome Hybridization (CGH) In recent years, a number of experimental approaches and computational strategies were used to detect human genome structural variations with different resolutions

Copy Number Variations

25

(Table 2.1). The most popular approach for the analysis of CNV is the array-based CGH (array CGH or a-CGH). Array CGH is also called molecular karyotyping. It is a technique to scan the genome for gains and losses of chromosomal segments to discover CNVs (SolinasToldo et al., 1997; Pinkel et al., 1998; Lucito et al., 2003; Iafrate et al., 2004; Sebat et al., 2004; Selzer et al., 2005; Tyson et al., 2005). It is a hybridization-based approach using array as a platform. The use of array allowed for the placement of a large number of features (target sequences, sometimes also referred to as probes; but we will use features to avoid confusion with hybridization probes), which in this case are short sequence oligos based on the reference genome sequence. For instance, if the genome of interest is 1 billion base pairs (bp) in size, various numbers of features can be designed to provide the desired resolution. For instance, if one would like to know the copy number situation across the entire genome with one feature every 100 kb, a total of 10,000 short oligos would be required, with each of them designed based on the reference genome sequence with a spacing of 100 kb among them. Array CGH is the most widely used approach for the analysis of CNVs. The first step of making an array CGH is to place short oligo features representing very short genomic DNA segments spanning the entire genome on arrays (sometimes also referred to as microarrays because of the high density of features). The number of probes depends on the level of the resolution. For example, for a genome with a size of 1 billion bp, a set of 10,000 evenly spaced features would allow detection of CNVs at a resolution of one probe per 100 kb. The higher the resolution desired, the more target sequence features are needed. In an ideal situation, if short oligos of 100 bp is used with no spacing among them, 10 million features would cover the entire 1 Gb genome. That would provide a complete “scan” of the entire genome for any possible CNVs. However, practically, it is a balance between the resolution and the cost that dictates the number of features. The more the features are, the greater the resolution, but the more the costs are as well. Most often, an interfeature spacing of 50–100 kb is used. The tens of thousands of features can be derived from gene coding regions or from noncoding regions of the genome, depending on the purpose of the experiments. The second step is to fluorescently label the genomic DNA from a test sample and a normal reference sample using different fluorophores, for example, Cy3 and Cy5. The idea is that when equal genomic DNA is used from the test and the normal DNA sample, hybridization of the Cy3-labeled (say normal sample) and Cy5-labeled (say test sample) probes will generate equal signals, thereby yellow fluorescence, if there is no CNVs. Upon possession of any CNVs between the normal reference and the test samples, the hybridization signals will not be equal, thereby generating a red or green fluorescence signal, depending on the ratio of Cy3 and Cy5 hybridization signals. If the test sample has more copy numbers, the Cy5 label will generate a stronger signal than the Cy3 label, and therefore the corresponding probes will be red (Figure 2.1). Genomes often contain highly repetitive elements that interfere with hybridization. In the designing of the features, repetitive elements should be avoided. Nonetheless, the highly repetitive elements in the genome probes can still cause problems. Therefore, hybridization by highly repetitive elements should be blocked by competitive hybridization using nonlabeled repetitive sequences such as COT-1

26

Next Generation Sequencing and Whole Genome Selection in Aquaculture Genomic DNA Evenly spaced features

Cy3 label

Reference DNA

Array with features designed from genome sequences

Cy5 label Test DNA

Hybridization

Detection of CNV by Cy3 & Cy5 ratio Figure 2.1 Principles of array comparative genome hybridization (array CGH). A large number of evenly spaced features are designed from the reference genome sequence and placed to an array. Equal amount of reference genome (normal genome) and test genome DNA are labeled by differential fluorescence, for example, Cy3 and Cy5, and hybridized to the array. The ratios of Cy3 and Cy5 define CNV. If red fluorescence is observed, the feature on the array has more copy numbers in the test genome than in the normal genome. See color insert.

sequences of human and mouse, which is commercially available. COT-1 DNA is made of highly repetitive sequences based on genome information of the species. The term was derived from reassociation studies using Cot analysis where repetitive DNA reassociates rapidly. COT-1 DNA contains DNA elements with a Cot value of 1.0. In humans, COT-1 DNA is composed of highly repetitive DNA sequences, such as the Alu, LINE-1 and THE repeats. The COT-1 DNA can block the repetitive sequences before the reference and test sample is hybridized to the arrays. The third step of array CGH is the analysis of hybridization data based on fluorescence ratios. After hybridization, the ratio of the fluorescence intensity of the test probe to that of the reference probe is calculated. The ratio, upon calibration, reveals the copy number differences between the genomes. The hybridization result can be measured using microarray scanner. Then the feather extraction software can be used to quantify the hybridization images. Finally, the test file outputs are used to do the CNV analysis using CNV detection software.

Copy Number Variations

27

Reference Cancer DNA DNA

+ Hybridization Array CGH

Figure 2.2 An example of using array CGH for the detection of chromosomal segment duplications in cancer. See color insert.

Typical applications of array CGH are for cancer studies because chromosome aberrations usually occur during tumor progression (Albertson et al., 2003) and human genetic disease research (Albertson and Pinkel, 2003; Shaw-Smith et al., 2004). In many cases of cancers, the malignant genome is instable, and segmental duplications can happen in certain genomic regions depending on the cancer type. By using array CGH, it is relatively easy to detect genome regional duplications leading to CNVs (Figure 2.2). The target sequences on the arrays can be designed based on the needs of the experiments. The targets can be bacterial artificial chromosomes (BACs), cDNAs, polymerase chain reaction (PCR) products, or oligonucleotides (Figure 2.1). The array CGH with BACs have also been widely used recently (Kauraniemi et al., 2001; Ishkanian et al., 2004) because it can provide comprehensive coverage of the genome, low-noise hybridization, reliable mapping data, and accessible clones. However, BACs are usually around 80–200 kb. It is very difficult to detect high-quality single copy number difference smaller than 50 kb, even when hybridization noise is low. cDNA clones have been used for array CGH to increase the resolution for analysis of single genes or partial genes (Pollack et al., 1999; Kauraniemi et al., 2001; Porkka et al., 2002; Squire et al., 2003). However, there are two shortcomings for this method: (1) the presence of intervening sequences in genomic DNA but not in cDNA due to introns can affect the Cy5 : Cy3 ratio during the hybridization process; and (2) the uneven distribution of genes in the genome (Carter, 2007) would dictate the uneven resolution of the CNV analysis.

Multiplex Amplifiable Probe Hybridization (MAPH) MAPH is a recently developed procedure for the analysis of CNV in targeted genomic regions based on previously known information (Armour et al., 2000; Patsalis et al., 2005). For instance, certain genes can undergo duplications under malignant tumor conditions. Patient DNA can be subjected to MAPH analysis to detect if the genes are duplicated. In MAPH, target genomic DNA, along with controls in parallel, is immobilized to nylon membranes. Specific genomic segments previously known to be involved in duplications are used as probes. Upon hybridization, and washing away of all unbound probes, the hybridized probes are released and then quantified by

28

Next Generation Sequencing and Whole Genome Selection in Aquaculture

PCR by comparison with the control DNA samples. This method is highly useful for cancer studies, but its application in aquaculture is limited because information of targeted genome duplication is unknown in aquaculture species.

Multiplex Ligation-dependent Probe Amplification (MLPA) MLPA is another recently developed method for the analysis of CNVs (Schouten et al., 2002). In the MLPA technique, two “half probes” are designed to be adjacent to each other, with each harboring a universal primer sequence linked to its end. Upon hybridization to the target DNA template, the two half probes are brought to proximity to allow ligation. Once ligation happens, a joint molecule is generated that would allow PCR amplification using the known universal primer sequences. The key measurement is the number of the half probes hybridizing to the target. The amounts of ligated probe produced are proportional to the target copy number. Through quantitative PCR of the ligated products, the copy number of the targets is quantified. This approach, in spite of its high specificity, also depends on the prior knowledge of duplicated regions for the design of the half probes. Therefore, its application in aquaculture species is limited. Similar approaches such as quantitative multiplex PCR of short fluorescent fragments (QMPSF) also suffer from the same limitations: requiring prior knowledge for the design of fluorescent primers (Charbonnier et al., 2000), and therefore are not highly useful for aquaculture.

Computational Approach for CNV Detection Although the CNV detection methods are powerful based on array data, their applications are limited by the array density (array CGH) as well as by costs. CNVs can also be detected based on the computational approach if genome sequences are available. There would be no limitation for the resolution, and the CNVs can be detected at the nucleotide level. The limitation is the unavailability of genomic sequences at present. With the exception of humans, multiple genome sequences are not available from multiple individuals of any species at present. However, with the capacity of next generation sequencing, genome sequences from multiple individuals of the same species will soon become available for many species including agriculturally important species and perhaps even some major aquaculture species.

Paired-end Mapping (PEM) Based on Next Generation Sequencing A large-scale CNV detection strategy, PEM, was developed recently (Korbel et al., 2007) based on next generation sequencing (Figure 2.3). Basically, with PEM, the genome DNA sequence was first sheared into ∼3-kb fragments followed by massive sequencing using next generation sequencing. The 3-kb fragments are ligated to biotinylated adaptors, circularized, and then linearized by shearing. The biotinylated adaptors mark the ends of the genomic fragments and allow the researchers to trace the orientation of the sequences. The sequences generated with next generation

Copy Number Variations

29

Biotinylated Hairpin adaptor Ligation Sheared Genome DNA

Circularized DNA fragments Bio Randomly sheared

Data analysis Paired ends span

0

SVs mapping

4000

454 sequencing Isolation Linker (+) library DNA fragments Paired ends

Count 2000

>Pair 1, End A TGTGATCACCCGCCAATATCTC AGATGACACAATGGACCAAAGT TTACGAGCGGCTGACATAGGCT >Pair1, End B TGTGATCACCCGCCAATATCTC AGATGACACAATGGACCAAAGT

0

2000 4000 6000 8000 Span of paired ends

Figure 2.3 Principles of paired-end mapping-based CNV detection. Genomic DNA is sheared into approximately 3-kb fragments. The genomic fragments are then ligated to biotinylated adaptors to mark the orientation. The segments are circularized, followed by linearization at random sites. Next generation sequencing is used to massively sequence the segments. Bioinformatic mapping by in silico positioning of the sequences to the reference genome would detect any size difference or orientation difference, which suggest genome structural variations including CNVs. See color insert.

sequencing are then mapped to the reference genome sequence by in silico mapping. A deviation in size (maximal 3.0 kb as dictated by the fragment) or orientation (as marked by the orientation of the adaptors) of the sequences generated from paired reads from the reference genome sequence provides evidence of structural variation. If a deletion or insertion is involved, a size difference is expected; if inversion is involved, the sequence orientation is expected to be different (Figure 2.3). Any internal segmental duplication within the sequenced segments would increase the size of the sequenced segments.

Cross-species Computational Analysis The computational approach is useful for the analysis of gene CNVs among species with reference genome sequence available. For instance, the reference genome sequences are available from zebrafish, medaka, stickleback, fugu, and tetraodon. Analysis of gene CNVs among these would provide perspectives as to how duplicated genomes become diploidized (Lu et al., manuscript in preparation), providing insight into genome evolution.

30

Next Generation Sequencing and Whole Genome Selection in Aquaculture

The advantage of the computational approach for the analysis of CNVs is its great economic benefits without investing large amount of resources if the genome sequences are already available. It also has the advantage of detecting all kinds of structural variations, including translocations, inversions, large-scale CNVs (>50 kb), insertions or deletions (1–50 kb), and small sequence variants (<1 kb) (Feuk et al., 2006). Obviously, computational analysis requires the availability of genome sequences, which is the major disadvantage of computational approaches.

Genome-wide Association Studies Using CNV The use of genome-wide association studies has successfully linked genetic variants with susceptibility to a wide range of common polygenic diseases. Such genome-wide association studies, however, have almost exclusively focused on single-nucleotide polymorphisms (SNPs). Recent studies, however, have suggested that CNVs may contribute significantly to genetic predisposition to several common diseases (Gonzalez et al., 2005; Aitman et al., 2006). Initially, CNV effect is most often approached through the tagging of CNVs using SNPs, but this approach is not without problems. This has led to the development of association studies directly targeting CNVs. Direct CNV association analysis requires the availability of the maps of common CNV polymorphisms of the genome of interest, which is not available for most aquaculture species at the moment. In the future, it is almost certain that CNV polymorphisms will be found to be important in aquaculture species. The readers are always reminded that the teleost fish have undergone an additional round of genome duplication followed by various levels of diploidization. Therefore, CNVs in teleost fish species will prove to be more important than in other vertebrate species such as humans.

References Aitman TJ, Dong R, Vyse TJ, et al. 2006. Copy number polymorphism in FCGR3 predisposes to glomerulonephritis in rats and humans. Nature, 439:851–855. Albertson DG and Pinkel D. 2003. Genomic microarrays in human genetic disease and cancer. Human Molecular Genetics, 12:R145–R152. Albertson DG, Collins C, McCormick F, et al. 2003. Chromosome aberrations in solid tumors. Nature Genetics, 34:369–376. Armour JAL, Sismani C, Patsalis PC, et al. 2000. Measurement of locus copy number by hybridisation with amplifiable probes. Nucleic Acids Research, 28:605–609. Barber JCK, Joyce CA, Collinson MN, et al. 1998. Duplication of 8p23.1: A cytogenetic anomaly with no established clinical significance. Journal of Medical Genetics, 35:491–496. Bruder CEG, Piotrowski A, Gijsbers AACJ, et al. 2008. Phenotypically concordant and discordant monozygotic twins display different DNA copy-number-variation profiles. American Journal of Human Genetics, 82:763–771. Carter NP. 2007. Methods and strategies for analyzing copy number variation using DNA microarrays. Nature Genetics, 39:S16–S21. Charbonnier F, Raux G, Wang Q, et al. 2000. Detection of exon deletions and duplications of the mismatch repair genes in hereditary nonpolyposis colorectal cancer families using

Copy Number Variations

31

multiplex polymerase chain reaction of short fluorescent fragments. Cancer Research, 60:2760–2763. Coco R and Penchaszadeh VB. 1982. Cytogenetic findings in 200 children with mental retardation and multiple congenital anomalies of unknown cause. American Journal of Medical Genetics, 12:155–173. Cook EH and Scherer SW. 2008. Copy-number variations associated with neuropsychiatric conditions. Nature, 455:919–923. Eichler EE. 2001. Recent duplication, domain accretion and the dynamic mutation of the human genome. Trends in Genetics, 17:661–669. The Effects of Genomic Structural Variation on Gene Expression and Human Disease Workshop, The Wellcome Trust Sanger Institute, Hinxton, UK, November 27–28, 2005. Feuk L, Carson AR, and Scherer SW. 2006. Structural variation in the human genome. Nature Reviews Genetics, 7:85–97. Fredman D, White SJ, Potter S, et al. 2004. Complex SNP-related sequence variation in segmental genome duplications. Nature Genetics, 36:861–866. Freeman JL, Perry GH, Feuk L, et al. 2006. Copy number variation: New insights in genome diversity. Genome Research, 16:949–961. Gibbs RA, Belmont JW, Hardenbol P, et al. 2003. The International HapMap Project. Nature, 426:789–796. Gokcumen O and Lee C. 2009. Copy number variants (CNVs) in primate species using arraybased comparative genomic hybridization. Methods, 49:18–25. Gonzalez E, Kulkarni H, Bolivar H, et al. 2005. The influence of CCL3L1 gene-containing segmental duplications on HIV-1/AIDS susceptibility. Science, 307:1434–1440. Iafrate AJ, Feuk L, Rivera MN, et al. 2004. Detection of large-scale variation in the human genome. Nature Genetics, 36:949–951. Ishkanian AS, Malloff CA, Watson SK, et al. 2004. A tiling resolution DNA microarray with complete coverage of the human genome. Nature Genetics, 36:299–303. Jacobs PA, Matsuura JS, Mayer M, et al. 1978. A cytogenetic survey of an institution for the mentally retarded: I. Chromosome abnormalities. Clinical Genetics, 13:37–60. Jacobs PA, Browne C, Gregson N, et al. 1992. Estimates of the frequency of chromosomeabnormalities detectable in unselected newborns using moderate levels of banding. Journal of Medical Genetics, 29:103–108. Jones MR, Maydan JS, Flibotte S, et al. 2007. Oligonucleotide array comparative genomic hybridization (oaCGH) based characterization of genetic deficiencies as an aid to gene mapping in Caenorhabditis elegans. BMC Genomics, 8:402. Kauraniemi P, Barlund M, Monni O, et al. 2001. New amplified and highly expressed genes discovered in the ERBB2 amplicon in breast cancer by cDNA microarrays. Cancer Research, 61:8235–8240. Kidd JM, Cooper GM, Donahue WF, et al. 2008. Mapping and sequencing of structural variation from eight human genomes. Nature, 453:56–64. Kim SS, Jung SC, Kim HJ, et al. 1999. Chromosome abnormalities in a referred population for suspected chromosomal aberrations: A report of 4117 cases. Journal of Korean Medical Science, 14:373–376. Kleinjan DA and van Heyningen V. 2005. Long-range control of gene expression: Emerging mechanisms and disruption in disease. American Journal of Human Genetics, 76:8–32. Korbel JO, Urban AE, Grubert F, et al. 2007. Systematic prediction and validation of breakpoints associated with copy-number variants in the human genome. Proceedings of the National Academy of Sciences of the United States of America, 104:10110–10115. Lee JA and Lupski JR. 2006. Genomic rearrangements and gene copy-number alterations as a cause of nervous system disorders. Neuron, 52:103–121.

32

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Lee JA, Madrid RE, Sperle K, et al. 2006. Spastic paraplegia type 2 associated with axonal neuropathy and apparent PLP1 position effect. Annals of Neurology, 59:398–403. Liu GE, Van Tassel CP, Sonstegard TS, et al. 2008. Detection of germline and somatic copy number variations in cattle. Developmental Biology, 132:231–237. Liu GE, Hou Y, Zhu B, et al. 2010. Analysis of copy number variations among diverse cattle breeds. Genome Research, 20:693–703. Lucito R, Healy J, Alexander J, et al. 2003. Representational oligonucleotide microarray analysis: A high-resolution method to detect genome copy number variation. Genome Research, 13:2291–2305. Lupski JR. 1998. Genomic disorders: Structural features of the genome can lead to DNA rearrangements and human disease traits. Trends in Genetics, 14:417–422. McCarroll SA, Hadnott TN, Perry GH, et al. 2006. Common deletion polymorphisms in the human genome. Nature Genetics, 38:86–92. Mills RE, Luttig CT, Larkins CE, et al. 2006. An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Research, 16:1182–1190. Patsalis PC, Kousoulidou L, Sismani C, et al. 2005. MAPH: From gels to microarrays. European Journal of Medical Genetics, 48:241–249. Pinkel D, Seagraves R, Sudar D, et al. 1998. High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nature Genetics, 20:207–211. Piotrowski A, Bruder CEG, Andersson R, et al. 2008. Somatic mosaicism for copy number variation in differentiated human tissues. Human Mutation, 29:1118–1124. Pollack JR, Perou CM, Alizadeh AA, et al. 1999. Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nature Genetics, 23:41–46. Ponchel F, Toomes C, Bransfield K, et al. 2003. Real-time PCR based on SYBR-Green I fluorescence: an alternative to the TaqMan assay for a relative quantification of gene rearrangements, gene amplifications and micro gene deletions. BMC Biotechnology, 3:18. Porkka K, Saramaki O, Tanner M, et al. 2002. Amplification and overexpression of elongin C gene discovered in prostate cancer by cDNA microarrays. Laboratory Investigation, 82:629–637. Redon R, Ishikawa S, Fitch KR, et al. 2006. Global variation in copy number in the human genome. Nature, 444:444–454. Schouten JP, McElgunn CJ, Waaijer R, et al. 2002. Relative quantification of 40 nucleic acid sequences by multiplex ligation-dependent probe amplification. Nucleic Acids Research, 30:e57. Sebat J, Lakshmi B, Troge J, et al. 2004. Large-scale copy number polymorphism in the human genome. Science, 305:525–528. Selzer RR, Richmond TA, Pofahl NJ, et al. 2005. Analysis of chromosome breakpoints in neuroblastoma at sub-kilobase resolution using fine-tiling oligonucleotide array CGH. Genes Chromosomes & Cancer, 44:305–319. Sharp AJ, Locke DP, McGrath SD, et al. 2005. Segmental duplications and copy-number variation in the human genome. American Journal of Human Genetics, 77:78–88. Shaw-Smith C, Redon R, Rickman L, et al. 2004. Microarray based comparative genomic hybridisation (array-CGH) detects submicroscopic chromosomal deletions and duplications in patients with learning disability/mental retardation and dysmorphic features. Journal of Medical Genetics, 41:241–248. de Smith AJ, Tsalenko A, Sampas N, et al. 2007. Array CGH analysis of copy number variation identifies 1284 new genes variant in healthy white males: Implications for association studies of complex diseases. Human Molecular Genetics, 16:2783–2794. Solinas-Toldo S, Lampel S, Stilgenbauer S, et al. 1997. Matrix-based comparative genomic hybridization: Biochips to screen for genomic imbalances. Genes Chromosomes & Cancer, 20:399–407.

Copy Number Variations

33

Somerville MJ, Mervis CB, Young EJ, et al. 2005. Severe expressive-language delay related to duplication of the Williams-Beuren locus. New England Journal of Medicine, 353:1694–1701. Squire JA, Pei JM, Marrano P, et al. 2003. High-resolution mapping of amplifications and deletions in pediatric osteosarcoma by use of CGH analysis of cDNA microarrays. Genes Chromosomes & Cancer, 38:215–225. Starke H, Seidel J, Henn W, et al. 2002. Homologous sequences at human chromosome 9 bands p12 and q13-21.1 are involved in different patterns of pericentric rearrangements. European Journal of Human Genetics, 10:790–800. Stranger BE, Forrest MS, Dunning M, et al. 2007. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science, 315:848–853. Tuzun E, Sharp AJ, Bailey JA, et al. 2005. Fine-scale structural variation of the human genome. Nature Genetics, 37:727–732. Tyson C, Harvard C, Locker R, et al. 2005. Submicroscopic deletions and duplications in individuals with intellectual disability detected by array-CGH. American Journal of Medical Genetics Part A, 139A:173–185. Verma RS, Dosik H, and Lubs HA. 1978. Size and pericentric inversion heteromorphisms of secondary constriction regions (H) of chromosomes 1, 9, and 16 as detected by Cbg technique in Caucasians—Classification, frequencies, and incidence. American Journal of Medical Genetics, 2:331–339. Völker M, Backström N, Skinner BM, et al. 2010. Copy number variation, chromosome rearrangement, and their association with recombination during avian evolution. Genome Research, 20:503–511. Warburton D. 1991. De-novo balanced chromosome rearrangements and extra marker chromosomes identified at prenatal-diagnosis—Frequency, clinical-significance and distribution of breakpoints. American Journal of Human Genetics, 49:18–18.

Chapter 3

Next Generation DNA Sequencing Technologies and Applications Qingshu Meng and Jun Yu

Introduction DNA sequencing is undoubtedly the most important technology in biology. Over the past three decades, DNA sequencing technology has gone through remarkable advances. It has gone through many phase changes: from radioactivity-labeled to fluorescence-labeled in chemistry, from slab gel to capillary in geometry, and from electrophoresis to ordered array in signal detection (Tettelin and Feldblyum, 2009). There is only one concerted force behind all these changes: The pressure to produce more, faster, and cheaper DNA sequences that provide basic information pivotal for scientific research and personalized medicine. As a result, both parameters—perinstrument yield increase and unit cost reduction—of DNA sequencing change dramatically, with a growth curve that exceeds computing capacity that follows the Moore’s law (Shendure and Ji, 2008). The “gold rush” is on for the “virtual gold”— digital information of the genetic code from all life forms on earth. Therefore, it is important for the beneficiaries of the rapidly advancing DNA technology to understand a few critical points in order to take the advantage of the situation. First, one should always try to understand the potentials and limits of an emerging new sequencing technology since it may soon replace the old one regardless if it is dominant in the current market (Ansorge, 2009). We have seen the complete disappearance of slab gel and the reduction of capillary apparatuses from the market. Sometimes, we also have to be aware of the market competition of different machines among the technologies and their commercial products. A wrong decision can become very costly since the unit prices of these sequencing machines are in the range of half a million U.S. dollars. We have seen one of the capillary sequencers squeezed out of the market completely (even before some of the customers opened their packages) and some terminated at their earlier development stages. The reasons may be complicated but the result is simple and harsh. Second, one should always try to understand both the market and the technology niches of a particular technology and its instrumentation. For instance, the capillary sequencer, such as the ABI 3730 (Applied Biosystems), is still viable and good for sequencing polymerase chain reaction (PCR) products and other small-scale sequencing projects (Mardis, 2008). However, one

Next Generation Sequencing and Whole Genome Selection in Aquaculture © 2011 Blackwell Publishing Ltd. ISBN: 978-0-813-80637-2

Edited by Zhanjiang (John) Liu

35

36

Next Generation Sequencing and Whole Genome Selection in Aquaculture

may not need more than a couple of them as the workload may not exceed a few hundred samples per day for an academic institution. Third, one should always try to gain first-hand experience and to do one’s own accounting for the cost of operation and data production. Often, the most popular machines may not be the best as they might enter the market earlier but later prove to be less competitive when compared with other platforms, especially among different applications. Fourth, one should always try to grasp the opportunity, especially when the window of the opportunity is rather narrow. We have encountered researchers who are always shopping around for better deals but missed the best opportunity for initiating a good project and for an excellent publication. Fifth, when a machine is purchased, it should by all means be used as much as possible. If one does not have enough projects, he or she should ask others to use it because the half life of the sequencers is very short, a few years at most. Some are upgraded every year and the upgrades are not always free. Finally, the development of new protocols building upon the new technologies is critical. Usually, the introduction of reagent kits comes late relative to the emergence of the new technology. Therefore, it provides a great advantage if one can work on new protocols simultaneously as he or she starts to adopt the new sequencing technology. After all, new protocols and applications deserve their own attention. In this chapter, we will first review the next generation DNA sequencing platforms to explain briefly how they work as well as their relative strengths and limitations, and then we will devote most of the space to introduce their applications.

Next Generation Sequencing: The Current Platforms The Roche (454) Genome Sequencer FLX System The GS FLX System based on sequencing-by-synthesis (pyrosequecing) technology was developed by 454 Life Sciences, as the first next generation sequencing platform available on the market (Margulies et al., 2005). In this system, the DNA sample is first sheared into fragments. Two short adaptors, an A-adaptor and a B -adaptor are then ligated to the fragments. The adaptors provide priming sites for amplification and sequencing, as well as a special key sequence. The B -adaptor also contains a 5′-biotin tag that allows the immobilization of the library onto streptavidin-coated magnetic beads. The double-stranded products bound to the beads are then denatured to release the complementary nonbiotinylated strands containing both an Aand a B -adaptor sequences. These denatured strands form the single-stranded template DNA library (Figure 3.1A). For DNA amplification, the Genome Sequencer FLX System employs emulsion-based clonal amplification, called emPCR (Dressman et al., 2003). The single-stranded DNA library is immobilized by hybridization onto primer-coated capture beads. The process is optimized to produce beads, where a single library fragment is bound to each bead. The bead-bound library is emulsified along with the amplification reagents in a water-in-oil mixture. Each bead, with a single library fragment, is captured within its own emulsion microreactor, where independent clonal amplification takes place. After amplification, the microreactors are broken, releasing the DNA-positive beads for the enrichment (Figure 3.1B). For

Next Generation DNA Sequencing Technologies and Applications

A

37

A

A B

B

B

Signal image

C

polymerase G A A T CG GC A T GC T A A A G T CA Anneal primer APS PPI Sulturylase Luciferase

ATP

Light + αxy luciferin

Key sequence

Flowgram

D Figure 3.1 Outline of the Roche/454 sequencer workflow. (A) Single-strand template DNA library preparation; (B) emulsion-based clonal amplification; (C) depositing DNA beads into the PicoTiterPlate device; (D) sequencing by synthesis. (Figure was adapted from www.454.com.) See color insert.

sequencing, the DNA beads are layered onto a PicoTiterPlate device, depositing the beads into the wells, followed by enzyme beads and packing beads. The enzyme beads contain sulfurylase and luciferase, key components of the sequencing reaction, while the packing beads ensure that the DNA beads remain positioned in the wells during that sequencing reaction (Figure 3.1C). The fluidics subsystem delivers sequencing reagents, containing buffers and nucleotides, by flowing them across the wells of the plate. Nucleotides are flowed sequentially in a specific order over the PicoTiterPlate device. When a nucleotide is complementary to the next base of the template strand, it is incorporated into the growing DNA chain by the polymerase. The incorporation of a nucleotide releases a pyrophosphate molecule. The sulfurylase enzyme converts the pyrophosphate molecule into ATP using adenosine phosphosulfate. The ATP is

38

Next Generation Sequencing and Whole Genome Selection in Aquaculture

hydrolyzed by the luciferase enzyme using luciferin to produce oxyluciferin and light. The light emission is detected by a CCD camera, which is coupled to the PicoTiterPlate device. The intensity of light from a particular well indicates the incorporation of nucleotides (Figure 3.1D). Across multiple cycles, the pattern of detected incorporation events reveals the sequence of templates represented by individual beads. The sequencing is “asynchronous” in that some features may get ahead or behind other features depending on their sequence relative to the order of base addition. Raw reads processed by the 454 platform are screened by various quality filters to remove poor-quality sequences, mixed sequences (more than one initial DNA fragment per bead), and sequences without the initiating key sequence. For downstream analysis, three different bioinformatic tools are available: GS De Novo Assembler, GS Reference Mapper, and GS Amplicon Variant Analyzer (http://454.com/products-solutions/analysis-tools/index.asp). Using these graphical analysis tools, researchers can obtain biologically meaningful results from sequence data quickly. A major limitation of the 454 technology relates to resolution of homopolymercontaining DNA segments, such as AAA and GGG (Rothberg and Leamon, 2008). Because there is no terminating moiety preventing multiple consecutive incorporations at a given cycle, pyrosequencing relies on the magnitude of light emitted to determine the number of repetitive bases. This is prone to a greater error rate than the discrimination of incorporation versus nonincorporation. As a consequence, the dominant error type for the 454 platform is insertion–deletion, rather than substitution. Another disadvantage of 454 sequencing platform is that the per-base cost of sequencing is much greater than that of other next generation platforms (e.g., SOLiD and Solexa) (Rothberg and Leamon, 2008). It is therefore unsuited for sequencing targeted fragments from small numbers of DNA samples, such as those for phylogenetic analysis. Relative to other next generation platforms, the key advantage of the 454 platform is its long read length (Metzker, 2009). The 454 system can generate more than 1,000,000 individual reads, with an improved Q20 read length of 400 bases per 10-h instrument run, that is, a total of 400,000,000 bp per sample run. It may be the method of choice for certain applications where long reads are critical, such as de novo assembly and metagenomics.

The Illumina (Solexa) Genome Analyzer The Solexa sequencing platform was commercialized in 2006. The principle is on the basis of sequencing-by-synthesis chemistry (Figure 3.2). Input DNA is fragmented by hydrodynamic shearing to generate fragments of <800 bp. The fragments are bluntended and phosphorylated, and a single “A” nucleotide is added to the 3′ ends of the fragments. Then, DNA fragments are ligated at both ends to adapters that have a single-base “T” overhang. After denaturation, DNA fragments are immobilized at one end on a solid support-flow cell. The surface of the flow cell is coated densely with the adapters and the complementary adapters. Each single-stranded fragment, immobilized at one end on the surface, creates a “bridge” structure by hybridizing with its free end to the complementary adapter on the surface of the flow cell. The adapters on the surface also act as primers for the following PCR amplification.

Next Generation DNA Sequencing Technologies and Applications 1

2

3

4

5

6

7

8

39

C A

G C

C

T

A

T

G T

G

G

C

A

T

9

10

11

G

C

A

A C

T

Reference sequence G

A C

A

T

G

T

T

C

G G

A

G

T …GCTGATGTGCCGCCTCACTCCGGTGG CACTCCTGTGG CTCACTCCTGTGG GCTGATGTGCCACCTCA GATGTGCCACCTCACTC GTGCCGCCTCACTCCTG CTCCTGTGG

C

Unknown variant contfied and cated

T A

12

Known SNP Cated

G

Figure 3.2 The Illumina sequencing-by-synthesis approach. (1) Prepare genome DNA sample; (2) attach DNA to surface; (3) bridge amplification; (4) fragments become doublestranded; (5) denaturation of the double-stranded molecules; (6) complete amplification; (7) determine first base; (8) image first base; (9) determine second base; (10) image second chemistry cycle; (11) sequencing over multiple chemistry cycles; (12) align data. (From the Genome Analyzer brochure, with permission from Illumina Inc.) See color insert.

Adding mixtures containing the PCR amplification reagents to the flow cell surface, the DNA fragments are amplified by “bridge PCR” (Adessi et al., 2000; Fedurco et al., 2006). After several PCR cycles, about 1000 copies of single-stranded DNA fragments are created on the surface, forming a surface-bound colony (the cluster). The reaction mixture for the sequencing reactions and DNA synthesis is supplied onto the surface, which contains four reversible terminator nucleotides each labeled with a different fluorescent dye. After incorporation into the DNA strand, the terminator nucleotide, as well as its position on the support surface, is detected and identified via its fluorescent dye by the CCD camera. The terminator group at the 3′

40

Next Generation Sequencing and Whole Genome Selection in Aquaculture

end of the base and the fluorescent dye are then removed from the base and the synthesis cycle is repeated. This series of steps continues for a specific number of cycles, as determined by user-defined instrument settings. A base-calling algorithm assigns sequences and associated quality values to each read, and a quality checking pipeline evaluates the Illumina data from each run, removing poor-quality sequences. In 2008, Illumina introduced an upgrade, the Genome Analyzer II. It offered a powerful combination of the cBot and Paired-End Module (www.illumina.com/ systems/genome_analyzer.ilmn). cBot is a revolutionary automated system that creates clonal clusters from single-molecule DNA templates, preparing them for sequencing by synthesis on the Genome Analyzer. The Paired-End Module is a fluidics station that attaches to the Genome Analyzer. After completion of the first read, the templates can be regenerated in situ to prepare for the second round of sequencing from the opposite end of the fragments. First, the newly sequenced strands are stripped off and the complementary strands are bridge-amplified to form clusters. Once the original templates are cleaved and removed, the reverse strands undergo sequencing by synthesis. The Paired-End Module enables paired-end sequencing up to 2 × 100 bp for fragments ranging from 200 bp to 5 kb. For the Genome Analyzer II, the run time is highly decreased and the output per paired-end run can reach 45–50 gigabase (Gb). Compared to Sanger sequencing, the Illumina system is able to produce more data at a reduced time and cost; however, error rates are higher (often resulting in false positive when identifying sequence variations) and reads are shorter (Metzker, 2009). Usually, the error rate can be overcome by coverage but contiguity is rather limited by the read length as the Lander–Waterman curve describes (Lander and Waterman, 1988).

The Applied Biosystems SOLiD System The AB SOLiD system is based on a sequencing-by-ligation technology. This platform has its origins in the system described by Shendure et al. (2005) and in work by McKernan et al. (2006) at Agencourt Personal Genomics (acquired by Applied Biosystems in 2006). The generation of a DNA fragment library and the sequencing process by subsequent ligation steps are shown in Figure 3.3. In this technology, two types of libraries—fragment or mate-paired library—can be constructed depending on the researchers’ purposes. DNA fragments are ligated to adapters and bound to beads. DNA fragments on the beads are amplified by emPCR. After PCR, the templates are denatured and bead enrichment is performed to select beads with extended templates. The template on the selected beads undergoes a 3′ modification to allow covalent bonding to the slide. After modification, the beads are deposited onto a glass slide. The sequencing methodology is based on sequential ligation with dye-labeled oligonucleotides (Housby and Southern, 1998). In the first step, a primer is hybridized to the adapter sequence within the library template. Next, a set of four fluorescently labeled oligonucleotide octamers compete for ligation to the sequencing primer. In these octamers, the first and second di-base is characterized by one of four fluorescent labels at the end of the octamer. After the detection of the fluorescence from the label, bases 1 and 2 in the sequence are thus determined. The ligated octamer oligonucleotides are cleaved off after the fifth base, removing the fluorescent label, then

LIBRARY PREPARATION

Fragment Library

Polymerase

OR

P1 Coupled Beads

Enhancement

C.

Bead

deposition

BEAD DEPOSITION

B.

Bead

Emulsion

PCR

EMULSION PCR BEAD ENHANCEMENT

A.

Mate-Paired Library

D.

3’

LIGATE

3’ Bead

5’

3’

CLEAVE

Bead

5’

3’ Bead

5’

Primer p5’

C-T-n-n-n-z-z-z

3’

3’

G-G-n-n-n-z-z-z 3’

C-A-n-n-n-z-z-z

Adapter Sequence

G-C-n-n-n-z-z-z

Template Sequence

Primer C-A-n-n-n-z-z-z GT Adapter Sequence

Primer

Template Sequence

Cleavage

z-z-z

C-A-n-n-n p5’ GT Adapter Sequence

Template Sequence

SEQUENCING BY LIGATION/DATA ANALYSIS

Ligase

E.

PRIMER ROUND

3 Universal seq primer (n-2) 4

3’ Universal seq primer (n-3) 3’

seq primer (n-4) 5 Universal 3’ Indicates positions of interrogation

Ligation cycle

DUAL INTEROGATION OF EACH BASE

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2122 2324 25 26 2728 29 30 3132 33 34 35 Universal seq primer (n) 3’ Universal seq primer (n-1) 2 3’

1

Figure 3.3 The ligase-mediated sequencing approach of the Applied Biosystems SOLiD sequencer. (A) Library preparation; (B) emulsion PCR/bead enrichment; (C) bead deposition; (D) sequencing by ligation; (E) primer reset and two-base encoding. (Adapted from www.appliedbiosystems.com.) See color insert.

41

42

Next Generation Sequencing and Whole Genome Selection in Aquaculture

hybridization and ligation cycles are repeated, this time determining bases 6 and 7 in the sequence; in the subsequent cycle bases 11 and 12 are determined, and so on. Progressive rounds of octamer ligation enable sequencing of every five bases. Following a series of ligation cycles, the extension product is removed and the template is reset with another octamer complementary to the n-1 position for a second round of ligation cycles. After five rounds of primer reset completed for each sequence tag, each base is interrogated in two independent ligation reactions by two different primers. This method is called “two-base encoding” (Mckernan et al., 2006). Two-base encoding is a unique and powerful approach designed to clearly discriminate measurement errors. The combination of ligase enzymology, primer reset, and two-base encoding all contribute to the low error rate and reduced systemic noise. Until December 2009, Applied Biosystems has updated the platform to SOLiD™ 3 Plus. The SOLiD™ 3 Plus System can generate more than 60 Gb of mappable sequence or greater than 1 billion reads per run. The cost of the instrument is substantially lower than that of other next generation sequencing platforms. The current read length, however, significantly limits its applications (Metzker, 2009). A standard DNA sequencing workflow has traditionally included three key steps: library preparation, sequencing, and data analysis. Although these next generation sequencing platforms are quite diverse in sequencing biochemistry as well as in how the array is generated, their workflows are conceptually similar. Library preparation is accomplished by random fragmentation of DNA, followed by in vitro ligation of common adaptor sequences (see Chapter 4). The generation of clonally clustered amplicons to serve as sequencing features can be achieved by emulsion PCR (454 and SOLiD) or bridge PCR (Solexa). The sequencing process consists of alternating cycles of enzyme-driven biochemistry and imaging-based data acquisition. Pyrosequencing (454) uses chemiluminescence-based detection of each released pyrophosphate that occurs upon the incorporation of a nucleotide by the DNA polymerase. Similar to 454/Roche, the Illumina Genome Analyzer also uses sequencing by synthesis, albeit with a different detection chemistry. In contrast to the polymerasebased approaches, the SOLiD system uses a sequencing-by-ligation approach in which the sequence is inferred indirectly via successive rounds of hybridization and ligation events. A comparison of the next generation sequencing platforms is summarized in Table 3.1. Global advantages of next generation sequencing strategies, relative to Sanger sequencing, include (1) in vitro sequencing library, followed by in vitro clonal amplification to generate sequencing features, circumventing several bottlenecks that restrict the parallelism of conventional sequencing (i.e., transformation of Escherichia coli and colony picking). (2) Array-based sequencing enables a much higher degree of parallelism than conventional capillary-based sequencing. As the effective size of sequencing features can be on the order of 1 μm, hundreds of millions of sequencing reads can potentially be obtained in parallel by rastered imaging of a reasonably sized surface area. (3) Because array features are immobilized to a planar surface, they can be enzymatically manipulated by a single reagent volume. Although microliter-scale reagent volumes are used in practice, these are essentially amortized over the full set of sequencing features on the array, dropping the effective reagent volume per feature to the scale of picoliters or femtoliters. Collectively,

Next Generation DNA Sequencing Technologies and Applications Table 3.1

43

Comparison of next generation sequencing platforms as on January 1, 2010.

Platform Roche 454 Illumina

ABI SOLiD

Sequencing chemistry

Template amplification

Read length (bp)

Run time (days)

Gb per run

Pyrosequencing Reversible terminator, sequencing by synthesis Sequencing by ligation

Emulsion PCR Bridge PCR

400 100

0.4 4a, 9.5b

0.4–0.6 22.5–25a, 45–50b

Emulsion PCR

50

6–7a, 12–14b

25–30a, 50–60b

a

Single-end library. Paired-end library.

b

these differences translate into dramatically lower costs for DNA sequence production. The advantages of next generation DNA sequencing strategies are currently offset by several disadvantages. The most prominent of these include read length and raw accuracy. Although these limitations create important algorithmic challenges for the immediate future, we should bear in mind that these technologies will continue to be improved with respect to these parameters, much as conventional sequencing progressed gradually over three decades to reach its current level of technical performance. As both Illumina and Life Technologies (Applied Biosystems) have already claimed a new phase of throughput increase: hundreds of gigabase per instrument run, we should anticipate that cheaper and faster sequencing keeps coming, and thus their applications become critical.

Application of Next Generation Sequencing Technologies Next generation sequencing technologies are applied in a variety of areas during the past several years. Important applications are summarized in Table 3.2 including the following: 1. de novo genome sequencing, whole genome resequencing, or more targeted discovery of mutations or polymorphisms; 2. transcriptome analysis, where shotgun libraries derived from mRNA or small RNAs are deeply sequenced; 3. large-scale analysis of DNA methylation, by deep sequencing of bisulfite-treated DNA; 4. genome-wide mapping of DNA–protein interactions, by deep sequencing of DNA fragments pulled down by chromatin immunoprecipitation (ChIP). As mentioned previously, there are different advantages and limitations among the next generation platforms that respect to specific applications.

44

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Table 3.2

Applications of next generation sequencing technologies.

Category

Examples of applications

References

Genome

De novo sequencing: the initial generation of large eukaryotic genomes Whole genome resequencing: comprehensive SNP, indels, copy number, and structural variations in genomes Targeted resequencing: targeted polymorphism and mutation discovery Quantification of gene expression and alternative splicing; transcript annotation; discovery of transcribed SNPs or somatic mutations Small RNA profiling

Velasco et al. (2007); DiGuistini et al. (2009); Li et al. (2010); Huang et al. (2009) Bentley (2006); Ossowski et al. (2008); Xia et al. (2009); Denver et al. (2009)

Trancriptome

Epigenome

Transcription factor with its direct targets Genomic profiles of histone modifications DNA methylation Genomic profiles of nucleosome positions

Table 3.3

Harismendy et al. (2009); Hodges et al. (2007); Porreca et al. (2007) Jacquier (2009); Sultan et al. (2008); Sugarbaker et al. (2008)

Axtell et al. (2006); Berezikov et al. (2006); Houwing et al. (2007) Johnson et al. (2007); Robertson et al. (2007) Impey et al. (2004); Mikkelsen et al. (2007) Cokus et al. (2008); Costello et al. (2009) Johnson et al. (2006)

Statistics of eukaryotic genome sequencing projects as of March 1, 2010.

Organism group

Complete

Assembly

In progress

Sum

Animals Plants Fungi Protists Total

4 3 10 6 23

137 23 120 49 329

146 85 93 64 388

287 111 223 119 740

Genome De Novo Sequencing and Assembly The initial generation of the primary genetic sequence of a particular organism is called de novo sequencing. A detailed genetic analysis of any organism is possible only after de novo sequencing has been performed (Goldberg et al., 2006; Durfee et al., 2008; Reinhardt et al., 2009). Until March 1, 2010, there have been 740 eukaryotic genome sequencing projects submitted to the the National Center for Biotechnology Information (NCBI) (Table 3.3), while only 23 genomes are sequenced

Next Generation DNA Sequencing Technologies and Applications Table 3.4

45

De novo eukaryotic genomes sequencing using next generation technologies. Common name

Genome size (Mb)

Method

Platform

Depth

Reference

Ailuropoda melanoleuca Grosmannia clavigera Vitis vinifera

Panda

246

WGS

Solexa

56×

Blue stain fungus Grape

32.5

WGS

50×

500

WGS

Cucumis sativus

Cucumber

367

WGS&C

Sanger, 454, and Solexa Sanger and 454 Sanger and Solexa

Li et al. (2010) DiGuistini et al. (2009) Velasco et al. (2007) Huang et al. (2009)

Organism

11× 72.2×

WGS, whole genome shotgun; WGS&C, whole genome shotgun combined with clone-based method.

in completion, and most of them are in draft assemblies or are works in progress (http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi). Four of these organisms are sequenced by using next generation sequencing technologies independently or in combination with the traditional Sanger method (Velasco et al., 2007; DiGuistini et al., 2009; Huang et al., 2009; Li et al., 2010) (Table 3.4). A Sanger/pyrosequencing hybrid approach resolved a complex heterozygous grape genome. A consensus sequence of the genome and a set of mapped marker loci were generated. This is the first project that utilizes both the long Sanger and short sequencing by synthesis (SBS) reads to assemble the genome sequence of a large eukaryotic genome (Velasco et al., 2007). A draft sequence of the giant panda genome was successfully generated and assembled based on next generation sequencing technology alone, taking the advantage of excellent colinearity of the mammalian genomes. The assembled contigs (2.25 Gb) cover approximately 94% of the whole genome and the remaining gaps (0.05 Gb) seem to contain carnivore-specific repeats and tandem repeats (Li et al., 2010). When taking on large genomes, that is, over 1 Gb in total length, one should be more cautious in designing a sequencing project since polyploidy and large repetitive fraction may both hamper the endeavor deadly. Nevertheless, successful sequencing projects demonstrate the feasibility in using next generation sequencing technologies for accurate, cost-effective, and rapid de novo assembly of large eukaryotic genomes (Imelfort and Edwards, 2009; Turner et al., 2009a). With its long read lengths and high accuracy, capillary electrophoresis-based sequencing has been the gold standard technology for de novo genome sequencing projects in the past decades. However, the throughput of these systems makes de novo assembly of most organisms a lengthy and costly endeavor. Next generation sequencing technologies hold great promise in reducing the time and cost of sequencing. Compared with the technology just a few years ago, it is now much easier and cheaper to sequence entire genomes, and a wide variety of species are being studied using these advanced tools.

Whole Genome or Targeted Resequencing To identify single-nucleotide polymorphisms (SNPs), indels, copy number, and structural variations, multiple individuals or strains, a population-based sampling of a

46

Next Generation Sequencing and Whole Genome Selection in Aquaculture

species have to be resequenced (Bentley, 2006; Ossowski et al., 2008; Denver et al., 2009; Xia et al., 2009; Pleasance et al., 2010). In humans, such an endeavor has already commenced with the publication of six complete genomes (Table 3.5). The first is from J. Craig Venter and achieved using traditional Sanger sequencing methods (Levy et al., 2007), and the second is from James D. Watson, which was sequenced using the Roche 454 technology to 7.5× genome coverage. The reads were aligned to the NCBI reference sequence by using a combination of the BLAT (Kent, 2002) and Smith–Waterman algorithms. The sequence differs from the reference at 3.32 Mb, of which 2.7 Mb is the known difference (Wheeler et al., 2008). The other four human genome sequences of human individuals are from a Chinese (Wang et al., 2008), an African (Pushkarev et al., 2009), and two Korean individuals (Ahn et al., 2009; Kim et al., 2009); all were done by using the Illumina Genome Analyzer platform and sequenced to approximately 20× haploid genome coverage. For all four genomes, reads covered more than 99% of the NCBI human reference genome, revealing approximately 3 million SNPs. Target region resequencing refers to sequencing a targeted region of a species’ genome from multiple individuals; it enables scientists to investigate variations of interested genomic regions or genes with high coverage and lower cost (Harismendy et al., 2009). Two methods of target region resequencing are widely used: PCR-based candidate gene (Dracatos et al., 2009; Goossens et al., 2009; Harismendy and Frazer, 2009; Tewhey et al., 2009) and whole exome approaches (Hodges et al., 2007; Porreca et al., 2007; Choi et al., 2009; Turner et al., 2009b). Ji and colleagues developed a procedure for massively parallel resequencing of multiple human genes by combining a highly multiplexed and target-specific amplification process with a parallel sequencing technology(Dahl et al., 2007). They demonstrated parallel resequencing of 10 cancer genes covering 177 exons with an average sequence coverage per sample of 93%. Seven cancer cell lines and one normal genomic DNA sample were studied with multiple mutations and polymorphisms identified among the 10 genes. Applied exome sequencing, Bamshad and colleagues discovered the gene for a rare Mendelian disorder of unknown cause, Miller syndrome (Ng et al., 2010). For four affected individuals in the three independent kindreds, they captured and sequenced coding regions to a mean coverage of 40×, a sufficient depth to call variants at approximately 97% of each targeted exome. Filtering against public SNP databases and eight HapMap exomes for genes with two previously unknown variants in each of the four individuals enabled the authors to identify a single candidate gene, DHODH, which encodes a key enzyme in the pyrimidine de novo biosynthesis pathway. They demonstrated that exome sequencing of a small number of unrelated affected individuals is a powerful, efficient strategy for identifying the genes underlying rare Mendelian disorders and will likely transform the genetic analysis of monogenic traits.

Transcriptome Sequencing The transcriptome is the complete set of transcripts in a cell, and their quantity, for a specific developmental stage or physiological condition (Jacquier, 2009). Understanding the transcriptome is essential for interpreting the functional elements of the genome and revealing the molecular constituents of cells and tissues, and also

47

J. Craig Venter James D. Watson Yoruba male (NA18507) Han Chinese male (YH) Korean male (SJK) Korean male (AK1)

Sanger

Illumina Solexa

Roche 454

Individual

35 35 35, 74 36, 88, 106

3681 2950 1647 1910

250

800

31.9 93.2

Read length (bases)

No. of reads (millions)

Genome coverage (%) N/A 95 99.9 99.9 99.9 99.8

Read coverage 7.5× 7.4× 40.6× 36× 29.0× 27.8×

Sequencing statistics of six individual human genomes.

Platform

Table 3.5

3.45

3.44

3.07

4.0

3.32

3.21

SNPs (millions)

30

15

35

40

234

>340,000

No. of runs

200

250

500

250

1000

70,000

Estimated costs ($1000)

Levy et al. (2007) Wheeler et al. (2008) Pushkarev et al. (2009) Wang et al. (2008) Ahn et al. (2009) Kim et al. (2009)

References

48

Next Generation Sequencing and Whole Genome Selection in Aquaculture

for understanding development and disease. The specific aims of transcriptomics are (1) to catalog all transcripts in a context of cell types for a species, including mRNAs, noncoding RNAs, and small RNAs; (2) to determine the transcriptional structure of genes, in terms of their start sites, 5′ and 3′ ends, splicing patterns, and other posttranscriptional modifications; and (3) to quantify the expression levels of each transcript during development or under different physiological and pathological conditions.

RNA Sequencing RNA-Seq is a recently developed approach for transcriptome profiling that uses deep sequencing technologies (Wang et al., 2009). Studies using this method have already altered our view of the extent and complexity of eukaryotic transcriptomes (Cloonan et al., 2008; Mortazavi et al., 2008; Sugarbaker et al., 2008; Sultan et al., 2008; Tang et al., 2010). The current gold standard for protein-coding gene annotation is expressed sequence tag (EST) or full-length cDNA sequencing followed by alignment to a reference genome, but it has been estimated that most EST studies using Sanger sequencing detect only about 60% of transcripts in the cell, which fails to cover the poorly expressed or long transcripts (Brent, 2008). This information gap can be addressed by using the next generation sequencing technologies. To date, next generation sequencing technologies have been used to generate transcriptomes for many species and tissues (Mortazavi et al., 2008; Nagalakshmi et al., 2008; Sultan et al., 2008). For instance, a study used the 454 technology to generate 391,157 EST reads from the brain transcriptome of the wasp Polistes metricus (Toth et al., 2007). The reads were then aligned to the genome sequence and EST resources from the honeybee, Apis mellifera, to annotate P. metricus transcripts. Interestingly, the study found wasp EST matches to 39% of the honeybee mRNAs and observed a strong correlation between the expression levels of the corresponding transcripts from the two species. The short reads produced by next generation technologies, particularly Illumina and SOLiD, are arguably suitable for gene expression profiling based on tens of millions short reads rather than tens of thousands based on the Sanger method. RNASeq has been used to accurately monitor gene expression during yeast vegetative growth (Nagalakshmi et al., 2008), yeast meiosis (Wilhelm et al., 2008), and mouse embryonic stem (ES) cell differentiation (Cloonan et al., 2008), to track gene expression changes during development, and to provide a “digital measurement” of gene expression difference between different tissues. Before the advent of RNA-Seq, the starts and ends of most transcripts had not been precisely resolved and the extent of spliced heterogeneity remained poorly understood. RNA-Seq, with its high resolution and sensitivity has revealed many novel transcribed regions and splicing isoforms of known genes, and has mapped 5′ and 3′ boundaries for many genes. Using RNA-Seq method, the 5′ and 3′ boundaries of 80% and 85% of all annotated genes, respectively, were mapped in Saccharomyces. cerevisiae (Nagalakshmi et al., 2008). Similarly, in Schizosaccharomyces pombe (Wilhelm et al., 2008), many boundaries were defined by RNA-Seq data in combination with tiling array data. In humans, 31,618 known splicing events were confirmed (11% of all known splicing events) and 379 novel splicing events were discovered

Next Generation DNA Sequencing Technologies and Applications

49

(Morin et al., 2008a). In mice, extensive alternative splicing was observed for 3462 genes (Mortazavi et al., 2008). In addition, results from RNA-Seq suggest the existence of a large number of novel transcribed regions in every genome surveyed, including those of Arabidopsis thaliana (Lister et al., 2008), mouse (Cloonan et al., 2008; Mortazavi et al., 2008), human (Morin et al., 2008a), S. cerevisiae (Nagalakshmi et al., 2008), and S. pombe (Wilhelm et al., 2008). These novel transcribed regions, combined with many undiscovered novel splicing variants, suggest that there is considerably more transcriptomic complexity than previously appreciated.

Small RNA Analysis A related application of next generation sequencing technologies to the analysis of transcriptomes is small RNA discovery and profiling. High-throughput sequencing of small RNAs provides great potential for the identification of novel small RNAs as well as profiling of known and novel small RNA genes. To date, small RNA profiling studies involving the 454 technology have been widely reported. These include studies in the moss Physcomitrella patens (Axtell et al., 2006), A. thaliana (Henderson et al., 2006; Lu et al., 2006; Rajagopalan et al., 2006), Triticum aestivum (Yao et al., 2007), the basal eudicot species Eschscholzia californica (Barakat et al., 2007), the lycopod Selaginella moellendorffii (Axtell et al., 2006), the unicellular alga Chlamydomonas reinhardtii (Zhao et al., 2007), Marek disease virus (Burnside et al., 2006), and some primates (Berezikov et al., 2006). Importantly, small RNA sequencing studies with the 454 technology contributed to the discovery of a novel class of small RNAs, termed Piwi-interacting RNAs, which are expressed in mammalian testes and are presumably required for germ cell development in mammals and other species (Girard et al., 2006; Lau et al., 2006; Houwing et al., 2007). The higher throughput of Illumina and SOLiD technologies enables the generation of deeper small RNA libraries. Using Illumina sequencing, Morin et al. (2008b) identified 334 known plus 104 novel miRNA genes expressed in human ES cells, while Glazov et al. (2008) detected 449 novel and all known chicken miRNAs in the chicken embryo. In addition, small RNA profilings in locust, Xenopus tropicalis (Armisen et al., 2009), Caenorhabditis elegans embryos (Stoeckius et al., 2009) and Gossypium hirsutum L. (Pang et al., 2009) have been reported.

Characterization of the Epigenome Epigenetics is the study of heritable gene regulation that does not involve the DNA sequence itself. The next generation sequencing technologies offer the potential to accelerate epigenomic research substantially. To date, these technologies have been applied in several epigenomic areas, including the characterization of DNA methylation patterns, posttranslational modifications of histones, the interaction between transcription factors and their direct targets, and nucleosome positioning on a genome-wide scale. These areas are summarized into the following two major sections.

50

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Methylome DNA cytosine methylation is a central epigenetic modification that has essential roles in cellular processes including genome regulation, development, and disease. Singlebase resolution analysis of DNA methylation sites can be achieved by sodium bisulfite treatment of genomic DNA, which converts cytosines, but not methylcytosines, to uracil. Subsequent sequencing of PCR-amplified bisulfite-converted DNA allows determination of the methylation state of the cytosines in the sequenced region of the genome, as methylcytosine will be sequenced as cytosine, and unmethylated cytosine as thymine. Taylor et al. (2007) improved the bisulfite DNA sequencing procedure by combining with the 454 technology. The approach was applied to analyze methylation patterns in 25 gene-related CpG-rich regions from >40 cases of primary cells. The study generated >1600 individual sequence far beyond the few clones (<20) typically analyzed by traditional bisulfite sequencing. Using the Illumina Genome Analyzer, Cokus et al. (2008) and Lister et al. (2008) generated 2–3 Gb of uniquely aligned bisulfite sequence to comprehensively identify sites of DNA methylation throughout the Arabidopsis genome at a single-base resolution, including previously unidentified sites of cytosine methylation, and local sequence motifs associated with DNA methylation. The approach of bisulfite DNA sequencing is now widely used for DNA methylation profiling in various organisms (Costello et al., 2009; Smith et al., 2009; Chung et al., 2010).

Analysis of DNA–Protein Interactions The association between DNA and proteins is a fundamental biological interaction that plays a key role in regulating gene expression and controlling the availability of DNA for transcription, replication, and other processes. These interactions can be studied using a technique called chromatin immunoprecipitation followed by sequencing (ChIP–seq). ChIP–seq is a technique for genome-wide profiling of DNAbinding proteins, histone modifications, and nucleosomes. The precedent-setting paper for CHIP-seq was published by Johnson and colleagues, who used C. elegans and the Roche platform to elucidate nucleosome positioning on genomic DNA (Johnson et al., 2006). This study established that sequencing the micrococcal nuclease–derived digestion products of genomic DNA carefully isolated from mixed stage hermaphrodite populations of C. elegans was sufficient to generate a genomewide, highly precise positional profile of chromatin. Subsequent studies utilized a ChIP-based approach and the Illumina platform to provide insights into transcription factor binding sites in the human genome such as neuron-restrictive silencer factor (NRSF) (Johnson et al., 2007) and signal transducer and activator of transcription 1 (STAT1) (Robertson et al., 2007). The first applications of ChIP–seq to profile histone modifications were done in CD4+ T cells (Impey et al., 2004) and mouse ES cells (Mikkelsen et al., 2007). In a landmark study, Mikkelsen et al. (2007) explored the connection between chromatin packaging of DNA and differential gene expression using mouse ES cells and lineage-committed mouse cells (neural progenitor cells and embryonic fibroblasts), providing a next generation sequencingbased framework for using genome-wide chromatin profiling to characterize cell populations.

Next Generation DNA Sequencing Technologies and Applications

51

Owing to the tremendous progress in next generation sequencing technology, ChIP–seq offers higher resolution, less noise, and greater coverage than its arraybased predecessor ChIP–chip. With the decreasing cost of sequencing, ChIP–seq has become an indispensable tool for studying gene regulation and epigenetic mechanisms.

Next-Next Generation Sequencing Next generation DNA sequencing has started a revolution in genomics and created the opportunity for large-scale sequencing projects, but in the future, “next-next” generation sequencing using single molecules might be able to take the genomics community even further. Efforts have been made to develop single-molecule sequencing platforms—where sequencing by synthesis is performed on an array of single DNA molecules. A single molecule helps increase the number of DNA fragments that can be independently analyzed in a given surface area and, therefore, achieves a much higher level of throughput. Of course, it also means that no costly cluster amplification step is required, further reducing sequencing cost. Now, a few platforms have been developed or are currently under development. Helicos Biosciences is the first company to offer a “next-next” generation DNA sequencing system for single molecules. HeliScope utilizes a fluorescent microscopic technique called the total internal reflection microscopy (TIRM) to detect signals, where only fluorophores within a very thin layer of reaction volume on the surface of a flow cell can be excited by an evanescent wave to produce fluorescence (Harris et al., 2008). VisiGen Biotechnologies, now part of Life Technologies, has also been working on an implementation of singlemolecule sequencing by synthesis. In a nutshell, they engineered a protein nanodevice to observe and record the DNA synthesis process by DNA polymerase in real time. This is achieved through fluorescence resonance energy transfer (FRET) between fluorescence donor and receptor (Blow, 2008). Pacific Biosciences is another company that has been working to develop a new generation of sequencing technology, the single-molecule real-time (SMRT) technology. Its single-molecule sequencing by synthesis relies on a nanostructure called zero-mode waveguide (ZMW) for real-time observation of DNA polymerization (Eid et al., 2009).

Perspectives As we are closing up this chapter, news and publications about next generation sequencers and their applications are coming out almost on a daily basis. It is indeed very true that some of the new technologies and protocols may replace the old ones completely but we hope this chapter provides its readers an overview and a valuable start for understanding the DNA sequencing technology and its applications. A couple of decades down the road, we will not only be able to sequence everyone’s genome but also the genomes of all life on Earth. The challenges will shift toward

52

Next Generation Sequencing and Whole Genome Selection in Aquaculture

the understanding of the vast information generated by the sequencing machines, and bioinformatics will take the center stage. In addition, the design of experimentation for validating the information and hypothesis-driven new sequencing projects become more important. What we are doing now is build a solid foundation for scientific practice and developments in the future.

References Adessi C, Matton G, Ayala G, Turcatti G, Mermod JJ, Mayer P, and Kawashima E. 2000. Solid phase DNA amplification: Characterisation of primer attachment and amplification mechanisms. Nucleic Acids Res, 28:e87. Ahn SM, Kim TH, Lee S, Kim D, Ghang H, Kim DS, Kim BC, Kim SY, Kim WY, Kim C, et al. 2009. The first Korean genome sequence and analysis: Full genome sequencing for a socio-ethnic group. Genome Res, 19:1622–1629. Ansorge WJ. 2009. Next-generation DNA sequencing techniques. Nat Biotechnol, 25:195–203. Armisen J, Gilchrist MJ, Wilczynska A, Standart N, and Miska EA. 2009. Abundant and dynamically expressed miRNAs, piRNAs, and other small RNAs in the vertebrate Xenopus tropicalis. Genome Res, 19:1766–1775. Axtell MJ, Jan C, Rajagopalan R, and Bartel DP. 2006. A two-hit trigger for siRNA biogenesis in plants. Cell, 127:565–577. Barakat A, Wall K, Leebens-Mack J, Wang YJ, Carlson JE, and Depamphilis CW. 2007. Large-scale identification of microRNAs from a basal eudicot (Eschscholzia californica) and conservation in flowering plants. Plant J, 51:991–1003. Bentley DR. 2006. Whole-genome re-sequencing. Curr Opin Genet Dev, 16:545–552. Berezikov E, Thuemmler F, van Laake LW, Kondova I, Bontrop R, Cuppen E, and Plasterk RH. 2006. Diversity of microRNAs in human and chimpanzee brain. Nat Genet, 38:1375–1377. Blow N. 2008. DNA sequencing: Generation next-next. Nat Methods, 5:267–274. Brent MR. 2008. Steady progress and recent breakthroughs in the accuracy of automated genome annotation. Nat Rev Genet, 9:62–73. Burnside J, Bernberg E, Anderson A, Lu C, Meyers BC, Green PJ, Jain N, Isaacs G, and Morgan RW. 2006. Marek’s disease virus encodes microRNAs that map to meq and the latency-associated transcript. J Virol, 80:8778–8786. Choi M, Scholl UI, Ji WZ, Liu TW, Tikhonova IR, Zumbo P, Nayir A, Bakkaloglu A, Ozen S, Sanjad S, et al. 2009. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc Natl Acad Sci U S A, 106:19096–19101. Chung CAB, Boyd VL, McKernan KJ, Fu YT, Monighetti C, Peckham HE, and Barker M. 2010. Whole methylome analysis by ultra-deep sequencing using two-base encoding. PLoS One, 5:e9320. Cloonan N, Forrest ARR, Kolle G, Gardiner BBA, Faulkner GJ, Brown MK, Taylor DF, Steptoe AL, Wani S, Bethel G, et al. 2008. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Methods, 5:613–619. Cokus SJ, Feng SH, Zhang XY, Chen ZG, Merriman B, Haudenschild CD, Pradhan S, Nelson SF, Pellegrini M, and Jacobsen SE. 2008. Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning. Nature, 452:215–219. Costello JF, Krzywinski M, and Marra MA. 2009. A first look at entire human methylomes. Nat Biotechnol, 27:1130–1132.

Next Generation DNA Sequencing Technologies and Applications

53

Dahl F, Stenberg J, Fredriksson S, Welch K, Zhang M, Nilsson M, Bicknell D, Bodmer WF, Davis RW, and Ji HL. 2007. Multigene amplification and massively parallel sequencing for cancer mutation discovery. Proc Natl Acad Sci U S A, 104:9387–9392. Denver DR, Dolan PC, Wilhelm LJ, Sung W, Lucas-Lledo JI, Howe DK, Lewis SC, Okamoto K, Thomas WK, Lynch M, et al. 2009. A genome-wide view of Caenorhabditis elegans basesubstitution mutation processes. Proc Natl Acad Sci U S A, 106:16310–16314. DiGuistini S, Liao NY, Platt D, Robertson G, Seidel M, Chan SK, Docking TR, Birol I, Holt RA, Hirst M, et al. 2009. De novo genome sequence assembly of a filamentous fungus using Sanger, 454 and Illumina sequence data. Genome Biol, 10:R94. Dracatos PM, Cogan NOI, Sawbridge TI, Gendall AR, Smith KF, Spangenberg GC, and Forster JW. 2009. Molecular characterisation and genetic mapping of candidate genes for qualitative disease resistance in perennial ryegrass (Lolium perenne L.). BMC Plant Biol, 9:62. Dressman D, Yan H, Traverso G, Kinzler KW, and Vogelstein B. 2003. Transforming single DNA molecules into fluorescent magnetic particles for detection and enumeration of genetic variations. Proc Natl Acad Sci U S A, 100:8817–8822. Durfee T, Nelson R, Baldwin S, Plunkett G, Burland V, Mau B, Petrosino JF, Qin X, Muzny DM, Ayele M, et al. 2008. The complete genome sequence of Escherichia coli DH10B: Insights into the biology of a laboratory workhorse. J Bacteriol, 190:2597–2606. Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, Peluso P, Rank D, Baybayan P, Bettman B, et al. 2009. Real-time DNA sequencing from single polymerase molecules. Science, 323:133–138. Fedurco M, Romieu A, Williams S, Lawrence I, and Turcatti G. 2006. BTA, a novel reagent for DNA attachment on glass and efficient generation of solid-phase amplified DNA colonies. Nucleic Acids Res, 34:e22. Girard A, Sachidanandam R, Hannon GJ, and Carmell MA. 2006. A germline-specific class of small RNAs binds mammalian Piwi proteins. Nature, 442:199–202. Glazov EA, Cottee PA, Barris WC, Moore RJ, Dalrymple BP, and Tizard ML. 2008. A microRNA catalog of the developing chicken embryo identified by a deep sequencing approach. Genome Res, 18:957–964. Goldberg SMD, Johnson J, Busam D, Feldblyum T, Ferriera S, Friedman R, Halpern A, Khouri H, Kravitz SA, Lauro FM, et al. 2006. A Sanger/pyrosequencing hybrid approach for the generation of high-quality draft assemblies of marine microbial genomes. Proc Natl Acad Sci U S A, 103:11240–11245. Goossens D, Moens LN, Nelis E, Lenaerts AS, Glassee W, Kalbe A, Frey B, Kopal G, De Jonghe P, De Rijk P, et al. 2009. Simultaneous mutation and copy number variation (CNV) detection by multiplex PCR-based GS-FLX sequencing. Hum Mutat, 30:472–476. Harismendy O and Frazer KA. 2009. Method for improving sequence coverage uniformity of targeted genomic intervals amplified by LR-PCR using Illumina GA sequencing-by-synthesis technology. Biotechniques, 46:229–231. Harismendy O, Ng PC, Strausberg RL, Wang XY, Stockwell TB, Beeson KY, Schork NJ, Murray SS, Topol EJ, Levy S, et al. 2009. Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biol, 10:R32. Harris TD, Buzby PR, Babcock H, Beer E, Bowers J, Braslavsky I, Causey M, Colonell J, Dimeo J, Efcavitch JW, et al. 2008. Single-molecule DNA sequencing of a viral genome. Science, 320:106–109. Henderson IR, Zhang X, Lu C, Johnson L, Meyers BC, Green PJ, and Jacobsen SE. 2006. Dissecting Arabidopsis thaliana DICER function in small RNA processing, gene silencing and DNA methylation patterning. Nat Genet, 38:721–725. Hodges E, Xuan Z, Balija V, Kramer M, Molla MN, Smith SW, Middle CM, Rodesch MJ, Albert TJ, Hannon GJ, et al. 2007. Genome-wide in situ exon capture for selective resequencing. Nat Genet, 39:1522–1527.

54

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Housby JN and Southern EM. 1998. Fidelity of DNA ligation: A novel experimental approach based on the polymerisation of libraries of oligonucleotides. Nucleic Acids Res, 26:4259–4266. Houwing S, Kamminga LM, Berezikov E, Cronembold D, Girard A, van den Elst H, Filippov DV, Blaser H, Raz E, Moens CB, et al. 2007. A role for Piwi and piRNAs in germ cell maintenance and transposon silencing in zebrafish. Cell, 129:69–82. Huang S, Li R, Zhang Z, Li L, Gu X, Fan W, Lucas WJ, Wang X, Xie B, Ni P, et al. 2009. The genome of the cucumber, Cucumis sativus L. Nat Genet, 41:1275–1281. Imelfort M and Edwards D. 2009. De novo sequencing of plant genomes using secondgeneration technologies. Brief Bioinform, 10:609–618. Impey S, McCorkle SR, Cha-Molstad H, Dwyer JM, Yochum GS, Boss JM, McWeeney S, Dunn JJ, Mandel G, and Goodman RH. 2004. Defining the CREB regulon: A genome-wide analysis of transcription factor regulatory regions. Cell, 119:1041–1054. Jacquier A. 2009. The complex eukaryotic transcriptome: Unexpected pervasive transcription and novel small RNAs. Nat Rev Genet, 10:833–844. Johnson SM, Tan FJ, McCullough HL, Riordan DP, and Fire AZ. 2006. Flexibility and constraint in the nucleosome core landscape of Caenorhabditis elegans chromatin. Genome Res, 16:1505–1516. Johnson DS, Mortazavi A, Myers RM, and Wold B. 2007. Genome-wide mapping of in vivo protein-DNA interactions. Science, 316:1497–1502. Kent WJ. 2002. BLAT—The BLAST-like alignment tool. Genome Res, 12(4):656–664. Kim JI, Ju YS, Park H, Kim S, Lee S, Yi JH, Mudge J, Miller NA, Hong D, Bell CJ, et al. 2009. A highly annotated whole-genome sequence of a Korean individual. Nature, 460:1011–1015. Lander ES and Waterman MS. 1988. Genomic mapping by fingerprinting random clones: A mathematical analysis. Genomics, 2:231–239. Lau NC, Seto AG, Kim J, Kuramochi-Miyagawa S, Nakano T, Bartel DP, and Kingston RE. 2006. Characterization of the piRNA complex from rat testes. Science, 313:363–367. Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G, et al. 2007. The diploid genome sequence of an individual human. PLoS Biol, 5:e254. Li R, Fan W, Tian G, Zhu H, He L, Cai J, Huang Q, Cai Q, Li B, Bai Y, et al. 2010. The sequence and de novo assembly of the giant panda genome. Nature, 463:311–317. Lister R, O’Malley RC, Tonti-Filippini J, Gregory BD, Berry CC, Millar AH, and Ecker JR. 2008. Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell, 133:523–536. Lu C, Kulkarni K, Souret FF, MuthuValliappan R, Tej SS, Poethig RS, Henderson IR, Jacobsen SE, Wang W, Green PJ, et al. 2006. MicroRNAs and other small RNAs enriched in the Arabidopsis RNA-dependent RNA polymerase-2 mutant. Genome Res, 16:1276–1288. Mardis ER. 2008. Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet, 9:387–402. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, et al. 2005. Genome sequencing in microfabricated high-density picolitre reactors. Nature, 437:376–380. Mckernan K, Blanchard A, Kotler L, and Costa G. 2006. Reagents, methods, and libraries for bead-based sequencing. US patent application 20080003571. Metzker ML. 2009. Sequencing technologies—The next generation. Nat Rev Genet, 11:31–46. Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, Alvarez P, Brockman W, Kim TK, Koche RP, et al. 2007. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature, 448:553–560.

Next Generation DNA Sequencing Technologies and Applications

55

Morin R, Bainbridge M, Fejes A, Hirst M, Krzywinski M, Pugh T, McDonald H, Varhol R, Jones S, and Marra M. 2008a. Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing. Biotechniques, 45:81–94. Morin RD, O’Connor MD, Griffith M, Kuchenbauer F, Delaney A, Prabhu AL, Zhao Y, McDonald H, Zeng T, Hirst M, et al. 2008b. Application of massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells. Genome Res, 18:610–621. Mortazavi A, Williams BA, McCue K, Schaeffer L, and Wold B. 2008. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods, 5:621–628. Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, and Snyder M. 2008. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science, 320:1344–1349. Ng SB, Buckingham KJ, Lee C, Bigham AW, Tabor HK, Dent KM, Huff CD, Shannon PT, Jabs EW, Nickerson DA, et al. 2010. Exome sequencing identifies the cause of a Mendelian disorder. Nat Genet, 42:30–35. Ossowski S, Schneeberger K, Clark RM, Lanz C, Warthmann N, and Weigel D. 2008. Sequencing of natural strains of Arabidopsis thaliana with short reads. Genome Res, 18:2024–2033. Pang M, Woodward AW, Agarwal V, Guan X, Ha M, Ramachandran V, Chen X, Triplett BA, Stelly DM, and Chen ZJ. 2009. Genome-wide analysis reveals rapid and dynamic changes in miRNA and siRNA sequence and expression during ovule and fiber development in allotetraploid cotton (Gossypium hirsutum L.). Genome Biol, 10:R122. Pleasance ED, Stephens PJ, O’Meara S, McBride DJ, Meynert A, Jones D, Lin ML, Beare D, Lau KW, Greenman C, et al. 2010. A small-cell lung cancer genome with complex signatures of tobacco exposure. Nature, 463:184–190. Porreca GJ, Zhang K, Li JB, Xie B, Austin D, Vassallo SL, LeProust EM, Peck BJ, Emig CJ, Dahl F, et al. 2007. Multiplex amplification of large sets of human exons. Nat Methods, 4:931–936. Pushkarev D, Neff NF, and Quake SR. 2009. Single-molecule sequencing of an individual human genome. Nat Biotechnol, 27:847–852. Rajagopalan R, Vaucheret H, Trejo J, and Bartel DP. 2006. A diverse and evolutionarily fluid set of microRNAs in Arabidopsis thaliana. Genes Dev, 20:3407–3425. Reinhardt JA, Baltrus DA, Nishimura MT, Jeck WR, Jones CD, and Dangl JL. 2009. De novo assembly using low-coverage short read sequence data from the rice pathogen Pseudomonas syringae pv. oryzae. Genome Res, 19:294–305. Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G, Bernier B, Varhol R, Delaney A, et al. 2007. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods, 4:651–657. Rothberg JM and Leamon JH. 2008. The development and impact of 454 sequencing. Nat Biotechnol, 26:1117–1124. Shendure J and Ji H. 2008. Next-generation DNA sequencing. Nat Biotechnol, 26:1135–1145. Shendure J, Porreca GJ, Reppas NB, Lin X, McCutcheon JP, Rosenbaum AM, Wang MD, Zhang K, Mitra RD, and Church GM. 2005. Accurate multiplex polony sequencing of an evolved bacterial genome. Science, 309:1728–1732. Smith ZD, Gu H, Bock C, Gnirke A, and Meissner A. 2009. High-throughput bisulfite sequencing in mammalian genomes. Methods, 48:226–232. Stoeckius M, Maaskola J, Colombo T, Rahn HP, Friedlander MR, Li N, Chen W, Piano F, and Rajewsky N. 2009. Large-scale sorting of C. elegans embryos reveals the dynamics of small RNA expression. Nat Methods, 6:745–751.

56

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Sugarbaker DJ, Richards WG, Gordon GJ, Dong L, De Rienzo A, Maulik G, Glickman JN, Chirieac LR, Hartman ML, Taillon BE, et al. 2008. Transcriptome sequencing of malignant pleural mesothelioma tumors. Proc Natl Acad Sci U S A, 105:3521–3526. Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D, et al. 2008. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science, 321:956–960. Tang F, Barbacioru C, Nordman E, Li B, Xu N, Bashkirov VI, Lao K, and Surani MA. 2010. RNA-Seq analysis to capture the transcriptome landscape of a single cell. Nat Protoc, 5:516–535. Taylor KH, Kramer RS, Davis JW, Guo J, Duff DJ, Xu D, Caldwell CW, and Shi H. 2007. Ultradeep bisulfite sequencing analysis of DNA methylation patterns in multiple gene promoters by 454 sequencing. Cancer Res, 67:8511–8518. Tettelin H and Feldblyum T. 2009. Bacterial genome sequencing. Methods Mol Biol, 551:231–247. Tewhey R, Warner JB, Nakano M, Libby B, Medkova M, David PH, Kotsopoulos SK, Samuels ML, Hutchison JB, Larson JW, et al. 2009. Microdroplet-based PCR enrichment for largescale targeted sequencing. Nat Biotechnol, 27:1025–1031. Toth AL, Varala K, Newman TC, Miguez FE, Hutchison SK, Willoughby DA, Simons JF, Egholm M, Hunt JH, Hudson ME, et al. 2007. Wasp gene expression supports an evolutionary link between maternal behavior and eusociality. Science, 318:441–444. Turner DJ, Keane TM, Sudbery I, and Adams DJ. 2009a. Next-generation sequencing of vertebrate experimental organisms. Mamm Genome, 20:327–338. Turner EH, Lee C, Ng SB, Nickerson DA, and Shendure J. 2009b. Massively parallel exon capture and library-free resequencing across 16 genomes. Nat Methods, 6:315–316. Velasco R, Zharkikh A, Troggio M, Cartwright DA, Cestaro A, Pruss D, Pindo M, Fitzgerald LM, Vezzulli S, Reid J, et al. 2007. A high quality draft consensus sequence of the genome of a heterozygous grapevine variety. PLoS One, 2:e1326. Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, Fan W, Zhang J, Li J, Guo Y, et al. 2008. The diploid genome sequence of an Asian individual. Nature, 456:60–65. Wang Z, Gerstein M, and Snyder M. 2009. RNA-Seq: A revolutionary tool for transcriptomics. Nat Rev Genet, 10:57–63. Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen YJ, Makhijani V, Roth GT, et al. 2008. The complete genome of an individual by massively parallel DNA sequencing. Nature, 452:872–876. Wilhelm BT, Marguerat S, Watt S, Schubert F, Wood V, Goodhead I, Penkett CJ, Rogers J, and Bahler J. 2008. Dynamic repertoire of a eukaryotic transcriptome surveyed at singlenucleotide resolution. Nature, 453:1239–1243. Xia Q, Guo Y, Zhang Z, Li D, Xuan Z, Li Z, Dai F, Li Y, Cheng D, Li R, et al. 2009. Complete resequencing of 40 genomes reveals domestication events and genes in silkworm (Bombyx). Science, 326:433–436. Yao Y, Guo G, Ni Z, Sunkar R, Du J, Zhu JK, and Sun Q. 2007. Cloning and characterization of microRNAs from wheat (Triticum aestivum L.). Genome Biol, 8:R96. Zhao T, Li G, Mi S, Li S, Hannon GJ, Wang XJ, and Qi Y. 2007. A complex system of small RNAs in the unicellular green alga Chlamydomonas reinhardtii. Genes Dev, 21:1190–1203.

Chapter 4

Library Construction for Next Generation Sequencing Huseyin Kucuktas and Zhanjiang (John) Liu

Several major new sequencing platforms have been adopted recently, and they are collectively referred to as the next generation of DNA sequencers. A common feature among the new generation of sequencing procedures is the elimination of the need to clone DNA fragments into cloning vectors and subsequent amplification of cloned DNA in transformed Escherichia coli cells, and purification of DNA templates prior to sequencing. Instead, sequence templates are handled in bulk, and massively parallel sequencing allows the generation of numerous sequences simultaneously. Nonetheless, samples need to be prepared to be adaptable to the sequencing platform. In this chapter, we will describe principles and methods for sample preparation for the next generation sequencing platforms. Our focus will be on planning for the library construction either in-house or through outsourcing rather than on the detailed procedures and protocols for library construction. For detailed protocols, readers are referred to protocols for each sequencing platform.

Sample Preparation for Illumina Sequencing For the Illumina sequencing platform, sample preparation depends on the sources of DNA or RNA and the purpose of the sequencing project. For instance, the sequencing purpose could be for genomic sequencing, genomic sequencing of multiple samples using bar-coded primers, genomic mate pair sequencing, mRNA sequencing, or sequencing for the discovery of small RNAs. Sources of DNA can be in different forms including genomic DNA or pooled PCR products. For PCR products, a minimum fragment size of 2.5 kb is required, or restriction fragments such as reduced representation libraries can be used. For RNA samples, obviously, firstand second-strand cDNA must be synthesized, and from cDNA, the operations are similar to DNA templates. Here the considerations of using genomic DNA as samples are described. The general steps for library construction is illustrated in Figure 4.1.

Next Generation Sequencing and Whole Genome Selection in Aquaculture © 2011 Blackwell Publishing Ltd. ISBN: 978-0-813-80637-2

Edited by Zhanjiang (John) Liu

57

58

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Figure 4.1

The general procedures of library preparation for next generation sequencing.

DNA Fragmentation The first step of DNA sample preparation is fragmentation of DNA to generate fragments of various sizes. Three approaches are frequently used for DNA fragmentation: nebulization, sonication, and enzymatic fragmentation. DNA fragmentation through nebulization involves the use of a nebulizer that creates a fine mist of DNA by forcing a DNA solution through a small hole in the nebulizer unit. Several factors determine the size of the fragments using a nebulizer including the speed at which the DNA solution passes through the hole, the pressure of the gas blowing through the nebulizer, the viscosity of the solution, and the temperature. The advantage of nebulization is that it is easy, quick, and requires only small amounts of DNA (0.5–5 μg). The disadvantage is that the distribution of the resulting DNA fragments is over a narrow range of sizes (700–1300 bp). It is difficult to obtain small fragments in the range of 200 bp. Sonication shears DNA into small fragments through the use of hydrodynamic force by using a sonicator. A general sonication protocol can be found in Sambrook and Russell (2001), although conditions for shearing should be adjusted for each sample. Most often, for small fragment generation, high power, one single pulse, for a short period of 1–2 s should be sufficient to generate a whole range of smears of fragments. Generation of very small fragments and large fragments can be difficult with sonication. For instance, fragments smaller than 400 bp can be very difficult to achieve. Similarly, large fragment sizes such as 8 kb or 10 kb can also be difficult to achieve.

Library Construction for Next Generation Sequencing

59

DNA fragmentation can also be achieved through the use of enzymatic fragmentation such as New England Biolabs (NEB) fragmentase. The NEBNext™ dsDNA Fragmentase™ generates dsDNA breaks in a time-dependent manner to yield 100– 800-bp DNA fragments depending on the reaction time. NEBNext dsDNA Fragmentase contains two enzymes: one randomly generates nicks on dsDNA, and the other recognizes the nicked site and cuts the opposite DNA strand across from the nick, producing dsDNA breaks. The resulting DNA fragments contain short overhangs, 5′-phosphates, and 3′-hydroxyl groups. According to the NEB Web description (www.neb.com/nebecomm/products/productM0348.asp), a comparison of the sequencing results between genomic DNA prepared with the NEBNext dsDNA Fragmentase and with mechanical shearing demonstrates that the NEBNext dsDNA Fragmentase does not introduce any detectable bias during the sequencing library preparation and that no difference in sequence coverage is observed using the two methods. A major advantage of the fragmentase approach is the control of fragment size. The quantities of fragmentase and reaction times can be tested to achieve the optimal results. For instance, 15–30-min treatment of HeLa cell genomic DNA with fragmentase generated a good smear in the range of 100 bp to 2 kb. In order to generate larger DNA fragments, the NEB fragmentase can be diluted, for example, 1 : 10 dilution with the storage buffer, and the diluted fragmentase can be tested with different reaction times to generate the desired DNA fragment sizes. The ability of generating various sizes of DNA fragments is a major strength of the enzymatic approach.

DNA End Repair A common problem of DNA after fragmentation is the need for end polishing. This need is demanded when (1) DNA fragments are not blunt-ended with overhangs; and (2) in some cases, the 5′ end is not phosphorylated, prohibiting the fragment from ligation. An enzymatic step is required to repair the ends of DNA fragments to blunt-ended molecules with phosphate at their 5′ ends. Blunt ends can be achieved through the use of T4 DNA polymerase and E. coli DNA polymerase I Klenow fragment. The 3′ to 5′ exonuclease activity of these enzymes removes 3′ overhangs, and the polymerase activity fills in the 5′ overhangs. The 5′-phosphate can be added by phosphorylation reactions using polynucleotide kinase. All these functions can be achieved in a single combined reaction (Figure 4.2). For instance, one can use the end repair reagents from a commercial source such as those of NEB or Illumina. Here is a typical end repair reaction with a total volume of 100 μL: 1. Assemble the following into an Eppendorf tube: 30 μL fragmented DNA sample 10 μL phosphorylation buffer 4 μL dNTP solution mix 5 μL T4 DNA polymerase 1 μL DNA polymerase I, Klenow fragment 5 μL T4 polynucleotide kinase 45 μL dH20

60

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Figure 4.2 Schematic presentation of end repair of fragmented DNA during library preparation for next generation sequencing.

2. Mix gently and incubate for 30 min at 20°C. 3. Purify DNA by phenol/chloroform and ethanol precipitation or DNA purification column. Resuspend DNA in 32 μL TE buffer. Now, after this reaction is completed, the DNA fragments are repaired to harbor blunt ends with phosphate at their 5′ ends.

Addition of an Extra Base (Adenine) at the 3′ Ends A single base, adenine (A), needs to be added at the 3′ end of DNA fragments with repaired ends because the adaptors have a single T base overhang at their 3′ end (Figure 4.2). This can be achieved by using Klenow fragment (3′ to 5′ exo−). Here is a typical reaction of 50 μL: 1. Assemble the following into an Eppendorf tube: 32 μL blunted, phosphorylated DNA 5 μL NEBuffer 2 for Klenow exo− 10 μL dATP solution 3 μL Klenow fragment (3′ to 5′ exo−) 2. Mix gently and incubate for 30 min at 37°C. 3. Purify DNA by phenol/chloroform and ethanol precipitation or DNA purification column. Resuspend DNA in 32 μL TE buffer. Now, the DNA fragments with a 3′-A are ready to be ligated to adaptors.

Library Construction for Next Generation Sequencing

61

Ligation to Adaptors The adaptors can be from the Illumina or custom made. Custom-made adaptors are required if one is attempting to bar-code the samples in order to sequence multiple samples in a single lane (see the section on “Indexed Libraries for Sequencing Multiple Samples in a Single Lane”). In order to ensure adaptor ligation, a 10 : 1 molar excess of adaptors are used as compared with DNA fragments. Here is a typical ligation reaction of 50 μL: 1. Mix the following in an Eppendorf tube: 10 μL DNA sample 25 μL DNA ligase buffer (2×) 10 μL adaptor oligo mix 5 μL DNA ligase 2. Incubate for 15 min at room temperature. 3. Purify the ligation product. Basically, an agarose gel is run to separate the DNA fragments ligated to adaptors. Regions of interest, for example, a 150–200 bp range for short template and 300–650 bp for long template, is excised from the gel. Qiagen Gel Extraction Kit (Qiagen part # 28704) and MinElute Extraction Kit (Qiagen part # 28604) works well for the purification of DNA fragments from the gel.

PCR Amplification of the Library After the DNA fragments of the desired size are purified, they need to be enriched by PCR. This step enriches the DNA fragments with adaptors on both ends. The PCR is conducted using primers that anneal to the ends of the adaptors. Usually, a low number of PCR amplification, for example, 18 cycles, is used to avoid skewing the representation of the library. A typical PCR reaction is the following: 1. Mix the following in a 15 μL tube. 1 μL purified DNA fragments 25 μL Phusion DNA polymerase 1 μL PCR primer #1 1 μL PCR primer #2 22 μL water 2. PCR for 18 cycles of 98°C for 10 s, 65°C for 30 s, and 72°C for 30 s, followed by a final incubation at 72°C for 5 min, and then hold at 4°C. 3. Purify the PCR products using the QIAquick PCR Purification Kit, and elute in 50 μL elution buffer. The library is now ready to be sequenced. However, it is recommended to check the quality and quantity of the library by • • • •

measuring its absorbance at 260 nm; checking the 260/280 ratio; running a gel to see the isolated fraction with the sizes of the original gel slice; and cloning a fraction into a sequencing vector, and then sequencing by Sanger sequencing.

62

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Indexed Libraries for Sequencing Multiple Samples in a Single Lane Sequencing multiple biological samples in a single lane is of great interest to biologists, and particularly so for aquaculture researchers, as this can significantly reduce the sequencing costs, allowing biological questions to be answered while staying within the budget. In order to sequence multiple biological samples in a single lane, obviously, sample pooling is required. However, before the samples are pooled, different sequence tags can be ligated to different samples to allow bioinformatic tracing of the sources of the samples during sequence analysis. For instance, if one is considering analysis of genetic differences of strains using SNPs, DNA from each strain can be bar-coded with sequence tags in the adaptors before pooling. Obviously, such sequences for bar coding purposes need to be positioned downstream of the sequencing primer so they will be sequenced during sequencing. Indexes (adapter “bar codes”) can be added to genomic DNA or PCR products for sequencing multiple samples (multiplexing up to 12 samples for Illumina libraries) in a single sequencing deliverable. Customers can choose to prepare their own libraries using Illumina’s sample prep kits, or sample preparation can be completed at Illumina or other sources. For instance, Hudson/Alpha and Lucigen both offer indexed library services at a cost of $300–$400 per tagged library. With the Illumina’s Multiplexing Sample Preparation Oligonucleotide Kit, 12 unique oligonucleotides are included to “tag” libraries for pooling in a single lane of a flow cell, or up to 96 samples can be sequenced on a single flow cell using the Genome Analyzer. Selection of proper sequences for bar coding is important because any sequencing errors may tamper the ability to differentiate the sequences if the tags are similar. As a rule of thumb, the more the sequence differences, the better the bar coding tags are. Table 4.1 provides a guide for selecting the bar coding adaptors. Each tagging oligo is six bases in length, assuring for accurate differentiation between tags and to overcome any single-base errors that may inadvertently be introduced during PCR (www.illumina.com/products/multiplexing_sample_preparation_oligonucleotide_ kit.ilmn). The indexed libraries have various applications, in particular in the consideration of research costs. For instance, one can obtain strain-specific SNPs by sequencing DNA from 12 strains in a single lane, with samples from each strain tagged with an index adaptor. Indexed libraries are extremely useful for analysis of tissue expression profiles. RNA from each tissue can be converted to cDNA, tagged with indexed adaptors, and pooled for DNA sequencing. In a similar fashion, RNA samples with various treatments can be tagged and then pooled for sequencing in a single lane. Unless very deep sequencing is required, the construction of indexed libraries cost much less than the cost for each lane of sequencing. Therefore, pooling samples tagged with indexed adaptors is an effective way to reduce costs.

Sample Preparation for 454 Sequencing Genome Sequencer GS FLX™ System, commonly known as the 454 sequencer, provide long sequence reads (see Chapter 3), currently with an average read length

63

ACTGTG AGCCAT ATCTCG CAGTGT CGAATG CTATGC GCTAGT GTACTG GTTGCA TCGCAA TGTAGC TTGCTC

0 5 4 6 4 6 4 4 4 5 5 5 ACTGTG 0 4 6 5 6 5 5 6 4 5 5 AGCCAT 0 5 5 4 6 4 4 6 6 5 ATCTCG 0 5 3 4 6 6 5 5 5 CAGTGT

A guide for selecting the bar coding adaptors.

0 4 5 3 6 6 4 5 CGAATG 0 5 4 5 6 4 4 CTATGC 0 5 4 5 3 6 GCTAGT

0 4 5 6 3 GTACTG

0 5 5 5 GTTGCA

0 5 0 3 4 0 TCGCAA TGTAGC TTGCTC

The more different the adaptor sequences, the less likely a mistake will be made in tracing the bar code. The 12 adaptor sequences shown on the first column are cross-compared with themselves listed on the last row of the table. The numbers in the table are the number of bases that are different among them. The table was adopted from Dr. Marta Matvienko of University of California–Davis with permission.

1 2 3 4 5 6 7 8 9 10 11 12

Table 4.1

64

Next Generation Sequencing and Whole Genome Selection in Aquaculture

of 400–500 bp; but its read length is improving rapidly. Although not commercially available yet, Roche has achieved average length of over 700–850 bp recently. A single run can be achieved in approximately 10 h, generating over a million reads, thereby providing quite high throughput. Compared with the Illumina sequencing, its read length is much longer, but throughput is lower. Sample preparation for the 454 sequencing is similar to that for the Illumina sequencing. All the protocols for 454 library construction can be found at http://454.com/my454/documentation/gs-flx-system/manuals.asp. The first step is to generate DNA fragments from genomic DNA to prepare a universal library. Library preparation can be performed by one lab technician in an afternoon without special equipment. A single library preparation can supply enough material for numerous sequencing runs of the Genome Sequencer GS FLX™ System. Like Illumina sequencing, the 454 sequencers support the sequencing of samples from a wide range of sources including genomic DNA, PCR products, BACs, and cDNA. For genomic DNA and BACs, the first required step is fragmentation into 300–800-bp fragments. Depending on the applications, libraries can be made as shotgun libraries, paired-end read libraries, amplicon libraries, or cDNA libraries. Protocols for shotgun and paired-end libraries have been developed by Margulies et al. (2005) and Ng et al. (2006), respectively. Shotgun libraries and sequencing are relatively straightforward, but sequences so generated do not provide any physical information. Paired-end libraries are much preferred in a genome sequencing project or any project that requires a level of sequence assembly. Because two reads are generated from each DNA segment, one each from the end of the segment, these reads are “physically linked” with the space of the size of the segments, providing scaffolding capabilities to the paired reads. Detailed general library construction protocol is available with Roche 454 at http://454.com/downloads/my454/documentation/gs-flx/method-manuals/GS-FLXTitanium-General-Library-Preparation-Method-Manual-%28Apr2009%29.pdf. This protocol is suitable for a general library, that is, libraries other than a paired-end or an amplicon library. Protocols for making libraries using PCR products are available from Roche for amplicon library protocols at http://454.com/downloads/my454/documentation/gs-flx/ method - manuals/GS - FLX -Titanium -Amplicon - Library - Prep-Method - Manual. pdf. Two protocols are currently available for the construction of paired-end libraries depending on the fragment size. For 3-kb fragment libraries, the protocols are available at http://454.com/downloads/my454/documentation/gs-flx/methodmanuals/GS - FLX -Titanium - Paired - End - Library - Prep - 3kbSpan - Method Manual.pdf. For 8–20-kb fragment libraries, the protocols are available at http://454. com/downloads/my454/documentation/gs-flx/method-manuals/GS-FLX-TitaniumPaired-End-Library-Prep-20-8kbSpan-Method-Manual.pdf. Construction of paired read libraries involves additional steps as compared with that of general libraries. The relatively large (3 kb, or 8–20 kb) fragments need to be ligated to adaptors to mark sequence orientation, followed by circularization. The circularized molecules with adaptors at the ligation junctions are fragmented by nebulization. Afterward, the fragmented DNA need to be end polished and ligated to library adaptors. The remaining steps are similar to the procedures for general libraries.

Library Construction for Next Generation Sequencing

65

Figure 4.3 Schematic presentation of paired-end read library preparation. See color insert.

It is important to note that only a fraction of the paired reads are true paired reads. These paired reads are marked by sequencing of the adaptor junction in the circularization processes, and the orientation of the sequences are resolved by the adaptor sequences, and the distance between them are estimated by the size of the library (Figure 4.3). The Roche 454 sequencing is based on pyrosequencing. Therefore, homopolymers would lead to a huge release of lights that may ruin the sequencing reaction recording on the sequencers. Therefore, cDNA libraries are made differently for 454 sequencing. Instead of using poly dT priming, random primer priming is used. The detailed protocol for making cDNA libraries for 454 sequencing is available at http://454.com/ downloads/my454/documentation/gs-flx/method-manuals/GS-FLX-Titanium-cDNARapid-Library-Preparation-Method-Manual-%28Jan2010%29.pdf. The major difference is the first steps that involve fragmentation of RNA, then synthesis of first-strand cDNA using random primers. From there, second-strand cDNA synthesis and the subsequent steps are similar to the construction of general libraries.

Sample Preparation for Sequencing Multiplexed Samples Using SOLiD Sequencing One of the advantages of SOLiD sequencing is that you can have a high level of multiplexing, up to 96 samples in one sequencing reaction. Each of the samples can be bar-coded by a bar code sequence linked to one of the adaptors. Protocol for SOLiD sequencing provided by Applied Biosystems is very detailed, so we are not

66

Next Generation Sequencing and Whole Genome Selection in Aquaculture

attempting to provide a detailed protocol here, but rather concentrate on various considerations for aquaculture applications. We will use the SOLiD™ Fragment Library Barcoding Kit Module 1–16 (PN 4444836) to illustrate the concept and the procedures. Approximately 5 ug DNA is made into 100–150-bp fragments; then these fragments are linked to the adaptors, one of which carries a unique sequence identifier or bar code. The bar code tag enables multiplexed sequencing of multiple samples in a single sequencing reaction. The DNA fragments are ligated with a truncated Multiplex P1 Adaptor and a Multiplex P2 Adaptor with a bar code. The Multiplex P2 Adaptor consists of three segments: (1) an internal adaptor sequence (derived from the sequence used for mate-paired libraries); (2) a bar code decamer sequence; and (3) a standard P2 adaptor sequence. Because the Multiplex P2 Adaptor is longer than the standard P2 adaptor, the Multiplex P1 Adaptor is truncated to keep the total length of the adaptors approximately the same as for a nonbar-coded library. The steps for making an ABI SOLiD library is similar to those described above for the Illumina or 454 sequencing platforms. The procedure involves DNA fragmentation, end repair, and adaptor ligation (Figure 4.3). The key to multiplexed sample sequencing is the use of bar codes, one for each of the samples contained in P2 adaptors. After adaptor ligation, the libraries are amplified by using the adaptor sequences as primers. Each library is quantified using qPCR, and then equal molar of each library is pooled together for multiplex sequencing. Detailed protocols for the library construction can be found at ABI’s Web site: http://www3. appliedbiosystems.com/cms/groups/mcb_support/documents/generaldocuments/ cms_059675.pdf.

Acknowledgments Research in my laboratory is supported by grants from United States Department of Agriculture, Agriculture and Food Research Initiative (USDA AFRI) Animal Genome and Genetic Mechanisms Program, USDA National Research Initiative (NRI) Basic Genome Reagents and Tools Program, Mississippi–Alabama Sea Grant Consortium, Alabama Department of Conservation, United States Agency for International Development, National Science Foundation, and Binational Agricultural Research and Development Fund. The authors would like to thank Dr. Martha Matvienko for allowing us to use the adapter sequences shown in Table 4.1.

References Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, Dewell SB, Du L, Fierro JM, Gomes XV, Godwin BC, He W, Helgesen S, Ho CH, Irzyk GP, Jando SC, Alenquer ML, Jarvie TP, Jirage KB, Kim JB, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei M, Li J, Lohman KL, Lu H, Makhijani VB, McDade KE, McKenna MP, Myers EW, Nickerson E, Nobile JR, Plant R, Puc BP, Ronan MT, Roth GT, Sarkis GJ, Simons JF, Simpson JW, Srinivasan M, Tartaro KR,

Library Construction for Next Generation Sequencing

67

Tomasz A, Vogt KA, Volkmer GA, Wang SH, Wang Y, Weiner MP, Yu P, Begley RF, and Rothberg JM. 2005. Genome sequencing in microfabricated high-density picolitre reactors. Nature, 437(7057):376–380. Ng P, Tan JJ, Ooi HS, Lee YL, Chiu KP, Fullwood MJ, Srinivasan KG, Perbost C, Du L, Sung WK, Wei CL, and Ruan Y. 2006. Multiplex sequencing of paired-end ditags (MS-PET): A strategy for the ultra-high-throughput analysis of transcriptomes and genomes. Nucleic Acids Research, 34(12):e84. Sambrook J and Russell D. 2001. Molecular Cloning: A Laboratory Manual, 3rd edn. Cold Spring Harbor Press, Cold Spring Harbor, NY.

Chapter 5

SNP Discovery through De Novo Deep Sequencing Using the Next Generation of DNA Sequencers Geoffrey C. Waldbieser

Introduction Whole genome selection in aquatic species is generally aimed toward the identification of DNA sequence variation throughout the genome that predicts phenotype. Ideally, breeders would compare whole genome sequence from all potential broodstock to identify individuals with the greatest breeding potential. Until whole genome sequence comparison becomes cost-efficient, breeders can utilize a collection of polymorphic loci that serve as proxy markers for genomic variation. In recent years, single-nucleotide polymorphisms (SNPs) have become widely used in many vertebrate species due to their density in the genome and amenability to highly parallel genotyping assays. Large numbers of SNP loci are generally not available within most aquacultured species. Discovery of SNP loci using second generation DNA sequencing technologies is cost-efficient compared with traditional Sanger sequencing. Reviewed in Chapter 3 in this volume, the new DNA sequencing technologies provide massive amounts of sequencing data compared with traditional Sanger sequencing platforms. The new platforms differ from each other in the length and number of sequences produced per run (see Chapter 3). Thus, for a given cost, there is generally a tradeoff between the length of individual sequences and the depth of coverage of the DNA sample. This chapter provides guidance on the concepts of high-throughput SNP discovery, along with a practical example, to help investigators choose an optimal experimental design for their goals.

SNP Discovery Overview The goal of a high-throughput SNP discovery experiment is the identification of loci that are truly polymorphic in relevant individuals or populations and that also contain sufficient DNA sequence flanking the polymorphism to support subsequent design

Next Generation Sequencing and Whole Genome Selection in Aquaculture © 2011 Blackwell Publishing Ltd. ISBN: 978-0-813-80637-2

Edited by Zhanjiang (John) Liu

69

70

Next Generation Sequencing and Whole Genome Selection in Aquaculture

of unique DNA primers for high-throughput genotyping platforms. In general, individual sequences from a genomic DNA library are mapped (aligned) to a reference sequence, and computational algorithms identify mismatched bases at any particular location (Altshuler et al., 2000; Van Tassell et al., 2008; Wiedmann et al., 2008; Kerstens et al., 2009; Ramos et al., 2009; Sanchez et al., 2009). Longer flanking sequence improves the specificity of alignment of each sequence. The level of polymorphism of putative SNP loci within each alignment is then validated in relevant populations.

Reference Sequence The reference sequence can be a whole genome sequence assembly, or an existing normalized sequence collection such as bacterial artificial chromosome (BAC)-end or expressed sequence tag (EST) sequences. Alternately, individual reads from the SNP discovery sequencing project can be assembled into contigs, and the contig consensus sequences can serve as a pseudoreference. Longer reference sequences will increase the probability of detecting a SNP with sufficient flanking DNA sequence for design of genotyping primers. Additionally, greater sequence depth will improve the identification of SNP alleles, aid the identification of rare alleles, and provide a measurable estimate of allele frequency in the reference population.

Source of Genomic DNA for SNP Discovery The ultimate goal of the aquaculture researcher will determine the best source of genomic DNA for sequence production (Table 5.1). All second generation sequencing platforms are useful for high-throughput SNP discovery, although certain types of sequencing reads are more efficiently used. Table 5.1

Experimental design of high-throughput SNP discovery.

SNP inference

DNA library preferred

Species or population

Pooled reduced representation library, pooled sheared DNA library Restriction digest library, random shear library Pooled PCR amplicons

High-value individual Subgenomic region

Type of reads preferred Reference available

No reference

Short reads for greater depth of coverage in population

Long reads (pseudoreference)

Short reads for greater breadth of coverage Long or short reads, depends on the desired depth of coverage

Long reads (pseudoreference) (Assumes available reference)

SNP Discovery Through De Novo Deep Sequencing

71

If one desired to identify polymorphism throughout the genome of a highvalue individual, then one would choose a wide but shallow, perhaps 1—3×, coverage of the genome for comparison against a reference genome. Thus, the source DNA library could be fragments produced by restriction digestion or random shearing. For efficient SNP discovery within a population, a source DNA library would consist of DNA samples pooled from a sample of individuals representative of the population. Restriction enzyme digestion of a pooled DNA sample produces DNA fragments of identical size from all individuals. The restriction fragments are separated by electrophoresis, and specific-size fractions are then isolated and sequenced to provide a reduced representation library (RRL; Altshuler et al., 2000). The second generation DNA sequencing technology can provide great depth of sequence coverage so that most or all genomic fragments in the RRL will be sampled from all contributing individuals. Sequences from identical regions of multiple individuals are then coaligned to permit population-wide SNP detection within this subfraction of the genome. Researchers may wish to identify SNPs from multiple individuals from only targeted regions of the genome (Lorenz et al., 2010). In this case, it may be more efficient to sequence a set of DNA fragments that are polymerase chain reaction (PCR)amplified from multiple individuals and pooled for high-throughput sequencing. This assumes that a reference is available for PCR primer design. Again, the goal is to obtain sequencing reads of sufficient depth of sequence coverage in the population, or breadth of coverage in an individual, that can be computationally aligned to the reference.

Algorithms for DNA Sequence Alignment The design of sequence alignment algorithms is a dynamic area of research because both sequencing and computational platforms are continually evolving. While not an inclusive list, programs such as SSAHA2, MAQ, Bowtie, BWA, and SOAP2 (Ning et al., 2001; Li et al., 2008, 2009; Langmead et al., 2009; Li and Durbin, 2009) map individual reads to a reference sequence. The limiting factor in a SNP discovery project may be the computational resources required to handle ever-increasing amounts of sequence data. However, projects aimed to produce a modest number of SNP markers can utilize a fraction of a full high-throughput sequencing run.

High-throughput Approaches to SNP Discovery Aquaculture encompasses a large number of species, and few will have a reference genome sequence in the near future. However, BAC-end sequences, EST sequences, or the reference genome sequence of a closely related species can be used as the reference. Short (<150 bp), long (>400 bp), or both types of sequences could be mapped to the reference. While the choice of sequencing platforms may depend on availability or cost, the short reads can generally be produced in higher numbers to provide greater depth of coverage given limited financial resources.

72

Next Generation Sequencing and Whole Genome Selection in Aquaculture

If a reference sequence is not available, one may produce a pseudoreference through assembly of randomly sheared genomic DNA from an individual, where 2–3× genome coverage with longer reads (>400 bp) could be sufficient to represent >80% of the genome and provide sufficient sequence flanking the SNP locus. Then long or short reads from pooled RRLs could be mapped to provide depth of coverage for SNP discovery. Alternately, a pseudoreference can be produced by assembly of the sequencing reads obtained from the pooled RRL. While contig sequences produced from longer reads would maximize the probability of obtaining sequence that flanks the SNP, when supplemented with short reads from the same library it would add depth of coverage as above.

An Example of SNP Discovery via Pyrosequencing of RRLs The example below describes the production of pyrosequence data from a pooled RRL, and the assembled sequence contigs serve as a pseudoreference against which the individual reads are mapped for SNP discovery. The approach and criteria presented are meant to illustrate the concepts and are provided as a starting point for consideration.

Production of Genomic DNA Libraries Test digestions of genomic DNA with restriction enzymes will demonstrate the size distribution of fragments on an agarose gel. Usually, enzymes with a 4-bp recognition sequence will maximize the distribution of fragments from 200 to 800 bp, which is the range of template sizes currently preferred in second generation sequencing platforms. The fragments may be end-repaired if the sequencing protocol requires blunt ends. Repetitive DNA fractions, which appear as more intensely stained DNA bands, should be avoided. The goal of the experiment is deep sequencing of a defined size fraction to maximize the contribution of all genome donors. Equimolar amounts of genomic DNA from multiple individuals are pooled and digested overnight using 5–10 units of restriction enzyme per microgram of DNA. The restriction fragments are separated by agarose or polyacrylamide gel electrophoresis (Figure 5.1). If resources are limited to few sequencing runs, the probability of sampling all contributing genomes increases as the size fraction decreases. Therefore, it is useful to include size standards for resolving smaller size fractions on the gel. The example in Figure 5.1 shows size standards spaced every 10 bp from 200 to 260 bp, and every 20 bp from 400 to 480 bp, plus additional standards at 330, 520, and 600 bp for orientation. The size standards can be purified PCR amplicons of defined size from a standard template such as the plasmid pUC19. Digestion of ≥1 mg of total genomic DNA will ensure sufficient quantities of DNA isolated from smaller size fractions for pyrosequencing. If less amount of sample is required, then one could select a larger size fraction. The restriction fragments can be purified from the gels by elution, electroelution, or silica-based affinity methods. The purified DNA fragments

SNP Discovery Through De Novo Deep Sequencing

1

2

3

73

n

600 bp 520 480

400

330 260 240 220 200

Figure 5.1 Diagram of an electrophoretic gel used to isolate reduced representation libraries. Genomic DNA from multiple individuals (green tubes at top) is pooled and digested with a restriction enzyme. The DNA fragments are separated by electrophoresis (shaded green box) alongside size standards (green lines, sizes in bp are denoted at left). White boxes represent size fractions that are excised for deep sequencing. See color insert.

are processed for high-throughput sequencing according to the manufacturer’s protocols. The sequence data is provided as a text file in FASTA format. The processed sequencing reads do not absolutely require further trimming of low-quality bases prior to mapping to the pseudoreference. The accuracy of a lower-quality base call is improved by the depth of sequence coverage, and the probability is low that the same nucleotide in a particular contig will be miscalled in more than one read. When producing and analyzing massive amounts of DNA sequence data, one may have to exchange a high level of base calling accuracy for existing computational resources and time.

74

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Production of a Pseudoreference Sequence In this example, the pseudoreference is produced de novo through alignment of the new pyrosequences using the gsAssembler software. The stringency of alignment can be determined empirically, but a suggested starting point is a minimum contig length of 100 bp, minimum overlap length of 50 bp, and minimum match identity of 95%. The goal is to maximize the number of unique contigs and minimize the incidence of false alignments. This is more straightforward if the sequencing reads are chosen from one individual. However, if the sequencing reads are produced from a pool of individuals, then separate alleles of a single locus could be placed into separate contigs if the stringency is too high. Multiple assemblies can be produced at varied stringency and compared through sequence similarity matching (e.g., BlastN) to determine an optimum. While pyrosequencing is capable of producing sequences at 400 bp or longer, this example utilizes shorter sequencing reads for brevity. The gsAssembler FASTA sequence output has the following format: >contig00001 length=233 numreads=31 CCGGTCCACATTTTACTTACACCACACAAGGATAGAGCGATCTAG AGGGTCATGGGGAATGGTATGAGAACCACTGCTTTTAGTATTTTG CAAGTCCATTTGCTTTGATTCAATATTATTCAGCGTTCATTAATTTA TTCATCATCAATAATTGCTTGATGCTGGTCCATTGATATGGATCCA GAGCCTATCCTGGGGAACACTAGGCATGATATGGGAATACACA CTGGAGG >contig00002 length=242 numreads=15 CCATACATTTAAACTCCTCATAATATGAATTTTCCTGACATCGCTT ATTGGTGTAACTACGCCGTAGCAAGCACCCCCCAAGCCAAGGCGT TCTTTTATAGGCATGGGGTTTTTCTTTTTTAGGGTCACTTTCATTT GGCATACTTCATGCAGGCAATCAACAAACATTCAAGGTCGCACTA TTTTTATTCAAGCCAACCAAAAATGTAATGCTTTAATGACATAC TTGAAATAATTACAGG For downstream applications, it is useful to simplify the header line with the command #Linux Command 1 ‘$sed –e ‘/>/s/∧$.\{12\}$.*$/\1/g’ infile > outfile’ Command 1 retains the first 12 characters of each line that contains a “>” (the FASTA format header) in the file “infile” and outputs the results to the file “outfile.” The FASTA output file will then have a simplified FASTA format as shown in Table 5.2. For researchers without expertise in scripting languages or without access to a Unix-type operating system, manipulation of the header line can be performed in Microsoft Word using the “Find/Replace” command. The sequences may have to be processed in batches to not overwhelm a desktop computer. Some sequence manipulation and trimming can be performed in Microsoft Excel 2007, which can contain up to 1,048,576 sequences per worksheet. The following Linux commands, performed sequentially, will prepare the FASTA-formatted file for import into Microsoft Excel:

Table 5.2

Example of pseudoreference sequences in FASTA format.

>contig00001 CCGGTCCACATTTTACTTACACCACACAAGGATAGAGCGATCTAGAGGGTCAT GGGGAATGGTATGAGAACCACTGCTTTTAGTATTTTGCAAGTCCATTTGCTTT GATTCAATATTATTCAGCGTTCATTAATTTATTCATCATCAATAATTGCTTGATG CTGGTCCATTGATATGGATCCAGAGCCTATCCTGGGGAACACTAGGCATGATAT GGGAATACACACTGGAGG >contig00002 CCATACATTTAAACTCCTCATAATATGAATTTTCCTGACATCGCTTATTGGTGTA ACTACGCCGTAGCAAGCACCCCCCAAGCCAAGGCGTTCTTTTATAGGCATGGGG TTTTTCTTTTTTAGGGTCACTTTCATTTGGCATACTTCATGCAGGCAATCAACAA ACATTCAAGGTCGCACTATTTTTATTCAAGCCAACCAAACATGTGATGCTTTA ATGACATACTTGAAATAATTACAGG >contig00003 CCAACATTCCCCGACACATGCAATTTCACTATGGTTTAGTACATAATATGTATA ACTCAACATCATGGTTTAAGTACATATTATGTATAATATTACATCATGGTTTAA TTCATTACATGGTATATCAACATACAACCTACATTAAACATTTTTGTTTACAATA TCAAAATAAGCCGTACATAAACCATATTAATTCAAACTCATAAATAATATATCT TAAAATGGGCTATTGCATAATTCCTAAT >contig00004 TATAACCAACATTCCCCGACACATGCAATTTCACTATGGTTTAGTACATAATAT GTATAACTCAACATCATGGTTTAAGTACATATTATGTATAATATTACATCATCG TTTAATTCATTACATGGTATATCAACATACAACCTACATTAAACATTTTTGTTTA CAATATCAAAATAAGCCGTACATAAACCATATTAATTCAAACTCATAAATAATA TATCTTAAAATGGGCTATTGCATAATTCCTAAT >contig00005 CCTCAATCCCCTGCCCGGGGACGAGGAGCAGGCATCAGGCACACTTTCTACCC CCGCCCAAGACGCCTTGCTACGCCACACCCCCAAGGGAACTCAGCAGTAATAG ACATTAAGCCATAAGTGTAAACTTGACTTAGTTAGGGCTATTAGGGCCGGTA AAATTCGTGCCAGCCACCGCGGTTATACGAAAGACCCTAGTTGCTAGCCACGG CGTAAAGGGTGGTTAAGGACA >contig00006 CCACTTCTTGTTTATCCCGCCTATATACCGCCGTCGTCAGCTTACCCTGTGAAG GCCTAACAGTAAGCAAAATGGGCCCGCCCAAAAACGTCAGGTCGAGGTGTAG CGTACGAAGTGGGAAGAAATGGGCTACATTTTCTATACCTAGAATATTACGAA TGGCACCATGAAAATAATGCCTGAAGGTGGATTTAGTAGTAAAAAGCAAATA GAGTGTCCTTTTGAATTAGGCTCTGAGACGCGCACACACCG >contig00007 CCGAGCAGTCGCCCAAACTATCTCCTATGAAGTCAGTCTCGGCCTAATCCTTCT ATCAATTATTATCTTTACAGGAGGTTTTACTCTCCAAATATTTAACATGACACA AGAAGCTATCTGACTTCTAATCCCTGCCTGACCTCTAGCCGCCATATGATATAT TTCTACCCTCGCAGAAACAAACCGAGCCCCCTTTGATCTCACAGAAGGCGAA TCAGAATTAGTATCGGGGTTTAACGTAGAATACGCCGGAGGTCCTTTCGCACT CTTTTTCCTAGCCGAATACGCCAACA >contig00010 CCCAGCTCCTTAGAAAGAAGGGATTTGAACCCATATTATGGAGATCAAAACT CCAAGTGTTTCCATTACACCACTTCCTAGTAAGATCAGCTAAATTAAGCTTTTGG GCCCATACCCCAAAAATGTAGGTTAAAACCCTTCTCTTACCAATGAGCCCCTAC GTCATCACAATTTTATTATCAAGCCTAGGTCTAGGCACCGCTCTTACCTTCATA AGCTCCCACTGGCTGCTAGCATGGATAGGACTAGAAATTAATACCCTAGCGA TCCTTCCCCTAATAGCTCAACACCATCACC (Continued)

75

Table 5.2

(Continued)

>contig00012 CCTGTTATACAGGGCTTAACCCTAACCACCGGACTAATTATGGCTACCTGACAA AAACTGGCCCCATTCGCACTAATCATTCAAATGGCCCCCTTCACCCACCCCCTCC TATTAACAACCCTAGGATTACTATCCGTTTTCATCGGGGGCTGGGGAGGTTTA AATCAAACTCAATTACGAAAAATCTTAGCCTACTCATCCATCGCCCATCTCGG >contig0015 CCACGACGATACTCAGACTACCCCGATGCCTACTCACTATGAAACATCATCTC TTCAATCGGCTCCCTGGTGTCCCTAGTAGCAGTTGTAATATTCCTGTATATTTTA TGGGAGGCCTTTACTGCCAAACGAGAAGTACTCTCCGTCGAACTCACCTCCA CAAACGTAGAGTGGCTACACGGATGCCCCCCACCCTATCA >contig00020 CCTCACAACTAGGATTCCAAGACGCGGCCTCCCCTGTAATAGAAGAACTTCTG CACTTCCACGACCACGCCTTAATAATTGTTTTCCTAATTAGCACCTTAGTCCTA TACATTATTGTTGTTATAGTAACCACCAAGCTTACCAATAAGTTTATCCTAGA CTCCCAAGAAATTGAAATTGTCTGAACCATCCTCCCAGCAGTA >contig00022 CCTTAGTCCTATACATTATTGTTGTTATAGTAACCACCAAGCTTACCAATAAGT TTATCCTAGACTCCCAAGAAATTGAAATTGTCTGAACCATCCTCCCAGCAGTAA TCCTTATTCTAGTTGCCCTTCCTTCCCTTCGAATTCTTTATCTAATGGATGAAGT AAATGATCCCCACTTGACAGTAAAAGCCATGGGCCATCAATGGTATTGAAGC TATGAGTATACTGACTACGAAAATTTAGCTTTCGACTCCTATATAATCCCCACA CAAGACCTGGTCCCAGGACAATTCCGACTACT >contig00025 GATTTGCAATCCTTGTATTCTCGTGATTAATTTTCTTGACAGTAATCCCAAACA AAGTCTTAAACCACACCTTCACAAATGAAGTCACAGCACTTAGTGCCGAAAAA CTTAAATCAGACACCTGAAACTGACCATGGCACTAAACCTGTTTGACCAATTT ATAAGCCCCACACATCTCGGTATCCCCCTAATTGCTATTGCTCTCACCCTCCCT TGAATTTTAATC >contig00026 GCTGTCCTTAAATATAGGACTGGCCGTACCGCTATGGCTAGCCACAGTAATTA TTGGCCTCCGAAACCAGCCCACTGCGGCCCTAGGACACCTCCTACCAGAAGG AACTCCCGCCCTTTTAATTCCAATTCTAATTATTATTGAAACCATCAGCTTATT TATTCGCCCTCTAGCCCTCGGAGTCCGACTCACAGCTAATCTTACAGCCGGCCA CCTGCTAATTCAACTAATCTCAACAGCAACCATCACCCTTATGCCCATAATAA CCACAGTAGCAACCCTTACCGCCATTCTTCTAGTGCTATTAACACTCCTAGAGG TTGCAG >contig00030 CGAATGCGGTTTCGACCCTTTAGGCTCTGCACGCCTACCCTTCTCCCTACGCTTC TTTCTAGTCGCCATCCTATTCTTGCTGTTTGACCTGGAAATTGCCCTCCTGCTCC CCCTTCCATGAGGCAATCAACTACTAACTCCCGCTTACACCCTTCTATGAGCTGC AACCATTTTAATCCTACTCACCCTAGGCCTAATTTATGAGTGGGTACAGGGTGG CCTAGAATGGGCCGAATAGGGGACTAGTCCAAATTAAGACCTCTGATTTCGACT CAGAAAACCGCGGTTTAATTCCGCGGTCCCCTTATGACACC >contig00035 TGTTATTTCCAACAATTTGACTCTCCCCTTCCAAATGAGTTTGAACTACTACGA CCCTTCAAAGCTTAATTATCGCCCTAGTCAGCCTTAGCTGAATTAATTGGTCC TCAGAAACAGGCTGAGCTTCCTCTAACTTATATATAGGCACGGACCCTTTGTCA ACTCCCCTTTTAGTGCTCACTTGTTGACTACTCCCCCTCATAATTCTCGCTAGC CAAAATCACATTAAAGCCG (Continued)

76

SNP Discovery Through De Novo Deep Sequencing Table 5.2

77

(Continued)

>contig00040 CCCTTATATGGAGTTCACCTCTGACTACCAAAAGCCCACGTAGAAGCTCCAG TAGCCGGATCCATGGTACTAGCAACAATTTTACTAAAACTTGGAGGCTACGGC ATAATACGAATAATACTTATACTAGACCCCCTGTCCAAAGACATAGTATATCCT GTTATTGCACTAGCCCTCTGAGGCGTACTAATGACAGGCTCTATCTGCTTACGA CAAACAGACTTAAAATCATTAATTGCCTACTCATCCGTCAGCCACATAGGCC TTGTTGCAGGCGGAATTTTAATCCAAACCCCATGAGGCTTTACCGGCGCCCTCG TATTAATAATTGCCCATGGCCTAGTCTCGTCTGCCCTATTCTGTTTGGCCAATA CCACTTACGAACGCACCCA >contig00045 CCTCCTTCCAGTTGCTCTCCTCATTACAAAGCCTGAAATCATATGAGGTTGATG GTACTGTAGATATAGTTTAACACAAAACATTAGATTGTGGTTCTAAAAATGG AAGTTAAACCCTTCCTATCCACCGAAAGAGGCCCAGGGCAGTAGAGACTGCTA ATCCCTATTACCACGGTTAAACTCCGTGGCTCATTCAAAGCTCCTAAAGGATAA TAGTTCATCCGTTGGTCTTAGGAACCAAAGACTCTTGGTGCAACTCCAAGTAGC AGCTATGGCAGATATTATAACCACCACCCTTCTTCTCACCCTAGCAATTCTAAT GTGACCTCTTATAACAACACTAAGTCCCACCCCCTTAGACCAAAAATGGGCCC TAAAATACGTCAAAACCGCCGTAAGCACTGCATTTTTTATTAACACTATCCCCC TTATTATTTTCTTAGACCAAGG

#Linux Command 2 (places semicolon at the end of the header line) ‘$sed -e ’/>/s/$/;/g’ infile >outfile1’ #Linux Command 3 (removes all hard return/line feed characters in file) ‘$tr -d [:cntrl:] outfile2’ #Linux Command 4 (adds line feed before each header line) ‘$sed -e ‘s/>/\n>/g’ outfile2 >outfile3’ The semicolon delimits the header from the sequence for parsing into separate columns in the spreadsheet. >contig00001;CCGGTCCACATTTTACTTACACCACACAAGGATAGAG CGATCTAGAGGGTCATGGGGAATGGTATGAGAACCACTGCTTTTA GTATTTTGCAAGTCCATTTGCTTTGATTCAATATTATTCAGCGTTC ATTAATTTATTCATCATCAATAATTGCTTGATGCTGGTCCATTGAT ATGGATCCAGAGCCTATCCTGGGGAACACTAGGCATGATATGGGA ATACACACTGGAGG >contig00002;CCATACATTTAAACTCCTCATAATATGAATTTTCCTGAC ATCGCTTATTGGTGTAACTACGCCGTAGCAAGCACCCCCCAAGCC AAGGCGTTCTTTTATAGGCATGGGGTTTTTCTTTTTTAGGGTCAC TTTCATTTGGCATACTTCATGCAGGCAATCAACAAACATTCAAGG TCGCACTATTTTTATTCAAGCCAACCAAAAATGTAATGCTTTA ATGACATACTTGAAATAATTACAGG Occasionally, two genomic fragments may join during the ligation of sequencing adapters, which can lead to a chimeric sequencing read. Therefore, the contigs should be screened to identify those containing the restriction site recognition sequence.

78

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Extraneous sequence 5′ or 3′ of the restriction recognition sequence should be removed.

Sequence Mapping to Pseudoreference and SNP Identification Contig sequences that match known interspersed repeats, such as transposons, should also be identified because these multicopy loci can produce false positive SNPs due to multiple loci mapping to one contig. The simple repetitive sequence in the individual reads should be masked using a program such as Repeatmasker (Smit et al., 2004). This will minimize false matches due to short tandem repeats or low-complexity sequence. Masked individual reads are then mapped to the pseudoreference sequences using SSAHA_pileup, which is a pipeline of commands within the SSAHA2 software (Ning et al., 2001). The SSAHA2 documentation explains the options used to control the stringency of alignment, but one may vary the stringency empirically to obtain optimal results. A suggested starting point is the use of “-seeds 5 -score 100 -kmer 13 -skip 4 -diff 0” in the command line. The SSAHA_pipeline output is written to a space-delimited text file that can be imported into a spreadsheet for convenience.

Filtering and Analysis of SSAHA_pipeline Output The sample SNP output file shown in Table 5.3, based on the contigs from Table 5.2, contains one SNP locus per row. The report only includes contigs for which a putative SNP was detected. The columns are defined below.

Column A: “SNP_hom” or “SNP_hez” denote homozygosity or heterozygosity within the aligned reads. The SNP_hom loci (such as loci 40 and 45 in rows 25 and 26) may be useful in comparisons of one or a few individuals against a genome reference sequence. In a pseudoreference-RRL experiment, the SNP_hom loci can be discarded. Column B: The contig names are extracted from the reference FASTA file. Contig names from Table 5.2 were truncated for brevity in this example. Column C: The SNP confidence score is based on the uniqueness of the read alignment combined with the base quality. When multiple reads are aligned at each contig, the SNP confidence score may not be indicative of SNP quality. For example, contigs 10, 20, and 30 had a low score (20) due to the low number of reads, but the SNPs could be valid. Column D: Position of the SNP within the contig sequence. Column E: Number of sequencing reads mapped to the reference at the SNP position.

79

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

1

C SNP_ score 99 90 80 99 99 99 99 70 70 70 99 20 99 60 60 20 99 99 99 11 46 20 20 80 99

B Ref_ name 1 2 2 3 4 5 6 6 6 6 7 10 12 15 15 20 22 25 26 26 26 30 35 40 45 45 204 209 87 192 170 181 205 208 215 85 172 84 32 210 87 42 42 14 29 45 15 105 214 8

Offset

D N_ reads 42 32 32 38 66 45 33 32 32 32 42 4 45 21 21 6 53 29 60 60 60 6 3 8 10

E refe_ base G C G T C T C A T T A A A A G C G A T G C G A G T

F SNP_ base G/A T/C C/G G/T T/C A/T T/C T/A A/T A/T G/A G/A G/A G/A T/G A/C T/G G/A A/T A/G T/C C/G T/A T C

G

Simulated output from SNP discovery software (SSAHA_pipeline).

SNP_ type SNP_hez: SNP_hez: SNP_hez: SNP_hez: SNP_hez: SNP_hez: SNP_hez: SNP_hez: SNP_hez: SNP_hez: SNP_hez: SNP_hez: SNP_hez: SNP_hez: SNP_hez: SNP_hez: SNP_hez: SNP_hez: SNP_hez: SNP_hez: SNP_hez: SNP_hez: SNP_hez: SNP_hom: SNP_hom:

A

Table 5.3

28 0 0 0 0 19 0 25 7 7 24 2 28 15 0 2 0 18 15 27 0 0 1 0 0

N_'A'

H

0 23 8 0 42 0 20 0 0 0 0 0 2 0 0 4 0 0 0 0 44 2 0 0 10

N_'C'

I

14 0 24 13 0 0 0 0 0 0 17 2 15 6 15 0 30 11 0 33 0 4 0 0 0

N_'G'

J

0 9 0 22 24 26 13 7 25 25 1 0 0 0 6 0 23 0 45 0 16 0 2 8 0

N_'T'

K

0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

N_'-'

L

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

N_'N'

M

80

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Column Column Column Column Column Column Column Column

F: G: H: I: J: K: L: M:

Nucleotide in the reference sequence at SNP position. Dominant polymorphism at the SNP position. Number of mapped reads with adenine at SNP position. Number of mapped reads with cytosine at SNP position. Number of mapped reads with guanine at SNP position. Number of mapped reads with thymine at SNP position. Number of mapped reads with gap at SNP position. Number of mapped reads with ambiguous base at SNP position.

As an example, contig00001 from Table 5.2, listed as “1” in Table 5.3, contained a G/A polymorphism at position 45 in the sequence. There were only two alleles—28 reads contained the “A” allele and 14 reads contained the “G” allele. There were no reads with gaps or ambiguous base calls in this contig. All contigs contained one SNP except for contig 2 (2), contig 6 (4), contig 15 (2), and contig 26 (3). The stringency of filtering the data to remove unlikely SNPs can vary according to investigator. One may afford to be more aggressive in removing questionable SNPs if the experiment provides a high number of candidate high-probability SNP loci. Alternately, one may wish to keep more questionable loci if the total is low, or at least segregate loci into classes based on quality of results. Nonetheless, the removal of low-probability SNPs is more cost efficient in silico than later during SNP validation. Loci with good-quality SNPs will contain at least two reads at two alleles only. Alleles with two or more reads at a third allele could indicate coalignment of multicopy loci so these should be considered only if validated through genotyping of individuals. The investigator may also wish to remove contigs that contain two or more Ns or gaps at the SNP position. The following formulas can be used to denote higher-quality SNP loci based on the number of reads per allele. The formulas assume the data structure from Table 5.3: Excel Cell N2 =IF(H2<2,IF(OR(I2<2,J2<2,K2<2),1,0)) Excel Cell O2 =IF(I2<2,IF(OR(H2<2,J2<2,K2<2),1,0)) Excel Cell P2 =IF(J2<2,IF(OR(H2<2,I2<2,K2<2),1,0)) Excel Cell Q2 =IF(K2<2,IF(OR(H2<2,I2<2,J2<2),1,0)) Excel Cell R2 =SUM(N2:Q2) Excel Cell S2 =IF(L2<2,IF(M2<2,IF(R2=2,1,0))) As shown in Table 5.4, the output in column R is “2” for probable SNP loci, “0” for loci with two or more reads at a third allele, and “3” for loci with fewer than two reads per allele. Thus, contigs 35, 40, and 45 could be culled, and contig 12 would be

81

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

1

Table 5.4

Ref_ name 1 2 2 3 4 5 6 6 6 6 7 10 12 15 15 20 22 25 26 26 26 30 35 40 45

B Offset

SNP_ score 99 90 80 99 99 99 99 70 70 70 99 20 99 60 60 20 99 99 99 11 46 20 80 99 20 45 204 209 49 107 12 38 85 191 192 213 194 85 49 85 163 120 111 21 77 78 228 144 209 99

D

C

42 32 32 38 66 45 33 32 32 32 42 4 45 21 21 6 53 29 60 60 60 6 8 10 3

Reads

E

SNP output analysis extended from Table 5.3.

FALSE 1 1 1 1 FALSE 1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE 1 FALSE 1 FALSE FALSE FALSE 1 1 1 1 1

N

1 FALSE FALSE 1 FALSE 1 FALSE 1 1 1 1 1 FALSE 1 1 FALSE 1 1 1 1 FALSE FALSE 1 FALSE 1

O

FALSE 1 FALSE FALSE 1 1 1 1 1 1 FALSE FALSE FALSE FALSE FALSE 1 FALSE FALSE 1 FALSE 1 FALSE 1 1 1

P

1 FALSE 1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE 1 1 0 1 FALSE 1 FALSE 1 FALSE 1 FALSE 1 FALSE 1 FALSE

Q

2 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 3 3 3

Qual1

R

1 1 1 FALSE 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1

Qual2

S

Next Generation Sequencing and Whole Genome Selection in Aquaculture

82

culled or segregated for eventual validation. The output in column S would be “1” for probable SNP loci or “FALSE” for loci with more than two or more gaps and/or two or more Ns (contigs 35, 40 and 45 are marked as “1” but would have been previously culled). Contig 3 could be noted for possible validation because it contained three reads with a gap at the SNP locus. The gap could represent an insertion/ deletion, but the base calling algorithm sometimes produces small insertions/deletions in pyrosequences so these polymorphisms should be considered with caution. The average number of reads per contig should be calculated over all contigs in the assembly because a general guide is the selection of SNP-containing contigs that contain a number of aligned reads no more than twice the overall average. Selection of contigs with large numbers of aligned reads risks the selection of multiple-copy loci with a greater potential for false positive SNPs. However, undersampling of the DNA fraction could produce a multicopy locus contig that contains only four or five aligned sequences, thus also producing a false positive SNP. As stated above, some sequencing platforms, for example, 454 sequencers, have difficulty resolving longer homopolymer sequences, so alignment of reads with 1-bp differences in the homopolymer sequence can lead to false positive SNPs. Therefore, it is useful to identify these sequences for further validation. The following formulas assume the original contig sequence has been inserted into column T: Excel Cell U2 =MID(T2,(D2-3),3) Excel Cell V2 =MID(T2,(D2+1),3) Excel Cell W2 =IF(OR(U2=“aaa”,U2=“ccc”,U2=“ggg”,U2=“ttt”,V2=“aaa”, V2=“ccc”, V2=“ggg”,V2=“ttt”,),1,0) Column U will contain the three bases prior to the SNP base, and column V will contain the three bases that follow the SNP base. Column W will contain a “1” if a 3-bp homopolymer is present, otherwise it is “0” and serves as a handle for sorting the data. Table 5.5 demonstrates 3-bp homopolymers (in bold font) in contigs 2, 5, 6, and 45. Empirical testing can determine one’s level of confidence in these sequences for the design of successful allele-specific primers. One may wish to alter these formulas to report longer homopolymers instead. If the reference sequence is relatively short, compared with a long chromosomal region or kilobase-sized contig, then one must consider the distance of the SNP from either end of the sequence contig. It may not be possible to design a useful primer for a SNP genotyping assay with less than 25 bp of flanking sequence. The distance from the left end is the offset (column D). Distance from the right end is obtained by Excel Cell X2 =LEN(T2)-D2 The number of SNPs per contig will vary depending on sequence length, reference population, and even species. It is useful to utilize a pivot Table within the spreadsheet to identify the number of SNPs per contig. A large number of SNPs in one contig may signal a multilocus alignment or a misalignment. Validation of a sample

83

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

1

Ref_ name 1 2 2 3 4 5 6 6 6 6 7 10 12 15 15 20 22 25 26 26 26 30 35 40 45

B

Table 5.5

Offset

SNP_ score 99 90 80 99 99 99 99 70 70 70 99 20 99 60 60 20 99 99 99 11 46 20 80 99 20 45 204 209 49 107 12 38 85 191 192 213 194 85 49 85 163 120 111 21 77 78 228 144 209 99

D

C Sequence (partial shown) …GATCTAGAGGGTC… …ACCAAACATGTGA… …ACATGTGATGCTT… …ATAATATGTATAA… …CATCATCGTTTAA… …ATCCCCTGCCCGG… …CGTCGTCAGCTTA… …CGCCCAAAAACGT… …GGTGGATTTAGTA… …GTGGATTTAGTAG… …AAGGCGAATCAGA… …AGGTCTAGGCACC… …ATTCAAATGGCCC… …AACATCATCTCTT… …GTAGCAGTTGTAA… …TAGACTCCCAAGA… …ATTCTAGTTGCCC… …AAACTTAAATCAG… …TAGGACTGGCCGT… …CCCACTGCGGCCC… …CCACTGCGGCCCT… …TAGAATGGGCCGA… …TATATAGGCACGG… …TCTGCTTACGACA… …GGTTCTAAAAATG…

T

SNP output analysis extended from Table 5.3.

Left 3 bp CTA AAA TGT ATA CAT CCC CGT CCA GGA GAT GCG TCT CAA ATC GCA ACT CTA CTT GAC ACT CTG AAT ATA GCT TCT

U Right 3 bp AGG ATG ATG GTA GTT GCC AGC AAA TTA TAG ATC GGC TGG TCT TTG CCA TTG AAT GGC CGG GGC GGC GCA ACG AAA

V

0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

Homopolymer

W Dist_from right end 188 38 33 196 143 220 214 167 61 60 80 103 130 151 115 40 181 115 305 249 248 86 90 183 351

X SNP per ctg 1 2 2 1 1 1 4 4 4 4 1 1 1 2 2 1 1 1 3 3 3 1 1 1 1

Y

Modified SNP ;G/A; ;T/C; ;C/G; ;G/T; ;T/C; ;A/T; ;T/C; ;T/A; ;A/T; ;A/T; ;G/A; ;G/A; ;G/A; ;G/A; ;T/G; ;A/C; ;T/G; ;G/A; ;A/T; ;A/G; ;T/C; ;C/G; ;T; ;C; ;T/A;

Z

1 1 2 1 1 1 1 2 3 4 1 1 1 1 2 1 1 1 1 2 3 1 1 1 1

Index

AA

84

Next Generation Sequencing and Whole Genome Selection in Aquaculture

of these loci will determine a practical limit for the selection of the number of SNPs per contig as high-probability SNPs. In practice, only one SNP per contig will likely be used for genotyping. A particular benefit of high-throughput DNA sequencing and alignment of reads from pooled samples is the inference of minor allele frequency (MAF). This is based on the number of reads for each allele that map to the reference sequence. Calculation of the MAF provides a useful tool for further selection of high-likelihood SNP loci that will be useful in populations. However, putative SNPs with a predicted MAF near 0.5 could arise from alignment of reads from multicopy loci, whereas SNPs with a predicted MAF < 0.05 may not have practical use in the population.

SNP Sequence Annotation for Primer Design The SNP locus should be annotated in the contig sequence to facilitate design of locus-specific and SNP-specific primers for genotyping. The SNP database (dbSNP) at the National Center for Biotechnology Information (NCBI) accepts a variety of SNP designations, most commonly two nucleotides separated by a slash (e.g., A/G): SNP:contig00001 LENGTH:233 5′_FLANK: CCGGTCCACATTTTACTTACACCACACAAGGATAGAGC GATCTA OBSERVED:A/G 3′_FLANK:AGGGTCATGGGGAATGGTATGAGAACCACTGCTTTTA GTATTTTGCAAGTCCATTTGCTTTGATTCAATATTATTCAGCGTTC ATTAATTTATTCATCATCAATAATTGCTTGATGCTGGTCCATTGA TATGGATCCAGAGCCTATCCTGGGGAACACTAGGCATGATATG GGAATACACACTGGAGG Alternately, the BatchPrimer3 primer design software (You et al., 2008) requires the SNP to be designated as an ambiguity code on a separate line between the 5′ and 3′ flanking sequences: >contig00001 CCGGTCCACATTTTACTTACACCACACAAGGATAGAGCGATCTA R AGGGTCATGGGGAATGGTATGAGAACCACTGCTTTTAGTATTTT GCAAGTCCATTTGCTTTGATTCAATATTATTCAGCGTTCATTAAT TTATTCATCATCAATAATTGCTTGATGCTGGTCCATTGATATGGAT CCAGAGCCTATCCTGGGGAACACTAGGCATGATATGGGAATACA CACTGGAGG The following commands for SNP annotation within Microsoft Excel assume that column Y contains the number of SNPs per contig and column Z contains the modified SNP text. The latter permits modification of the SNP if the original format in column G (e.g., “A/G”) is not desired. For example, changing “A/G” to “;A/G;” flanks the SNP with semicolons for downstream sequence manipulation. Alternately, one may substitute “A/G” with “;R;” in column Z while retaining the original SNP

SNP Discovery Through De Novo Deep Sequencing

85

designation in column G. The following instructions are designed to annotate up to five SNP loci per contig: Sort the contigs by column Y (SNP/contig, ascending), then column B (contig name, ascending), then column D (SNP position, ascending). Index the multiple SNP contigs. Enter a “1” in column Z for single SNP contigs. For contigs that contain two SNPs, enter a “1” in column Z for the first SNP, then a “2” for the second SNP, and so on. Excel Cell AB2 =+IF(AA2=1,REPLACE(T2,(D2),1,“[“&Z2&”]”),IF(AA2=2, REPLACE(AB1,((D2)+4),1,“[“&Z2&”]”),IF(AA2=3,REPLACE (AB1,((D2)+8),1,“[“&Z2&”]”),IF(AA2=4,REPLACE(AB1,((D2)+12), 1,“[“&Z2&”]”),IF(AA2=5,REPLACE(AB1,((D2)+16),1, “[“&Z2&”]”)))))) Tables 5.5 and 5.6 demonstrate these columns and the eventual marked sequence output in column AB. The data in Table 5.6 may be copied and pasted as values, instead of formulas, to a new spreadsheet. For contigs with multiple SNPs, only the annotated sequence from the highest index should be used. One can use the semicolons in the annotated sequence to separate the 5′ and 3′ flanking sequences into separate columns for merging into a NCBI-formatted file. Alternately, the annotated sequences can then be exported to a text file in which each semicolon is replaced by a hard return/line feed to produce the BatchPrimer3-type output.

Validation of Putative SNP Loci The bioinformatic analyses result in a collection of putative SNP loci, but it is important to validate the accuracy of prediction. Several SNP genotyping methods are available and are discussed in Chapter 8 in this volume. Discovery of SNP loci through deep sequence alignments of pooled DNA samples provides a high level of confidence for true locus polymorphism in the reference population and, depending on the inference of the reference population, true polymorphism in other populations. Another advantage of deep sequence alignment is the prediction of allele frequency via counts of sequencing reads. This permits the identification of loci with common alternate alleles versus loci that contain rare alternate alleles that may not be practically useful for scientific investigation. A typical contig that contains a true polymorphic SNP is shown in Figure 5.2. The sequence alignment shows 18 aligned sequences beneath the partial contig consensus sequence. The 12 G and six A alleles are denoted by an asterisk. For validation, the locus was PCR-amplified from individuals, and the amplicons were sequenced using traditional Sanger chemistry. The asterisk at right denotes the SNP position in each chromatogram. The top chromatogram is from an individual homozygous for the A allele, the middle chromatogram if from an A/G heterozygote, and the bottom chromatogram is from a G homozygote. SNP validation is important to assess the level of false positive SNP loci that arise from incorrect base calls or incorrect alignments (Table 5.7). For example, if

Table 5.6

1 2

SNP output analysis extended from Table 5.3.

B

C

D

AA

AB

Ref_ name 1

SNP_ score 99

Offset

Index

Marked Sequence

45

1

CCGGTCCACATTTTACTTACACCACACAAG GATAGAGCGATCT A;G/A;AGGGTCATGGG GAATGGTATGAGAACCACTGCTTTTAGTA TTTTGCAAGTCCATTTGCTTTGATTCAATA TTATTCAGCGTTCATTAATTTATTCATCAT CAATAATTGCTTGATGCTGGTCCATTGATA TGGATCCAGAGCCTATCCTGGGGAACACT AGGCATGATATGGGAATACACACTGGAGG CCATACATTTAAACTCCTCATAATATGAAT TTTCCTGACATCGCTTATTGGTGTAACTAC GCCGTAGCAAGCACCCCCCAAGCCAAGGC GTTCTTTTATAGGCATGGGGTTTTTCTTTT TTAGGGTCACTTTCATTTGGCATACTTCAT GCAGGCAATCAACAAACATTCAAGGTCG CACTATTTTTATTCAAGCCAACCAAA;T/C;A TGTAATGCTTTAATGACATACTTGAAATAA TTACAGG CCATACATTTAAACTCCTCATAATATGAAT TTTCCTGACATCGCTTATTGGTGTAACTAC GCCGTAGCAAGCACCCCCCAAGCCAAGGC GTTCTTTTATAGGCATGGGGTTTTTCTTTT TTAGGGTCACTTTCATTTGGCATACTTCA TGCAGGCAATCAACAAACATTCAAGGTC GCACTATTTTTATTCAAGCCAACCAAA;T/C; ATGT;C/G;ATGCTTTAATGACATACTTGAAA TAATTACAGG CCAACATTCCCCGACACATGCAATTTCACT ATGGTTTAGTACATAATA;G/T;GTATAACT CAACATCATGGTTTAAGTACATATTATGTA TAATATTACATCATGGTTTAATTCATTACA TGGTATATCAACATACAACCTACATTAAA CATTTTTGTTTACAATATCAAAATAAGCC GTACATAAACCATATTAATTCAAACTCAT AAATAATATATCTTAAAATGGGCTATTGC ATAATTCCTAAT CCACTTCTTGTTTATCCCGCCTATATACCGC CGTCGT;T/C;AGCTTACCCTGTGAAGGCCTA ACAGTAAGCAAAATGGGCCCGCCCA;T/A; AAACGTCAGGTCGAGGTGTAGCGTACGAA GTGGGAAGAAATGGGCTACATTTTCTATAC CTAGAATATTACGAATGGCACCATGAAAA TAATGCCTGAAGGTGGA;A/T;;A/T;TAGTAG TAAAAAGCAAATAGAGTGTCCTTTTGA ATTAGGCTCTGAGACGCGCA CACACCG

3

2

90

204

1

4

2

80

209

2

5

3

99

49

1

11

6

70

192

4

(Continued)

86

SNP Discovery Through De Novo Deep Sequencing Table 5.6

87

(Continued)

B

C

14

12

99

16

15

22

26

D

AA

AB

85

1

60

85

2

46

78

3

CCTGTTATACAGGGCTTAACCCTAACCACC GGACTAATTATGGCTACCTGACAAAAACT GGCCCCATTCGCACTAATCATTCAA;G/A;TG GCCCCCTTCACCCACCCCCTCCTATTAACAA CCCTAGGATTACTATCCGTTTTCATCGGGG GCTGGGGAGGTTTAAATCAAACTCAATTAC GAAAAATCTTAGCCTAC TCATCCATCGCCCATCTCGG CCACGACGATACTCAGACTACCCCGATGC CTACTCACTATGAAACTC;G/A;TCTCTTCAA TCGGCTCCCTGGTGTCCCTAGTAGCA;T/G; TTGTAATATTCCTGTATATTTTATGGGAGG CCTTTACTGCCAAACGAGAAGTACTCTCC GTCGAACTCACCTCCACAAACGTAGAGTG GCTACACGGATGCCCCCCACCCTATCA GCTGTCCTTAAATATAGGAC;A/T;GGCCGT ACCGCTATGGCTAGCCACAGTAATTATTGG CCTCCGAAACCAGCCCACT;A/G;;T/C;GGCC CTAGGACACCTCCTACCAGAAGGAACTC CCGCCCTTTTAATTCCAATTCTAATTATTAT TGAAACCATCAGCTTATTTATTCGCCCTCTA GCCCTCGGAGTCCGACTCACAGCTAATCT TACAGCCGGCCACCTGCTAATTCAACTAAT CTCAACAGCAACCATCACCCTTATGCCC ATAATAACCACAGTAGCAACCCTTACCGCC ATTCTTCTAGTGCTATTAACACTCCTAGAG GTTGCAG

5%–10% of tested loci are heterozygous in all individuals due to alignment of sequences from multicopy loci, one should consider a higher level of stringency in the mapping algorithm. Alternately, one may choose to produce a higher stringency assembly for the pseudoreference. False negative SNP loci can occur when a SNP exists at the restriction enzyme recognition site. The sequence alignment in Figure 5.3 demonstrates a contig from an RRL produced by digestion with Alu I (AG|CT; in box). The partial consensus sequence is at top, followed by 12 reads that end at the Alu I recognition site then nine reads with an AGCA sequence that was not digested by Alu I. Although the algorithm did not recognize this as a SNP, it can be identified in contigs that are longer than the expected size range. False negatives could also occur if the reference population does not represent the full population due to undersampling of the DNA library or sampling of a limited genetic pool within the population. Lack of sequence depth would lead to many genomic areas covered by fewer than four sequencing reads; therefore, these loci

88

Next Generation Sequencing and Whole Genome Selection in Aquaculture

* …GGTTGTTTCATTTCCAGAGCCTTTCATGTCTCCATGAAG… *

…GGTTGTTTCATTTCCAGAGCCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAGCCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAGCCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAACCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAGCCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAGCCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAGCCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAACCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAGCCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAACCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAGCCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAACCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAGCCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAGCCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAACCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAGCCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAACCTTTCATGTCTCCATGAAG… …GGTTGTTTCATTTCCAGAGCCTTTCATGTCTCCATGAAG…

A/A

A/G

G/G

Figure 5.2 Multiple sequence alignment and chromatograms from a single SNP locus. The asterisk denotes the SNP site. On the left, the consensus sequence is at the top of the alignment and alternate alleles are denoted in green. On the right are chromatograms from an A/A homozygote, an A/G heterozygote, and a G/G homozygote. See color insert. Table 5.7

SNP discovery decisions. SNP in population

Putative SNP

Yes

No

Yes No

Correct False negative

False positive Correct

would have been filtered from the SNP output. In general, false negatives are less of a problem than false positive loci because the high-throughput experiment produces such a large number of putative loci.

Conclusions Alignment and analysis of deep sequencing data is an efficient method for SNP discovery in species without substantial genomic resources. The example presented in this chapter is only one method of many to facilitate SNP discovery. Short sequences can be produced in great depth, but must be aligned to reference sequences that are long enough to support primer development for genotyping assays. Alternately, long

SNP Discovery Through De Novo Deep Sequencing

89

…CAACCGCCATCTATGCATCTGAAACTTGGAAGAGCTCGAGAGCAGTCATCCACAAGCTGAATGTGTTTCAACAAC… …CAACCGCCATCTATGCATCTGAAACTTGGAAGAGCTCGAGAG …CAACCGCCATCTATGCATCTGAAACTTGGAAGAGCTCGAGAG …CAACCGCCATCTATGCATCTGAAACTTGGAAGAGCTCGAGAG …CAACCGCCATCTATGCATCTGAAACTTGGAAGAGCTCGAGAG …CAACCGCCATCTATGCATCTGAAACTTGGAAGAGCTCGAGAG …CAACCGCCATCTATGCATCTGAAACTTGGAAGAGCTCGAGAG …CAACCGCCATCTATGCATCTGAAACTTGGAAGAGCTCGAGAG …CAACCGCCATCTATGCATCTGAAACTTGGAAGAGCTCGAGAG …CAACCGCCATCTATGCATCTGAAACTTGGAAGAGCTCGAGAG …CAACCGCCATCTATGCATCTGAAACTTGGAAGAGCTCGAGAG …AACCCGCATCTATGGATCTAGAAACTTGGAAGAGCTCGAGAG …CAACCGCCATCTATGCATCTGAAACTTGGAAGAGCTCGAGAG …CAACCGCCATCTATGCATCTGAAACTTGGAAGAGCTCGAGAGCAGTCATCCA …CAACCGCCATCTATGCATCTGAAACTTGGAAGAGCTCGAGAGCAGTCATCCACAAGCTGAATG …CAACCGCCATCTATGCATCTGAAACTTGGAAGAGCTCGAGAGCAGTCATCCACAAGCTGAATGTGTTTCAACAAC… …CAACCGCCATCTATGCATCTGAAACTTGGAAGAGCTCGAGAGCAGTCATCCACAAGCTGAATGTGTTTCAACAAC… …CAACTGCCATCTATGCATCTGAAACTTGGAAGAGCTCGAGAGCAGTCATCCACAAGCTGAATGTGTTTCAACAAC… …CAACCGCCATCTATGCATCTGAAACTTGGAAGAGCTCGAGAGCAGTCATCCACAAGCTGAATGTGTTTCAACAAC… …CAACCGCCATCTATGCATCTGAAACTTGGAAGAGCTCGAGAGCAGTCATCCACAAGCTGAATGTGTTTCAACAAC… GAAACTTGGAAGAGCTCGAGAGCAGTCATCCACAAGCTGAATGTGTTTCAACAAC… …CAACCGCCATCTATGCATCTGAAACTTGGAAGAGCTCGAGAGCAGTCATCCACAAGCTGAATGTGTTTCAACAAC…

Figure 5.3 DNA sequence alignment from a reduced representation library produced by Alu I (AG|CT) digestion. The consensus sequence is at top, and the box denotes the Alu I restriction recognition site. The first 12 aligned sequences were from fragments digested with Alu I, and the remaining nine sequences contained a SNP at the Alu I site.

sequences can be aligned to produce a pseudoreference against which the individual sequences are realigned for SNP discovery. As sequencing platforms evolve and become more cost-efficient, genotyping by sequencing could supplant SNP genotyping as the preferred method for comparisons of genomic variation. Individual genome samples can be tagged by ligation of short sequence adapters and the tagged sequence output can be deconvoluted using bioinformatic methods. Alignment of tagged reads can provide allele-specific information for individuals or groups. It is clear that, regardless of the method used, the analysis of genomic variation will require a considerable level of bioinformatic resources and expertise.

Acknowledgments I thank Dan Nonneman and Curt Van Tassell for helpful discussions in experimental design for SNP discovery. I also thank Linda Ballard, Beth Flanagan, and Ralph Wiedmann for contributing bioinformatic expertise. Mention of trade names or commercial products in this publication is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the U.S. Department of Agriculture.

90

Next Generation Sequencing and Whole Genome Selection in Aquaculture

References Altshuler D, Pollara VJ, Cowles CR, Van Etten WJ, Baldwin J, Linton L, and Lander ES. 2000. An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature, 407:513–516. Kerstens HHD, Crooijmans RPMA, Veenendaal A, Dibbits BW, Chin-A-Woeng TFC, den Dunnen JT, and Groenen MAM. 2009. Large scale single nucleotide polymorphism discovery in unsequenced genomes using second generation high throughput sequencing technology: Applied to turkey. BMC Genomics, 10:479. Langmead B, Trapnell C, Pop M, and Salzberg SL. 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10:R25. Li H and Durbin R. 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25:1754–1760. Li H, Ruan J, and Durbin R. 2008. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research, 18:1851–1858. Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, and Wang J. 2009. SOAP2: An improved ultrafast tool for short read alignment. Bioinformatics, 25:1966–1967. Lorenz S, Brenna-Hansen S, Moen T, Roseth A, Davidson WS, Omholt SW, and Lien S. 2010. BAC-based upgrading and physical integration of a genetic SNP map in Atlantic salmon. Animal Genetics, 41:48–54. Ning Z, Cox AJ, and Mullikin JC. 2001. SSAHA: A fast search method for large DNA databases. Genome Research, 11:1725–1729. Ramos AM, Crooijmans RPMA, Affara NA, Amaral AJ, Archibald AL, Beever JE, Bendixen C, Churcher C, Clark R, Dehais P, et al. 2009. Design of a high density SNP genotyping assay in the pig using SNPs identified and characterized by next generation sequencing technology. PLoS One, 4:e6524. Sanchez CC, Smith TPL, Wiedmann RT, Vallejo RL, Salem M, Yao JB, and Rexroad CE. 2009. Single nucleotide polymorphism discovery in rainbow trout by deep sequencing of a reduced representation library. BMC Genomics, 10:559. Smit AFAH, Hubley R, and Green P. 2004. RepeatMasker Open-3.0. Available at www.repeatmasker.org. Van Tassell CP, Smith TPL, Matukumalli LK, Taylor JF, Schnabel RD, Lawley CT, Haudenschild CD, Moore SS, Warren WC, and Sonstegard TS. 2008. SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries. Nature Methods, 5:247–252. Wiedmann RT, Smith TPL, and Nonneman DJ. 2008. SNP discovery in swine by reduced representation and high throughput pyrosequencing. BMC Genetics, 9:81. You FM, Huo NX, Gu YQ, Luo MC, Ma YQ, Hane D, Lazo GR, Dvorak J, and Anderson OD. 2008. BatchPrimer3: A high throughput web application for PCR and sequencing primer design. BMC Bioinformatics, 9:253.

Chapter 6

SNP Discovery through EST Data Mining Shaolin Wang and Zhanjiang (John) Liu

The key issue of single-nucleotide polymorphism (SNP) marker applications in aquaculture species is the availability of SNPs. Identification of large numbers of SNPs requires massive sequencing efforts and resources (Picoult-Newberg et al., 1999). Prior to the application of next generation sequencing, large-scale genome sequencing was not possible with aquaculture species. With recent adoption of next generation sequencing, it is obvious that SNP discovery has been made possible through both whole genome sequencing or through sequencing of reduced representation libraries (RRLs; for details, see Chapter 5). In spite of such efforts and possibilities, it is expected that in the near future, whole genome sequences will still not become available for the vast majority of aquaculture species, especially not for the minor aquaculture species. Therefore, SNP identification in aquaculture species will still likely be using various alternative available resources such as expressed sequence tags (ESTs). In this chapter, we will focus on SNP discovery through mining of EST databases.

Advantages and Disadvantages of SNP Discovery through EST Data Mining ESTs are already available in the databases, so additional sequencing efforts are not essential. ESTs are single-pass sequence reads generated by direct sequencing of cDNA clones. They have been generated in the course of expression studies. In recent years, EST resources have become available for many aquaculture species, and a summary is provided in Table 6.1 for major aquaculture species. ESTs-derived SNPs are associated with genes, therefore they are type I markers. Gene-associated SNPs can account for genomic causes of phenotypes. In this regard, gene-associated markers are superior to markers identified from anonymous genomic regions. EST-derived SNPs should correlate genes in terms of genomic locations. While the numbers of markers available is very important, it is even more important to have markers that are evenly distributed in the genome. In genomic scale, genes are distributed in all chromosomes and chromosome segments, allowing EST-derived SNPs to have the potential of the same distribution in the genome, thereby reducing the levels of marker clustering. Next Generation Sequencing and Whole Genome Selection in Aquaculture © 2011 Blackwell Publishing Ltd. ISBN: 978-0-813-80637-2

Edited by Zhanjiang (John) Liu

91

92

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Table 6.1 EST resources available from the major aquaculture species as of April 2, 2010 (dbEST release 040210). Species Sanger EST sequences Salmo salar (Atlantic salmon) Ictalurus punctatus (channel catfish) Oncorhynchus mykiss (rainbow trout) Gadus morhua (Atlantic cod) Litopenaeus vannamei (white shrimp) Ictalurus furcatus (blue catfish) Oreochromis niloticus (tilapia) Crassostrea gigas (Pacific oyster) Dicentrarchus labrax (European sea bass) Penaeus monodon (giant tiger prawn) Cyprinus carpio (common carp) Eriocheir sinensis (Chinese river crab) Mytilus galloprovincialis (Mediterranean mussel) Crassostrea virginica (Eastern oyster) Fenneropenaeus chinensis (fleshy prawn) 454 EST sequences O. mykiss (rainbow trout) C. carpio (common carp)

Number of ESTs 494,392 354,434 287,928 206,649 161,075 139,475 117,193 57,139 54,200 35,180 34,056 16,882 15,408 14,560 10,446 1,298,911 242,263

However, ESTs are single-pass sequences, and the sequence qualities are relative low. Therefore, identification of SNPs from ESTs may lead to pseudo-SNPs because of sequencing errors. In order to overcome the sequencing errors, deep sequencing and higher coverage are required to achieve higher sequencing accuracy, which may not be available for most aquaculture species. Without the whole genome sequences, the intron information is unknown, which also could lead to genotyping failures because of failed primer amplification caused by intron involvement. During the assembly of EST data set, the low stringent assembly may bring the related gene family members into the same contigs, which also leads to identification of pseudo-SNPs. Efforts have been made to develop SNP markers in aquaculture species (He et al., 2003). Recently, efforts for application of EST-derived SNPs are increasing (Hayes et al., 2007; Moen et al., 2008; Wang et al., 2008). High-throughput SNP genotyping chips have been developed for salmon and catfish for whole genome association studies. Identification of SNPs from ESTs requires extensive bioinformatics support. Here we introduce several software packages commonly used for SNP discovery.

SNP Discovery Using ESTs Generated by Sanger Sequencing SNP discovery using EST requires the alignment of multiple EST sequences generated from a single transcript. Its basic procedures involve (1) retrieving EST resources from the National Center for Biotechnology Information (NCBI) dbEST database;

SNP Discovery through EST Data Mining

93

(2) cluster analysis of ESTs by sequence alignments; and (3) identification of mismatch (SNP) based on the alignment of multiple sequences.

Retrieving EST Sequences Retrieving ESTs from existing databases is the first step for researchers who wish to use various EST sources generated from different laboratories. This step is not required if one already has the ESTs with the original trace files or stored as FASTA format. NCBI dbEST is the most useful and powerful database, including over 65 millions (as of April 2010) of ESTs from various species. To retrieve all the EST resource from the related species, go to the NCBI Web site (www.ncbi.nlm.nih.gov): Click the “Search” drop-down menu and select EST, and then input the species name, such as Ictalurus punctatus (scientific names can provide accurate searches); choose the “FASTA” format from the “Display” drop-down menu; go to the “Send” drop-down menu and select file, and you should be ready to download the EST data set. The EST sequences retrieved using the above methods is in the FASTA file format. If you need to retrieve the trace files for some special SNP discovery program, such as POLYBAYES and POLYPHRED (Nickerson et al., 1997; Marth et al., 1999), you need to go to the trace repository at NCBI that is under construction at the NCBI. Currently, EST sequence traces can be downloaded from the Washington University FTP site: (ftp://genome.wustl.edu/pub/ gsc1/est) for ESTs produced there.

Multiple Sequence Alignments A number of methods are available for multiple sequence alignments, and here we will just discuss a few. If a small number of genes are involved, some of the easy portals, such as the Basic Local Alignment Search Tool (BLAST) and ClustalW probably would provide a quick and convenient way for SNP identification. If large numbers of ESTs are involved, EST assembly would be needed.

NCBI BLAST Computational SNP discovery, in a general sense, refers to the process of compiling and organizing DNA sequences that represent orthologous regions in samples of multiple individuals, followed by the identification of polymorphic sequence locations. The first step typically involves a similarity search with BLAST to compile groups of sequences that originate from the region under examination (Altschul et al., 1990). This is followed by the construction of a base-wise multiple alignment to determine the precise, base-to-base correspondence of residues present in each of the samples in a group. Finally, each position of the multiple alignments is scanned for nucleotide mismatches. The following is a step-by-step example of how to use BLAST

94

Next Generation Sequencing and Whole Genome Selection in Aquaculture G/T

Query GH670619 FD181027 GH670618 FD182032 GH691336 FD360977 FD181026 GH684633 GH684632 FD182031

481 490 490 346 490 346 346 330 482 346 225

A/G

CACCTGTGATTTCATCTCATATAGGACAGCTTCTTCTGGAGTCAGCTGAATGGCCTCGTC ....................................C......C................ ............................................................ ....................................C......C................ ....................................C......C................ ............................................................ ............................................................ ............................................................ ............................................................ ............................................................ ....................................C......C................

540 549 549 287 549 287 287 271 541 287 166

Figure 6.1 SNP visualization from BLAST results with the option of “Query-anchored with dots for identities.” The dots represent identical nucleotides at the locus. The positions indicated by arrows are the SNP site with C/T and A/C SNPs.

to discover SNP from EST sequences: (1) If you already have a target gene or EST sequence for SNP discovery, (2) go to NCBI BLAST (blast.ncbi.nlm.nih.gov/Blast. cgi); under the Basic BLAST category select nucleotide BLAST, (3) paste or input the query sequence, select EST database for the species you are looking, such as “est_ others” for all aquaculture species, and specify the species for the accurate search, such as I. punctatus, then click BLAST button; (4) once the blast results pop up, go to “format option” on the top of the BLAST result page, select “Query-anchored with dots for identities” under alignment menu, then click “Reformat” menu; the alternative alleles for the putative SNPs will show with the letters in the multiple alignment view results; all identical nucleotides will turn into dots in the results (Figure 6.1). The NCBI BLAST also has a stand-alone version, which could be run on all platforms including Windows, Unix/Linux, and Mac OS. The BLAST programs could be found on the NCBI FTP site (blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_ TYPE=BlastDocs&DOC_TYPE=Download). The command for running the BLAST program is identical on all platforms. In order to run BLAST searches on a local computer, the sequences need to be downloaded and the databases need to be prepared first. NCBI BLAST program provides a command “formatdb,” which could be used to set up the database. The EST sequences file (FASTA format) can be directly used to set up the database with command “formatdb -i input_file (seqeunces) -p F (nucleotide).” Once the database has been set up, the BLAST searches could be performed by using “blastall -p blastn -i input_file (target gene or sequences for SNP identification) -d database -o output_file -m 2 (formatting the results to Queryanchored no identities).”

Multiple Alignment Tool The ClustalX/ClustalW program has been widely used for both protein and nucleic acid multiple sequence alignments and construction of phylogenetic trees (Higgins et al., 1996; Jeanmougin et al., 1998). The program has undergone many improve-

SNP Discovery through EST Data Mining

95

ments since Clustal was first described in 1988 (Higgins and Sharp, 1988) and is available for different platforms including, most recently, ClustalW. It has an interface to Unix/Linux, the Macintosh Mac OS system, and MS Windows systems (Thompson et al., 1994, 1997). If you have multiple genes and ESTs, ClustalW can be used to construct multiple alignments instead of doing the multiple alignments using BLAST one by one. The program ClustalW has a stand-alone version for all platforms including Windows, Unix/Linux, and Mac OS. The European Molecular Biology Laboratory (EMBL) also has developed ClustalW for Web site access through (www.ebi.ac.uk/Tools/clustalw2/index.html), which is very convenient. ClustalW2 currently supports seven multiple sequence formats. These are • • • • • • •

NBRF/PIR EMBL/UniProtKB/Swiss-Prot Pearson (FASTA) GDE ALN/ClustalW GCG/MSF RSF

Identification of SNP using ClustalW is similar to the NCBI BLAST. First, the sequences (FASTA format) can be pasted into the input BOX of ClustalW (Web site). If you would like to use the default parameters, just click the “RUN” button. If you just have several sequences, you can wait for a while until the results is generated. If you have a large number of sequences, you can upload your sequence file through the Web site and leave your email address, and the notice for retrieving the results will be sent to your mailbox once the multiple alignments are finished. The alignment results can be directly used to find the SNP information (Figure 6.2). The stand-alone ClustalW program has both Windows and Linux versions. Under Windows ClustalX, sequences and profiles (a term for preexisting alignments) are input using the File menu “Load Sequences.” ClustalX has two modes that can be selected using the switch directly above the sequence display: Multiple Alignment Mode and Profile Alignment Mode. To do a multiple alignment on a set of sequences, make sure Multiple Alignment Mode is selected. A single sequence data area is then displayed. The alignment menu then allows you to either produce a guide tree for the alignment, or to do a multiple alignment following the guide tree, or to do a full multiple alignment. In Profile Alignment Mode, two sequence data areas are displayed, allowing you to align two alignments (termed profiles). Profiles are also used

G/T gi|224240675|gb|GG670618.1 gi|204151091|gb|FD182031.1 gi|204288225|gb|FD181026.1 gi|204247742|gb|FD360977.1 gi|224282111|gb|GH691336.1 gi|224255363|gb|GH684632.1

A/G

GCGGACTCCGGAAGAAGCTGTCCTATATGAGATGAAATCACAGGTGCTGA GCGGACTCCGGAAGAAGCTGTCCTATATGAGATGAAATCACAGGTGCTGA GCGGACTCCAGAAGAAGCTGTCCTATATGAGATGAAATCACAGGTGCTGA GCGGACTCCAGAAGAAGCTGTCCTATATGAGATGAAATCACAGGTGCTGA GCGGACTCCAGAAGAAGCTGTCCTATATGAGATGAAATCACAGGTGCTGA GCGGACTCCAGAAGAAGCTGTCCTATATGAGATGAAATCACAGGTGCTGA ** ****** ****************************************

350 229 334 334 350 350

Figure 6.2 SNP visualization from ClustalW. The asterisks represent identical nucleotides at the locus. The positions indicated by arrows are the SNP site with G/T and A/G SNPs.

96

Next Generation Sequencing and Whole Genome Selection in Aquaculture

to add a new sequence to an old alignment, or to use secondary structure to guide the alignment process. Under Linux, ClustalW will generate three files with the command “clustalw -inputfile (FASTA sequence format),” including output file (the process of analysis), dnd file (guide tree files for construction phylogenetic tree), and aln file (multiple alignments file). The last one is the most important file used to identify SNPs.

EST Assembly The above approaches are suiTable for single gene or small-scale SNP discovery. Large-scale SNP discovery from ESTs requires much more efficient approaches to construct multiple sequence alignments. ClustalW can provide some capability to construct multiple alignments, but it will take very long time for large-scale SNP discovery using a large number of EST sequences. The sequence assembly programs provide strong computational powers for the SNP discovery from high-throughput ESTs. Two types of assembly programs have been widely utilized on the Unix/Linux platform: (1) the first type in which the original trace file or the quality file is necessary, such as PHRAP; (2) the second type in which the trace file or quality file is not necessary, such as CAP3. In most scenarios, EST trace files are not available for scientists, such as the sequences directly retrieved from the NCBI dbEST databases. Therefore, CAP3 is quite useful.

POLYBAYES and POLYPHRED The program POLYBAYES requires the original sequence trace files to detect SNPs, but the trace files need to be processed with the software PHRED (base calling) and PHRAP (assembly) developed by the University of Washington (PHRED/PHRAP/ CONSED architecture) (Ewing and Green, 1998; Gordon et al., 1998, 2001; Ewing et al., 1998). First, make sure that the sequence FASTA files and sequence quality files were prepared before assembly using PHRED with prompt command. All the sequences trace files are stored in the directory “chromat_dir” by specifying the location of this directory with the “-id” option. Sequence PHD format files are created in the “phd_ dir” subdirectory by specifying the location of this directory with the “-pd” option. The complete command for this step is “phred -id chromat_dir (sequences trace files) -pd phd_dir (phd output files).” The program PHD2FASTA will be used to change the PHD format file to FASTA format file with option (“-os” option) of the ESTs file. Also, the program can produce a FASTA format file for the accompanying base quality values with option (“-oq” option), and one for the list of base positions that specify the location of each called nucleotide relative to the sequence trace with option (“-ob” option). The DNA sequence of the ESTs will be used in the next step, as the members of the cluster (group) of expressed sequences to be analyzed for polymorphic sites. The application POLYBAYES packages provides two ways to construct multiple alignments for SNP identification, and all the process results will be stored in a directory named “edit_dir,” which is required for CONSED. First, such as in RPBMAP (one of the applications in the POLYBAYES package), identification of SNPs is

SNP Discovery through EST Data Mining

97

conducted by mapping the EST sequences to the anchor sequences (reference sequences), then a multiple alignment of the EST sequences using the anchored alignment algorithm implemented within POLYBAYES is created. The CROSS_ MATCH dynamic alignment program was utilized to compute the initial pairwise alignments between each of the ESTs and the genomic anchor sequence. Second, RPBACE (one of the applications in the POLYBAYES package) is used for SNP identification from de novo EST assembly. The assembly was required before the SNP identification if the reference sequences are not available. Vector sequence trimming was required before the assembly using CROSS_MATCH (cross_match seqs_fasta vector.seq -minmatch 12 -minscore 20 -screen > screen.out). The clean sequence file, “seqs_fasta.screen,” will be generated. Then, the assembly will be performed by using PHRAP (phrap seqs_fasta.screen -new_ace > phrap.out). The program PHRAP will write the assembly results to the ace file. The ace file can be used for SNP identification using POLYBAYES (RPBACE). The multiple alignments are scanned for polymorphic sites. At each site, the slice of the alignment composed of nucleotides contributed by every sequence that was locally aligned is examined for mismatches. The Bayesian SNP detection algorithm calculates the probability that such mismatches are the result of true polymorphism as opposed to sequencing error. Likely, polymorphic sites are recorded as SNP candidates. The SNP detection feature is enabled with the “-screenSnps” option. The new ace file should be generated after the SNP screening, which could be opened using CONSED. Figure 6.3 shows an output file with the site of a SNP candidate in T/C

Figure 6.3 SNP visualization from POLYBAYES. The SNP identified at position 364 is a C/T SNP, which was generated from SNP screening based on multiple ESTs using POLYBAYES. See color insert.

98

Next Generation Sequencing and Whole Genome Selection in Aquaculture

the multiple alignments. This SNP is found within members of one alternatively spliced group of EST sequences and is automatically tagged by the SNP detection algorithm implemented within POLYBAYES. A similar procedure is applicable for a wide range of scenarios where sequence fragments (e.g., ESTs, random genomic shotgun reads, bacterial artificial chromosome (BAC)-end reads, sequenced restriction fragments) are organized with the help of genome reference sequence and compared against each other and/or with the reference sequence in search of polymorphic sites. POLYPHRED is another SNP detection application similar to the POLYBAYES. SNP identification using POLYPHRED requires the similar steps as in using POLYBAYES. The only difference is that after base calling with PHRED, the polymorphism output file is required during the SNP analysis with POLYPHRED, but not with POLYBAYES. POLYPHRED is a program developed based on sequence fluorescence across traces obtained from different individuals to identify heterozygous sites for single-nucleotide substitutions. The functions of POLYPHRED are integrated with the use of three other programs: PHRED/PHRAP/CONSED. POLYPHRED identifies potential heterozygotes using the base calls and peak information provided by PHRED and the sequence alignments provided by PHRAP. Potential heterozygotes identified by POLYPHRED are marked for rapid inspection using the CONSED tool. POLYPHRED is very powerful to identity heterozygous individual because of fluorescence-based SNP discovery algorithm (Figure 6.4). The

Figure 6.4 SNP visualization from POLYPHRED. The SNP identified at position 192 is a C/G SNP, which was generated from SNP screening based on individual fish. The sample E09 and A12 has homozygous allele C, the sample B11 has homozygous allele G, and the sample A11 is heterozygous with both allele C and allele G. See color insert.

SNP Discovery through EST Data Mining

99

command for PHRED is “phred -id chromat_dir (sequences trace files) -pd phd_dir (phd output files) -dd poly_dir (polymorphism output files).” All the other following steps before SNP discovery are similar to the process for the POLYBAYES. The command for SNP discovery with POLYPHRED is “polyphred -a Input.ace -o polyphred.out” for the default settings. The output report can be viewed with text editor, or CONSED can be utilized to view the tags added by POLYPHRED.

AUTOSNP The CAP3 program includes a number of improvements and new features (Huang and Madan, 1999). The program has a capability to clip 5′ and 3′ low-quality regions of EST sequence reads. It uses base quality values in computation of overlaps between reads, construction of multiple sequence alignments of reads, and generation of consensus sequences. The program also uses forward-reverse constraints to correct assembly errors and link contigs. PHRAP often produces longer contigs than CAP3, whereas CAP3 often produces fewer errors in consensus sequences than PHRAP. It is easier to construct scaffolds with CAP3 than with PHRAP on low-pass data with forward-reverse constraints. CAP3 also has the capability to assemble the ESTs without any quality files or sequence trace files. The CAP3 program can be used directly with all FASTA format files. CAP3 is easy to use by inputting the command “cap3 input_file” without giving any settings. The CAP3 program will use all the default parameters to assemble all input sequences and generate contig and singleton sequences. If one prefers to set up more stringent or loose parameters for the assembly, one can give the parameters by himself or herself. The most common parameters for the assembly is “-o,” which specify the cutoff length of sequence overlap (default 40 bp) and “-p,” which specify the cutoff percent of identity within the overlap sequence (default 80). If the sequence quality file is available along with the sequence file, it can be used to provide more accurate assembly with the quality score. However, the disadvantage of using quality score is the requirement for more system memory to perform the assembly. CAP3 assembly generates several files; the most important file is the ace file, which includes all the assembly information. The ace file will be utilized for the next step in SNP identification. AUTOSNP is a program to detect SNPs and insertion/deletion polymorphisms (indels) with EST data (Barker et al., 2003). AUTOSNP is a perl-based program that can use d2cluster or CAP3 to cluster and align EST sequences. The program can also directly read the ace file generated by CAP3 and rebuild the multiple alignments within each contig. The program uses redundancy to differentiate between candidate SNPs and sequence errors. Candidate polymorphisms are identified as occurring in multiple reads within an alignment. For each candidate SNP, two measures of confidence are calculated—the redundancy of the polymorphism at a SNP locus and the codetection of the candidate SNP with other SNPs in the alignment (sort of cosegregation due to close linkage, or the assumption of being within the same haplotype). The mismatch will be identified from the multiple alignments based on the parameters. The default parameters for SNP identification are as follows: (1) a sequence variation is declared as a SNP whenever a mismatch is identified within contigs with four or fewer sequences; (2) a sequence variation is declared as a SNP

100

Next Generation Sequencing and Whole Genome Selection in Aquaculture

when the minor allele sequence existed at least twice within contigs with five to six sequences; (3) a sequence variation is declared as a SNP when the minor allele sequence existed at least three times within contigs with seven to eight sequences; (4) similarly, a sequence variation is declared as a SNP when the minor allele sequence existed at least four times within contigs with 9–12 sequences, and (5) when the minor allele sequence existed at least five times within contigs with 13–16 sequences, and so on. Before using AUTOSNP, the CAP3 program is necessary and needs to be set promptly, especially if FASTA file is used; the command should be “cap3SNP.pl -f input_file (fasta file).” If the assembly ace file have already been generated by CAP3, the command should be “cap3SNP.pl -a input_file (ace file).” The memory requirement for the program depends on the number of sequence used in the assembly, 4–8 GB of memory were recommended for SNP identification with 100,000–200,000 sequences. After execution of SNP detection, AUTOSNP will generate a folder, usually named “results,” that will hold all the HTML files including the sequence information, SNP information, and sequence alignments for each contig (Figure 6.5). A summary HTML file will also be generated including the assembly and SNP identification summary information. If text files are desired (by using “-t” option), three text files will be generated including contig.txt, snps.txt, and sequences.txt. The file contig.txt includes the number of sequences and SNP identified in each contig

Figure 6.5 SNP visualization from AUTOSNP. The left upper panel displays sequence information, for example, GenBank accession numbers and putative gene identities. The left lower panel displays SNP summary information, for example, at position 310, the SNP is a T/G SNP, at position 377, the SNP is a T/A SNP with a sequence ratio of 3 : 2 each. The right panel displays the sequences alignment information, highlighting the SNP at position 435, an A/G SNP with a sequence ratio of 3 : 2. See color insert.

SNP Discovery through EST Data Mining

101

sequence. The file snps.txt provides the detailed information for every SNP including the location of SNP in the contig, number of minor allele frequency, SNP cosegregation information, and all the alleles identified in each SNP. The file sequences.txt provides the sequences alignment information for each contig. The .txt file is relatively easy to access, which could be combined with Access or MySQL database for data management, especially for large-scale sequence assemblies and SNP identification. The HTML file is more intuitive and combines the contig, sequences, and SNP information together in one file.

SNP Discovery Using Transcript Sequences Generated by Next Generation Sequencing Recently, transcript sequences have been generated using the next generation sequencing platforms such as Illumina Genome Analyzer and the Roche 454 sequencer. These sequencing platforms generate relatively short sequences with a huge number of sequence reads. As a result, NCBI has deposited these short sequence reads into a special database called Short Reads Archive (SRA). Here in this section, we will present methods for SNP detection using SRA sequences.

GIGABAYES In order to adapt the next generation sequencing data from 454 and Illumina sequencing platforms, SNP discovery applications are required to optimize the characteristics of new sequencing platforms. New SNP discovery application packages (PYROBAYES/ MOSAIK/GIGBAYES) for SRA sequences have been developed based on POLYBAYES, which was originally developed for Sanger sequencing, as discussed above. The application package includes three applications: PYROBAYES, for 454 pyrosequencing base calling (Quinlan et al., 2008); MOSAIK, for Reference Guided Read Aligner/Assembler; and GIGABAYES, for short-read polymorphism detection (Smith et al., 2008). The alignment visualization applications EAGLEVIEW and GAMBIT have been developed to adapt the next generation sequencing assembly and alignment. The package does not include any de novo assembly application, which means that if reference sequences are not available, de novo assembly is required before SNP discovery. In addition to the assembly applications developed along with the next generation sequencing platforms, several independent de novo assembly applications have been developed, such as Celera Assembler (Rausch et al., 2009), Velvet for both 454 and Illumina (Zerbino and Birney, 2008), and SOAPdenovo for Illumina (Li et al., 2008, 2009). The next generation sequencing SNP discovery using GIGABAYES requires several steps: 1. Prepare the reference sequence; if the reference sequence is not available, de novo assembly is required to produce reference sequences.

102

Next Generation Sequencing and Whole Genome Selection in Aquaculture

2. Once the reference sequences have been generated, which could be used for alignment, MOSAIK package is required to build the alignment before the SNP discovery by using MosaikBuild, MosaikAligner, MosaikSort, and MosaikAssembler. The MOSAIK program package is briefly introduced below, but the detailed information for the use of this package is not provided here as it can be found in the manual of the software package. a. In order to speed up the assembly process, MosaikBuild can convert external read formats, including FASTA, FASTQ, and sequence read format (SRF), to a compressed archive format that the aligner can readily use. In addition to processing reads, the program also converts reference sequences from a FASTA format to an efficient binary format. b. MosaikAligner performs pairwise alignment between every read in the read archive and the reference sequences. Then, MosaikSort takes the pairwise alignment output and sort the alignment. For single-ended reads, MosaikSort simply resorts the reads in the order they occur on each reference sequence. For mate-pair/paired-end reads, MOSAIK resolves the reads according to userspecified criteria (the fragment length confidence interval) before re-sorting the reads in the order they occur on each reference sequence. c. MosaikAssembler convert the sorted alignment file to a multiple sequence alignment that is saved in an assembly file format. At the moment, MosaikAssembler saves the assembly in the phrap ace format and the GigaBayes gig format. 3. After the alignment, GIGABAYES can read the assembly output and generate SNP discovery results with the GFF format. GIGADUMP can convert the GFF format output to ace format output, which can be used by CONSED (-nophd option is required) for visualization. EAGLEVIEW (Huang and Marth, 2008) is another alternative visualization program for reading the results.

SSAHA SSAHA (Ning et al., 2001) is developed for mapping large Sanger sequences data set to reference sequences with fast alignment algorithm by the Sanger Institute, which has been used in the HapMap project. The new version SSAHA2 has been developed to adapt the program to next generation sequencing reads including 454 and Illumina sequencing, and a variety of output formats are supported: SSAHA2, SAM, CIGAR, GFF, PSL, and so on. SNP discovery using SSAHA2 package also takes several steps: 1. ssaha2Build: builds a hash Table for reference sequences stored in the file as subject file; 2. ssaha2: maps next generation sequencing reads in the FASTQ format file as query file against the reference sequence; 3. ssaha2SNP: detects the SNPs and indels by aligning next generation sequencing reads to the reference sequences. The quality of base will also be considered with variation to reduce the false discovery rate, as well as the quality values in the neighboring bases.

SNP Discovery through EST Data Mining

103

Windows-Based Platforms for EST Assembly and SNP Identification The programs introduced above all require the use of the Linux system, with which many users are not familiar. Several Windows-based programs are available for analysis of relatively small data sets based on a 32-bit Windows system. The programs developed for a 64-bit Windows system can take the advantage of large memory and will provide much power for large data sets. As most users are familiar with the Windows system, these programs may find their way for applications.

CLC Genomics Workbench CLC Genomics Workbench is a very comprehensive package that integrates analysis functions for nucleotide and protein sequences such as sequence assembly, multiple alignment, BLAST searches, and gene expression analysis. It is especially well suited for next generation sequencing data analysis, including Roche/454, Illumina Genome Analyzer, and ABI SOLiD. The CLC Genomics Workbench has both Windows and Linux versions commercially available (32-bit and 64-bit). SNP detection is one of the applications that could be used to identify putative SNPs from EST sequences or other short sequences generated from next generation sequencing platforms. The principle of SNP discovery is based on the sequence coverage and base quality score, which can be adjusted by users. Before the SNP discovery, the sequence files need to be imported through the “NGS import” function built within the software, which can recognize the raw sequence trace file (Sanger, Roche 454, Illumina Genome Analyzer, or ABI SOLiD), FASTA, FASTQ, or the ace file with the preassembly information. If the sequence file is utilized for SNP discovery, the de novo or reference assembly is required first. Once the assembly is finished, the SNP detection function can be directly performed under the Toolbox “Highthroughput sequencing” built within the software. The parameters for the SNP detection from sequences without any quality score is based on the sequences coverage of SNP position: (1) two minimum minor alleles for four or five sequences; (2) three minimum minor alleles for six to eight sequences; (3) four minimum minor allele for 9–11 sequences; (4) five minimum minor alleles for 12 sequences or more. The SNP detection application will generate the information related to the SNPs including, SNP position, allele variants, allele frequency, allele counts, sequences coverage, and alignment information (Figure 6.6). As a demonstration project, we have used the CLC Genomics Workbench for the analysis of SNPs with carp transcript sequences generated with 454 sequencing. A total of 3157 SNP were identified from 242,261 Carp 454 sequencing reads (FASTQ format), which downloaded from the NCBI SRA database (SRX007427). The de novo assembly was conducted first and then SNP detection was directly applied on the assembly results. The whole SNP detection processes took approximately 30 min. Both sequence coverage and quality score were utilized for SNP detection. The quality score was set for Q20 for center base and Q15 for surrounding base.

104

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Figure 6.6 SNP visualization from the CLC Genomics Workbench. The left panel is a navigation area including all the files and results information. The right upper panel displays SNP summary information, for example, contig number, consensus sequence length, consensus base at the SNP site (majority rule), SNP allele bases, sequence count (count) and ratio (frequency), and total number of sequences (coverage). The right lower panel displays the sequences alignment information and the SNP sites of the selected contig. In this example, sequence alignments of contig 38 are given, with forward sequence being shown in red and reverse sequence being shown in green. See color insert.

NextGENe NextGENe is a commercial available software package developed by SoftGenetics for the data analysis of next generation sequencing, including Roche 454, Illumina Genome Analyzer, and the ABI SOLiD system. The program has both 32-bit and 64-bit versions, and a 64-bit Windows system with at least a Quad Core processor and at least 8GB of RAM is recommend. The NextGENe next generation sequences data analysis includes four steps: (1) select the instrument type (454, Illumina, and SOLiD) based on next generation

SNP Discovery through EST Data Mining

105

sequencing data and the application type (de novo assembly, SNP/Indel discovery, transcriptome, ChIP-seq, SAGE, and other); (2) load the sequences data (FASTA format only) to the software and define the results output folder, and the program also provides a tool to convert different sequences format to FASTA format; (3) select the condensation tool based on the sequence instrument type to correct the sequencing error and increase the assembly and alignment efficiency; (4) select algorithm and set up the parameters for the assembly and alignment based on the sequences data and instrument type. For SNP/Indel discovery, the reference sequences are required for this application. If no reference sequences are available, the de novo assembly is required first. The SNP discovery process takes four steps: (1) select the instrument type and application “SNP/Indel”; (2) load next generation sequences and reference sequences generated during the de novo assembly; (3) select the default condensation tool; and (4) select the parameter for the SNP discovery by defining sequence coverage, minimum minor allele counts, and frequency. After the SNP discovery, the results will be displayed by the SequenceAlignment software (Figure 6.7), then go to drop-down menu “Report,” select “Mutation Report,” and the SNP information can be exported to a text file.

Figure 6.7 SNP visualization from NextGENe. The upper panel displays a global view of the project, and the lower panel displays the sequence alignment and sequence variation with SNPs highlighted in blue. See color insert.

106

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Table 6.2

Comparison of different SNP identification applications. Input format

Platform

Application

Trace

FASTA

Ace

Output with SNP

BLAST CLUSTALW POLYBAYES POLYPHRED AUTOSNP CLC Genomics Workbench NextGENe

— — R R — Y

Y Y R R Y Y

— — R R Y Y

N/A N/A Y Y Y Y

— — Y Y Y NR

Y Y — — Y

Y Y Y Y Y Y

—

Y

—

Y

NR

Y

—

Preassembly

PC

Linux

Free Y Y Y Y Y N N

Y, yes; N, no; N/A, not available; R, required; NR, not required.

Summary Identification of SNP from ESTs requires local alignment techniques that are unperturbed by exon–intron punctuation and alternatively spliced sequence variants. Once a multiple alignment is constructed, nucleotide differences among individual sequences can be analyzed. The programs BLAST and CLUSTALW were not originally designed for SNP identification, but these approaches can be utilized under certain scenarios. POLYBAYES, POLYPHRED, AUTOSNP, GIGABAYES, CLC Genomics Workbench, and NextGENe SNP detection were very powerful and efficient SNP identification platforms. POLYBAYES and AUTOSNP require the preassembly step for SNP identification. The trace files were necessary for the POLYBAYES and POLYPHRED, but not for AUTOSNP. A comprehensive comparison of all applications introduced in this chapter is listed in Table 6.2. Putative SNP discovery are not limited to the programs mentioned above; many customized SNP detection pipelines have been developed based on these applications. The intention of this chapter is to provide some basic methods and protocols for researchers who wish to discover SNPs. Owing to the presence of sequencing errors, not every nucleotide position with mismatches automatically implies a SNP. The success of SNP projects depends heavily on the ability to discriminate true SNPs from sequencing errors, especially without trace file or sequence quality score. This is usually accomplished by statistical considerations that take advantage of measures of sequence accuracy accompanying the analyzed sequences (Marth et al., 1999). The result, ideally, should be a set of candidate SNPs, each with an associated SNP score that indicates the confidence of the prediction. The confidence values can be useful for researchers in selecting which SNPs to use for follow-up studies. The issues related to SNP quality assessment are discussed in Chapter 7.

References Altschul SF, Gish W, Miller W, Myers EW, and Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol, 215:403–410.

SNP Discovery through EST Data Mining

107

Barker G, Batley J, O’Sullivan H, Edwards KJ, and Edwards D. 2003. Redundancy based detection of sequence polymorphisms in expressed sequence tag data using autoSNP. Bioinformatics, 19:421–422. Ewing B and Green P. 1998. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res, 8:186–194. Ewing B, Hillier L, Wendl MC, and Green P. 1998. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res, 8:175–185. Gordon D, Abajian C, and Green P. 1998. Consed: A graphical tool for sequence finishing. Genome Res, 8:195–202. Gordon D, Desmarais C, and Green P. 2001. Automated finishing with autofinish. Genome Res, 11:614–625. Hayes BJ, Nilsen K, Berg PR, Grindflek E, and Lien S. 2007. SNP detection exploiting multiple sources of redundancy in large EST collections improves validation rates. Bioinformatics, 23:1692–1693. He C, Chen L, Simmons M, Li P, Kim S, and Liu ZJ. 2003. Putative SNP discovery in interspecific hybrids of catfish by comparative EST analysis. Anim Genet, 34:445–448. Higgins DG and Sharp PM. 1988. CLUSTAL: A package for performing multiple sequence alignment on a microcomputer. Gene, 73:237–244. Higgins DG, Thompson JD, and Gibson TJ. 1996. Using CLUSTAL for multiple sequence alignments. Methods Enzymol, 266:383–402. Huang W and Marth G. 2008. EagleView: A genome assembly viewer for next-generation sequencing technologies. Genome Res, 18:1538–1543. Huang X and Madan A. 1999. CAP3: A DNA sequence assembly program. Genome Res, 9:868–877. Jeanmougin F, Thompson JD, Gouy M, Higgins DG, and Gibson TJ. 1998. Multiple sequence alignment with Clustal X. Trends Biochem Sci, 23:403–405. Li R, Li Y, Kristiansen K, and Wang J. 2008. SOAP: Short oligonucleotide alignment program. Bioinformatics, 24:713–714. Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, and Wang J. 2009. SOAP2: An improved ultrafast tool for short read alignment. Bioinformatics, 25:1966–1967. Marth GT, Korf I, Yandell MD, Yeh RT, Gu Z, Zakeri H, Stitziel NO, Hillier L, Kwok PY, and Gish WR. 1999. A general approach to single-nucleotide polymorphism discovery. Nat Genet, 23:452–456. Moen T, Hayes B, Baranski M, Berg PR, Kjoglum S, Koop BF, Davidson WS, Omholt SW, and Lien S. 2008. A linkage map of the Atlantic salmon (Salmo salar) based on EST-derived SNP markers. BMC Genomics, 9:223. Nickerson DA, Tobe VO, and Taylor SL. 1997. PolyPhred: Automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing. Nucleic Acids Res, 25:2745–2751. Ning Z, Cox AJ, and Mullikin JC. 2001. SSAHA: A fast search method for large DNA databases. Genome Res, 11:1725–1729. Picoult-Newberg L, Ideker TE, Pohl MG, Taylor SL, Donaldson MA, Nickerson DA, and Boyce-Jacino M. 1999. Mining SNPs from EST databases. Genome Res, 9:167–174. Quinlan AR, Stewart DA, Stromberg MP, and Marth GT. 2008. Pyrobayes: An improved base caller for SNP discovery in pyrosequences. Nat Methods, 5:179–181. Rausch T, Koren S, Denisov G, Weese D, Emde AK, Doring A, and Reinert K. 2009. A consistency-based consensus algorithm for de novo and reference-guided sequence assembly of short reads. Bioinformatics, 25:1118–1124. Smith DR, Quinlan AR, Peckham HE, Makowsky K, Tao W, Woolf B, Shen L, Donahue WF, Tusneem N, Stromberg MP, et al. 2008. Rapid whole-genome mutational profiling using next-generation sequencing technologies. Genome Res, 18:1638–1642.

108

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Thompson JD, Higgins DG, and Gibson TJ. 1994. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, 22:4673–4680. Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, and Higgins DG. 1997. The CLUSTAL_X windows interface: Flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res, 25:4876–4882. Wang S, Sha Z, Sonstegard TS, Liu H, Xu P, Somridhivej B, Peatman E, Kucuktas H, and Liu Z. 2008. Quality assessment parameters for EST-derived SNPs from catfish. BMC Genomics, 9:450. Zerbino DR and Birney E. 2008. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res, 18:821–829.

Chapter 7

SNP Quality Assessment Shaolin Wang, Hong Liu, and Zhanjiang (John) Liu

Identification of single-nucleotide polymorphisms (SNPs) relies on the alignment of multiple sequences derived from different genomes that include genomes of different individuals or the two haploid sets of chromosomes of a single diploid individual. Traditionally, large-scale SNP identification depends on the availability of whole genome sequences. Genome sequencing projects using traditional Sanger sequencing usually involve a single individual with two sets of chromosomes from which SNPs can be identified. The situation is quite different, however, with aquaculture species where no genomes have been entirely sequenced using the traditional Sanger sequencing, with the exception of the Atlantic salmon genome that is being sequenced using Sanger sequencing for the first phase of the sequencing project. As a result, SNP identification in aquaculture depends on various resources including expressed sequence tag (EST) databases, genomic survey sequences, and very recently, sequences generated using next generation sequencing technologies. Advances in next generation sequencing have allowed the generation of whole genome sequences in aquaculture species. Currently, whole genome sequences have been generated or are being generated from several aquaculture species including Atlantic cod, channel catfish, carp, rainbow trout, tilapia, Pacific oyster, and shrimp. In some of these cases, many SNPs will be generated through the whole genome sequencing project; but in other cases, no SNPs will be expected from the whole genome sequencing project because of the use of doubled haploid as sequencing templates. This is particularly true for several teleost fish species. Teleost fish are believed to have gone through one extra round of whole genome duplication and therefore are believed to harbor duplicated genes or duplicated genome regions. In order to reduce complications in whole genome sequencing assembly, doubled haploid is used in several cases for whole genome sequencing such as in channel catfish, carp, and rainbow trout. Therefore, SNPs are not expected from these species during whole genome sequencing. Instead, additional efforts will have to be devoted in these species for SNP discovery, and subsequent SNP quality assessment would become an issue of interest. Recent use of next generation sequencing technologies allowed sequencing of the same genomic segments from multiple individuals by sequencing reduced representation libraries (RRLs; see Chapter 5), greatly enhancing the power of SNP discovery. SNPs can also be identified from genome survey sequences such as BAC-end

Next Generation Sequencing and Whole Genome Selection in Aquaculture © 2011 Blackwell Publishing Ltd. ISBN: 978-0-813-80637-2

Edited by Zhanjiang (John) Liu

109

110

Next Generation Sequencing and Whole Genome Selection in Aquaculture

sequences if multiple sequences exist for the same genomic location, or from EST databases when multiple sequences are available from the same transcripts. While a high level of sequence redundancy may be involved in whole genome sequencing situations because the number of genome coverage is usually high (at least 5–10 genome coverage) as required for whole genome assembly, sequence redundancy for genome survey sequences or ESTs varies depending on the depth of sequencing efforts. As such, putative SNPs are identified from sequence alignments involving a variable number of sequences, for example, as few as two sequences. When only two sequences are aligned, a discrepancy of sequences could mean either a true SNP or a sequence error. Such situations necessitate the need for quality assessment of SNPs. In addition, not all SNPs are equal in terms of their genomic location and distribution, if they are involved in the genes, or whether they are adaptable to efficient genotyping for follow-up genomic studies. In this chapter, we provide some general considerations for SNP quality assessment.

Quality Assessment Parameters for EST-derived SNPs SNPs derived from ESTs are very important SNPs as they reside within genes and thus may represent differences in protein-coding capacities. Biologically, geneassociated SNPs are obviously more important because they could account for the causations of phenotype differences, whereas SNPs in intergenic regions are only associated in location with phenotypes. In addition, gene-associated SNPs offer two additional benefits: (1) gene sequences are complex sequences so genotyping should be more straightforward in general than SNPs from unknown genomic locations that often involve simple sequences or repetitive sequences; (2) as genes are distributed throughout the genome, SNPs derived from ESTs, collectively at the genome scale, could have a wide distribution in the genome, alleviating some problems of SNP clustering. However, as ESTs represent single-pass sequences of cDNAs, EST sequences could involve a significantly higher level of sequencing errors, leading to the identification of “pseudo-SNPs.” Use of EST-derived SNPs requires extensive consideration of quality assessment.

Sequence Base Quality Scores If sequences were 100% correct, SNPs identified by multiple sequence alignments then represent true SNPs. However, no matter what method was used in generating the sequences, 100% of sequence accuracy is not achievable. Practically, therefore, sequence base quality score derived from sequencing trace files is a primary source for SNP quality assessment. Base calling is one of the most important factors for the identification of true SNP instead of sequencing errors. During base calling, the program PHRED scans the trace files and generates sequence FASTA file along with base quality file. Q20 (99% accuracy) is the most common cutoff quality score. However, in some cases, in the effort to obtain longer and more sequences, the quality score could have been lowered to Q15 (97.5% accuracy) or Q13 (95% accuracy), which may then significantly increase the sequencing errors. In order to reduce

SNP Quality Assessment

111

the chances of pseudo-SNPs, it is recommended that only high-quality sequences are used unless absolutely required otherwise. Several applications have been developed based on the sequence trace file or quality score evaluation for the SNP identification, such as POLYBAYES and POLYPHRED. These methods have been proven to provide high validation rates of SNPs (Hayes et al., 2007). However, these applications require sequence quality scores or original sequence trace files for SNP identification, which may not be available for many researchers using GenBank sequences as resources.

Sequence Redundancy To assess SNP qualities after base calling, a major criterion is the sequence redundancy at the SNP site, which is determined, for the most part, by the number of sequences in the contig; but we stress the redundancy at the SNP site, not just the number of sequences in the contigs. The first factor influencing EST sequence redundancy is related to the level of gene expression of the transcripts. For instance, in the dbEST database, over 6 million ESTs have been generated for human transcriptomes, with an average of over 150 ESTs per transcript. However, some most highly expressed genes can be represented in huge numbers in tens of thousands of counts, whereas some transcripts are represented only few times. ESTs are sequenced from cDNA libraries made from various conditions. The abundance of ESTs is proportional to the expression level of the genes if the cDNA libraries were not normalized. Even with normalized libraries, highly expressed genes tend to be overrepresented in the EST databases. Sequence redundancy at the SNP site can be significantly lower than the number of sequences in the contigs. Given that transcript sizes can vary in sizes from several hundred base pairs (bp) to multiple of kilobases (kb), sequence redundancy is usually high at the 5′ end and/or at the 3′ end. This is because ESTs were sequenced from either the 5′ end or the 3′ end of the transcripts. Single-pass sequencing usually generate 600–800 bp. Therefore, sequences at the middle of the transcripts tend to have a lower sequence redundancy (Figure 7.1).

Figure 7.1 Importance of the minor sequence frequency and the number of sequences in the contig. Note that the number of sequences at the SNP site is the most important. For instance, more sequences are available at the 5′ and 3′ of a transcript, providing a greater level of reliability of sequences. In contrast, fewer sequences are available in the middle of transcripts. See color insert.

112

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Large EST resources can be a tremendous source for SNP identification. For instance, the large number of human ESTs provided a tremendous resource for SNP identification from ESTs, which was proven to be sufficient for SNP identification for the HapMap project. Such a high level of sequence redundancy or coverage, however, may not be available for most species, especially not for the aquaculture species. The higher the sequence redundancy at the SNP sites, the more likely each of the alternative alleles can be observed more than one time, alleviating the problem of sequencing errors. For most aquaculture species, large sequence redundancy at the SNP sites may not be available. Therefore, quality assessment standards should be established for the SNPs identified through alignment of ESTs.

The Number of Sequences in the Contigs Mining SNPs out of EST data with much smaller EST resources in aquaculture species (usually fewer than 500,000 ESTs per species) is still a highly productive approach. However, quality of the SNPs can become a greater challenge because of much smaller number of sequences in the contigs. Researchers are facing a dilemma of sacrificing for the number of SNPs or sacrificing for SNP quality. A wise resolution of such a dilemma can prove to be difficult but extremely useful. Keeping this dilemma in mind, we have conducted a pilot project (Wang et al., 2008). The objective of the pilot project was to develop a strategy for rapid and reliable identification and evaluation of qualities of EST-derived SNPs to reduce the rate of pseudo-SNPs resulted from sequence errors typically found in single-pass EST data sets, especially those deposited in the National Center for Biotechnology Information (NCBI) where sequence trace files or quality files may or may not be available. In this pilot project, about 55,000 catfish ESTs were downloaded from the NCBI dbEST database. The ESTs were assembled using CAP3, and putative SNPs were identified by using AUTOSNP. Obviously, the number of sequences representing a specific transcript is the primary source for SNPs to be discovered. As such, the more sequences in the contigs, the more likely for one to detect SNPs from the contigs. Think of a contig with two ESTs; any sequence discrepancy would be identified as a putative SNP, providing no clue as to if the putative SNP is a true SNP or a sequencing error. Now, think of a contig with three sequences; a potential SNP would be presented by the sequence alignment as 1 : 2 at the SNP site. While a repeated calling of a base twice increased the chances of correctness for that allele, the base calling just once for the alternative allele provided no assurances if the base represent a SNP or a sequence error. Similarly, when four sequences are involved in a contig, one would have two possibilities: a 2 : 2 situation or a 1 : 3 situation. Obviously, one would feel more comfortable for the 2 : 2 situation to call the SNP than for the 1 : 3 situation as the one sequence could still be a sequence error. By the same token, when 10 sequences are involved in a contig, one can have allele ratios of 1 : 9, 2 : 8, 3 : 7, 4 : 6, or 5 : 5, and obviously, the confidence for a true SNP increases with the same order, with the 5 : 5 being the most likely to represent a true SNP. Technically, this involves both the chances of sequencing errors and the possibility of having both alleles being sequenced, given a fixed SNP allele frequency in the population. Clearly, it is not the contig sizes primarily (the number of sequences in the contigs), but the allele frequency that is most

SNP Quality Assessment

113

important for the assessment of SNP quality; however, it is only when the number of sequences in the contig is large that it becomes possible to detect sequence allele distributions, thereby making some assessment of SNP quality.

Minor Allele Frequency The sequence allele distribution is the most important indicator of SNP quality. The higher the minor allele frequency, the higher chance of a putative SNP to be a real SNP can be. Given a fixed number of sequences in a contig, the more equal the minor and major allele frequencies, the more likely the putative SNP can be a true SNP (Figure 7.2). Theoretically, a compound of sequencing quality and real SNP allele distribution determines the chances for the detected SNPs. For sequencing errors, the chances of getting the same sequencing error at the same sequence location multiple times becomes smaller and smaller as the number of sequences increases. For allele frequencies, the greater the minor allele frequency in the population, the Minor Number of sequence sequences frequency

Major sequence frequency

Sequence heterozygosity

10 seq 9 seq 8 seq 7 seq 6 seq 5 seq 4 seq 3 seq 2 seq

1 1 1 1 1 1 1 1 1

9 8 7 6 5 4 3 2 1

0.18 0.20 0.22 0.24 0.28 0.32 0.38 0.44 0.50

10 seq 9 seq 8 seq 7 seq 6 seq 5 seq 4 seq

2 2 2 2 2 2 2

8 7 6 5 4 3 2

0.32 0.35 0.38 0.41 0.44 0.48 0.50

10 seq 9 seq 8 seq 7 seq 6 seq

3 3 3 3 3

7 6 5 4 3

0.42 0.44 0.47 0.49 0.50

10 seq 9 seq 8 seq

4 4 4

6 5 4

0.48 0.49 0.50

10 seq

5

5

0.50

SNP quality trend

Figure 7.2 SNP quality assessment based on EST contig size and sequence frequency of the alleles. Arrows indicate the trend of increases of heterozygosity and the trend of increases in SNP quality. See color insert.

114

Next Generation Sequencing and Whole Genome Selection in Aquaculture

more likely one can detect the SNP. Because SNPs are regarded as biallelic markers (even though there are four alleles, theoretically, at any given SNP site), the largest chance for the detection of the SNP is when the two alleles share equal allele frequency of 50% each (Figure 7.2). These theoretical predictions were validated with our pilot studies (Wang et al., 2008). The presence of minor allele sequence in relation to the contig size is important. For instance, if the minor allele sequence was present only once, then the smaller the contig size, the more likely the SNP could be real. This is because the contig size of ESTs is simply a reflection of expression abundance. If a rarely expressed gene was sequenced twice, with the alternative allele being present once each, one can still expect that the allele frequency could be equal or close to be equal when the transcript is sequenced 10 times. However, if the transcript was already sequenced 10 times, with the minor allele sequence being present only once, it is more likely that the minor allele could have been derived from sequencing errors (Figure 7.2). This relation is obvious when sequence heterozygosity is considered as shown in Figure 7.2. A contig of two sequences with one each of the alternative alleles would have a sequence heterozygosity of 0.5, while a contig with 10 sequences of major allele : minor allele = 9 : 1 would have a sequence heterozygosity of only 0.18. Although the above discussion is correct, practically however, one does not have to increase SNP quality at the extreme expense of sacrificing the number of SNPs. This is to say that once the quality is high enough, one does not need to push for the highest SNP quality. In our pilot study, we attempted to determine the optimal level of quality in terms of minimal number of sequences in the contig and the minimal allele frequency, and the SNP validation rate. Our results indicated that a minimum of four sequences in the contig, with minor sequence allele to be represented at least twice, provided a relatively high level of SNP validation rate as tested in a single resource family (Table 7.1). Table 7.1 SNP polymorphic rates as a function of contig size and minor sequence allele frequency as determined from a pilot study in catfish (Wang et al., 2008). Number of sequences in the contig

Number of successful loci

2 3 4 Subtotal 4 5–6 7–8 9–12

24 37 26 87 44 60 17 21

>12

37

Subtotal Total

179 266

Sequence ratio

Minimal minor sequence frequency

1:1 1:2 1:3

50% 33.3% 25%

2:2 2:3 & 2:4 & 3:3 3:4 & 3:5 & 4:4 4:5 & 4:6 & 4:7 & 4:8 & 5:5 & 5:6 & 5:7 & 6:6 5:7 & 6:6 & 5:8 & 6:7 … & 12 : 57

50% 33.% 37.5% 33.3%

33.3 45.9 15.4 33.3 70.5 60.0 64.7 76.2

17.4%

89.2

Polymorphic rate (%)

70.9 58.6

SNP Quality Assessment

Black Warrior River 30.0

35.0 30.0 25.0 20.0 15.0 10.0 5.0

25.0 % of SNPs

% of SNPs

Black Belt Farm

20.0 15.0 10.0 5.0

0.0

0.0 0.1

0.2

0.3 MAF

0.4

0.5

0.1

30.0

30.0

25.0

25.0 % of SNPs

% of SNPs

0.2

0.3 MAF

0.4

0.5

Guntersville Reservoir

Geneva Hatchery

20.0 15.0 10.0 5.0

20.0 15.0 10.0 5.0

0.0 0.1

0.2

0.3 MAF

0.4

0.0

0.5

0.1

0.2

0.3 MAF

0.4

0.5

Weiss Reservoir

Petit Farm 35.0 30.0 25.0 20.0

% of SNPs

% of SNPs

115

15.0 10.0 5.0 0.0

35.0 30.0 25.0 20.0 15.0 10.0 5.0 0.0

0.1

0.2

0.3 MAF

0.4

0.5

0.1

0.2

0.3 MAF

0.4

0.5

Figure 7.3 Distribution of minor allele frequency (MAF) in domestic and wild channel catfish strains. The label at the top of each panel refers to the name of the populations.

It is noteworthy to point out that the sequence allele frequencies generated from EST studies by no means reflect the real allele distribution in various populations as revealed by SNP analysis with various domestic and wild populations of catfish (Mickett et al., 2003; Simmons et al., 2006; Figure 7.3).

Sequences Flanking SNPs and Their Sequence Quality In addition to sequence redundancy at the SNP site and minor allele frequency, several other factors were also important for SNP genotyping and validation. As

116

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Illumina SNP genotyping technology is one of the most popular high-throughput SNP genotyping platforms; here we will introduce several important factors that can affect the success rate of EST-derived SNP genotyping. Illumina genotyping technology, and perhaps several other genotyping technologies as well, requires highly reliable SNP flanking sequences for efficient base extension, polymerase chain reaction (PCR), and genotyping. This would be particularly true for Affymetrix MyGeneChip Custom Arrays as this system is a hybridization-based system relying on sequences surrounding SNP sites for probes. With EST-derived SNPs, sequence quality flanking the SNP sites was found to be important for successful SNP genotyping using Illumina’s BeadArray technology, but the flanking sequence context was less important, when the Illumina quality score was above 0.5. It is probably true that SNP genotyping primers would have worked properly for the most part even if the sequence context was somewhat simple or A/Trich, or G/C-rich. However, sequence errors in the SNP region could directly affect the base pairing of the SNP genotyping primers. Low-quality sequences could easily generate false SNPs, especially at the beginning or at the end of the sequence. Therefore, sequence quality surrounding the SNP site should be used as one parameter to identify reliable SNPs. However, many EST sequences retrieved from the NCBI do not have quality scores or trace files. In such cases, greater caution should be exercised. In particular, hot spot of SNP occurrence should be avoided if possible (Wang et al., 2008). It is worthwhile noting that the quality scores could become more important when genomic sequences are involved, which often involves repetitive sequences (Lepoittevin et al., 2010). Flanking sequence quality greatly affected the SNP success rate. In the pilot study (Wang et al., 2008), we identified 28 contigs with hot spots of SNP occurrence where a region of sequence was highly variable with many “SNPs” detected. Sequence quality examination suggested low quality scores in the sequencing reactions. We intentionally included these SNPs in the pilot project to determine the effect of quality of sequences flanking SNPs. Of the 28 SNPs tested, 14 (50%) failed in genotyping, suggesting that high sequence quality is required in the SNP region as they are involved in genotyping primer binding regions. In contrast to the quality of sequences flanking SNPs, the sequences themselves were not crucially important once the quality score is above 0.5. Illumina actually assigns a quality score as a reflection of the flanking sequence complexity and sequence context flanking SNPs. In our pilot study, we selected 384 SNPs with quality scores ranging from 0.5 to 1.0 to determine how Illumina quality scores affect the success rate of genotyping. Quality scores were not associated with the failures of SNP genotyping when the quality scores were above 0.5 (Wang et al., 2008).

The Presence of Introns Can Significantly Reduce the Success Rate of SNPs With EST-derived SNPs, a major challenge is the potential presence of introns near the SNP sites. The presence of introns greatly reduced the SNP genotyping success rate. In our pilot study, among the contigs containing SNPs, four had genomic DNA

SNP Quality Assessment Genotyping success

SNP location P1 P2

SNP

P3 +

cDNA P1 P2

Genomic DNA P1 P2

SNP

SNP

Genomic DNA

–

SNP

P1 P2

SNP

P3

P3

–

cDNA Genomic DNA

P1 P2

P3

P3

cDNA P1 P2

117

SNP

P3

Figure 7.4 Schematic illustration of the effect of introns involved in SNP genotyping. In the first case, all the genotyping primers are located in the same exon nearby, leading to successful genotyping (+); in the second case (middle), one of the genotyping primers (P3 as shown) was located at the exon–intron border, causing nonbase pairing that leads to failure of genotyping (–); and in the third case, all primers were located in exon regions, but an intron was involved that demands PCR extension to across the intron. Apparently, the BeadArray technology provides very limited extension capability, leading to genotyping failure (–) as well. See color insert.

information that allowed us to test if the involvement of introns has a major effect on SNP genotyping and validation rates. All four SNPs failed to provide genotypes. The reasons for the inability of successful genotyping when introns are present near the SNP sites may include the following. (1) The genotyping primers are located at the exon–intron boundary, leading to nonbase pairing of the primers with DNA amplified from genomic DNA (Figure 7.4). This is easy to understand as the genotyping primers are designed using cDNA sequences, while the real template used in genotyping is from genomic DNA. One possibility would be to genotype using RNA samples converted to cDNA, but this has not been tested. (2) The BeadArray technology depends on very short extension and subsequent ligation for success. The presence of introns requires long extension followed by ligation, which reduces the success rate; this is somewhat unexpected as one would expect that DNA polymerase should be able to extend easily a few hundred bases. Nonetheless, with experimental science, the guiding principle is that we should do whatever works. Selection of SNPs to allow both allele-specific and locus-specific primers to be located in a single exon is the key to achieving high success rate of SNP genotyping. In absence of a whole genomic sequence or gene structural information for the involved SNPs, comparative gene organization analysis is a useful and productive approach to resolve the problems caused by the presence of introns near SNP sites. Bioinformatics analysis using in silico comparative sequence and gene structural analysis is important when dealing with EST-derived SNPs.

118

Next Generation Sequencing and Whole Genome Selection in Aquaculture

We conducted comparative sequence analysis of catfish ESTs with corresponding zebrafish genes as references. The rationale is that if the gene organization is similar in catfish and zebrafish, then sequence similarity comparison would allow the location of SNP sites to be aligned to the zebrafish genome. If the SNP sites are close to the exon–intron junction, then that could have caused the genotyping failures, assuming conservation of gene structure and organization between catfish and zebrafish. In our pilot study, 92 of the 99 catfish SNP loci had significant BLAST hits with the zebrafish genome, but of these, only 50 allowed sequence alignment in the region containing the involved SNPs. Sequence alignment and gene structure in zebrafish indicated that 32 (64%) of the 50 SNPs were located at the exon–intron border, suggesting that the presence of the presumed introns was the major cause for the failures of the SNP genotyping.

Assessment of SNP Distribution in the Genome Even if the identified SNPs are real and can support successful genotyping, their utilities for genetic studies are dependent on their genomic location and distribution. It is not only the total number of SNPs that makes a difference in the power for genetic analysis but also their genomic distribution. Obviously, the more even distribution the SNPs have, the greater their power for genetic analysis. Highly clustered SNP will not provide additional power as they may not represent any haplotypes or there is no recombination among them. The best scenario is to achieve as even a distribution as possible. Theoretically, for a genome with a size of 1 × 109 bp such as that of the catfish, if 20,000 SNPs are used, one would be able to have one SNP per 50,000 bp, assuming all SNPs are distributed at equal space throughout the genome. If this level of coverage can be achieved, the power for genetic analysis would be great as one can analyze association of SNPs with traits within a 50-kb region, which is highly manageable in most laboratories. The problem is the inability to achieve this level of distribution, leading to the presence of large gaps in the genome without any SNP coverage (Figure 7.5). The most effective way for the assessment of SNP genomic location and distribution is through in silico mapping of SNPs to whole genome sequence assembly. This will soon be possible for several species of aquaculture species. One can select SNPs with roughly equal spaces in large scaffolds, and select at least one SNP in small scaffolds. In the absence of whole genome sequences, comparative genome analysis may lend some insight into genomic distribution of SNPs, assuming a colinearity of gene

Figure 7.5 Relationship of SNP distribution in the genome and their abilities to provide good genome coverage. The best scenario is the even genomic distribution of SNP markers as shown in A; SNP clusters can significantly reduce the power of genome coverage as shown in B.

SNP Quality Assessment

119

sequence arrangement at the genome scale. Whole genome sequence assembly of closely related species can be used as a resource sequence for the analysis of SNP distribution. Complexities can be caused by inabilities to locate genomic sequences to a related genome because of the lack of sequence similarities. Such an approach should, in theory, work well for gene-associated SNPs.

Quality Issues of SNPs Generated from Sequencing RRLs As detailed in Chapter 5, large numbers of SNPs can be rapidly identified through sequencing of RRLs by using next generation sequencing. SNPs so identified should not be much different from those identified through whole genome sequencing projects except that (1) the contig size (total length in base pairs) may be short, thereby limiting the utility of the identified SNPs; (2) the sequence context could be quite simple or repetitive in nature, which limits the utility of the identified SNPs; and (3) the contig assembly provide no information as to the genomic location and distribution of the identified SNPs. Small contig length can significantly reduce the utility of the identified SNPs. For instance, if SNPs are located at or close to the beginning and end of contigs, there would be insufficient flanking sequences for the design of genotyping primers. The sequence context can have a significant effect on the utility of the SNPs. Given the short length of the contigs constructed using the next generation sequencing technology, the flanking sequences near the SNP sites can fall under either the very simple sequences or repetitive sequences. Teleost fish genomes are high in repetitive sequences such as Tc-1/mariner transposons. SNPs involved in such sequences may appear to be useful, the genotyping of which may prove to be difficult. For instance, Tc-1/mariner repetitive elements represent 4.6% of the catfish genome (Xu et al., 2006; Nandi et al., 2007; Liu et al., 2009). Genomic distribution of SNPs from sequencing the RRLs should be random as the genomic segment is randomly chosen, assuming avoidance of repetitive elements selected for the RRLs.

Conclusions As compared with SNPs identified from genomic sequences, EST-derived SNPs have several advantages. Since ESTs are transcribed sequences, EST-derived SNPs are associated with actual genes, allowing use of gene-associated SNPs for mapping and subsequent use in comparative genome studies (Sarropoulou et al., 2008). This is particularly important for species without a genome sequence such as aquaculture species. In addition to be used as markers for mapping, SNPs are also considered a rich source of candidate polymorphisms underlying important traits, leading to the identification of causative genes or quantitative trait nucleotide (QTN) (Jalving et al., 2004). However, there are several important factors to be considered when using EST-derived SNPs. The major issue for development of SNPs from EST resources is not whether SNPs can readily be identified, but to what degree these SNPs would be reliable because parameters for quality assessment of EST-derived

120

Next Generation Sequencing and Whole Genome Selection in Aquaculture

SNPs simply do not exist. This reliability issue was mostly due to sequence errors; assembled contigs with sequence variation could simply be sequence errors. Additionally, since SNPs derived from ESTs can only be identified from EST contigs where the same gene transcripts were sequenced at least twice and sequencing frequency of ESTs is not random, large-scale sequencing is required to identify SNPs from rarely expressed genes. Moreover, SNP rates could be lower in coding regions because of evolutionary restraints of selection pressure. The contig size (number of sequences in the contig) and minor sequence allele frequency were the two major factors affecting the validation rates of EST-derived SNPs. Small contigs had much lower SNP validation rates. Obviously, in small contigs with two or three sequences, the alternative base is represented only once, and this could be due to sequencing errors. Similarly, in contigs with four sequences when the minor sequence allele is represented only once, it is highly likely that the minor allele is due to sequencing errors. Contigs of four or more sequences, with the minor sequence allele frequency being present at least twice in the contig, provided high levels of SNP validation rates (averaging 70.9% up to 89.2%). This makes good sense because it is highly unlikely for sequencing errors of two independently sequenced ESTs to occur at the same base location. When at least two ESTs exhibit an alternative base at the putative SNP sites, it is highly likely that such sequence variations are real. Even with true SNPs, a key issue to the success of SNP genotyping using ESTderived SNPs is the avoidance of introns. Genotyping using cDNAs as templates could, in theory, reduce genotyping complications due to the presence of introns, but such an approach yet needs to be tested. SNP locations and genome distributions are equally important for genetic analysis powers. The best scenario is even distribution of all SNPs, avoiding SNP clustering. If even genomic distribution can be achieved, a large genome can be effectively covered with a reasonable number of total SNPs. The best approach for the analysis of genome distribution of SNPs is in silico mapping if the whole genome sequence is available. If the whole genome sequence assembly is not available, cross-species in silico comparative analysis is highly useful. In spite of the power of SNP identification using next generation sequencing with RRLs, SNPs so identified are associated with potential complications of being located at the ends of sequences, flanked with simple sequences or repetitive elements, and without information on genomic location and distribution.

References Hayes BJ, Nilsen K, Berg PR, Grindflek E, and Lien S. 2007. SNP detection exploiting multiple sources of redundancy in large EST collections improves validation rates. Bioinformatics, 23:1692–1693. Jalving R, van’t Slot R, and van Oost BA. 2004. Chicken single nucleotide polymorphism identification and selection for genetic mapping. Poult Sci, 83:1925–1931. Lepoittevin C, Frigerio J-M, Garnier P, Salin F, Cervera T, Vornam B, Harvengt L, and Plomion C. 2010. In vitro vs in silico detected SNPs for the development of a genotyping array: What can we learn from a non-model species? PLoS ONE, 5:e11034.

SNP Quality Assessment

121

Liu H, Jiang Y, Wang S, Ninwichian P, Somridhivej B, Xu P, Abernathy J, Kucuktas H, and Liu Z. 2009. Comparative analysis of catfish BAC end sequences with the zebrafish genome. BMC Genomics, 10:592. Mickett K, Morton C, Feng J, Li P, Simmons M, Cao D, Dunham RA, and Liu Z. 2003. Assessing genetic diversity of domestic populations of channel catfish (Ictalurus punctatus) in Alabama using AFLP markers. Aquaculture, 228:91–105. Nandi S, Peatman E, Xu P, Wang S, Li P, and Liu Z. 2007. Repeat structure of the catfish genome: A genomic and transcriptomic assessment of Tc1-like transposon elements in channel catfish (Ictalurus punctatus). Genetica, 131:81–90. Sarropoulou E, Nousdili D, and Magoulas AGK. 2008. Linking the genomes of nonmodel teleosts through comparative genomics. Mar Biotechnol (NY), 10:227–233. Simmons M, Mickett K, Kucuktas H, Li P, Dunham R, and Liu ZJ. 2006. Comparison of domestic and wild channel catfish (Ictalurus punctatus) populations provides no evidence for genetic impact. Aquaculture, 252:133–146. Wang S, Sha Z, Sonstegard TS, Liu H, Xu P, Somridhivej B, Peatman E, Kucuktas H, and Liu Z. 2008. Quality assessment parameters for EST-derived SNPs from catfish. BMC Genomics, 9:450. Xu P, Wang S, Liu L, Peatman E, Somridhivej B, Thimmapuram J, Gong G, and Liu Z. 2006. Channel catfish BAC-end sequences for marker development and assessment of syntenic conservation with other fish species. Anim Genet, 37:321–326.

Chapter 8

SNP Genotyping Platforms Eric Peatman Single-nucleotide polymorphisms (SNPs) are abundant molecular marker sources that are easily multiplexed and can be automatically scored. These features lend themselves to high-throughput genotyping with the aid of sophisticated commercial platforms. As genetic and genomic research in aquaculture species has advanced, SNP genotyping has moved to the fore as the technique of choice for high-density linkage mapping, quantitative trait loci (QTL) analysis, fine mapping, and whole genome-based selection. DNA variations (or polymorphisms) have been studied intensively for the last quarter century in mammalian genetics (Nakamura, 2009). The increasing power of genomic tools have allowed a steady progression in the mammalian researchers’ ability to discover, quantify, and genotype polymorphic markers—from using single microsatellites one marker at a time with limited numbers of individuals in the late 1980s (Litt and Luty, 1989) to whole genome association studies with hundreds of thousands to millions of SNP markers genotyped on thousands of individuals in 2010 (Wang et al., 2010a). As detailed elsewhere in this book, the abundance, distribution, and adaptability of SNP markers to high-throughput analysis has led to their nearexclusive use in human genetics and genomic studies. Genetic research in aquaculture species has taken the same path as seen in mammalian species, albeit in a compressed time frame of less than 15 years. A rapid increase in large-scale marker discovery has been seen in fish in the last 5 years, mainly concurrent with the establishment of large expressed sequence tag (EST) resources (Liu and Cordes, 2004; Li et al., 2007; Moen et al., 2008; Wang et al., 2010b). The recent advent of next generation sequencing technologies and the rapid progression of whole genome sequencing in multiple aquaculture species have exponentially increased the availability of sequence data from which polymorphic markers can be identified. As with mammalian species, this has necessitated a focus on SNP discovery and genotyping to efficiently utilize these resources. Currently, depending on the progress within a given species, aquaculture genome researchers may seek to (1) identify a maximal number of polymorphic SNP markers both from gene and genomic (noncoding) regions; (2) place these markers on linkage maps constructed by genotyping reference families; (3) utilize dense SNP maps to identify QTL regions associated with important production traits such as growth, disease resistance, or cold tolerance; and (4) identify and utilize SNP subsets to fine-map QTL regions to find underlying genes and polymorphisms directly contributing to phenotypes of interest. Next Generation Sequencing and Whole Genome Selection in Aquaculture © 2011 Blackwell Publishing Ltd. ISBN: 978-0-813-80637-2

Edited by Zhanjiang (John) Liu

123

124

Next Generation Sequencing and Whole Genome Selection in Aquaculture

To achieve these goals, several different strategies are needed, each with significantly different requirements for SNP marker number, organism sample size, cost, and automation. In this regard, this chapter will consider present SNP genotyping platform technologies and their appropriateness for meeting the varied needs of the aquaculture genomics community. The chapter, while covering the principles behind SNP genotyping technologies, will focus less on details of assay chemistry and more on considerations of throughput, sample number, platform flexibility, and costs.

SNP Genotyping Platforms: Chemistries and Detection Methods While a multitude of methods have historically been utilized to genotype small numbers of SNPs, such as restriction fragment length polymorphism, single-strand conformation polymorphism, high-resolution melting, current SNP genotyping platforms utilize a relatively small number of strategies that lend themselves to higher sample throughputs and multiplexing capabilities. Table 8.1 outlines the chemistries and detection methods driving the dominant SNP genotyping platforms currently suited for aquaculture genetic and genomic applications. With the exception of Affymetrix GeneChip custom arrays, which employ differential hybridization as the basis for SNP detection, all other platforms rely on enzyme-based methods, either the 5′ nuclease activity of TaqMan assays or single-base extension (SBE) or allelespecific primer extension. Below, I will briefly describe the key principles and procedures undergirding these platforms.

Affymetrix MyGeneChip™ Custom Arrays Affymetrix is best known for its development of high-density microarrays for mammalian and model species. It has capitalized on its expertise in hybridization to create chips with the ability to screen over 1.8 million SNPs and copy number variations (CNVs) in humans. These densities are allowed by differential hybridization, which Table 8.1 Comparison of chemistry and detection of various currently popular SNP genotyping platforms. Platform

Company

Chemistry

Detection

iSelect HD Custom GoldenGate

Illumina Illumina

Fluorescence Fluorescence

MyGeneChip Custom Arrays MassArray SNPstream TaqMan OpenArray Dynamic Array

Affymetrix

Single-base extension Allele-specific primer extension Differential hybridization Single-base extension Single-base extension TaqMan—5′ nuclease TaqMan—5′ nuclease

Sequenom Beckman Coulter Applied Biosystems Fluidigm

Fluorescence Mass Spectrometry Fluorescence Fluorescence Fluorescence

SNP Genotyping Platforms

125

Genomic DNA

G A

Central SNP Quartet PM A PM-A

ATCAATAGCCATCATGAGTTAGTAG

MM-A

ATCAATAGCCATGATGAGTTAGTAG

PM-B

ATCAATAGCCATTATGAGTTAGTAG

MM B MM-B

ATCAATAGCCATAATGAGTTAGTAG

Idealized Array Image Sample 1 AA

Sample 2 AB

Sample 3 BB PM-A

–4 Offset Quartet PM-A

TGCCATCAATAGCCATCATGAGTTA

MM-A

TGCCATCAATAGGCATCATGAGTTA

PM-B

TGCCATCAATAGCCATTATGAGTTA

MM-B

TGCCATCAATAGGCATTATGAGTTA

MM-A PM-B MM-B PM-A–4 MM-A–4 PM-B–4

+4 Offset Quartet PM-A

ATAGCCATCATGAGTTAGTAGTTCA

MM-B–4 PM-A+4

MM-A

ATAGCCATCATGTGTTAGTAGTTCA

PM-B

ATAGCCATTATGAGTTAGTAGTTCA

MM-A+4

MM-B

ATAGCCATTATGTGTTAGTAGTTCA

PM-B+4 MM-B+4

+ Opposite Strand Probes

Figure 8.1 Differential hybridization utilizing multiple perfect match (PM) and mismatch (MM) probes per SNP allele and shifting the nucleotide context of the SNP provides the ability to differentiate homozygous and heterozygous signals as well as screening out signal resulting from nonspecific hybridization, as shown in the idealized, simplified array image. See color insert.

relies on the use of several matched and mismatched probes per locus (as with gene expression arrays). Twelve or more probes (25mers) per allele are represented on the chip. Following amplification and complexity reduction of genomic DNA, samples are fragmented, biotin-end labeled, and hybridized to SNP arrays. Comparison of hybridization signals (fluorescent intensities) between redundant probes for a given SNP allows differentiation of homozygous and heterozygous signals as well as screening out signal resulting from nonspecific hybridization (Figure 8.1). Beyond its popular genome-wide SNP platform for humans, Affymetrix makes available its differential hybridization-based platform for high-density custom SNP arrays for any species under its MyGeneChip Custom Array Program. Affymetrix also produces smaller (5–25 K), targeted SNP kits that use molecular inversion probe technology for SNP calling. However, these kits currently do not target aquaculture species.

Illumina SNP Genotyping Platforms Illumina offers two viable options for aquaculture researchers interested in SNP genotyping: the GoldenGate Assay and the iSelect HD Custom BeadChip.

126

Next Generation Sequencing and Whole Genome Selection in Aquaculture A/G P1 P2 P1 P2

T C

Address P3

P3

Allele-specific extension and ligation PCR with P1, P2, and P3

Homozygous A/A Homozygous G/G Homozygous A/G

Figure 8.2 Principles of the Illumina SNP genotyping platform. In the Illumina’s SNP assays, the allele discrimination at each SNP locus is achieved by using three oligos—P1, P2, and P3—of which P1 and P2 are allele-specific and are Cy3- and Cy5-labeled as indicated by red and green colors. P3 is a locus-specific oligo designed several bases downstream from the SNP site. Upon allele-specific extension and ligation, the artificial, allele-specific template is created for PCR using universal primers. If the template DNA is homozygous, either P1 or P2 will be extended to meet P3; if the template is heterozygous, both P1 and P2 will be extended to meet P3, allowing ligation to happen. P3 contains a unique address sequence that targets a particular bead type with complementary sequence to the address sequence. After downstreamprocessing, the single-stranded, dye-labeled DNAs are hybridized to their complement bead type through their unique address sequences. After hybridization, the BeadArray Reader is used to analyze fluorescence signal on the Sentrix Array Matrix or BeadChip, which is in turn analyzed using software for automated genotype clustering and calling. (Figure adapted from Illumina [www.illumina.com/]). See color insert.

GoldenGate Assays rely on allele-specific primer extension for SNP calling. In the GoldenGate Assay, DNA samples are first bound to paramagnetic particles. Three oligonucleotides are designed for each SNP locus—two allele-specific oligos and a locus-specific oligo that hybridizes several bases downstream from the SNP site and which contains a bead-specific address (Figure 8.2). Following hybridization between genomic DNA and assay olignucleotides, the template–primer complex is extended with DNA polymerase. Only when extension happens, the allele-specific primer is brought in close proximity with the locus-specific primer for ligation. The ligation joins the appropriate allele-specific product (genotype) with the locusspecific primer (address) to form a full-length product that serves as a template for polymerase chain reaction (PCR) using Cy3- and Cy5-labeled allele-specific primers. The single-stranded, dye-labeled DNAs are hybridized to their complement bead type contained on a BeadChip through their locus-specific primer address, fluorescent signal captured and SNP called. The Illumina iSelect BeadChip utilizes a related technique, SBE, for SNP calling (Figure 8.3). In this approach, a two-step allele detection strategy is employed. Amplified, fragmented genomic DNA is first hybridized to bead-bound 50-mer oligos, providing locus specificity. Then SBE is carried out, allowing for the incorporation of a fluorescently labeled dideoxynucleotide for assay readout and SNP calling (Figure 8.3).

SNP Genotyping Platforms Fragmented, genomic DNA

127

Fragmented, genomic DNA Sample 1

Hybridization with locus-specific oligo (LSO)

Sample p 1

Single-base extension with labeled ddNTP

Sample 2

Hybridization with locus-specific oligo (LSO)

Sample 2

Single-base extension with labeled ddNTP

Figure 8.3 Common features of single-base extension genotyping. Sample genomic DNA is amplified, fragmented, and allowed to hybridize to locus-specific oligos (LSOs) in solution or bound to beads. Enzymatic incorporation of a single, fluorescently labeled dideoxynucleotide (ddNTP) allows specific base calling for each sample and locus. See color insert.

Sequenom’s MassArray Sequenom’s MassArray platform also relies on SBE for SNP genotyping, albeit with several variations from that described above. Sequenom’s iPLEX Gold assay first requires PCR amplification of the region containing the SNP of interest. Next, primers are extended a single base to generate allele-specific DNA products. Finally, chip-based mass spectrometry is utilized for separation, analysis, and base-calling of the SNP loci based on unique mass values. Because the molecular weight of alternative bases at the SNP site is different, the ending molecular mass differences allow calling of the SNP.

Beckman Coulter’s SNPstream Platform Beckman Coulter’s SNPstream platform again utilizes SBE chemistry for SNP calling. PCR amplification is carried out using primers specific to each SNP-containing locus. After PCR cleanup, tagged SNP primers are added to allow for SBE with fluorescently labeled dideoxynucleotides. The products are captured and sorted on 384-well microarray plates by hybridization to complementary tag sequences. Fluorescent signals are captured by CCD camera and SNP genotypes are called.

OpenArray and Dynamic Array Systems Both the Applied Biosystems OpenArray system and Fluidigm’s Dynamic Array (EP1/BioMark System) utilize TaqMan chemistry for SNP genotyping. Two allelespecific TaqMan probes are designed for each SNP locus, each with a different 5′

128

Next Generation Sequencing and Whole Genome Selection in Aquaculture

fluorophore color and 3′ quencher molecule. In the native condition, the quencher eliminates the fluorophore’s signal. Additional PCR primers are designed flanking the probe/SNP region. During PCR amplification, the perfectly complementary allele-specific TaqMan probe will bind to the target DNA strand. Extension of the flanking PCR primers will result in the degradation of the TaqMan probe by the 5′ nuclease activity of the Taq polymerase, and the concurrent separation of fluorophore and quencher. A detectable, allele-specific fluorescent signal results.

SNP Genotyping Platforms: Throughput and Multiplexing Human SNP studies have usually been divided between whole genome association studies and fine mapping (high-density genotyping analysis in specific genomic regions). Human whole genome association studies take advantage of Affymetrix GeneChip or Illumina HD chip platforms to genotype several thousand samples for at least half a million SNPs. Fine-mapping studies follow up whole genome association results, confirming and/or refining findings by scanning a custom SNP subset (i.e., 5000 SNPs) with a larger sample size (i.e., 10,000 individuals). The majority of SNP genotyping platforms were originally targeted at human fine-mapping requirements. Given the current disparities in SNP marker availability and funding levels between model species and aquaculture species, it is not surprising that project requirements also differ significantly between these groups. A large “whole genome” SNP project for catfish or salmonid species may have throughput and sample size specifications typical of a human fine-mapping experiment. In this regard, Table 8.2 provides an overview of the suitability of SNP platforms depending on SNP numbers and sample Table 8.2 size.

An overview of SNP genotyping platforms in regard to SNP number and sample

Sample size SNP number

50

200

2000

>5000

50

OpenArray Dynamic Array

Dynamic Array MassArray

Dynamic Array Mass Array

200

MassArray SNPstream GoldenGate MassArray SNPstream NA

OpenArray Dynamic Array MassArray SNPstream GoldenGate MassArray SNPstream GoldenGate MassArray SNPstream NA

MassArray SNPstream GoldenGate MassArray SNPstream iSelectHD

MassArray SNPstream GoldenGate MassArray SNPstream iSelectHD

Affymetrix GeneChip

Affymetrix GeneChip

2000 >10,000

Platforms were selected based on reasonable running costs, multiplexing capabilities, and historical patterns of use. Numbers were chosen to reflect likely experimental parameters as relevant to aquaculture species as well as potential divisions in platform utility. NA indicates that genotyping large numbers of SNPs on small numbers of individuals is currently cost-prohibitive.

SNP Genotyping Platforms

129

sizes relevant for aquaculture genomics. Except at the higher end of sample size (>5000) and SNP numbers (>10,000), many of the existing SNP genotyping platforms are capable of handling a wide range of SNP and sample requirements. However, several of the technologies are clearly focused on specific research scenarios. The Applied Biosystems OpenArray system and the Fluidigm Dynamic Array system are both targeted at researchers with smaller numbers of validated SNPs for use in rapid genotyping on large numbers of samples. As with all SNP genotyping platforms, multiplexing strategies dictate throughput in these TaqMan-based platforms. The OpenArray system is capable of 3072 genotyping reactions on a single microscope slide (i.e., 144 samples × 16 TaqMan assays or 12 samples × 256 TaqMan assays). Fluidigm’s Dynamic Array uses nanofluidics to expand this capacity to 9216 reactions in a single run (96 samples × 96 TaqMan assays). Researchers planning to conduct projects aimed at genotyping several hundred SNPs have a wide array of platform options including MassArray, SNPstream, and Illumina’s GoldenGate. MassArray allows multiplexing of up to 40 SNPs in a single well and processing of 384 samples in parallel (15,360 reactions per plate; >100,000 genotypes per day). The SNPstream system supports 48-plex PCR and SBE in a 384well format (18,432 reactions per plate; >3,000,000 genotypes per day). Illumina’s GoldenGate Assay offers the highest multiplexing capabilities of this group-1536-plex on 96 samples for 147,456 genotypes per plate (>300,000 genotypes per day). Projects requiring genotyping of greater than 10,000 SNPs will likely utilize Illumina’s iSelect HD Custom chips or Affymetrix custom GeneChips. Both companies’ custom chips are built with the same chemistries and multiplexing capacities of their human equivalents used for whole genome association studies, meaning that potential capacity and throughput are more than sufficient for aquaculture applications. Sample sizes of 2000 or greater are recommended to bring per sample cost down to acceptable levels, although Affymetrix offers greater flexibility for users with lower sample needs. Next generation sequencing approaches for SNP discovery and genotyping (described elsewhere in this book) may soon fill the void of options available for genotyping large numbers of SNPs on relatively small sample sets.

SNP Genotyping Platforms: Cost and Flexibility SNP genotyping costs can vary considerably based on platform choice and availability, labor, and processing costs, desired SNP throughput, sample sizes, multiplexing, design costs, success rates, custom versus off-the-shelf components, the emergence of new technologies, and so on. It is, therefore, beyond the scope of this chapter to attempt to compare detailed pricing among SNP genotyping platforms. However, some general considerations of cost and flexibility are discussed here. Current high-throughput options (>10,000 SNPs) offer little flexibility after initial chip design but provide the lowest costs per genotype (less than $0.01/genotype) of any platform assuming an adequate sample size. Start-up times are also highest for high-density chip platforms (4–8 weeks for synthesis/fabrication alone). Affymetrix MyGeneChip projects allow additional flexibility in some areas not currently available from Illumina. Illumina iSelect projects emphasize high sample numbers to bring down per sample costs. These sample numbers and short reagent shelf life (12 months from date of bead set synthesis) may prove daunting to small aquaculture research

130

Next Generation Sequencing and Whole Genome Selection in Aquaculture

communities. Affymetrix users have options for smaller batch, longer time frame projects that may be more appropriate for some groups. GoldenGate Assays also require considerable start-up time and provide intermediate flexibility and intermediate costs per reaction ($0.05–$0.20/genotype). In comparison with GoldenGate Assays, MassArray and SNPstream platforms offer streamlined start-up and design but similar costs ($0.05–$0.20/reaction). TaqManbased assays require extensive primer/probe optimization prior to multiplexing. Nanofluidic systems (such as Fluidigm’s EP1 platform) have greatly reduced reaction costs for TaqMan SNP assays, but cost are often highest of the analyzed platforms when TaqMan probe set costs and labor are considered. One aspect that must be considered when weighing SNP genotyping costs is the cost of access to potential genotyping platforms. A platform available in a local core lab with minimal overhead costs may provide lower costs and greater flexibility for small projects than a “better” platform elsewhere with sizeable assay setup and processing fees.

SNP Genotyping Platforms: Matching Platforms with Project Goals Ultimately, project goals will help to dictate the appropriate platform chosen for SNP genotyping. While medium- to high-throughput SNP genotyping is still in its infancy in aquaculture species, several projects have been carried out and several more are currently underway. A review of applications of SNP genotyping platforms to real-life research aims in aquaculture may be helpful in guiding those seeking to begin similar research (Table 8.3). The Illumina GoldenGate Assay was used by Sanchez et al. (2009) to evaluate a subset of 384 rainbow trout SNPs from a larger set discovered by deep sequencing of a reduced representation library (RRL). Genotyping was run on 192 samples, and 48% of the tested SNPs were validated. Their work added 167 SNP markers to the rainbow trout linkage map. Table 8.3 Examples of aquaculture-/fisheries-related research utilizing SNP genotyping platforms. Project

Species

Platform

Reference

SNP validation SNP validation QTL analysis Linkage mapping Linkage mapping Map integration Mapping and QTL Mapping and QTL Population differentiation

Rainbow trout Catfish Atlantic salmon Pacific white shrimp Atlantic cod Atlantic salmon Atlantic salmon Catfish Salmon and steelhead

GoldenGate GoldenGate MassArray MassArray MassArray MassArray iSelect HD Affymetrix Dynamic Array

Sanchez et al. (2009) Wang et al. (2008) Boulding et al. (2008) Du et al. (2010) Moen et al. (2009) Lorenz et al. (2010) NA NA NA

NA refers to currently unpublished studies.

SNP Genotyping Platforms

131

The GoldenGate Assay was also used by Wang et al. (2008) to genotype 192 catfish for 384 SNPs. The research provided parameters for SNP quality assessment based on EST contig size and minor allele frequency. Several recent studies have utilized Sequenom’s MassArray platform for genotyping in the context of linkage mapping, map integration, or QTL analysis. Boulding et al. (2008) examined associations between SNPs and QTLs for adaptive traits in juvenile Atlantic salmon. They genotyped 980 fish for 129–320 SNP loci using the MassArray system. They were able to achieve high levels of multiplexing (up to 34plex) and successfully detected 79 significant associations between SNP markers and quantitative traits. A gene-based SNP linkage map was recently published for Pacific white shrimp by Du et al. (2010). A total of 825 SNPs were genotyped on 144 individual animals using the Sequenom MassArray platform. After mapping analysis, 418 SNPs were incorporated into the linkage map. An Atlantic cod genetic linkage map was also created by genotyping 1146 offspring from 12 full-sib families for 257 SNPs on the MassArray system (Moen et al., 2009). In this case, 174 SNPs were successfully placed on the cod linkage map. Lorenz et al. (2010) took an interesting approach to Atlantic salmon physical map-linkage map integration. They resequenced bacterial artificial chromosome (BAC) ends from 14 individual and detected 180 SNPs. They then genotyped 376 individuals for these SNPs using the MassArray system and positioned 110 SNPs on their existing linkage map. In doing so, they also anchored 73 BAC contigs to the Atlantic salmon linkage groups. Several other, currently unpublished, large SNP projects are ongoing in aquaculture species. Work is currently underway in Atlantic salmon and catfish to utilize Illumina iSelect and Affymetrix chips for high-throughput genotyping. A 16.5 K Illumina SNP chip has been constructed in Atlantic salmon from 9240 EST-derived SNPs and 7019 SNPs from RRLs. Construction of a high-density (40 K) SNP chip for catfish is under way, also utilizing a combination of EST-derived SNPs and SNPs from RRLs. Lastly, Fluidigm’s EP1 system is being utilized to genotype greater than 7500 samples of steelhead and Chinook salmon for 75–96 SNPs. Results will be used to differentiate wild fish populations (Narum et al., 2009). To summarize, a variety of platforms have been utilized for SNP genotyping in aquaculture-relevant species. For midlevel mapping and QTL throughputs (∼100 K genotyping reactions), the MassArray system has enjoyed the greatest popularity, followed by Illumina’s GoldenGate Assay. Several studies using Illumina’s iSelect high-density chips are underway. As SNP genotyping projects become more common in aquaculture genomics research over the next several years, new technologies may emerge that are ideally suited to the sample sizes and throughput requirements of researchers in nonmodel species.

Conclusions SNP genotyping platforms provide significant power to advance aquaculture research in genetics and genomics. A wide range of SNP numbers and sample sizes can be interrogated using current commercial solutions, and further advances in nanofluidics, multiplexing, and decreases in reagent costs are expected in the near future. Researchers interested in SNP genotyping must weigh aspects of cost, flexibility, and

132

Next Generation Sequencing and Whole Genome Selection in Aquaculture

throughput alongside project goals to select the most appropriate technology. Aquaculture genomics researchers, in many cases, have unique project requirements for low-cost, high-SNP-number, low- to medium-sample-size projects that are currently poorly matched with existing technologies. As SNP genotyping moves beyond human health and model species, these needs may begin to be met.

References Boulding EG, Culling M, Glebe B, Berg PR, Lien S, and Moen T. 2008. Conservation genomics of Atlantic salmon: SNPs associated with QTLs for adaptive traits in parr from four transAtlantic backcrosses. Heredity, 101:381–391. Du ZQ, Ciobanu DC, Onteru SK, et al. 2010. A gene-based SNP linkage map for Pacific white shrimp, Litopenaeus vannamei. Anim Genet, 41:286–294. Li P, Peatman E, Wang S, Feng J, He C, Baoprasertkul P, Xu P, Kucuktas H, Nandi S, Somridhivej B, Serapion J, Simmons M, Turan C, Liu L, Muir W, Dunham R, Brady Y, Grizzle J, Liu Z, 2007. Towards the ictalurid catfish transcriptome: Generation and analysis of 31,215 catfish ESTs. BMC Genomics, 8:177. Litt M and Luty JA. 1989. A hypervariable microsatellite revealed by in vitro amplification of a dinucleotide repeat within the cardiac muscle actin gene. Am J Hum Genet, 44:397–401. Liu ZJ and Cordes JF. 2004. DNA marker technologies and their applications in aquaculture genetics. Aquaculture, 238:1–37. Lorenz S, Brenna-Hansen S, Moen T, et al. 2010. BAC-based upgrading and physical integration of a genetic SNP map in Atlantic salmon. Anim Genet, 41:48–54. Moen T, Hayes B, Baranski M, et al. 2008. A linkage map of the Atlantic salmon (Salmo salar) based on EST-derived SNP markers. BMC Genomics, 9:223. Moen T, Delghandi M, Wesmajervi MS, Westgaard JI, and Fjalestad KT. 2009. A SNP/ microsatellite genetic linkage map of the Atlantic cod (Gadus morhua). Anim Genet, 40:993–996. Nakamura Y. 2009. DNA variations in human and medical genetics: 25 years of my experience. J Hum Genet, 54:1–8. Narum S, Campbell N, and Yi Y. 2009. High throughput genotyping in salmon and steelhead (Abstract). Plant and Animal Genomics Conference, San Diego, CA. Sánchez CC, Smith TP, Wiedmann RT, et al. 2009. Single nucleotide polymorphism discovery in rainbow trout by deep sequencing of a reduced representation library. BMC Genomics, 10:559. Wang K, Baldassano R, Zhang H, et al. 2010a. Comparative genetic analysis of inflammatory bowel disease and type 1 diabetes implicates multiple loci with opposite effects. Hum Mol Genet, 19:2059–2067. Wang S, Sha Z, Sonstegard TS, et al. 2008. Quality assessment parameters for EST-derived SNPs from catfish. BMC Genomics, 9:450. Wang S, Peatman E, Abernathy J, et al. 2010b. Assembly of 500,000 inter-specific catfish expressed sequence tags and large scale gene-associated marker development for whole genome association studies. Genome Biol, 11:R8.

Chapter 9

SNP Analysis with Duplicated Fish Genomes: Differentiation of SNPs, Paralogous Sequence Variants, and Multisite Variants Cecilia Castaño Sánchez, Yniv Palti, and Caird Rexroad

Applications of SNP Markers Most economically important traits in agriculture are quantitative phenotypes; their variation is continuous, not falling in discrete classes. Disease resistance, growth, meat quality, and stress are examples of quantitative traits. The genetic basis of those phenotypic variations lies in the combined effects of several genes or loci, named quantitative trait loci (QTL) (Massault et al., 2008). The detection of linkage between QTL and genetic markers on a segregating population can be a powerful method to track the QTL. Genetic loci that are physically close in the genome tend to segregate together during meiosis; therefore, their recombination fractions are close to zero (Ott, 1991). Linkage maps are composed by genetic markers that are grouped based on their recombination frequencies and represented according to their position in the genome. Those maps provide the tools with which to study the association of genetic markers with QTL. Once identified, the genetic markers associated with the QTL can be used in breeding programs to identify and select individuals carrying desired traits. Such a selection tool is called marker-assisted selection (MAS) (Liu and Cordes, 2004; Liu, 2007; Rothschild and Ruvinsky, 2007; Editorial, 2009). There are different kinds of genetic markers, including restriction fragment length polymorphisms (RFLPs), random amplified polymorphisms (RAPDs), amplified fragment length polymorphisms (AFLPs), microsatellites, and single-nucleotide polymorphisms (SNPs) (Ruane and Sonnino, 2007).

SNP Markers in Agriculture Advances in DNA sequencing technologies have recently propelled the discovery of thousands of SNP markers (Barbazuk et al., 2007; Novaes et al., 2008; Satkoski

Next Generation Sequencing and Whole Genome Selection in Aquaculture © 2011 Blackwell Publishing Ltd. ISBN: 978-0-813-80637-2

Edited by Zhanjiang (John) Liu

133

134

Next Generation Sequencing and Whole Genome Selection in Aquaculture

et al., 2008; Van Tassell et al., 2008; Wiedmann et al., 2008; Amaral et al., 2009; Castaño Sánchez et al., 2009; Kerstens et al., 2009; Hyten et al., 2010). Additionally, novel SNP genotyping technologies have facilitated the simultaneous analysis of great numbers of markers and, consequently, the construction of high-density genetic maps, giving a new impulsion to the search for polymorphisms underlying variations in complex traits through the use of genome-wide association (GWA) studies (Goddard and Hayes, 2009). Based on genome-wide dense maps, it is possible to estimate animals’ breeding values by evaluating and summing the effects of genes or chromosomal segments simultaneously (Meuwissen et al., 2001). When using SNP panels covering whole genomes, QTL are in linkage disequilibrium with at least one marker, maximizing the proportion of genetic variance explained by the SNPs (Verbyla et al., 2009). Affymetrix (www.affymetrix.com/ estore/) and Illumina (www.illumina.com/agriculture) have made commercially available SNP platforms that multiplex thousands of SNPs and can be used to perform GWA studies; for some agricultural species, those platforms include thousands of markers. The Affymetrix Bovine GeneChip 10 and 25 K have a SNP panel of approximately 25,000 markers that have been validated on Bos taurus and Bos indicus; they have been used in GWA studies of linkage disequilibrium, twinning rates, and other important traits (Daetwyler et al., 2008; Sargolzaei et al., 2008; Kim et al., 2009). The Illumina Bovine SNP50 chip contains 54,001 SNPs that were validated in 19 dairy and beef breeds (Matukumalli et al., 2009). In a very short time, it has been used to demonstrate that genomeenhanced estimates of genetic merit increase reliability of predictions in Holstein dairy cattle (Vanraden et al., 2009; Wiggans et al., 2009). It has also been used in a growth study in crossbred beef cattle (Snelling et al., 2010), in a bovine Johne’s disease QTL study in Holstein cattle (Pant et al., 2010), and in other large-scale and high-impact studies (e.g., Decker et al., 2009; Hayes et al., 2009). Illumina has recently announced the release of a next generation high-density SNP chip (BovineHD) that will interrogate more than 500,000 loci. Other studies have used Illumina’s custom GoldenGate SNP arrays to study population structure and disease in cattle breeds (Mckay et al., 2008; Murdoch et al., 2010) as well as feed intake and efficiency (Sherman et al., 2010). The Illumina Porcine SNP60 has 62,103 SNPs evenly spaced to offer a comprehensive coverage of the porcine genome (Ramos et al., 2009). The Equine SNP50 contains 54,602 SNPs uniformly distributed across the equine genome and validated in 15 horse breeds, and it has been used in a whole genome association study of the lavender foal syndrome (Gabreski et al., 2009). The Ovine SNP50 features 54,241 SNPs validated in 23 breeds and has been used in congenital abnormality studies (Becker et al., 2010); besides, custom high-throughput SNP assays have been developed to study genetic structure of sheep breeds (Kijas et al., 2009). The maize SNP50 BeadChip contains 56,110 markers derived from the B73 maize line reference sequence, and custom SNP platforms are available for maize genotyping (Yan et al., 2010). For other agricultural species, highthroughput custom assays have also been used, including linkage mapping and a genetic diversity study in commercial chicken breeds (Muir et al., 2008; Groenen et al., 2009) and genotyping and mapping of polyploid wheat, barley, soybean, and spruce (Pavy et al., 2008; Akhunov et al., 2009; Close et al., 2009; Hyten et al., 2010; Paux et al., 2010).

SNP Analysis with Duplicated Fish Genomes

135

SNP Markers in Aquaculture Species Domestication of aquaculture species has occurred more recently than that of agricultural plants, and livestock and genetic selection programs are at their initial stages for most of them. Therefore, MAS could be of great importance in developing those programs. However, even though some QTL have been identified, MAS is not being used in aquaculture breeding, mainly due to the lack of dense molecular maps (Sonesson, 2007). Genetic linkage maps are available for several aquaculture species and were constructed predominantly with AFLP and microsatellite markers; in most cases, marker densities are rather low and the markers are unevenly spread over the genome (Sonesson, 2007). Most QTL have been identified in salmonids, which are the aquaculture species with better developed linkage maps and longer history of domestication. Several QTL studies haven been done to identify loci associated with disease resistance in salmonid species (Ozaki et al., 2001; Nichols et al., 2003; Rodriguez et al., 2004; Moen et al., 2007, 2009; Barroso et al., 2008; Houston et al., 2008a, b; Gharbi et al., 2009). Other traits such as thermal tolerance (Danzmann et al., 1999; Perry et al., 2001), spawning time (Sakamoto et al., 1999; Araneda et al., 2009), maturation (Haidle et al., 2008), development rate (Sundin et al., 2005; Nichols et al., 2007), and albinism (Nakamura et al., 2001) have been studied in trout and salmon species. QTL studies have also been conducted in other aquaculture species, such as tilapia (Palti et al., 2002; Shirak et al., 2002; Cnaani et al., 2003, 2004; Lee et al., 2003, 2004; Moen et al., 2004), shrimp (Li et al., 2006; Lyons et al., 2007), and oyster (Yu and Guo, 2006). Microsatellites have been the most frequently used markers in these QTL studies. SNPs found within or near a coding sequence are more likely to alter the biological function of a protein. Gene-associated SNPs are suitable markers for mapping in comparative genome studies and in MAS (Sarropoulou et al., 2008; Wang et al., 2008). Currently, only few publications describe the use of SNPs for QTL analyses in aquatic species, including a conservation genomics QTL study in Atlantic salmon (Boulding et al., 2008) and QTL studies in shrimp and prawns (Zeng et al., 2008; Ciobanu et al., 2009; Thanh et al., 2010). SNPs have also been incorporated in linkage maps (Moen et al., 2008a; Kucuktas et al., 2009; Lorenz et al., 2009). However, unlike other agricultural species, the availability of large numbers of SNPs in aquaculture species is still very limited. SNPs have been identified from alignments of expressed sequence tags (ESTs) data in several species: bivalves (Tanguy et al., 2008); abalone (Bester et al., 2008); shrimp (Gorbach et al., 2009); Atlantic cod (Moen et al., 2008b); catfish (He et al., 2003; Wang et al., 2008, 2010); and salmonids (Smith et al., 2005; Hayes et al., 2007b). Even though a large number of SNPs has been reported for catfish (Wang et al., 2010), few studies have reported highthroughput SNP discovery in aquaculture species (Kent et al., 2008; Castaño Sánchez et al., 2009). Currently, there are no commercial high-throughput SNP genotyping assays available, but a 16 K Illumina Infinium BeadChip was recently developed for Atlantic salmon (Kent et al., 2008). High-throughput SNP assays of thousands of markers could facilitate QTL studies, whole genome association studies, whole genome-enabled selection and MAS programs in aquaculture. In addition, smaller scale assays of tens or hundreds of SNPs can successfully replace microsatellites for parentage assignment and pedigree identification as genetic markers tagging

136

Next Generation Sequencing and Whole Genome Selection in Aquaculture

is often used in aquaculture breeding programs (e.g., Palti et al., 2006; Pierce et al., 2008).

Genome Duplication in Fish The 2R genome duplication hypothesis identifies two rounds of genome duplication in ancestral vertebrates, one immediately before and one immediately after the divergence of the lamprey lineage 500–800 million years ago (MYA) (Wolfe, 2001). The first tetraploidy event separated cephalochordates from early agnathan (jawless) vertebrates. The second event apparently occurred within the subphylum Vertebrata and coincided with the development of jaws from the second gill arch that took place in Ordovician vertebrates. Evidences of these phenomena can be seen in certain gene families. Amphioxus fish (cephalocohordates) have a single Hox gene cluster, while lampreys (jawless vertebrates) are equipped with two independent Hox gene clusters. Jawed vertebrates such as birds or humans show evidence of the second tetraploidization event by presenting three to four separate clusters of Hox genes. The presence of four related sets of genes, on different chromosomes, in vertebrate genomes is not confined to Hox genes (reviewed by Ohno, 1999). Analysis of bony fish genomes has indicated that Hox clusters and many other genome loci are present at a higher copy number than mammals, providing evidence for an ancient whole genome duplication of the teleost lineage after it split from the lobe finned lineage 325–350 MYA. For example, seven Hox gene clusters were identified in the zebrafish and other ray-finned fish genomes compared with only four in mammals (Amores et al., 1998, 2004). The fish-specific genome duplication hypothesis or the so-called 3R hypothesis has been validated in recent years by the identification of hundreds of duplicated loci orthologous to single-copy mammalian loci as more fish genomes have been sequenced (Christoffels et al., 2004; Jaillon et al., 2004). Some fish species are believed to have had an additional (4R) round of genome duplication late in the evolution of the teleosts that might have led to their speciation. Among these are the catostomid fishes (Uyeno and Smith, 1972), salmonids (Allendorf and Thorgaard, 1984), common carp (Larhammar and Risinger, 1994), and goldfish. Genome duplication in catostomid fishes and common carp is supported by karyotype observation; the Cyprinidae, from which these groups of fish diverged, present a 2n = 48 to 50 chromosomes, while catostomid fishes and common carp have chromosome numbers of 2n = 100 (Ohno et al., 1967; Uyeno and Smith, 1972). Further molecular evidence has supported the R4 genome duplication theory in common carp and in the family Salmonidae (David et al., 2003; Phillips et al., 2003; Koop et al., 2008). In salmonids, at least 13 Hox clusters have been identified, compared with seven to eight clusters in the “diploid” or 3R teleost species (Moghadam et al., 2005a, b). The salmonid genome duplication event is thought to be the result of tetraploidization within the same ancestral species (autotetraploidy) that had occurred 25–100 MYA (Allendorf and Thorgaard, 1984). The common carp genome duplication event is thought to have occurred in the lineage that led to common carp and goldfish 11–21 MYA, and it is likely the result of hybridization between two distinct genomes (allotetraploidy) (David et al., 2003).

SNP Analysis with Duplicated Fish Genomes

137

Identifying and Genotyping SNPs in Duplicated Fish Genomes SNP Discovery in Duplicated Fish Genomes High-throughput SNP discovery in species having nonduplicated genomes is a relatively straightforward process. It involves alignment of sequences to identify different alleles of the same locus. SNP discovery in duplicated genomes is much more complex. Three types of sequence variation can be found in alignments of related DNA sequences in duplicated genomes: (1) paralogous sequence variants (PSVs) are fixed sites and not polymorphisms, they are evidence of sequence differences between paralogous copies diverged from a common ancestral gene; (2) SNPs are polymorphic variations and differ between allelic copies of a single gene; and (3) multisite variants (MSVs) are polymorphisms (SNPs) found across paralogous sequences (Lindsay et al., 2006). The big challenge in the discovery of SNP markers in duplicated genomes is the ability to identify and discriminate the different kinds of sequence variants. When analyzing sequence data, all variants appear to be polymorphic sites and it is difficult to identify true SNPs. In aquaculture species, where little DNA sequence data is available, SNP discovery strategies have included analyzing EST libraries (see Chapter 6). The use of ESTs for SNP discovery in salmonids presented particular challenges, resulting in a high number of putative EST-based SNPs actually being PSVs and MSVs rather than true SNPs (Ryynanen and Primmer, 2006; Hayes et al., 2007a, b; Castaño Sánchez et al., 2009). In addition, it is difficult to distinguish sequencing errors in EST libraries from true polymorphisms, which also contributes to low rates of putative SNPs validation. A study in Atlantic salmon (Hayes et al., 2007a) that identified over 2000 putative SNPs from EST data found a high proportion of putative markers showing significant heterozygote excess (likely MSVs) or complete absence of homozygotes, which indicated, most likely, that those variants were PSVs. Ryynanen and Primmer (2006), in their efforts to overcome problems related to genome duplication in Atlantic salmon, introduced a new DNA sequencing strategy to be used for SNP discovery. Frequently, in species with little genomic information, SNP identification processes involve the design of sequencing primers in conserved regions of gene sequences from closely related species to amplify conserved DNA regions of the targeted species; these approaches are termed “exonprimed intron-crossing” (EPIC). This study investigated two different EPIC approaches. In the first approach, primers were designed either on flanking exonic sequences of salmonid genes or flanking exonic sequences of other teleost species. In the second strategy, in an attempt to avoid amplification of potential paralogs, a new intron-primed exon-crossing (IPEC) method was introduced where at least one primer was designed in more variable (intronic) regions of salmonid genes. Exon-targeted primers were relatively unsuccessful compared with the proposed IPEC approach. The proportion of successful amplification of loci in which polymorphisms were identified was 36% in IPEC versus 6% in the other strategies. Even though the success rate in this strategy is higher than previous sequencing strategies, it is also a time-consuming methodology and not suitable for high-throughput SNP discovery.

138

Next Generation Sequencing and Whole Genome Selection in Aquaculture PCR forward

SNaPshot primer

S N P

Figure 9.1 Incorrect assembly of paralogous sequences. Genomic DNA sequencing of SNP flanking regions revealed incorrect assemblies of paralogous sequences during the SNP discovery process. The SNaPshot primer had been designed in a region with distinct paralogous differentiation. See color insert. S N P

SNapShot primer (part a)

SNapShot primer (part b) Intron

Figure 9.2 Presence of introns in the amplicons in rainbow trout. Genomic DNA sequencing of SNP flanking regions revealed presence of introns in the amplicon sequences. The SNaPshot primer had been designed in an intron–exon boundary region. The first sequence in the figure is the EST sequence used to design the SNaPshot primer. See color insert.

Castaño Sánchez et al. (2009), in their first attempt to find SNPs in rainbow trout, evaluated multiple bioinformatic pipelines for their ability to detect SNPs from EST data. The occurrence of the genome duplication resulted in assemblies of paralogous sequences, which led to the identification of a large proportion of false positives. In addition, a high number of the putative markers were not validated due to PCR amplification problems caused by incorrect assemblies of paralog sequences (Figure 9.1), presence of introns in the amplicons (Figure 9.2), and unspecific binding of primers due to amplification of multiple genes. A customized SNP discovery process using only 3′ ESTs and stringent assembly parameters was then developed to overcome issues associated with the genome duplication. The 3′ EST sequences were selected to avoid intron–exon boundaries, increase primer binding specificity, and raise the probability of finding SNPs in untranslated regions (UTRs). However, the

SNP Analysis with Duplicated Fish Genomes

139

results were not improved and no markers were validated. The authors then employed a high-throughput sequencing strategy, using pyrosequencing technology to generate data from a genomic DNA reduced representation library. Over 20,000 putative SNPs were identified; however, the validation rate (48%) was substantially lower than in livestock animals where the same methodology of SNP discovery was used (>90%) (Van Tassell et al., 2008; Wiedmann et al., 2008).

SNP Validation and Genotyping in Duplicated Fish Genomes Current SNP genotyping technologies present further challenges in the use of SNPs in species with duplicated genomes since they have not been designed to differentiate true SNPs from PSVs or MSVs. ABI’s SNaPshot technology, for example, uses fluorescent dyes to identify the four possible alleles of each SNP (A: yellow; C: black; G: blue; T: red); genotyping chromatograms are represented as colored peaks, one peak for each allele (Figure 9.3). In this case, it is not feasible to discriminate true SNPs

Figure 9.3 ABI’s GeneMapper SNP graphs. Columns represent homozygote and heterozygote genotypes for six SNPs. The genotypes in the first first SNPs were G/A (blue and green peaks), while the last one (V5666) was G/C (blue/black). See color insert.

140

Next Generation Sequencing and Whole Genome Selection in Aquaculture

from other types of variants since they would look the same on the chromatograms. One possible solution to overcome this problem would be the use of double haploid organisms for SNPs validations as double haploids are nearly 100% homozygous and marker assays showing heterozygous genotype in double haploids are likely detecting false SNPs. Medium- and high-throughput SNP genotyping platforms (see Chapter 8) have not taken into consideration the existence of duplicated genomes either, and the algorithms used for genotype calling are not adequate to identify the different variations. In their attempts to validate putative SNPs in Pacific salmon, Smith et al. (2005) used ABI’s TaqMan assay-by-design technology, which discriminates the different alleles in each SNP using fluorescent probes. The sequence detection software then plots the fluorescence emitted by the two allele-specific probes against each other as a scatter plot. The authors suggest that assays in which more than three genotype clusters (two heterozygous plus a homozygous genotype cluster) were detected may represent PSVs or MSVs rather than true SNPs. The multiple clusters observed in such assays were likely caused by assay primers or probes that were not locus specific. Similarly, Illumina’s BeadStudio software, which was designed to analyze data from GoldenGate or Infinium assays, uses a custom clustering algorithm to identify and call the different genotypes. Genotypes are represented in graphs that plot intensities of allele A (green fluorescence) over allele B (red); cluster A/A being homozygous samples for allele A, cluster B/B being allele B homozygous, and cluster A/B (purple) being heterozygous samples (Figure 9.4A). Kent et al. (2008) used Infinium assays to validate SNPs in Atlantic salmon. Putative SNPs that were heterozygous for all samples were considered PSVs (Figure 9.4B). When five distinctive clusters were identified, variants were considered to be MSVs in which both paralogous sequences were polymorphic (Figure 9.4C). In MSVs for which only one paralog locus is polymorphic, the genotype graph would be dislocated toward one of the alleles, like the example in Figure 9.4D, where approximately 10% of the samples were considered A/A-BB genotype. This approach enables SNPs validation in species with genome duplication if a large number of samples (approximately 3000) are included in the validation panel for genotyping. This not only allows for distinct clusters formation

Figure 9.4 Illumina BeadStudio Atlantic salmon SNP graphs (figures modified from Kent et al. 2008 with permission). Genotypes in these graphs are represented in clusters. Each dot represents one genotyped individual: red clusters symbolize homozygote individuals for allele A (A/A) and blues for allele B (B/B); purple dots represent heterozygotes (A/B); and black dots unscored individuals. (A) SNP. In this SNP (ESTNV_30477_550), 156 individuals were homozygotes for allele A, 1155 were heterozygotes (A/B), and 1800 were homozygotes for allele B. (B) Paralogous sequence variants (PSVs). Since all genotyped individuals (3114) presented both alleles (A/B), this putative SNP (ESTNV_20430_540) was verified as a PSV. (C) Multisite sequence variants (MSVs): Both paralogs are polymorphic. In this SNP (AY388579_a), both paralogs are polymorphic. Cluster A/A-A/B represents individuals who are homozygous for one paralog (A/A) and heterozygous in the other one (A/B), and cluster A/B -B/B is the opposite. (D) MSVs: One paralog is fixed, the other is polymorphic. One paralog sequence is fixed (B/B -A/B), while the other one is polymorphic (A/B -A/B). See color insert.

141

USDA04_M

30000

Intensity (B)

20000

A

10000

0

10000

0

20000

30000

40000

50000

Intensity (A)

Whale_Rock_Female_Line

20000

Intensity (B)

B

10000

0

0

10000

20000

30000

40000

50000

Intensity (A)

Figure 9.5 Illumina BeadStudio sample graphs. (A) Rainbow trout sample graph. Sample graphs represent the genotypes of all analyzed SNP for one sample. Red and blue dots represent all the SNP for which this particular individual (USDA04_M) was homozygous for allele A (A/A) and B (B/B), respectively; purple dots symbolize the heterozygous SNPs. (B) Double haploid rainbow trout sample graph. Validated SNPs should be all homozygous in double haploid organisms as reflected by the absence of purple dots. See color insert.

142

SNP Analysis with Duplicated Fish Genomes

143

in MSVs but also increases the cost of validation compared with nonduplicated species. Another caveat is that the MSV SNPs would need to be analyzed individually, which is time-consuming and less feasible for validating SNPs identified using highthroughout discovery approaches. Double haploids can be used as controls in Illumina assays validation as well. BeadStudio presents graphs for each sample in which it is possible to identify the SNPs that were homozygous or heterozygous (Figure 9.5A). As discussed above, all the validated SNPs should appear homozygous in double haploids (Figure 9.5B).

Concluding Remarks As all ray-finned fish share an additional (3R) round of ancestral genome duplication in their evolutionary history compared with mammals and birds, they are likely to encounter higher frequency of PSVs and MSVs than what is known from using SNPs in other livestock animals. However, this problem for SNPs validation and for developing high-throughput SNP assays is even more pronounced in salmonids (of which rainbow trout, Atlantic salmon, and Pacific salmon are considered important aquaculture commodities) and the common carp, which have undergone more recent and independent 4R of whole genome duplications. Therefore, the authors of this chapter believe that a reference genome sequence will be required to enable validation of SNPs from high-throughput discovery projects, and thus to enable the design and development of high-throughput SNP genotyping assays for whole genome association assays and whole genome-enabled selection in aquaculture species with duplicated genomes such as salmonids and the common carp.

Acknowledgment We thank Dr. Thomas Kent for contributing the images in Figures 9.4 and 9.5 showing SNP genotyping assays in Atlantic salmon. The use of trade, firm, or corporation names in this publication is for the information and convenience of the reader. Such use does not constitute an official endorsement or approval by the United States Department of Agriculture (USDA), Agricultural Research Service, of any product or service to the exclusion of others that may be suitable.

References Akhunov E, Nicolet C, and Dvorak J. 2009. Single nucleotide polymorphism genotyping in polyploid wheat with the Illumina GoldenGate assay. Theor Appl Genet, 119:507–517. Allendorf FW and Thorgaard GH. 1984. Tetraploidy and the evolution of salmonid fishes. In: Evolutionary Genetics of Fishes, edited by BJ Turner. Plenum Press, New York. Amaral AJ, Megens HJ, Kerstens HH, Heuven HC, Dibbits B, Crooijmans RP, Den Dunnen JT, and Groenen MA. 2009. Application of massive parallel sequencing to whole genome SNP discovery in the porcine genome. BMC Genomics, 10:374.

144

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Amores A, Force A, Yan Y-L, Joly L, Amemiya C, Fritz A, Ho RK, Langeland J, Prince V, Wang Y-L, Westerfield M, Ekker M, and Postlethwait J. 1998. Zebrafish hox clusters and vertebrate genome evolution. Science, 282:1711–1714. Amores A, Suzuki T, Yan Y-L, Pomeroy J, Singer A, Amemiya C, and Postlethwait JH. 2004. Developmental roles of pufferfish hox clusters and genome evolution in ray-fin fish. Genome Res, 14:1–10. Araneda C, Lam N, Diaz NF, Cortez S, Perez C, Neira R, and Iturra P. 2009. Identification, development and characterization of three molecular markers associated to spawning date in Coho salmon (Oncorhynchu kisutch). Aquaculture, 296:21–26. Barbazuk WB, Emrich SJ, Chen HD, Li L, and Schnable PS. 2007. SNP discovery via 454 transcriptome sequencing. Plant J, 51:910–918. Barroso RM, Wheeler PA, Lapatra SE, Drew RE, and Thorgaard GH. 2008. QTL for IHNV resistance and growth identified in a rainbow trout (Oncorhynchus mykiss) × Yellowstone cutthorat (Oncorhynchus clarki bouvieri) trout cross. Aquaculture, 277:156–163. Becker D, Tetens J, Brunner A, Burstel D, Ganter M, Kijas J, and Drogemuller C. 2010. Microphthalmia in Texel sheep is associated with a missense mutation in the paired-like homeodomain 3 (PITX3) gene. PLoS ONE, 5:e8689. Bester AE, Roodt-Wilding R, and Whitaker HA. 2008. Discovery and evaluation of single nucleotide polymorphisms (SNPs) for Haliotis midae: A targeted EST approach. Anim Genet, 39:321–324. Boulding EG, Culling M, Glebe B, Berg PR, Lien S, and Moen T. 2008. Conservation genomics of Atlantic salmon: SNPs associated with QTLs for adaptive traits in parr from four transAtlantic backcrosses. Heredity, 101:381–391. Castaño Sánchez C, Smith TP, Wiedmann RT, Vallejo RL, Salem M, Yao J, and Rexroad CE, 3rd. 2009. Single nucleotide polymorphism discovery in rainbow trout by deep sequencing of a reduced representation library. BMC Genomics, 10:559. Christoffels A, Koh EGL, Chia J-M, Brenner S, Aparicio S, and Venkatesh B. 2004. Fugu genome analysis provides evidence for a whole-genome duplication early during the evolution of ray-finned fishes. Mol Biol Evol, 21:1146–1151. Ciobanu DC, Bastiaansen JW, Magrin J, Rocha JL, Jiang DH, Yu N, Geiger B, Deeb N, Rocha D, Gong H, Kinghorn BP, Plastow GS, Van Der Steen HA, and Mileham AJ. 2009. A major SNP resource for dissection of phenotypic and genetic variation in Pacific white shrimp (Litopenaeus vannamei). Anim Genet, 41:39–47. Close TJ, Bhat PR, Lonardi S, Wu Y, Rostoks N, Ramsay L, Druka A, Stein N, Svensson JT, Wanamaker S, Bozdag S, Roose ML, Moscou MJ, Chao S, Varshney RK, Szucs P, Sato K, Hayes PM, Matthews DE, Kleinhofs A, Muehlbauer GJ, Deyoung J, Marshall DF, Madishetty K, Fenton RD, Condamine P, Graner A, and Waugh R. 2009. Development and implementation of high-throughput SNP genotyping in barley. BMC Genomics, 10:582. Cnaani A, Hallerman EM, Ron M, Weller JI, Indelman M, Kashi Y, Gall GAE, and Hulata G. 2003. Detection of a chromosomal region with two quantitative trait loci, affecting cold tolerance and fish size, in an F2 tilapia hybrid. Aquaculture, 223:117–128. Cnaani A, Zilberman N, Tinman S, Hulata G, and Ron M. 2004. Genome-scan analysis for quantitative trait loci in an F 2 tilapia hybrid. Mol Genet Genomics, 272:162–172. Daetwyler HD, Schenkel FS, Sargolzaei M, and Robinson JA. 2008. A genome scan to detect quantitative trait loci for economically important traits in Holstein cattle using two methods and a dense single nucleotide polymorphism map. J Dairy Sci, 91:3225–3236. Danzmann RG, Jackson TR, and Ferguson MM. 1999. Epistasis in allelic expression at upper temperature tolerance QTL in rainbow trout. Aquaculture, 173:45–58. David L, Blum S, Feldman MW, Lavi U, and Hillel J. 2003. Recent duplication of the common carp (Cyprinus carpio L.) genome as revealed by analyses of microsatellite loci. Mol Biol Evol, 20:1425–1434.

SNP Analysis with Duplicated Fish Genomes

145

Decker JE, Pires JC, Conant GC, Mckay SD, Heaton MP, Chen K, Cooper A, Vilkki J, Seabury CM, Caetano AR, Johnson GS, Brenneman RA, Hanotte O, Eggert LS, Wiener P, Kim J-J, Kim KS, Sonstegard TS, Van Tassell CP, Neibergs HL, Mcewan JC, Brauning R, Coutinho LL, Babar ME, Wilson GA, Mcclure MC, Rolf MM, Kim J, Schnabel RD, and Taylor JF. 2009. Resolving the evolution of extant and extinct ruminants with high-throughput phylogenomics. Proc Natl Acad Sci U S A, 106:18644–18649. Editorial NB. 2009. The genome-assisted barnyard. Nat Biotechnol, 27:487–487. Gabreski N, Brooks S, Miller D, and Anczak D. 2009. Mapping of lavender foal syndrome using the EquineSNP50 chip. J Equine Vet Sci, 29:321–322. Gharbi K, Glover KA, Stone LC, Macdonald ES, Matthews L, Grimholt U, and Stear MJ. 2009. Genetic dissection of MHC-associated susceptibility to Lepeophtheirus salmonis in Atlantic salmon. BMC Genet, 10:20. Goddard ME and Hayes BJ. 2009. Mapping genes for complex traits in domestic animals and their use in breeding programmes. Nat Rev Genet, 10:381–391. Gorbach DM, Hu ZL, Du ZQ, and Rothschild MF. 2009. SNP discovery in Litopenaeus vannamei with a new computational pipeline. Anim Genet, 40:106–109. Groenen MA, Wahlberg P, Foglio M, Cheng HH, Megens HJ, Crooijmans RP, Besnier F, Lathrop M, Muir WM, Wong GK, Gut I, and Andersson L. 2009. A high-density SNP-based linkage map of the chicken genome reveals sequence features correlated with recombination rate. Genome Res, 19:510–519. Haidle L, Janssen JE, Gharbi K, Moghadam HK, Ferguson MM, and Danzmann RG. 2008. Determination of quantitative trait loci (QTL) for early maturation in rainbow trout (Oncorhynchus mykiss). Mar Biotechnol (NY), 10:579–592. Hayes B, Laerdahl JK, Lien S, Moen T, Berg P, Hindar K, Davidson WS, Koop BF, Adzhubei A, and Hoyhem B. 2007a. An extensive resource of single nucleotide polymorphism markers associated with Atlantic salmon (Salmo salar) expressed sequences. Aquaculture, 265:82–90. Hayes BJ, Nilsen K, Berg PR, Grindflek E, and Lien S. 2007b. SNP detection exploiting multiple sources of redundancy in large EST collections improves validation rates. Bioinformatics, 23:1692–1693. Hayes BJ, Bowman PJ, Chamberlain AJ, Savin K, Van Tassell CP, Sonstegard TS, and Goddard ME. 2009. A validated genome wide association study to breed cattle adapted to an environment altered by climate change. PLoS ONE, 4:e6676. He C, Chen L, Simmons M, Li P, Kim S, and Liu ZJ. 2003. Putative SNP discovery in interspecific hybrids of catfish by comparative EST analysis. Anim Genet, 34:445–448. Houston RD, Gheyas A, Hamilton A, Guy DR, Tinch AE, Taggart JB, Mcandrew BJ, Haley CS, and Bishop SC. 2008a. Detection and confirmation of a major QTL affecting resistance to infectious pancreatic necrosis (IPN) in Atlantic salmon (Salmo salar). Dev Biol (Basel), 132:199–204. Houston RD, Haley CS, Hamilton A, Guy DR, Tinch AE, Taggart JB, Mcandrew BJ, and Bishop SC. 2008b. Major quantitative trait loci affect resistance to infectious pancreatic necrosis in Atlantic salmon (Salmo salar). Genetics, 178:1109–1115. Hyten DL, Cannon SB, Song Q, Weeks N, Fickus EW, Shoemaker RC, Specht JE, Farmer AD, May GD, and Cregan PB. 2010. High-throughput SNP discovery through deep resequencing of a reduced representation library to anchor and orient scaffolds in the soybean whole genome sequence. BMC Genomics, 11:38. Jaillon O, Aury JM, Brunet F, Petit JL, Stange-Thomann N, Mauceli E, Bouneau L, Fischer C, Ozouf-Costaz C, Bernot A, Nicaud S, Jaffe D, Fisher S, Lutfalla G, Dossat C, Segurens B, Dasilva C, Salanoubat M, Levy M, Boudet N, Castellano S, Anthouard V, Jubin C, Castelli V, Katinka M, Vacherie B, Biemont C, Skalli Z, Cattolico L, Poulain J, De Berardinis V, Cruaud C, Duprat S, Brottier P, Coutanceau JP, Gouzy J, Parra G, Lardier G, Chapple

146

Next Generation Sequencing and Whole Genome Selection in Aquaculture

C, Mckernan KJ, Mcewan P, Bosak S, Kellis M, Volff JN, Guigo R, Zody MC, Mesirov J, Lindblad-Toh K, Birren B, Nusbaum C, Kahn D, Robinson-Rechavi M, Laudet V, Schachter V, Quetier F, Saurin W, Scarpelli C, Wincker P, Lander ES, Weissenbach J, and Roest Crollius H. 2004. Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early vertebrate proto-karyotype. Nature, 431:946–957. Kent MP, Moen T, Hayes B, Gibbs RA, Weinstock GM, Omholt S, and Lien S. 2008. SNP discovery and linkage mapping in Atlantic salmon. Plant and Animal Genomes XVI Conference, San Diego, CA. Kerstens HH, Crooijmans RP, Veenendaal A, Dibbits BW, Chin AWT, Den Dunnen JT, and Groenen MA. 2009. Large scale single nucleotide polymorphism discovery in unsequenced genomes using second generation high throughput sequencing technology: Applied to turkey. BMC Genomics, 10:479. Kijas JW, Townley D, Dalrymple BP, Heaton MP, Maddox JF, Mcgrath A, Wilson P, Ingersoll RG, Mcculloch R, Mcwilliam S, Tang D, Mcewan J, Cockett N, Oddy VH, Nicholas FW, and Raadsma H. 2009. A genome wide survey of SNP variation reveals the genetic structure of sheep breeds. PLoS ONE, 4:e4668. Kim ES, Berger PJ, and Kirkpatrick BW. 2009. Genome-wide scan for bovine twinning rate QTL using linkage disequilibrium. Anim Genet, 40(3):300–307. Koop B, Von Schalburg K, Leong J, Walker N, Lieph R, Cooper G, Robb A, Beetz-Sargent M, Holt R, Moore R, Brahmbhatt S, Rosner J, Rexroad C, Mcgowan C, and Davidson W. 2008. A salmonid EST genomic study: Genes, duplications, phylogeny and microarrays. BMC Genomics, 9:545. Kucuktas H, Wang S, Li P, He C, Xu P, Sha Z, Liu H, Jiang Y, Baoprasertkul P, Somridhivej B, Wang Y, Abernathy J, Guo X, Liu L, Muir W, and Liu Z. 2009. Construction of genetic linkage maps and comparative genome analysis of catfish using gene-associated markers. Genetics, 181:1649–1660. Larhammar D and Risinger C. 1994. Molecular genetic aspects of tetraploidy in the common carp Cyprinus carpio. Mol Phylogenet Evol, 3:59–68. Lee BY, Penman DJ, and Kocher TD. 2003. Identification of a sex-determining region in Nile tilapia (Oreochromis niloticus) using bulked segregant analysis. Anim Genet, 34:379–383. Lee BY, Hulata G, and Kocher TD. 2004. Two unlinked loci controlling the sex of blue tilapia (Oreochromis aureus). Heredity, 92:543–549. Li Y, Dierens M, Byrne K, Miggiano E, Lehnert S, Preston N, and Lyons R. 2006. QTL detection of production traits for the Kuruma prawn Penaeus japonicus (Bate) using AFLP markers. Aquaculture, 258:198–210. Lindsay SJ, Khajavi M, Lupski JR, and Hurles ME. 2006. A chromosomal rearrangement hotspot can be identified from population genetic variation and is coincident with a hotspot for allelic recombination. Am J Hum Genet, 79:890–902. Liu ZJ, ed. 2007. Aquaculture Genome Technologies. Blackwell Publishing, Ames, IA. Liu ZJ and Cordes JF. 2004. DNA marker technologies and their applications in aquaculture genetics. Aquaculture, 238:1–37. Lorenz S, Brenna-Hansen S, Moen T, Roseth A, Davidson WS, Omholt SW, and Lien S. 2009. BAC-based upgrading and physical integration of a genetic SNP map in Atlantic salmon. Anim Genet, 41:48–54. Lyons RE, Dierens LM, Tan SH, Preston NP, and Li Y. 2007. Characterization of AFLP markers associated with growth in the Kuruma prawn, Marsupenaeus japonicus, and identification of a candidate gene. Mar Biotechnol (NY), 9:712–721. Massault C, Bovenhius H, Haley C, and De Koning DJ. 2008. QTL mapping designs for aquaculture. Aquaculture, 285:23–29. Matukumalli LK, Lawley CT, Schnabel RD, Taylor JF, Allan MF, Heaton MP, O’Connell J, Moore SS, Smith TP, Sonstegard TS, and Van Tassell CP. 2009. Development and characterization of a high density SNP genotyping assay for cattle. PLoS ONE, 4:e5350.

SNP Analysis with Duplicated Fish Genomes

147

Mckay SD, Schnabel RD, Murdoch BM, Matukumalli LK, Aerts J, Coppieters W, Crews D, Dias Neto E, Gill CA, Gao C, Mannen H, Wang Z, Van Tassell CP, Williams JL, Taylor JF, and Moore SS. 2008. An assessment of population structure in eight breeds of cattle using a whole genome SNP panel. BMC Genet, 9:37. Meuwissen TH, Hayes BJ, and Goddard ME. 2001. Prediction of total genetic value using genome-wide dense marker maps. Genetics, 157:1819–1829. Moen T, Agresti JJ, Cnaani A, Moses H, Famula TR, Hulata G, Gall GAE, and May B. 2004. A genome scan of a four way tilapia cross supports the existence of a quantitative trait loci for cold tolerance on linkage group 23. Aquac Res, 35:893–904. Moen T, Sonesson AK, Hayes B, Lien S, Munck H, and Meuwissen TH. 2007. Mapping of a quantitative trait locus for resistance against infectious salmon anaemia in Atlantic salmon (Salmo salar): Comparing survival analysis with analysis on affected/resistant data. BMC Genet, 8:53. Moen T, Hayes B, Baranski M, Berg PR, Kjoglum S, Koop BF, Davidson WS, Omholt SW, and Lien S. 2008a. A linkage map of the Atlantic salmon (Salmo salar) based on EST-derived SNP markers. BMC Genomics, 9:223. Moen T, Hayes B, Nilsen F, Delghandi M, Fjalestad KT, Fevolden SE, Berg PR, and Lien S. 2008b. Identification and characterisation of novel SNP markers in Atlantic cod: Evidence for directional selection. BMC Genet, 9:18. Moen T, Baranski M, Sonesson AK, and Kjoglum S. 2009. Confirmation and fine-mapping of a major QTL for resistance to infectious pancreatic necrosis in Atlantic salmon (Salmo salar): Population-level associations between markers and trait. BMC Genomics, 10:368. Moghadam HK, Ferguson MM, and Danzmann RG. 2005a. Evidence for Hox gene duplication in rainbow trout (Oncorhynchus mykiss): A tetraploid model species. J Mol Evol, 61:804–818. Moghadam HK, Ferguson MM, and Danzmann RG. 2005b. Evolution of Hox clusters in Salmonidae: A comparative analysis between Atlantic salmon (Salmo salar) and rainbow trout (Oncorhynchus mykiss). J Mol Evol, 61:636–649. Muir WM, Wong GK, Zhang Y, Wang J, Groenen MA, Crooijmans RP, Megens HJ, Zhang H, Okimoto R, Vereijken A, Jungerius A, Albers GA, Lawley CT, Delany ME, Maceachern S, and Cheng H. 2008. Genome-wide assessment of worldwide chicken SNP genetic diversity indicates significant absence of rare alleles in commercial breeds. Proc Natl Acad Sci U S A, 105:17312–17317. Murdoch BM, Clawson ML, Laegreid WW, Stothard P, Settles M, Mckay S, Prasad A, Wang Z, Moore SS, and Williams JL. 2010. A 2cM resolution genome-wide scan of European Holstein cattle affected by classical BSE. BMC Genet, 11:20. Nakamura Y, Ozaki A, Akutsu T, Iwai K, Sakamoto T, Yoshizaki G, and Okamoto N. 2001. Genetic mapping of the dominant albino locus in rainbow trout (Oncorhynchus mykiss). Mol Genet Genomics, 265:687–693. Nichols KM, Bartholomew J, and Thorgaard GH. 2003. Mapping multiple genetic loci associated with Ceratomyxa shasta resistance in Oncorhynchus mykiss. Dis Aquat Organ, 56:145–154. Nichols KM, Broman KW, Sundin K, Young JM, Wheeler PA, and Thorgaard GH. 2007. Quantitative trait loci x maternal cytoplasmic environment interaction for development rate in Oncorhynchus mykiss. Genetics, 175:335–347. Novaes E, Drost DR, Farmerie WG, Pappas GJ, Jr., Grattapaglia D, Sederoff RR, and Kirst M. 2008. High-throughput gene and SNP discovery in Eucalyptus grandis, an uncharacterized genome. BMC Genomics, 9:312. Ohno S. 1999. Gene duplication and the uniqueness of vertebrate genomes circa 1970–1999. Semin Cell Dev Biol, 10:517–522. Ohno S, Muramoto J, Christian L, and Atkin NB. 1967. Diploid-tetraploid relationship among Old World members of the fish family Cyprinidae. Chromosoma, 23:1–9.

148

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Ott J. 1991. Analysis of Human Genetic Linkage. The John Hopkins University Press, Baltimore, MD. Ozaki A, Sakamoto T, Khoo SK, Nakamura Y, Coimbra MR, Akutsu T, and Okamoto N. 2001. Quantitative trait loci (QTLs) associated with resistance/susceptibility to infectious pencreatic necorsis virus (IPNV) in rainbow trout (Oncorhynchus mykiss). Mol Genet Genomics, 265:23–31. Palti Y, Shirak A, Cnaani A, Hulata G, Avtalion RR, and Ron M. 2002. Detection of genes with deleterious alleles in an inbred line of tilapia (Oreochromis aureus). Aquaculture, 206:151–164. Palti Y, Silverstein JT, Wieman H, Phillips JG, Barrows FT, and Parsons JE. 2006. Evaluation of family growth response to fishmeal and gluten-based diets in rainbow trout (Oncorhynchus mykiss). Aquaculture, 255:548–556. Pant SD, Schenkel FS, Verschoor CP, You Q, Kelton DF, Moore SS, and Karrow NA. 2010. A principal component regression based genome wide analysis approach reveals the presence of a novel QTL on BTA7 for MAP resistance in Holstein cattle. Genomics, 95:176–182. Paux E, Faure S, Choulet F, Roger D, Gauthier V, Martinant JP, Sourdille P, Balfourier F, Le Paslier MC, Chauveau A, Cakir M, Gandon B, and Feuillet C. 2010. Insertion site-based polymorphism markers open new perspectives for genome saturation and marker-assisted selection in wheat. Plant Biotechnol J, 8:196–210. Pavy N, Pelgas B, Beauseigle S, Blais S, Gagnon F, Gosselin I, Lamothe M, Isabel N, and Bousquet J. 2008. Enhancing genetic mapping of complex genomes through the design of highly-multiplexed SNP arrays: Application to the large and unsequenced genomes of white spruce and black spruce. BMC Genomics, 9:21. Perry GM, Danzmann RG, Ferguson MM, and Gibson JP. 2001. Quantitative trait loci for upper thermal tolerance in outbred strains of rainbow trout (Oncorhynchus mykiss). Heredity, 86:333–341. Phillips RB, Zimmerman A, Noakes MA, Palti Y, Morasch MR, Eiben L, Ristow SS, Thorgaard GH, and Hansen JD. 2003. Physical and genetic mapping of the rainbow trout major histocompatibility regions: Evidence for duplication of the class I region. Immunogenetics, 55:561–569. Pierce LR, Palti Y, Silverstein JT, Barrows FT, Hallerman EM, and Parsons JE. 2008. Family growth response to fishmeal and plant-based diets shows genotype × diet interaction in rainbow trout (Oncorhynchus mykiss). Aquaculture, 278:37–42. Ramos AM, Crooijmans RP, Affara NA, Amaral AJ, Archibald AL, Beever JE, Bendixen C, Churcher C, Clark R, Dehais P, Hansen MS, Hedegaard J, Hu ZL, Kerstens HH, Law AS, Megens HJ, Milan D, Nonneman DJ, Rohrer GA, Rothschild MF, Smith TP, Schnabel RD, Van Tassell CP, Taylor JF, Wiedmann RT, Schook LB, and Groenen MA. 2009. Design of a high density SNP genotyping assay in the pig using SNPs identified and characterized by next generation sequencing technology. PLoS ONE, 4:e6524. Rodriguez MF, Lapatra S, Williams S, Famula T, and May B. 2004. Genetic markers associated with resistance to infectious necrosis in rainbow trout and steelhead trout (Oncorhynchus mykiss) backcrosses. Aquaculture, 241:93–115. Rothschild MF and Ruvinsky A. 2007. Marker-assisted selection for aquaculture species. In: Aquaculture Genome Technologies, edited by Z. Liu. Blackwell Publishing Professional, Ames, IA, pp. 199–214. Ruane J and Sonnino A. 2007. Chapter 1. Marker-assisted selection as a tool for genetic improvement of crops, livestock, forestry and fish in developing countries: An overview of the issues. In: Marker Assisted Selection: Curent Status and Future Perspectives in Crops, Livestock, Forestry and Fish, edited by J Ruane, BD Schert, A Sonnino, and JR Dorgie. Food and Agriculture of United Nations, Rome.

SNP Analysis with Duplicated Fish Genomes

149

Ryynanen HJ and Primmer CR. 2006. Single nucleotide polymorphism (SNP) discovery in duplicated genomes: Intron-primed exon-crossing (IPEC) as a strategy for avoiding amplification of duplicated loci in Atlantic salmon (Salmo salar) and other salmonid fishes. BMC Genomics, 7:192. Sakamoto T, Danzmann RG, Okamoto N, Ferguson MM, and Ihssen PE. 1999. Linkage analysis of quantitative trait loci associated with spawning time in rainbow trout (Oncorhynchus mykiss). Aquaculture, 173:33–43. Sargolzaei M, Schenkel FS, Jansen GB, and Schaeffer LR. 2008. Extent of linkage disequilibrium in Holstein cattle in North America. J Dairy Sci, 91:2106–2117. Sarropoulou E, Nousdili D, Magoulas A, and Kotoulas G. 2008. Linking the genomes of nonmodel teleosts through comparative genomics. Mar Biotechnol (NY), 10:227–233. Satkoski JA, Malhi R, Kanthaswamy S, Tito R, Malladi V, and Smith D. 2008. Pyrosequencing as a method for SNP identification in the rhesus macaque (Macaca mulatta). BMC Genomics, 9:256. Sherman EL, Nkrumah JD, and Moore SS. 2010. Whole genome single nucleotide polymorphism associations with feed intake and feed efficiency in beef cattle. J Anim Sci, 88:16–22. Shirak A, Palti Y, Cnaani A, Korol A, Hulata G, Ron M, and Avtalion RR. 2002. Association between loci with deleterious alleles and distorted sex ratios in an inbred line of tilapia (Oreochromis aureus). J Hered, 93:270–276. Smith CT, Elfstrom CM, Seeb LW, and Seeb JE. 2005. Use of sequence data from rainbow trout and Atlantic salmon for SNP detection in Pacific salmon. Mol Ecol, 14:4193–4203. Snelling WM, Allan MF, Keele JW, Kuehn LA, Mcdaneld T, Smith TP, Sonstegard TS, Thallman RM, and Bennett GL. 2010. Genome-wide association study of growth in crossbred beef cattle. J Anim Sci, 88:837–848. Sonesson AK. 2007. Within-family marker-assisted selection for aquaculture species. Genet Sel Evol, 39:301–317. Sundin K, Brown KH, Drew RE, Nichols KM, Wheeler PA, and Thorgaard GH. 2005. Genetic analysis of a development rate QTL in backcrosses of clonal rainbow trout, Onchorhynchus mykiss. Aquaculture, 247:75–83. Tanguy A, Bierne N, Saavedra C, Pina B, Bachere E, Kube M, Bazin E, Bonhomme F, Boudry P, Boulo V, Boutet I, Cancela L, Dossat C, Favrel P, Huvet A, Jarque S, Jollivet D, Klages S, Lapegue S, Leite R, Moal J, Moraga D, Reinhardt R, Samain JF, Zouros E, and Canario A. 2008. Increasing genomic information in bivalves through new EST collections in four species: development of new genetic markers for environmental studies and genome evolution. Gene, 408:27–36. Thanh NM, Barnes AC, Mather PB, Li Y, and Lyons RE. 2010. Single nucleotide polymorphisms in the actin and crustacean hyperglycemic hormone genes and their correlation with individual growth performance in giant freshwater prawn Macrobrachium rosenbergii. Aquaculture, 301:7–15. Uyeno T and Smith GR. 1972. Tetraploid origin of the karyotype of catostomid fishes. Science, 175:644–646. Van Tassell CP, Smith TP, Matukumalli LK, Taylor JF, Schnabel RD, Lawley CT, Haudenschild CD, Moore SS, Warren WC, and Sonstegard TS. 2008. SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries. Nat Methods, 5:247–252. Vanraden PM, Van Tassell CP, Wiggans GR, Sonstegard TS, Schnabel RD, Taylor JF, and Schenkel FS. 2009. Invited review: Reliability of genomic predictions for North American Holstein bulls. J Dairy Sci, 92:16–24. Verbyla KL, Bowman PJ, Hayes B, and Goddard ME. 2009. Sensitivity of genomic selection to using different prior distributions. 13th European Workshop on QTL Mapping and Marker Assisted Selection. Wageningen, The Netherlands.

150

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Wang S, Sha Z, Sonstegard TS, Liu H, Xu P, Somridhivej B, Peatman E, Kucuktas H, and Liu Z. 2008. Quality assessment parameters for EST-derived SNPs from catfish. BMC Genomics, 9:450. Wang S, Peatman E, Abernathy J, Waldbieser G, Lindquist E, Richardson P, Lucas S, Wang M, Li P, Thimmapuram J, Liu L, Vullaganti D, Kucuktas H, Murdock C, Small BC, Wilson M, Liu H, Jiang Y, Lee Y, Chen F, Lu J, Wang W, Xu P, Somridhivej B, Baoprasertkul P, Quilang J, Sha Z, Bao B, Wang Y, Wang Q, Takano T, Nandi S, Liu S, Wong L, Kaltenboeck L, Quiniou S, Bengten E, Miller N, Trant J, Rokhsar D, and Liu Z. 2010. Assembly of 500,000 inter-specific catfish expressed sequence tags and large scale gene-associated marker development for whole genome association studies. Genome Biol, 11:R8. Wiedmann RT, Smith TP, and Nonneman DJ. 2008. SNP discovery in swine by reduced representation and high throughput pyrosequencing. BMC Genet, 9:81. Wiggans GR, Sonstegard TS, Vanraden PM, Matukumalli LK, Schnabel RD, Taylor JF, Schenkel FS, and Van Tassell CP. 2009. Selection of single-nucleotide polymorphisms and quality of genotypes used in genomic evaluation of dairy cattle in the United States and Canada. J Dairy Sci, 92:3431–3436. Wolfe KH. 2001. Yesterday’s polyploids and the mystery of diploidization. Nat Rev Genet, 2:333–341. Yan J, Yang X, Shah T, Sanchez-Villeda H, Li J, Warburton M, Zhou Y, Crouch JH, and Xu Y. 2010. High-throughput SNP genotyoing with the GoldenGate assay in maize. Mol Breed, 25:441–451. Yu Z and Guo X. 2006. Identification and mapping of disease-resistance QTLs in the Easter oyster, Crassostrea virginica Gmelin. Aquaculture, 254:160–170. Zeng D, Chen X, Li Y, Peng M, Ma N, Jiang W, Yang C, and Li M. 2008. Analysis of HSP70 in Litopenaeus vannamei and detection of SNPs. J Crustacean Biol, 28:727–730.

Chapter 10

Genomic Selection for Aquaculture: Principles and Procedures Anna K. Sonesson

This chapter is about the application of genomics in aquaculture breeding. More specifically, it deals with how dense genetic marker information can be used to estimate breeding values that parents can be selected for and how this information affects the design of the breeding programs. Because the dense genetic marker information covers the whole genome, such selection is called genomic selection, or whole genome-based selection, and is based on utilization of population-wide linkage disequilibrium (LD) between quantitative trait locus (QTL; genes that code for quantitative traits, i.e., traits that are affected by many genes and the environment) and markers, as the markers are used to predict unmapped QTL. When QTL are in LD with the markers, one genotype is more common than the other, for example, QTL allele Q is more often linked to marker allele M than to m, and QTL allele q is more often linked to marker allele m than to M. Q

M

Q

m

q

m

q

M

COMMON

RARE

LD can be measured as R2, and is affected by 1. the (finite) effective population size (genetic drift), Ne, where recent Ne has effect on LD over long distances and old Ne has effect on LD over short distances (Hayes et al., 2003); 2. recombination rate between markers, c; 3. selection; 4. mating; 5. mutation; and 6. migration. When assuming no selection, random mating, constant Ne over time, and no mutation nor migration, R2 = 1/(1 + 4Nec) (Sved, 1971). R2 is underpredicted by this equation

Next Generation Sequencing and Whole Genome Selection in Aquaculture © 2011 Blackwell Publishing Ltd. ISBN: 978-0-813-80637-2

Edited by Zhanjiang (John) Liu

151

152

Next Generation Sequencing and Whole Genome Selection in Aquaculture

if the population was created by a recent crossbreeding of divergent lines/stocks, which often is the case for aquaculture populations. Similarly, selection can increase R2 in regions of the genome with important QTL. Also, family and population structures affect LD. For other measures of LD, see, for example, Hill and Robertson (1968), Hill (1981), Hayes et al. (2003), and Zhao et al. (2005). One of the products with whole genome sequencing is the identification of thousands to millions of single-nucleotide polymorphisms (SNPs), the most common type of genetic marker. These have been put on high-density genome-wide SNP chips for humans (>1 million SNPs), cattle, pigs, and poultry (∼50,000 SNPs). Also, for Atlantic salmon, a SNP chip containing approximately 15,000 SNPs has been developed (Kent et al., 2009). SNP chips are under development for several other aquaculture species. For genomic selection, all SNPs on the chip are used without testing their significance. One assumes that the dense SNP genotyping ensures that there is LD between markers and all individual QTL, such that also small QTL may be picked up by the markers. This gives genomic selection the potential to explain 100% of the total genetic variance for the trait. This differs from the situation in traditional markerassisted selection, where information from QTL mapping experiments is included in the evaluation of the breeding values (see Chapter 12); these QTLs use high significance thresholds, such that many small QTL remain undetected and only a limited fraction of the total genetic variance is explained.

The Steps for Genomic Selection The steps for genomic selection are the following: 1. Estimate the effects of the SNPs of each trait in a training data set. Here, test individuals are genotyped and phenotyped, and the SNP effects are estimated using this data. 2. Evaluate genome-wide estimated breeding values (GWEBVs) of the candidates. Here, the selection candidates are only genotyped, and the marker effects estimated in the training set provide the information on the marker–trait association. In general, the candidates do not need to be closely related to the training individuals because the LD can be due to a common ancestor many generations ago. Hence, genomic selection decouples estimated breeding values (EBVs) estimation from recording. However, later in the chapter, it will be shown that higher relationships between the candidates and training individuals improve R2.

Traits Similar to traditional marker-assisted selection (Georges and Massey, 1991; Meuwissen and Goddard, 1996), genomic selection has the highest potential for traits that cannot be measured on the selection candidates. This makes genomic selection an interesting selection method for aquaculture species, as many traits in the breeding goals of aquaculture species cannot (easily) be measured on the selection candidates,

Genomic Selection for Aquaculture

153

but are measured on sibs of the candidates. One example of these sib traits are challenge test traits for specific diseases, for example, viral (infectious salmon anemia, white spot), bacterial (furunculosis), or parasitic (salmon louse) diseases. Another example of sib traits is the different fillet quality traits, for example, texture and color traits. Many of these traits have moderate to high heritability (see review by Gjedrem and Olesen, 2005), which is advantageous for genomic selection because the marker effects are then more accurately estimated. Genomic selection has the potential to include new traits in the breeding goal that are either expensive or difficult to record, as long as these training individuals are (near) relatives of the selection candidates. For example, information from individuals on a natural disease outbreak can be used to estimate the marker effects. Similarly, if sorting on fillet quality can be done at the slaughter line, this information can be used for selection in the breeding nucleus. The sib traits are interesting candidates for traditional marker-assisted selection and genomic selection because with conventional breeding values, all candidates within a family receive the same (betweenfamily) breeding value. Hence, only between-family selection can be practiced, which utilizes only 50% of the total genetic variation. With traditional marker-assisted selection and genomic selection, the within-family variance term is also predicted, such that 100% of the total genetic variation is addressed.

Breeding Value Estimation for Genomic Selection In an additive model, the true breeding values can be modeled as (Falconer and Mackay, 1996): gi = Σ j X ij aj, where summation is over all QTL j; Xij denotes the QTL genotype of individual i for SNP j, with values 2(1 − q), 1 − 2q, and −2q for genotypes “1 1,” “1 0,” and “0 0,” and aj is the allele substitution effect of QTL j. The genotypes Xij are standardized so that the population mean breeding value is 0. For simplicity we will assume here that the QTL genotypes Xij are known, but will relax this assumption later. Then the “best” estimate of the breeding values is (Goddard, 2009) E ( gi yi , X ij ) = Σ j X ij E ( aj yi , X ij ) = Σ j X ij aˆj,

where the “best” estimate of the QTL genotype is (omitting the subscripts for simplicity) aˆ = E ( a y, X ) =

∫ ap ( y a, X ) p ( a) da , ∫ p ( y a, X ) p ( a) da

where p(y|a, X) is the likelihood of the data, given the QTL effect and genotype; p(a) is the prior distribution of the QTL effect, a; and “best” means that this estimate maximizes the correlation between the true and estimated breeding values.

154

Next Generation Sequencing and Whole Genome Selection in Aquaculture

The above derivation shows that the “best” estimate requires knowledge about the prior distribution of QTL/marker effects. The original Meuwissen et al. (2001) paper proposed three alternative assumptions for the prior distribution of QTL/ marker effects that are connected to the method used to estimate the QTL/marker effects: 1. GWBLUP (genome-wide BLUP): A normal prior distribution with the same variance for every SNP. This implies that the SNP effects are estimated by best linear unbiased prediction (BLUP). 2. BayesA: A t-distribution that is more thick-tailed than the normal distribution. Due to the thicker tails, it allows for more extreme QTL effects; that is, most QTL effects are close to zero but some are extremely big. 3. BayesB: A prior distribution where with probability π, the QTL effects follow a t-distribution, and with probability (1 − π) they have effect 0. A fourth prior distribution was later developed and used by Kizilkaya et al. (2010): 4. BayesC: A prior where with probability π, the genes are normally distributed and with probability (1 − π) they are 0. GWBLUP can be calculated in an iterative way, while BayesA/B/C require Monte Carlo Markov chain (MCMC) sampling. BayesA and C can be implemented by Gibbs sampling and are therefore easier to implement than BayesB, which requires a Metropolis–Hasting sampling step. In the above, it was assumed that the QTL genotypes were known, but in real life only SNP marker genotypes are known. If we assume a very dense SNP marker map, we expect that every QTL is in perfect LD with at least one SNP. Hence, most SNPs have zero effect, but some pick up the QTL effect, which suggests methods BayesB and BayesC. If the marker map is less dense, then not every QTL will be picked up by a marker that is in perfect LD. However, in this case, several nearby markers can pick up the effect of the QTL, and still explain all of its variance. Although BayesB and C are still the most appropriate models, π needs to be increased in order to account for more SNPs picking up the effect. In an extreme situation, nearly all of the markers pick up an effect and BayesB and C resemble increasingly BayesA and GWBLUP, respectively. If the marker map becomes even less dense, the QTL effects are only partly explained by the SNPs due to the imperfect LD, and the maximum accuracy that can be achieved by genomic selection becomes smaller than 1. In all the above models, the data, y, are modeled by yi = μ + Σ j X ij aj + ei, and the models only differ with respect to their assumption about the prior distribution of a. In the case of GWBLUP, the variance of the data is modeled by 2 V ( y ) = XX ′σsnp + Iσe2 , 2 where X is the matrix of SNP genotypes (Xij), and σsnp and σe2 are the SNP and residual variances, respectively. This model compares with the traditional animal model (Henderson, 1984) where

Genomic Selection for Aquaculture

155

V ( y ) = Aσa2 + Iσe2 , and A is the relationship matrix of the animals (based on pedigree data), and σa2 is the additive genetic variance. Since GWBLUP assumes that all additive genetic vari2 ance is spread evenly over the SNPs—σsnp = σa2 M, where M is the number of SNPs— then, the model for V(y) using GWBLUP becomes V ( y ) = XX ′ Mσa2 + Iσe2 , where element (ik) of XX′/M is: Σ j X ij X kj M , which is the covariance between the SNP genotypes for all SNPs of animal i and k (since the mean of Xij is 0). The covariance between the genotypes of animal i and j is a marker-based measure of the relationship between animal i and j; that is, it is an estimate of the genomic relationship. Thus, when moving from the traditional animal model toward GWBLUP, we are replacing the pedigree-based relationship matrix, A, by the genomic relationship matrix XX′/M. This also suggests an alternative way to calculate GWBLUP breeding value estimates: instead of estimating all the SNP effects, one can directly estimate the EBV of the animals by using the traditional animal model equations (Henderson, 1984) and replacing the A matrix by XX′/M (Goddard and Hayes, 2007). The latter is thus an equivalent model for calculating GWBLUP breeding value estimates. GWEBV methods can also deal with nonadditive effects. In the BayesA and BayesB methods, SNP × SNP interaction terms can be added, but the total number of terms to estimate becomes huge. Gianola et al. (2006) presented Reproducing Kernel Hilbert Space Regression methods, which automatically takes account of all interactions. However, the additive effects remain the most important for breeding purposes because dominance interactions are not inherited, and with linear models, interactions are largely included into the additive effects (Hill et al., 2008). Finally, additive interactions are only partly inherited due to recombinations between the genes.

Within-Family Genomic Selection In aquaculture breeding, it may be tempting to apply genomic selection within families for three reasons. First, marker densities in aquaculture species may not be sufficiently high to pick up between population-wide LD, which holds across the families. However, within a full-sib family, LD is high even between quite distant markers. Following Solberg et al. (2008), marker density is expressed as D × Ne per Morgan, where D is marker density. Hence, density D = 1, requires 100 (1000) SNPs per Morgan if Ne = 100 (Ne = 1000), and thus the number SNPs required may be very high for aquaculture species where Ne is high. Second, the sizes of full-sib families can be very large in aquaculture breeding, which favors within-family selection. The

156

Next Generation Sequencing and Whole Genome Selection in Aquaculture

costly separate rearing of families is no longer needed since the markers can be used to identify the family to which an individual belongs. The latter requires, however, that one can still guarantee a sufficiently large number of individuals per family. In within-family genome selection, the GWEBVs are estimated as described in the EBV estimation paragraph, except that the methodology is applied within every family. However, in order to apply between-family selection, the between-family component needs to be added to the GWEBV: EBVtot = EBVbetw + b × GWEBVwithin ,

(10.1)

where EBVbetw is the family mean as estimated by traditional BLUP-EBV estimation (Henderson, 1984), GWEBVwithin is the genome-wide EBV as estimated within the full-sib family (see next paragraph), b is a regression factor that accounts for the bias of GWEBV (Meuwissen et al., 2001). The regression factor, b, can be estimated by a cross-validation approach. For example, set 20% of the within-family phenotypes missing and estimate GWEBV for these fish. Next, estimate b by regressing the missing records on the predicted GWEBV: y = μ + b × GWEBV + e, where y is the records that are set as missing (which are known in this regression) and GWEBV are estimates of genome-wide breeding values when the records y were set as missing.

Effect on Genetic Gain Genetic gain, G, can be expressed as G = riσ g s , where r is the accuracy of the breeding values, i is the intensity of selection, σg is the genetic standard deviation, and s is the generation interval (Falconer and Mackay, 1996). For aquaculture species, genomic selection mainly affects the accuracy of the breeding values and the intensity of selection (increased within-family selection intensity). Both σg and s remain constant, assuming that the trait under selection is the same (although the change in genetic variance over time may be different for genomic selection than traditional BLUP selection, Sonesson and Meuwissen, 2009) and that selection and mating occurs as soon as the animals are mature.

Accuracy of GWEBV The accuracy of the breeding values, r, is the correlation between the true and estimated breeding values. Overall, the accuracy of the GWEBV depends on the accuracy that the markers are estimated with, which in turn depends on heritability of trait, prior distribution of marker effects, numbers of QTL and training records,

Genomic Selection for Aquaculture

157

and relationship between training individuals and selection candidates. The accuracy of the breeding values also depends on the proportion of the genetic variance of each QTL that is explained by the markers due to LD, which in relation to genomic selection depends on genome size, marker density, gaps in the marker map, and effective population size (Dekkers, 2007). The amount of LD is the driving parameter for the accuracy of the GWEBV. LD measured as R2 was shown to depend on LNe/M, where L is the length of the genome and M is the number of markers (Goddard, 2009). The nominator LNe can be seen as the number of independent chromosome segments that segregate in the population and whose effect on the trait must be estimated (Goddard, 2009). The number of independent chromosome segments can be identified as follows. The accuracy of GWEBV above that of conventional BLUP breeding values is due to the variation in relationships between the animals (Goddard, 2009). For instance, when using pedigree, full sibs are expected to have a relationship of 0.5, but the true genomic relationship may deviate from this. The number of independent segments is now the number of such segments that could explain the variability of the relationships. If there are Nsegm segments, the variability in relationships is 1/Nsegm, where Nsegm is chosen such that 1/Nsegm equals the variance in relationship (Goddard, 2009). The squared accuracy of the GWEBV may be approximated by the equation for squared accuracy from traditional BLUP, that is, r2 =

N N+

Ve Vsegm

=

Nh2 , Nh + Nsegm 2

where N is the number of training individuals, Ve is the residual variance (not due to the segment; approximately 1 here), and Vsegm is the variance due to a chromosome segment, Vsegm = h2/Nsegm. Assuming that the number of segments can be modeled by Nsegm = 4NeLν, r2 =

Nh2 , Nh + 4 Ne Lν 2

which shows that accuracy of GWEBV, r, depends on the number of training individuals N, heritability of trait h2, effective population size Ne, genome size L, and the ratio of effective/actual numbers of segments ν (Daetwyler et al., 2007; Goddard and Hayes, 2007). The equation holds for GWBLUP. When a more accurate prior distribution of the QTL effects and number of QTL is used, for example, BayesB, accuracy of the GWEBV may be increased. This equation shows that the required number of training individuals doubles in order to keep a constant accuracy of the GWEBV if • the genome size, L, doubles; this assumes a constant marker density; • the heritability, h2, halves; and • the effective population size, Ne, doubles. As shown in the section above on “Breeding Value Estimation for Genomic Selection,” it is important to use the right prior distribution for the QTL effects when estimating

158

Next Generation Sequencing and Whole Genome Selection in Aquaculture

the marker effects to get high accuracy. Some traits will probably be affected by many QTLs with effects that are normally distributed. Others will be affected by one or several large QTLs. One example from aquaculture is the QTL of infectious pancreatic necrosis (IPN) resistance that has been mapped in Atlantic salmon (Houston et al., 2008; Moen et al., 2009) and that explains ∼80% of the genetic variation of this trait. For this type of traits, fewer markers will be needed to pick up the QTL and achieve the same accuracy of the GWEBV. In practice, cross-validation is used to test which prior distribution of marker effects that gives the highest accuracy of the GWEBV. This implies testing of the different GWEBV methods because they differ in their prior assumptions. In cross-validation, the data set is split at random, for example, in 10 equal-sized subdata sets, called Si. Then the effects are estimated 10 times, every time leaving one of the subsets Si out of the training data, and using the estimated marker effects to predict GWEBV for the left-out individuals in Si. This yields a GWEBV prediction for every individual, while its own data was left out of the training set. Next, these GWEBV are correlated to the phenotypes of the individuals, resulting in the correlation γ. If we assume that phenotypes can be modeled as y = μ + a + e, and the error terms are uncorrelated, then the phenotypes can only be predicted by predicting a, that is, the true genetic effects. Even if the true genetic effect was perfectly predicted, the correlation between a and y is smaller than 1 and actually equals h, which is the square root of the heritability. Thus, an unbiased cross-validation estimate of the accuracy of GWEBV equals γ/h (Luan et al., 2009). In breeding schemes, the predicted individuals may not be a random sample from the training population, but a specific group of individuals, for example, the latest generation of individuals. In this case, the latest generation can be used as training individuals (assuming records are available), and the accuracy of GWEBV can be based on the correlation between these GWEBV and phenotypes.

Quality of SNP Chip The SNP chip used for genomic selection needs specific qualities. Above all, average R2 between adjacent SNP markers must be at least moderate (>∼0.2; Calus et al., 2008), but this will be dependent on the relationship between training individuals and selection candidates. Also, the SNP chip must cover the genome adequately. The GenCall score measures the accuracy of the Illumina SNP chip and should be higher than 0.7 (Oliphant et al., 2002). Fredman et al. (2004) set up a list with some quality parameters related to duplication of the genome, which is relevant for salmonids. Dominik et al. (2009) used the SNP chip for Atlantic salmon (Kent et al., 2009) and found that for the Tasmanian Atlantic salmon population, among 15,525 SNPs, 2991 were single polymorphisms, 952 were duplicated polymorphisms, and LD > 0.175 for 6389 pairs of the 2991 single polymorphism SNPs; 2234 SNPs had a GenCall Score >0.7. To indicate whether the density of the SNP chip is sufficiently high for use in several (sub) populations, Goddard et al. (2006) calculated the correlation of R values (using the method of Hill and Robertson, 1968) that measures how far the same marker phase is likely to persist in these populations. If the correlation is high, marker

Genomic Selection for Aquaculture

159

effects estimated in one population should persist in the other. Assuming that age classes in aquaculture species can resemble different breeds in cattle, a higher accuracy will be seen when individuals from all age classes are included in the training set. A higher-density SNP chip is needed for a chip that is used for several age classes. If the correlation between the R values is low, marker effects must be estimated within each age class (de Roos et al., 2008; Toosi et al., 2010).

Design of Genomic Selection Schemes for Aquaculture This demonstration will include two types of design of genomic selection schemes: one based on the traditional family-based breeding programs and one based on communal rearing of families and selective genotyping. The design of the traditional family-based breeding programs (Gjedrem, 2005) can easily be used for within-family marker-assisted selection (Sonesson, 2007) as well as genomic selection (Nielsen et al., 2009; Sonesson and Meuwissen, 2009). However, for within-family marker-assisted selection schemes, the phase of marker and QTL alleles needs to be established for all families because the linkage between the markers and QTL is not sufficiently close to ensure that marker–QTL allele relationships are consistent across the population. For the sib test design, there are two groups of offspring, one with test fish and one with selection candidates. The test fish get both phenotypic and genetic records, and the candidates are only genotyped. In Sonesson and Meuwissen (2009), this design was studied using computer simulation. With 3000 selection candidates and 3000 test individuals (sibs of the candidates) split over 100 families, a heritability of 0.4 and a marker density of 0.5 Ne/M, the accuracy of GWEBV was 0.823 in generation 5 when the sib testing was performed every generation. The accuracy was then hardly reduced by selection. In order to reduce genotyping and phenotypic recording costs, genotyping and phenotypic recording was omitted one generation, and the marker effects that had been estimated in the previous generation were used. This, however, reduced the accuracy of GWEBV in the generations without recording, such that the accuracy of GWEBV fluctuated over generations. When recording only in the first generation, accuracy of GWEBV decreased rapidly over generations and was only 0.304 in generation 5. This reduction in accuracy is mainly because of changes in LD between marker and QTL. Especially, spurious LD, which is not due to linkage, changes quickly over generations (Goddard and Meuwissen, 2005). In general, the results of Sonesson and Meuwissen (2009) showed the importance of updating the marker effects for genomic selection. One way to reduce the genotyping costs is to genotype the parents with a dense SNP chip but the offspring (selection candidates and test individuals) less dense and infer the missing genotypes of the offspring from the genotypes of their parents (Habier et al., 2009). Sonesson and Meuwissen (2009) also showed that rates of inbreeding were reduced by 81% in these schemes compared with traditional selection schemes due to within-family selection. For within-family genomic selection, this design is also suitable because of the large families in aquaculture species. Results from Sonesson et al. (2010) show that accuracy was 0.664, 0.848, and 0.877 with 20, 200, and 500 test individuals/family using a marker density of 1.0 Ne/M and heritability of the trait of 0.4.

160

Next Generation Sequencing and Whole Genome Selection in Aquaculture

2000 or 20,000 test individuals

10,000 selection candidates

Kept in groups of 1, 10, or 100 families/group Selective genotyping within group based on phenotypic extremes.

2000 preselected individuals for growth (h2 = 0.4) are genotyped.

Estimation of marker effects for sib trait (h2 = 0.4).

Selection of 100 sires and 100 dams on their GW-EBV for the sib trait over groups.

Figure 10.1 Design based on communal rearing of families and selective genotyping of pooled samples from phenotypic extremes of each group.

As discussed in the section above on “Traits,” information from natural disease outbreaks can be used to estimate marker effects. The design for these traits has not been optimized but will probably resemble a case-control design as used in human disease studies. As shown in Sonesson and Meuwissen (2009), the relationship between the training individuals in the natural disease outbreak and the selection candidates will be critical for the accuracy of genomic selection. Sonesson et al. (2010) presented a design for aquaculture species that takes advantage of the large family sizes of these species (Figure 10.1). It uses communal rearing of families for both one group of candidates and one group of training individuals (sibs of the candidates) in order to reduce housing costs compared with traditional family-based breeding schemes. The selection candidates are recorded for growth and the sibs are recorded for one sib trait but can easily be extended to several traits. Selective genotyping on pooled DNA (Darvasi and Soller, 1994) of the individuals with the highest and lowest phenotypic values for the sib trait within each group was performed to estimate marker effects. This greatly saves genotyping costs, which is the main limitation of genomic selection using traditional family-based breeding design. The marker effects are used to calculate genome-wide breeding values of the selection candidates. The selection candidates have been preselected for growth and are also genotyped to estimate their GWEBV. The results of the computer simulations show that with 20,000 test individuals (200/family) grouped in 10 families/group with marker density of 1.0 Ne/M, accuracy of selection was 0.817. With 100 families/group, accuracy of selection was 0.808 because the phenotypic family mean, which is one of the terms in the estimation of the GWEBV, was then more difficult to estimate using the markers. With only 2000 test individuals, accuracy of selection dropped to 0.603. Similarly, with a marker density of 0.5 Ne/M, the accuracy of selection dropped from 0.817 to 0.608. With only one family/pool, which is a possible application of genomic selection in traditional family-based breeding schemes, accuracy increased to 0.848 because the marker and

Genomic Selection for Aquaculture

161

phenotypic information coincide to estimate the phenotypic family means. Overall, this design gives higher accuracy of selection than a traditional BLUP scheme, with accuracies around 0.5 for a typical design used today (Nielsen et al., 2009). However, this design assumes that genotyping pools of phenotypic extreme individuals yield accurate estimates of allele frequencies within the pools. A pooling strategy for this design requires that sorting of the extreme families is possible. The groups must be large enough to get large phenotypic extremes, such that different QTL alleles are (almost) fixed in the two groups. Baranski et al. (2009) showed a correlation of 0.98 between pool and individual values of allele frequencies. Then pooling was of one individual/family from 60 families into three pools of each susceptible or resistant groups of infectious salmon anemia in Atlantic salmon. Fifteen individuals/family had been tested. This pooling strategy was set up for QTL detection and aimed for detecting low numbers of false positive QTL by using exactly the same number of animals per family. The latter results are similar to the Transmission Disequilibrium Test (TDT) for QTL mapping (Spielman et al., 1993). However, for genomic selection, another pooling design that pools extremes over families should probably be used in order to estimate also the between-family effects.

References Baranski M, Gidskehaug L, Hayes B, and Bakke H. 2009. Empirical evaluation of selective DNA pooling for genome-wide association analysis of ISA resistance using the Atlantic salmon 16.5K SNP array. In: Proceedings of International Symposium on Genetics in Aquaculture X, edited by U Na-Nakorn. IAGA, Bangkok, Thailand, p. 34. Calus MPL, Meuwissen THE, de Roos APW, and Veerkamp RF. 2008. Accuracy of genomic selection using different methods to define haplotypes. Genetics, 178:553–561. Daetwyler HD, Villanueva B, Bijma P, and Woolliams JA. 2007. Inbreeding in genome-wide selection. J Anim Breed Genet, 124:369–376. Darvasi A and Soller M. 1994. Selective DNA pooling for determination of linkage between a molecular marker and a quantitative trait locus. Genetics, 138:1365–1373. Dekkers JCM. 2007. Prediction of response to marker-assisted and genomic selection using selection index theory. J Anim Breed Genet, 124:331–341. Dominik S, Henshall JM, Kube P, King H, Lien S, Kent M, and Elliott N. 2009. Can whole genome selection be applied in the Tasmanian Atlantic salmon population? Proceedings of International Symposium on Genetics in Aquaculture X, edited by U Na-Nakorn. IAGA, Bangkok, Thailand, p. 34. Falconer DS and Mackay TFC. 1996. Introduction to Quantitative Genetics. Longman Sci Tech, Harlow, UK. Fredman D, White SJ, Potter S, Eichler EE, Den Dunnen JT, and Brookes AJ. 2004. Complex SNP-related sequence variation in segmental genome duplications. Nat Genet, 36:861–866. Georges M and Massey JM. 1991. Velogenetics, or the synergistic use of marker assisted selection and germ-line manipulation. Theriogenology, 35:151–159. Gianola D, Fernando RL, and Stella A. 2006. Genomic-assisted prediction of genetic value with semiparametric procedures. Genetics, 173:1761–1776. Gjedrem T. 2005. Improvement of productivity through breeding schemes. GeoJournal, 10:233–241. Gjedrem T and Olesen I. 2005. Basic statistical parameters. In: Selection and Breeding Programs in Aquaculture, edited by T Gjedrem. Springer, Dordrecht, The Netherlands, pp. 45–72.

162

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Goddard M. 2009. Genomic selection: prediction of accuracy and maximisation of long term response. Genetica, 136:245–257. Goddard ME and Hayes BJ. 2007. Genomic selection. J Anim Breed Genet, 124:323–330. Goddard ME and Meuwissen THE. 2005. The use of linkage disequilibrium to map quantitative trait loci. Aust J Exp Agric, 45:837–845. Goddard ME, Chamberlain AC, and Hayes BJ. 2006. Can the same markers be used in multiple breeds? Proceedings of the 8th World Congress of Genetics Applied to Livestock Production, Belo Horizonte, Brazil. Habier D, Fernando RL, and Dekkers JCM. 2009. Genomic selection using low-density marker panels. Genetics, 182:343–353. Hayes BJ, Visscher PM, McPartlan HC, and Goddard ME. 2003. Novel multilocus measure of linkage disequilibrium to estimate past effective population size. Genome Res, 13:635–643. Henderson C. 1984. Applications of Linear Models in Animal Breeding. Guelph University Press, Guelph, Canada. Hill WG. 1981. Estimation of effective population-size from data on linkage disequilibrium. Genetical Res, 38:209–216. Hill WG and Robertson A. 1968. Linkage disequilibrium in finite populations. Theor Appl Genet, 38:226–231. Hill WG, Goddard ME, and Visscher PM. 2008. Data and theory point to mainly additive genetic variance for complex traits. PLoS Genet, 4:e1000008. Houston RD, Haley CS, Hamilton A, Guyt DR, Tinch AE, Taggart JB, McAndrew BJ, and Bishop SC. 2008. Major quantitative trait loci affect resistance to infectious pancreatic necrosis in Atlantic salmon (Salmo salar). Genetics, 178:1109–1115. Kent MP, Hayes B, Xiang Q, Berg PR, Gibbs RA, and Lien S. 2009. Development of 16.5K SNP chip for Atlantic salmon. Proceedings of the 17th Plant and Animal Genome Conference, San Diego, CA. Kizilkaya K, Fernando RL, and Garrick DJ. 2010. Genomic prediction of simulated multibreed and purebred performance using observed fifty thousand single nucleotide polymorphism genotypes. J Anim Sci, 88:544–551. Luan T, Woolliams JA, Lien S, Kent M, Svendsen M, and Meuwissen THE. 2009. The accuracy of genomic selection in Norwegian red cattle assessed by cross-validation. Genetics, 183:1119–1126. Meuwissen THE and Goddard ME. 1996. The use of marker haplotypes in animal breeding schemes. Genet Sel Evol, 28:161–176. Meuwissen THE, Hayes BJ, and Goddard ME. 2001. Prediction of total genetic value using genome-wide dense marker maps. Genetics, 157:1819–1829. Moen T, Baranski M, Sonesson AK, and Kjoglum S. 2009. Confirmation and fine-mapping of a major QTL for resistance to infectious pancreatic necrosis in Atlantic salmon (Salmo salar): Population-level associations between markers and trait. BMC Genomics, 10:368. Nielsen HM, Sonesson AK, Yazdi H, and Meuwissen THE. 2009. Comparison of accuracy of genome-wide and BLUP breeding value estimates in sib based aquaculture breeding schemes. Aquaculture, 289:259–264. Oliphant A, Barker DL, Stuelpnagel JR, and Chee MS. 2002. BeadArray (TM) technology: Enabling an accurate, cost-effective approach to high throughput genotyping. Biotechniques, 32:56–61. de Roos APW, Hayes BJ, Spelman RJ, and Goddard ME. 2008. Linkage disequilibrium and persistence of phase in Holstein-Friesian, Jersey and Angus cattle. Genetics, 179:1503–1512. Solberg TR, Sonesson AK, Woolliams JA, and Meuwissen THE. 2008. Genomic selection using different marker types and densities. J Anim Sci, 86:2447–2454.

Genomic Selection for Aquaculture

163

Sonesson AK. 2007. Within-family marker-assisted selection for aquaculture species. Genet Sel Evol, 39:301–317. Sonesson AK and Meuwissen THE. 2009. Testing strategies for genomic selection in aquaculture breeding programs. Genet Sel E, 41:37. Sonesson AK, Goddard ME, and Meuwissen THE. 2010. The use of communal rearing of families and DNA pooling in multi-trait aquaculture genomic selection schemes. Genet Sel Evol (submitted). Spielman RS, McGinnis RE, and Ewens WJ. 1993. Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am J Hum Genet, 52:506–516. Sved JA. 1971. Linkage disequilibrium and homozygosity of chromosome segments in finite populations. Theor Popul Biol, 2:125–141. Toosi A, Fernando RL, and Dekkers JC. 2010. Genomic selection in admixed and crossbred populations. J Anim Sci, 88:32–46. Zhao H, Nettleton D, Soller M, and Dekkers JCM. 2005. Evaluation of linkage disequilibrium measures between multi-allelic markers as predictors of linkage disequilibrium between markers and QTL. Genet Res, 86:77–87.

Chapter 11

Genomic Selection in Aquaculture: Methods and Practical Considerations Ashok Ragavendran and William M. Muir

Introduction Management of genetic resources in aquaculture has been highlighted in recent years with reviews in selection programs (Gjedrem, 2005) and the application of current advances in genomics (Liu, 2007). Progress in large-scale genomic technologies has led to many more organisms being sequenced and includes some economically important species in aquaculture. The recently initiated Genome 10K (G10K) project (Genome 10K Community of Scientists, 2009) proposed a comprehensive representation of vertebrates, including fish species. Specifically, for fish populations, based on a phylogenetic perspective, 62 out of 62 possible orders (100%) are under plan to be sequenced, along with 424 out of 532 families (80%), 1777 out of 4956 genera (36%), and 4246 of about 31,564 named species (13%), as well as the incorporation of a minimum of 2500 additional species from partner institutions (Genome 10K Community of Scientists, 2009). Genome sequencing projects have discovered large numbers of single-nucleotide polymorphisms (SNPs; e.g., Wong et al., 2004, in poultry), and cost-effective methods to genotype these SNPs are now commercially available under a variety of platforms (Dove, 2005). Current technologies have greatly facilitated the detection of SNPs for different species. For example, significant numbers of SNPs have been reported in channel catfish (He et al., 2003; Wang et al., 2008, 2010), salmon (Hayes et al., 2007), and sea bream (Cenadelli et al., 2007). Recent developments in genomic technology will also increase the availability of SNPs in other species. An example is the Atlantic cod, where next generation sequencing efforts are already under way (Johansen et al., 2009). A recent Food and Agriculture Organization (FAO) report on sustainable aquaculture emphasized the importance of breeding as a key component in meeting the projected fingerling requirement of 2020. This working group delineated the following main genetic aspects: (1) genetic management of domesticated stocks and (2) development of improved broodstock (Bondad-Reantaso, 2007). Aquaculture breeding programs are relatively in their nascent stages compared with other species of livestock and selection programs. For many species, production still relies on catching wild broodstock and or/fry to augment existing technologies for

Next Generation Sequencing and Whole Genome Selection in Aquaculture © 2011 Blackwell Publishing Ltd. ISBN: 978-0-813-80637-2

Edited by Zhanjiang (John) Liu

165

166

Next Generation Sequencing and Whole Genome Selection in Aquaculture

genetic gain (Gjedrem, 2005). Domestication and genetic improvement of fish may offer an advantage in comparison with other domesticated animals. Selection response can, theoretically, be higher in fish and shellfish due to higher genetic variance in fish populations. Additionally, higher fecundity of aquatic animals allows for higher selection intensities. Aquaculture presents challenges to application of advanced breeding technologies because pedigrees are hard to maintain; that is, single-pair matings are rarely made. Thus, mass selection has been the normative breeding scheme, that is, selection based on an individual’s own performance. Even if individual matings occurred, keeping track of individual fish is difficult since individuals cannot be tagged until they are large enough. Another alternative would be to use family selection, but this selection scheme can be cost prohibitive due to the requirement that different families are reared separately until tagging is possible (see Fjalestad, 2005, for a detailed review of different schemes). However, molecular tools can alleviate problems associated with classical breeding where pedigrees are a necessity for breeding evaluations. Recent innovations in molecular technology allow for the estimating of relationships using genomic information. Further, use of this technology to generate a genomic relationship matrix has been shown to be superior to traditional pedigree-based relationships, provided the density of makers is high enough. In the following, we will explore the technology and examine how it could be implemented in aquaculture, along with its potential benefits and pitfalls.

Genomic Selection: Definition and Theory Genomic selection is a variation on marker-assisted selection (MAS) in which genetic markers covering the whole genome over all chromosomes of the organism are used. Given this dense coverage, it is possible to assume that all quantitative trait loci (QTL) are in population-wide linkage disequilibrium (LD) with the markers. Therefore, this dense coverage of markers can be used to select for the favorable allele at each QTL without actually identifying the QTL or the functional polymorphism. Conceptually, the more dense the markers and greater the population-wide LD, the better the method works. Without any prior knowledge of QTL positions, equal spacing of markers is considered to be optimal. While genomic selection can be considered a variant of MAS, the goals differ in that research on MAS tends to focus on mapping a few QTL precisely, in the hope that the gene that was the QTL can be identified (reviewed in Rothschild and Ruvinsky, 2007; for aquaculture, see also Chapter 12). On the other hand, genomic selection focuses on prediction of estimated breeding value (EBV) using a random set of markers. Currently, SNPs covering the entire genome are readily available, making genomic selection feasible for many animal species. Meuwissen et al. (2001) pioneered statistical methods to conduct genomic selection using markers covering the whole genome so that all genetic variance can be associated with the markers. Using simulations, they showed that breeding value of an individual with only marker data could be estimated with an accuracy as high as 0.85. They compared four different methods, namely (1) least squares regression; (2)

Genomic Selection in Aquaculture: Methods and Practical Considerations

167

a genomic equivalent of best linear unbiased prediction (GS-BLUP); and (3) BayesA and (4) BayesB. BayesA and BayesB are two alternative Bayesian specifications. Using simulated data, they found the Bayesian approaches provided the best accuracy among the four methods. Application of genomic selection involves prediction of estimated genomic breeding values (GEBVs) for individuals in future generations (CG) based on estimated statistical relationships among the generations in which data on both genotypes and phenotypes were collected, termed training generations (TG). Based on the collected phenotypic data, a two- or three-step procedure is applied to calculate the genomic predictions for the CG using two different modeling approaches. As detailed below, a variety of statistical models can be used to obtain predictions. Most of these methods are based on estimating effects associated with the markers and then summing over all loci. Alternatively, at least one of the methods is based directly on estimating breeding values using classical BLUP mixed-model equations (MMEs), but with a genomic relationship matrix G (VanRaden, 2008; Goddard, 2009) instead of that based on the pedigree. Depending on the statistical methodology used, the differences in accuracy among predictive models are dependent on the true but unknown a priori distribution of SNP effects.

Statistical Analysis to Calculate GEBV from Genome-wide DNA Markers Most current statistical models employed for genomic selection are based on the methods described by Meuwissen et al. (2001). The basic model can be written as '

y = Xb + Zg +

(11.1) Zg = Mγ

Zg = Zu

where y is the n × 1 vector of observations of trait phenotypes; β is a vector of systematic effects, usually used to control for the effects of some macroenvironmental parameters such as sex, X is the associated design matrix and ∈ is the vector of random residual effects. As shown above, the Zg part of the model depends on the method of analysis and is discussed in detail below. In general, the Zg part is used to calculate the effects due to the locus and the genotype of the animals and generate predictions. ∈ is the random residual vector and Var(∈) ∼ RI, where I is the identity matrix and R is diagonal in a single-trait setting. With respect to the Zg effects, conceptually, the current methods to estimate the SNP effects can be resolved in two simple approaches: 1. Estimating SNP effects and summing approach (eSNP method): Estimating the effect γk of each genotype and summing them over all k loci for an individual i GEBV = uˆi = ∑ M k γ k. k

(11.2)

168

Next Generation Sequencing and Whole Genome Selection in Aquaculture

2. The genomic relationship matrix (GRM) approach: Estimating the relationship among individuals using the genotypes at each loci and calculating the breeding value for any individual; and this is equivalent to the “animal” model of traditional breeding. GEBV = uˆ1

(11.3)

The differences among the two methods can be easily visualized when both methodologies are compared using the BLUP approach, based on using Henderson’s MMEs, as shown below eSNP method: estimating SNP effects and summing approach ⎡ X1T, n X n,1 ⎢ ⎢M T X ⎢ p, n n,1 ⎣

⎤ ⎥ ⎡ b1,1 ⎤ ⎡ X1T, n yn,1 ⎤ σ ⎥⎢ γ ⎥ = ⎢ T ⎥ M Tp, nM p, n + I ⎢ p,1 ⎥⎦ ⎣M p, n yn,1 ⎦ σ ⎥⎦ ⎣ X1T, nM n, p

2 e 2 γ

(11.4)

GRM method: genomic relationship approach ⎡ X1T, n X n,1 ⎢ ⎢Z T X ⎢ n, n n,1 ⎣

⎤ ⎥ ⎡ b1,1 2 −1 σ e ⎥ ⎢ T T Z n, nZ n, n + (( M n, mM m, n )) I 2 ⎥ ⎢⎣ un,1 σu ⎦ X1T, nZ n, n

⎤ ⎡ X1T, n yn,1 ⎤ ⎥=⎢ T ⎥ ⎥⎦ ⎣Z n, n yn,1 ⎦

(11.5)

The dimensions of the systems of equations in Eq. (11.4) is p, the number of loci, whereas those in equation (11.5) are of n, the number of individuals. Using Eq. (11.4), we are estimating individual SNP effects (γp,1) and with the current availability of genomic information, the matrices can be of the order >50K × 50K posing huge computational burdens (VanRaden, 2008; Stranden and Garrick, 2009; Misztal et al., 2009). Under the eSNP method (Eq. 11.4), the QTL effect over the entire genome is captured using the M relationship, which depends on the statistical assumptions regarding the SNP or haplotype effects γK. In the GRM method (Eq. 11.5), the assumption made is that all QTL are of small effect and make equal contributions and the GRM replaces the usual pedigree relationship matrix A (Habier et al., 2007; VanRaden, 2008; Goddard, 2009). Implementing the eSNP method (Eq. 11.4) for the estimation procedure usually involves a multistage procedure involving two to three steps: 1. evaluation by restricted maximum likelihood (REML) methodology to estimate the variances σe2 and σ 2u or σ 2γ using a traditional “animal” model; and 2. estimating the genomics effects γk for the TG; and finally, 3. prediction of GEBVs for the candidates (which can be combined with the previous step). While there are advantages to the multistage procedure using Eq. (11.4), such as no changes in the regular evaluations and simple steps for predicting genomic values for CG animals, disadvantages include estimating parameters in the steps such as prior variances and weights (e.g., deregressed proofs; Garrick et al., 2009), which

Genomic Selection in Aquaculture: Methods and Practical Considerations

169

can also be computationally expensive (Legarra and Misztal, 2008; VanRaden, 2008). One major advantage of this method is that the marker effects are estimated allowing one to delineate regions that may contain QTL (e.g., Xu, 2003). However, a major disadvantage with this approach is that current methods do not support multitrait evaluations, and this implies that analyses have to be undertaken trait by trait. The GRM method (Eq. 11.5) provides a single-step procedure to calculate GEBV, and recently, Mistzal and colleagues have proposed a methodology to modify the traditional A matrix to incorporate genomic information (Legarra et al., 2009; Misztal et al., 2009; Aguilar et al., 2010). Additionally, this approach can be implemented using existing methodology and conventional mixed-model software, allowing for simultaneous REML estimation as well as extended to multitrait evaluations. The main disadvantage lies in the restricted assumption, Var ( γ k ) ∼ N ( 0, σ 2γk ), as well as that the estimates of QTL effects are not directly available.

eSNP Method Conceptually, the eSNP method (Eq. 11.4) is equivalent to regression methodology where the effects of each QTL are estimated from the data, and consequently, many of the effects are zero or close to it. Following Habier et al. (2007), we can define the Zg part in Eq. (11.1) as Zg = Mγ = ∑ m k γ k δ k.

(11.6)

k

mk is a column vector of marker genotypes at locus k, γk is the marker effect, and δk is a 0/1 indicator variable. In mk, the marker genotype of an individual is coded as the number of copies of one SNP allele it carries, that is, 0, 1, or 2. Using the classical approach for parameterization, we can arbitrarily assign the value + 21 γ k to allele 1 and the value − 21 γ k to allele 2, resulting in γk being half the difference between the two homozygotes, and the effects of the different genotypes are +γk for genotype 0, 0 for genotype 1, and −γk for genotype 2 (e.g., Lynch and Walsh, 1998). The matrix M of marker values can also account for the breeding structure in its construction, with the dummy variables specified for each locus depending on the type of cross used. For example, in a double haploid population, an individual can take only one of the two genotypes, A1A1 and A2A2, at any locus and the dummy variable is defined as mij = 1 for A1A1 and mij = −1 for A2A2, for an individual i. Defining the genetic effects associated with A1A1 and A2A2 by G11 and G22, respectively, the regression coefficient for a locus j is γj = G11 − G22. The partial regression coefficient γj is the effect of marker j associated with the trait and will absorb partly the effects of all QTL located between markers j − 1 and j + 1 (Xu, 2003). If γk is considered a fixed effect, this is equivalent to a least squares regression, M is a matrix containing columns to represent substitution effects for the genotype at each marker locus, and δk can be 0 or 1. In this setting, n << p and some kind of a variable selection procedure need to be used to select the best subset of markers. An alternative approach is to fit simple linear regressions for every locus and then use a threshold to select coefficients (e.g., Habier et al., 2007; Meuwissen et al., 2001).

170

Next Generation Sequencing and Whole Genome Selection in Aquaculture

If γ is considered random, then the γk’s are assumed to be drawn with a common distribution, the simplest being γ k ∼ N ( 0, σ 2γ ), and this leads to shrinkage of most of the γk’s toward zero. The total genetic variance (V(g)) of the sum g = Mγ has the variance-covariance structure Var ( g ) = Var ( Mγ M ) ∼ σ 2γ M T M

(11.7)

conditional on the observed marker genotypes. The vector of γk’s contains the additive genetic effects that correspond to allele substitution effects for each marker or haplotype and the indicator variable δk = 1 for all marker loci. This model can also be readily extended to include observations based on progeny test merit rather than phenotypes (Garrick et al., 2009) or a general diagonal variance structure for the genetic effects γk, as in BayesA or BayesB (Meuwissen et al., 2001, explained below), wherein each locus can have its own variance.

The GRM Method The genomic relationship matrix (G) plays an important role as it provides an avenue to implement genomic selection using the traditional “animal” model with a relationship matrix estimated from the markers. This is advantageous in that one breeding value must be estimated per animal instead of one QTL effect per marker. From Eq. (11.5), the GEBVs are estimated using the traditional BLUP equation (e.g., Mrode, 1996)

(

uˆ = GZ T V −1 y − Xβˆ

)

(11.8)

or equivalently from (11.5) σe ⎛ (MMT ) ⎞ 2 T T T ˆ u = MM Z ⎜ Z Z + Rσu ⎟ ⎝ ⎠ 2

−1

( y − Xβˆ)

(11.9)

here G = σ 2γ MM T. This model is computationally tractable as the matrix dimensions are n × n and under genomic selection n << p, where n is the number of individuals and p is the number of SNPs. With breeding programs in aquaculture still in their nascent stages, another attractive feature of this approach is that it is readily applicable to current programs without development of extensive pedigrees. The variance components can be simultaneously estimated and software to conduct this kind of analysis is readily available, already developed for other livestock. Recent developments have shown that the G matrix and the A matrix can be combined to utilize information from both the pedigree and the markers (Legarra et al., 2009; Misztal et al., 2009; Aguilar et al., 2010), where the pedigree is already available, and accuracies using this approach have been shown to be equivalent to Bayesian implementations, BayesA and BayesB (Dr. I. Misztal, pers. comm.). With this new formulation, the GRM approach can also be used when many animals do not have marker genotypes but where pedigree is available.

Genomic Selection in Aquaculture: Methods and Practical Considerations

171

Construction of the Genomic Relationship Matrix (G) The genomic relationship matrix (G) is constructed from the matrix of marker genotypes M and the allelic frequencies at each marker. VanRaden (2008) presents three different methods of constructing G and evaluated their performance. When allele frequencies are known in the base population, then the simplest way to construct the G matrix is as follows (from VanRaden, 2008): Let M be the matrix that specifies which marker alleles each individual inherited. Dimensions of M are n × k, the number of individuals by the number of loci. Construct a matrix P containing allele frequencies expressed as a difference from 0.5 and multiplied by 2, such that column i of P is 2 (pi − 0.5). Subtraction of P from M gives M, which under the random effects setting, profiles out the mean values of γk’s. Then the genotypic relationship matrix G is G=

MM T 2∑ k pk (1 − pk )

(11.10)

as given in VanRaden (2008). Division by 2 Σ kpk(1 − pk) makes G analogous to A, the numerator relationship matrix, under the assumption that γ k ∼ N ( 0, σ 2γ I ). Alternative methods for construction of the G matrix includes a method based on regression on the A matrix and another using a different weighting scheme in Eq. (11.10) (VanRaden, 2008). Matrix G is positive semidefinite but can be singular if the number of loci is limited. If elements of M are set to −1, 0, and 1 for the homozygote, heterozygote, and the other homozygote, respectively, diagonals of MMT count the number of homozygous loci for each individual, and off-diagonals measure the number of alleles shared by relatives. In contrast, diagonals of MTM count the number of homozygous individuals for each locus, and off-diagonals measure the number of times alleles at different loci were inherited by the same individual. A more complicated construction of the GRM method uses a pedigree–genomic relationship matrix H and this can allow for performing a joint evaluation using all phenotypic, pedigree, and genomic information (see Legarra et al., 2009; Misztal et al., 2009 for details).

Current Methods of Prediction for Genomic Selection Currently, the most common methods used for prediction in genomic selection have their basis in the methodologies proposed by Meuwissen et al. (2001). They proposed a mixed-model approach equivalent to ridge regression (GS-BLUP; Eq. 11.4), as well as two hierarchical Bayesian modeling approaches, BayesA and BayesB, using different prior distributions. BayesA and BayesB is analogous to the eSNP method (Eq. 11.4), except the γk’s are assumed to be random effects distributed according to a specified prior distribution. Current Bayesian approaches are based on modifications of their approach, and a brief overview of some methods are detailed below.

172

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Bayesian Approaches The rationale behind Bayesian approaches is to enable incorporation of prior information that can make the model more biologically relevant. Details of the fundamentals of Bayesian methodology are not given here, rather the readers are referred to Sorensen and Gianola (2002) for applications in quantitative genetics, or any of the recent reviews available in the literature (e.g., Shoemaker et al., 1999; Beaumont and Rannala, 2004). The first application of Bayesian methodology stems from Meuwissen et al. (2001), where they proposed two methods, BayesA and BayesB, differing on the assumptions regarding the underlying biology. The differences in BayesA and BayesB methodologies, as well as subsequent methods, are based primarily on the prior specification and elicitation. The rationale underlying their proposal was based on extending the distributional assumptions of γk in GS-BLUP (Eq. 11.4), which under the assumption Var ( g ) of equal contributions from all loci, is shown to be distributed as γ k ∼ , where K K is the number of loci. They suggested that a more realistic distributional assumption is that each marker loci makes different contributions and proposed two different prior specifications. First, assuming that the variance can vary across loci with unequal contributions, that is, σ 2γ = ∑ i σ 2i , where each σ 2i is different. The σ 2i were assumed to be derived from a common scaled inverse chi-square (χ−2(S,ν)) distribution, the parameters (S and ν) of which were elicited from the assumptions of the genetics underlying the population. This is the basic principle underlying BayesA methodology. A further extension, BayesB (Eq. 11.11), instead assumes that the allelic contributions (σa) arise from a mixture distribution, with a proportion of loci with differing effects and the other loci having zero effects, that is, with probability π ⎧σ 2i = 0 σa = ⎨ 2 −2 ⎩σ i ∼ χ ( ν, S ) with probability (1 − π ) .

(11.11)

A key difference between the BayesA and BayesB methods is that in the former all loci are assumed to have some effect, however small, while in the latter many loci contribute zero effects and for those that do have effects, the effects are not necessarily similar. For a detailed theoretical exposition of the population genetics assumptions that lead to the BayesA and BayesB model, as well as the assumptions underlying the methodology, the reader is referred to Gianola et al. (2009). Subsequent improvements have been based on modifying the assumptions with regard to the prior distributions used, for example, Xu (2003) proposed a modification of the parameters of the χ−2 that is close to an exponential distribution of effects and further improved upon by ter Braak et al. (2005) to result in a proper posterior distribution. De Los Campos et al. (2009b) proposed a Bayesian least absolute shrinkage and selection operator (LASSO) approach that has better shrinkage properties and provided a theoretical exposition on the relationship among the above methodologies based on their prior specification.

Other Approaches The methods prescribed by Meuwissen et al. (2001) and their extensions can be broadly classified under regression methodologies (de Los Campos et al., 2009b),

Genomic Selection in Aquaculture: Methods and Practical Considerations

173

whether Bayesian or frequentist, and they assume that most of the underlying genetic variance is additive in nature. However, with ongoing debate on the importance of epistasis and pleitrophy in the architecture of quantitative traits (Mackay, 2001; Hill et al., 2008; ), semiparametric and nonparametric statistical approaches have been proposed. These include reproducing kernel Hilbert spaces (Gianola et al., 2006; Gianola and van Kaam, 2008; Gonzalez-Recio et al., 2008; de Los Campos et al., 2009a), generalized additive models (GAMs; Bennewitz et al., 2009), and machine learning techniques (Long et al., 2007, 2008a). These models can accommodate for the multitude of interactions that need to be fit, which can result in the “curse of dimensionality,” given the large number of marker effects. For a detailed exposition on these methods and their interrelationships, the readers are referred to de Los Campos et al. (2009a, b) for a Bayesian perspective and to Piepho (2009) for a review based on the frequentist approach.

Advantages Genetic Perspective From a genetic perspective, the implication of the equivalence of the GRM method, with the relationship matrix estimated from the markers and the “animal model,” is that genomic selection works, in part, by estimating the pedigree of each animal. Genomic selection is more accurate than using the pedigree because it uses detected departures of the realized relationship due to random segregation of chromosome segments at meiosis, relative to the expectation from the pedigree, and the markers track this segregation (Nejati-Javaremi et al., 1997). The accuracy of selection and persistence of accuracy based on GEBV remain higher than traditional BLUP the more generations of data in which both genotypes and phenotypes were collected (Muir, 2007; Sonesson and Meuwissen, 2009). GEBV excels for traits of low heritability regardless of initial equilibrium conditions, as opposed to traditional MAS which is not useful for traits of low heritability (Muir, 2007). For example, using genomic selection, VanRaden et al. (2009) demonstrated the possibility of enhancing genetic progress, using genomic information, in the North American bovine industry with gains in empirical reliability for each trait ranging from 5% to 34%. For aquaculture, Nielsen et al. (2009) demonstrated, using simulations of a typical sib-based breeding scheme, that genomic selection can improve accuracy of selection over traditional EBV estimation by up to 33% higher and differences were similar for traits of low heritabilities. They conclude that this was due to both between- and within-family genetic variances being utilized when estimating GEBVs. The scarcity of breeding programs in aquaculture has been attributed to deterioration of the stock because of a rapid build up of inbreeding. This can be a result of using few broodstock each generation without identification to prevent reuse, and is especially a problem in species with high fecundity (Gjedrem, 2005). One main advantage of genomic selection is the expectation of lowered inbreeding (ΔF) relative to BLUP selection. The main reason for this reduction is shown to be an increased estimation accuracy of the Mendelian sampling term. Accurate estimation

174

Next Generation Sequencing and Whole Genome Selection in Aquaculture

of Mendelian sampling allows for better differentiation within families and leads to lower coselection of sibs, which reduces ΔF (Daetwyler et al., 2007).

Production Perspective From a production perspective, genomic selection affords a few advantages that are specific to aquaculture. With breeding programs still under development in many aquaculture species, adoption of genomic selection can provide immediate avenues of increased genetic gain without the necessity of developing pedigrees. Additionally, benefits also accrue due to reduced maintenance costs as relationships can be captured by the G matrix or even estimated from genotypes. Genetic gain (ΔG) can be quantified by the breeders equation ΔG =

Intensity of selection × Accuracy of selection × Genetic staandard deviation . Generation interval (11.12)

Genomic selection can provide reduction in generation intervals by selection of progeny based on their genotypes, and thus an increased ΔG. Not only can this result in increased gain but also in a realized potential of reducing time in the production cycle. Schaeffer (2006) provides an example of a realistic calculation on the estimated costs of genomic selection versus traditional progeny testing and show that the associated costs can be reduced by as much as 90% of the original cost. As mentioned above, genomic selection also has a high accuracy of selection, and this can also lead to rapid increased genetic gain in aquaculture wherein breeding schemes are in their infancy. With high reproductive capacity for species used in aquaculture, a nucleus breeding scheme is an attractive option to transfer genetic gain to producers (Refstie and Gjedrem, 2005). Crossbreeding is successful for some species such as carp varieties, sea bream, rainbow trout, and oyster, and hybridization techniques have also been used in rainbow trout and African catfishes (Hulata, 2001). Therefore, prediction of crossbreed performance from purebred, and thus the potential to use training analyses from one breed to predict the performance of other breeds, can reduce production costs. With genomic selection, using simulations, it has been shown that training in purebred can be used for prediction of crossbred performance (Ibanez-Escriche et al., 2009; Hayes et al., 2009a; Kizilkaya et al., 2010; Toosi et al., 2010), and this can be of high utility when breeding goals are aimed toward producing infertile offspring, as can be necessary for some aquaculture species (Knibb, 2000).

Current Unknowns Genomic selection shows great promise in revolutionizing the current schemes of breeding, with the potential for increased genetic gain by reducing generation intervals and increased accuracy of selection. However, genomic selection is in its infancy

Genomic Selection in Aquaculture: Methods and Practical Considerations

175

and—along with the great strides being undertaken in the underlying genomic technology—still has some questions unanswered. Genomic selection as introduced by Meuwissen et al. (2001) was shown to be efficient under a number of restricted assumptions such as equally spaced QTL centered between two markers, populations in a mutation–drift equilibrium (MDE), and a trait with a high level of heritability (h2 = 0.5). Muir (2007) outlined other unresolved issues and presented a preliminary assessment of those issues in his paper. These included traits with low heritability, distributional assumptions regarding marker and QTL frequencies, the number of generations of data needed on phenotype and genotype for accurate prediction, importance of new versus existing LD between markers and QTL, marker density and number, number of animals genotyped per generation, and impact of selection on new and existing LD. A burgeoning amount of recent literature following his paper have dealt with these issues (Avendano et al., 2005; Daetwyler et al., 2007; Calus et al., 2008; Legarra and Misztal, 2008; Solberg et al., 2008, 2009a, b; Garrick et al., 2009; VanRaden et al., 2009; Kizilkaya et al., 2010); a major caveat underlying these assessments is that they are based primarily under simulated scenarios. While genomic selection has been shown to work in principle, experimental implementation of genomic selection is currently under way in most livestock species, for example, poultry (www.worldpoultry.net/news/2410-mln-dna-technology-selection-project-in-poultrybreeding-3150.html), dairy cattle (Hayes et al., 2009a), and beef cattle (VanRaden et al., 2009). Therefore, currently, we know that genomic selection works mainly from simulation and a few limited experimental results, but more validation is required. Current limitations can be broadly classified into those that are methodological and those that involve optimization of data resources for implementation. We briefly consider both aspects and highlight some of the important issues.

Methodological A major consideration with regards to genomic selection is the accuracy of predictions in the following contexts: • Long-term gain: GEBVs are based on the predictions of SNP effects that are in LD with the QTL or the QTL themselves. Selection can change the LD structure in subsequent generations by genetic drift as a result of reduction in population size. Additionally, allele frequencies across generations are also modified by selection, and this can impact predictions, especially with regard to methods using the genomic relationship matrix. Proposed methodological developments to overcome these limitations include adding a polygenic component to current models (Muir, 2007) or finding an optimal index to maximize the long-term selection response with variable marker weights based on allele frequency (Goddard, 2009). Including a polygenic effect showed a reduction in bias rather than an improvement in accuracy (Solberg et al., 2009b). This can be important when estimates from genome-wide markers are used to estimate breeding values over more than a single generation and requires more study. • Prediction across families and lines: Genomic selection is contingent on the LD structure in the population, and it is well known that linkage phase can be affected

176

Next Generation Sequencing and Whole Genome Selection in Aquaculture

by population structure. LD has also been implicated in the difficulty of implementation of MAS in breeding programs, as separate marker sets are usually needed for each trait, and linkage phase variants cause the markers to be incorrect in some families or breeds. For commercial applications, there is interest in both selection of purebred lines based on crossbreed performance and vice versa. This can be important in the context of aquaculture where hybridization strategies are important to improve production (e.g., rainbow trout) or for the maintenance of nucleus populations for broodstock. Using simulations, it has been shown that model training in purebreds can be used for prediction of crossbreed performance (Ibanez-Escriche et al., 2009; Kizilkaya et al., 2010) as well as when using a multibreed reference population (Hayes et al., 2009a) or admixed populations (Toosi et al., 2010) to predict performance of the purebred nucleus. However, there was considerable reduction in accuracies when the QTL in the training populations were not included in the candidate populations. This was shown to be related to the density of markers across the genome, and it is suggested that haplotyping strategies might merit more consideration in this context (Kizilkaya et al., 2010; Toosi et al., 2010). • Statistical and computational: As reviewed earlier, there has been considerable attention devoted to the application of a wide variety of statistical approaches and examination of the performance of these approaches on the basis of their accuracy. However, only a few studies have been focused on the computational demands of these methodologies in realistic situations, for both the eSNP method (Legarra and Misztal, 2008; VanRaden, 2008) and the GRM approach (Misztal et al., 2009). Bayesian methods can be computationally complex and demanding, and thus far there has been only a single study on the computational demands, an improved implementation of BayesB (Meuwissen et al., 2009). With the broad scope of statistical approaches available, the choice of methodology for implementation in a production setting will depend on both the accuracy of prediction as well as the computational overhead, which demands further investigation.

Data Requirements Since the introduction of genomic selection, initial developments occurred, with most research devoted toward comparisons among predictive statistical models for the greatest accuracy of selection. This body of work, principally based on simulations, demonstrated that genomic selection has great potential to increase rates of genetic gain by increasing accuracy of selection. With the availability of experimental data still a work in progress, several key questions remain regarding data requirements that can affect these gains in conjunction with the methods of analyses. • Density of markers: The density to be employed in the SNP chip panel is a function of the cost of genotyping and the number of animals to be genotyped based on the breeding scheme. For instance, under a sib testing scheme in aquaculture, it was shown that there was a reduction in genetic gain with reduced sib testing when training was not conducted every generation (Sonesson and Meuwissen, 2009),

Genomic Selection in Aquaculture: Methods and Practical Considerations

177

which can be potentially offset by lower genotyping costs. As expected, accuracy of selection increases with increasing marker density, and when markers were in high LD there is no benefit to using haplotype information for traits with low heritabilities (Calus et al., 2008; Solberg et al., 2008). In production, this implies that using direct marker effects maybe more advantageous as it can avoid the errors associated with estimation of marker phases. The number of markers needed for any specific genome will eventually depend on the LD and therefore on the effective population size (Ne) in the base generation. The accuracy of GEBV was shown to be higher for populations with smaller Ne and the density of SNPs to be employed will be contingent on this factor. Solberg et al. (2008) estimated that ∼24,000 SNPs would be needed in a 30M genome with Ne = 100 and for aquaculture species, where broodstock is collected from wild fry, this may be an order of magnitude higher. • Individuals for the TG: Prediction accuracies depend on the amount of variation captured by the markers, which in turn depends on the LD, the heritabilities, and Ne in the target population. Most comparisons on the efficacy of genomic selection have been based on approximately 1000 individuals in the TG (e.g., Meuwissen et al., 2001) and this is shown to be highly effective, with accuracies close to 80%. While not one study has extensively covered the requirements of the number of phenotypes required in the TG, it is understood that Ne and the heritability of the trait play key roles in this regard. For example, if the heritability of the trait is 0.3 and Ne = 100, it is estimated that ∼12,474 individuals with genotypes and phenotypes are required in order to predict GEBVs of candidates in the same population with an accuracy of 0.7, while increasing Ne can also increase the number of required individuals (Hayes et al., 2009b). However, with a higher heritability (h2 = 0.5) and a larger Ne = 1000, it was shown that accuracies are fairly consistent across numbers ranging from 500 to 100,000 under simplifying analytical considerations (table 1 of Goddard, 2009).

Future Directions Genomic selection is poised to make significant impacts on livestock breeding evaluations in the next few years. The concurrent rapid advances in genomic technology will lead to the adoption of this approach in most prevailing breeding programs in livestock production. In principle, genomic selection has been shown to increase genetic gain and selection accuracies and the methodologies will be refined in the near future with experimental validation. The next step, after refinement of the methodology, will be the implementation of this approach in production systems and this requires integration of genetic performance with economic considerations (Harris and Newman, 1994). One important aspect of the economics of genomic selection is the cost of genotyping, and thus the genotyping strategies to be employed, especially when genotyping many individuals in the production setting. For example, in sib testing scenarios utilized in aquaculture, the optimal number of sibs to be genotyped depends on a balance achieved between the savings in genotyping costs and reduction in the genetic

178

Next Generation Sequencing and Whole Genome Selection in Aquaculture

gain (Sonesson and Meuwissen, 2009). An alternative might be to employ low-cost smaller chips, with a low density of SNPs, providing approximately similar gains. As shown in Holstein sires, a low-cost assay with 750 to 1000 SNP out of ∼33,000 SNPs can provide roughly two-thirds of the gain in reliability relative to the high-density assay (Weigel et al., 2009). Many questions remain regarding these alternative approaches, primarily concerned with the subsets of SNPs used as well as how to combine the low-density with the high-density assay. For instance, in determining broiler mortality under differing environments, it was found that SNP subsets differed and there was a low level of LD between the two subsets (Long et al., 2008b). This suggests that the low-density chips developed might be trait- and/or environment-specific and may not result in the reduction of genotyping costs under a multitrait production setting. Additionally, the location of the SNPs used in the chip can be of importance and depends on the information available through dense genotyping. In Holstein sires, it was noted that the equally spaced SNPs may be optimal when both parents have dense genotypic data, whereas SNPs of large effects are optimal when only sires have dense genotypic data (Weigel et al., 2009). Another alternative approach is using a combination of high-density SNPs in the TG and imputation of unobserved SNP markers for low-density SNPs, evenly spaced, for the candidate generation. In simulations, this was shown to be independent of the number of QTLs as well as the methods used to estimate the trait (Habier et al., 2009). However, a recent study using experimental data from layers suggests that incorporation of LD as well as minor allele frequencies to favor SNPs in regions with lower fixation or LD can improve predictions (Vereijken et al., 2010). In aquaculture, where sib testing is a primary strategy, it is shown that the cost of genotyping can be an impediment to realize the benefits of increased genetic gain (Sonesson and Meuwissen, 2009). Some of the above strategies can be applied to reduce the cost of sib testing and thus maximize the benefits of using genomic selection.

Workflow for Genomic Selection The overall workflow for conducting genomic selection in a target species can be broadly categorized as (1) a single initial setup phase; and (2) a continual phase throughout the breeding program. The general description of the workflow is described in Figure 11.1 and is briefly detailed below.

Phase 1 This is the initial phase for setting up the program of genomic selection and involves the following steps for the species under consideration. We begin with the assumption that this species has not been sequenced yet, although some of the steps may not be necessary depending on the current state of the target genome:

Genomic Selection in Aquaculture: Methods and Practical Considerations

179

Workflow for Genomic Selection

Phase I

Bioinformatics Target Genome

SNPChip Development

Design Breeding Scheme

Phase II

Genotyping Individuals

Model Training Statistical Analysis

GEBV Prediction

Selection

Production

Figure 11.1 A conceptual workflow for genomic selection. Phase I is the setup phase for genomic selection wherein the bioinformatics and the design of the genomic selection program is conducted. Phase II represents a continual cycle of model training and prediction. Information can flow across components of both phases to enhance the breeding program.

1. Bioinformatics of the target genome: Based on the current status of the genome, there might be the necessity for de novo sequencing and assembly of the genome. The next step will be to find the relevant SNPs across the entire genome, and this may require multiple sequences, depending on the quality of the assembly. 2. SNP chip development: When SNPs across the genome is known, the next step is the development of a SNP chip used for genotyping individuals, and there is an array of platforms available (see Jenkins and Gibson, 2002, for a review of methodological platforms). The technology is still improving both in the number of SNPs available as well the cost, and >30 solution providers were identified as early as 2005 (see table 1 of Dove, 2005). 3. Design of breeding schemes: This step involves the type of selection to be used, mass selection or family selection; the number of chips and the types of chips; the number of TG and individuals in the TG, and so on.

180

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Phase 2 This phase involves a continual cycle of analysis including genotyping individuals; model training/statistical analysis; generating breeding value predictions (GEBVs) for the candidates; selection of individuals; and finally, incorporating the selected individuals into production schemes. The flow of information to improve genetic gains can be bidirectional based on the breeding scheme utilized (Figure 11.1).

Summary Emerging molecular technologies have led to the improvement of selection methodologies using a combination of traditional breeding and molecular tools. One such approach is genomic selection, and for aquaculture breeding schemes, genomic selection may prove very useful. The utility of genomic selection is multifold; the primary advantage being the potential for immediate implementation without an established breeding program. In aquaculture, where breeding programs are still in their infancy, genomic selection provides avenues for increased genetic gain. Additionally, genomic selection is of high utility where breeding goals include many traits that are based on information from the sibs and not from the candidates, as when using family selection schemes in aquaculture. Furthermore, in programs with mass selection, the genomic selection approach can lead to lowered inbreeding relative to phenotypic selection.

References Aguilar I, Misztal I, Johnson DL, Legarra A, Tsuruta S, and Lawlor TJ. 2010. Hot topic: A unified approach to utilize phenotypic, full pedigree, and genomic information for genetic evaluation of Holstein final score. J Dairy Sci, 93:743–752. Avendano S, Woolliams JA, and Villanueva B. 2005. Prediction of accuracy of estimated Mendelian sampling terms. J Anim Breed Genet, 122:302–308. Beaumont M and Rannala B. 2004. The Bayesian revolution in genetics. Nat Rev Genet, 5:251–261. Bennewitz J, Solberg T, and Meuwissen THE. 2009. Genomic breeding value estimation using nonparametric additive regression models. Genet Sel Evol, 41:20. Bondad-Reantaso M. 2007. Assessment of freshwater fish seed resources for sustainable aquaculture. FAO Fish Tech Paper, 501:1–669. ter Braak CJF, Boer MP, and Bink MCAM. 2005. Extending Xu’s Bayesian model for estimating polygenic effects using markers of the entire genome. Genetics, 170:1435– 1438. Calus MPL, Meuwissen THE, de Roos APW, and Veerkamp RF. 2008. Accuracy of genomic selection using different methods to define haplotypes. Genetics, 178:553–561. Cenadelli S, Maran V, Bongioni G, Fusetti L, Parma P, and Aleandri R. 2007. Identification of nuclear SNPs in gilthead seabream. J Fish Biol, 70:399–405. Daetwyler HD, Villanueva B, Bijma P, and Woolliams JA. 2007. Inbreeding in genome-wide selection. J Anim Breed Genet, 124:369–376. Dove A. 2005. The SNPs are down: genotyping for the rest of us. Nat Methods, 2:989.

Genomic Selection in Aquaculture: Methods and Practical Considerations

181

Fjalestad KT. 2005. Selection methods. In: Selection and Breeding Programs in Aquaculture, edited by T Gjedrem. Springer, Dordrecht, pp. 159–170. Garrick DJ, Taylor JF, and Fernando RL. 2009. Deregressing estimated breeding values and weighting information for genomic regression analyses. Genet Sel Evol, 41:55. Genome 10K Community of Scientists. 2009. Genome 10k: A proposal to obtain wholegenome sequence for 10,000 vertebrate species. J Hered, 100:659–674. Gianola D, and van Kaam JBCHM 2008. Reproducing kernel Hilbert spaces regression methods for genomic assisted prediction of quantitative traits. Genetics, 178:2289– 2303. Gianola D, Fernando RL, and Stella A. 2006. Genomic-assisted prediction of genetic value with semi-parametric procedures. Genetics, 173:1761–1776. Gianola D, de los Campos G, Hill WG, Manfredi E, and Fernando RL. 2009. Additive genetic variability and the Bayesian alphabet. Genetics, 183:347–363. Gjedrem T. 2005. Selection and Breeding Programs in Aquaculture. Springer, Dordrecht. Goddard M. 2009. Genomic selection: Prediction of accuracy and maximization of long term response. Genetica, 136:245–257. Gonzalez-Recio O, Gianola D, Long N, Weigel KA, Rosa GJM, and Avendano S. 2008. Nonparametric methods for incorporating genomic information into genetic evaluations: an application to mortality in broilers. Genetics, 178:2305–2313. Habier D, Fernando RL, and Dekkers JCM. 2007. The impact of genetic relationship information on genome-assisted breeding values. Genetics, 177:2389–2397. Habier D, Fernando RL, and Dekkers JCM. 2009. Genomic selection using low-density marker panels. Genetics, 182:343–353. Harris DL and Newman S. 1994. Breeding for profit: Synergism between genetic improvement and livestock production (A review). J Anim Sci, 72:2178–2200. Hayes BJ, Laerdahl J, Lien S, Moen T, Berg P, Hindar K, Davidson W, Koop B, Adzhubei A, and Hoyheim B. 2007. An extensive resource of single nucleotide polymorphism markers associated with Atlantic salmon (Salmo salar) expressed sequences. Aquaculture, 265:82–90. Hayes BJ, Bowman PJ, Chamberlain AJ, and Goddard ME. 2009a. Invited review: Genomic selection in dairy cattle: Progress and challenges. J Dairy Sci, 92:433–443. Hayes BJ, Visscher PM, and Goddard ME. 2009b. Increased accuracy of artificial selection by using the realized relationship matrix. Genet Res, 91:47–60. He C, Chen L, Simmons M, Li P, Kim S, and Liu Z. 2003. Putative SNP discovery in interspecific hybrids of catfish by comparative EST analysis. Anim Genet, 34:445–448. Hill WG, Goddard ME, and Visscher PM. 2008. Data and theory point to mainly additive genetic variance for complex traits. PLoS Genet, 4:e1000008. Hulata G. 2001. Genetic manipulations in aquaculture: A review of stock improvement by classical and modern technologies. Genetica, 111:155–173. Ibanez-Escriche N, Fernando RL, Toosi A, and Dekkers JCM. 2009. Genomic selection of purebreds for crossbred performance. Genet Sel Evol, 41:12. Jenkins S and Gibson N. 2002. High-throughput SNP genotyping. Comp Funct Genomics, 3:57–66. Johansen SD, Coucheron DH, Andreassen M, Karlsen BO, Furmanek T, Jorgensen TE, Emblem A, Breines R, Nordeide JT, Moum T, Nederbragt AJ, Stenseth NC, and Jakobsen KS. 2009. Large-scale sequence analyses of Atlantic cod. Nat Biotechnol, 25:263–271. Kizilkaya K, Fernando RL, and Garrick DJ. 2010. Genomic prediction of simulated multibreed and purebred performance using observed fifty thousand single nucleotide polymorphism genotypes. J Anim Sci, 88:544–551. Knibb W. 2000. Genetic improvement of marine fish—Which method for industry? Aquacult Res, 31:11–23.

182

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Legarra A and Misztal I. 2008. Technical note: Computing strategies in genome-wide selection. J Dairy Sci, 91:360–366. Legarra A, Aguilar I, and Misztal I. 2009. A relationship matrix including full pedigree and genomic information. J Dairy Sci, 92:4656–4663. Liu Z, ed. 2007. Aquaculture Genome Technologies. Blackwell Publishing, Ames, IA. Long N, Gianola D, Rosa GJM, Weigel KA, and Avendano S. 2007. Machine learning classification procedure for selecting SNPs in genomic selection: Application to early mortality in broilers. J Anim Breed Genet, 124:377–389. Long N, Gianola D, Rosa GJM, Weigel KA, and Avendano S. 2008a. Machine learning classification procedure for selecting SNPs in genomic selection: application to early mortality in broilers. Dev Biol (Basel), 132:373–376. Long N, Gianola D, Rosa GJM, Weigel KA, and Avendano S. 2008b. Marker-assisted assessment of genotype by environment interaction: A case study of single nucleotide polymorphismmortality association in broilers in two hygiene environments. J Anim Sci, 86:3358–3366. de Los Campos G, Gianola D, and Rosa GJM. 2009a. Reproducing kernel Hilbert spaces regression: A general framework for genetic evaluation. J Anim Sci, 87:1883–1887. de Los Campos G, Naya H, Gianola D, Crossa J, Legarra A, Manfredi E, Weigel K, and Cotes JM. 2009b. Predicting quantitative traits with regression models for dense molecular markers and pedigree. Genetics, 182:375–385. Lynch M and Walsh B. 1998. Genetics and Analysis of Quantitative Traits. Sinauer, Sunderland, MA. Mackay T. 2001. The genetic architecture of quantitative traits. Annu Rev Genet, 35:303–339. Meuwissen THE, Hayes BJ, and Goddard ME. 2001. Prediction of total genetic value using genome-wide dense marker maps. Genetics, 157:1819–1829. Meuwissen THE, Solberg TR, Shepherd R, and Woolliams JA. 2009. A fast algorithm for BayesB type of prediction of genome-wide estimates of genetic value. Genet Sel Evol, 41:2. Misztal I, Legarra A, and Aguilar I. 2009. Computing procedures for genetic evaluation including phenotypic, full pedigree, and genomic information. J Dairy Sci, 92:4648–4655. Mrode R. 1996. Linear Models for the Prediction of Animal Breeding Values, 2nd edn. CABI, Wallingford, Oxfordshire, UK. Muir WM. 2007. Comparison of genomic and traditional BLUP-estimated breeding value accuracy and selection response under alternative trait and genomic parameters. J Anim Breed Genet, 124:342–355. Nejati-Javaremi A, Smith C, and Gibson JP. 1997. Effect of total allelic relationship on accuracy of evaluation and response to selection. J Anim Sci, 75:1738–1745. Nielsen HM, Sonesson AK, Yazdi H, and Meuwissen THE. 2009. Comparison of accuracy of genome-wide and BLUP breeding value estimates in sib based aquaculture breeding schemes. Aquaculture, 289:259–264. Piepho HP. 2009. Ridge regression and extensions for genomewide selection in maize. Crop Sci, 49:1165–1176. Refstie T and Gjedrem T. 2005. Reproductive traits in aquatic animals. In: Selection and Breeding Programs in Aquaculture, edited by, T Gjedrem. Springer, Dordrect, pp. 113–120. Rothschild MF and Ruvinsky A. 2007. Marker-assisted selection for aquaculture species. In: Aquaculture Genome Technologies, edited by Z. Liu. Blackwell Publishing, Ames, IA, pp. 199–214. Schaeffer LR. 2006. Strategy for applying genome-wide selection in dairy cattle. J Anim Breed Genet, 123:218–223. Shoemaker JS, Painter IS, and Weir BS. 1999. Bayesian statistics in genetics: A guide for the uninitiated. Trends Genet, 15:354–358. Solberg TR, Sonesson AK, Woolliams JA, and Meuwissen THE. 2008. Genomic selection using different marker types and densities. J Anim Sci, 86:2447–2454.

Genomic Selection in Aquaculture: Methods and Practical Considerations

183

Solberg TR, Sonesson AK, Woolliams JA, and Meuwissen THE. 2009a. Reducing dimensionality for prediction of genome-wide breeding values. Genet Sel Evol, 41:29. Solberg TR, Sonesson AK, Woolliams JA, Odegard J, and Meuwissen THE. 2009b. Persistence of accuracy of genome-wide breeding values over generations when including a polygenic effect. Genet Sel Evol, 41:53. Sonesson AK and Meuwissen THE. 2009. Testing strategies for genomic selection in aquaculture breeding programs. Genet Sel Evol, 41:37. Sorensen D and Gianola D. 2002. Likelihood, Bayesian and MCMC Methods in Quantitative Genetics. Springer-Verlag, New York. Stranden I and Garrick DJ. 2009. Technical note: Derivation of equivalent computing algorithms for genomic predictions and reliabilities of animal merit. J Dairy Sci, 92:2971–2975. Toosi A, Fernando RL, and Dekkers JCM. 2010. Genomic selection in admixed and crossbred populations. J Anim Sci, 88:32–46. VanRaden PM. 2008. Efficient methods to compute genomic predictions. J Dairy Sci, 91:4414–4423. VanRaden PM, Tassell CPV, Wiggans GR, Sonstegard TS, Schnabel RD, Taylor JF, and Schenkel FS. 2009. Invited review: reliability of genomic predictions for North American Holstein bulls. J Dairy Sci, 92:16–24. Vereijken A, Albers GAA, and Visscher J. 2010. Imputation of SNP genotypes in chicken using a reference panel with phased haplotypes. 9th World Congress on Genetics Applied to Livestock Production, Liepzig, Germany. Wang S, Sha Z, Sonstegard TS, Liu H, Xu P, Somridhivej B, Peatman E, Kucuktas H, and Liu Z. 2008. Quality assessment parameters for EST-derived SNPs from catfish. BMC Genomics, 9:450. Wang S, Peatman E, Abernathy J, Waldbieser G, Lindquist E, Richardson P, Lucas S, Wang M, Li P, Thimmapuram J, Liu L, Vullaganti D, Kucuktas H, Murdock C, Small BC, Wilson M, Liu H, Jiang Y, Lee Y, Chen F, Lu J, Wang W, Somridhivej B, Baoprasertkul P, Quilang J, Sha Z, Bao B, Wang Y, Wang Q, Takano T, Nandi S, Liu S, Wong L, Kaltenboeck L, Xu P, Quiniou S, Bengten E, Miller N, Trant J, Rokhsar D, Liu Z, and Catfish Genome Consortium. 2010. Assembly of 500,000 inter-specific catfish expressed sequence tags and large scale gene-associated marker development for whole genome association studies. Genome Biol., 11:R8. Weigel KA, de los Campos G, Gonzalez-Recio O, Naya H, Wu XL, Long N, Rosa GJM, and Gianola D. 2009. Predictive ability of direct genomic values for lifetime net merit of Holstein sires using selected subsets of single nucleotide polymorphism markers. J Dairy Sci, 92:5248–5257. Wong GKS, Liu B, Wang J, Zhang Y, Yang X, Zhang Z, Meng Q, Zhou J, Li D, Zhan J, Ni P, Li S, Heng Li LR, Zhang J, Li R, Li S, Zheng H, Lin W, Li G, Wang X, Zhao W, Li J, Ye C, Dai M, Ruan J, Zhou Y, Li Y, He X, Zhang Y, Xiangang Huang JW, Tong W, Chen J, Ye J, Chen C, Wei N, Li G, Dong L, Lan F, Sun Y, Zhang Z, Yang Z, Yu Y, Huang Y, He D, Xi Y, Wei D, Qi Q, Li W, Shi J, Wang M, Xie F, Wang J, Zhang X, Wang P, Zhao Y, Li N, NingYang DW, Hu S, Zeng C, Zheng W, Hao B, Hillier LW, Yang SP, Warren WC, Wilson RK, Brandstrom M, Ellegren H, Crooijmans RPMA, van der Poel JJ, Bovenhuis H, Groenen MAM, Ovcharenko I, Gordon L, Stubbs L, Lucas S, Glavina T, Aerts A, Kaiser P, Rothwell L, Young JR, Rogers S, Walker BA, van Hateren A, Kaufman J, Bumstead N, Lamont SJ, Zhou H, Hocking PM, Morrice D, de Koning DJ, Law A, Bartley N, Burt DW, Hunt H, Cheng HH, Gunnarsson U, Wahlberg P, Andersson L, Kindlund E, Tammi MT, Andersson B, Webber C, Ponting CP, Overton IM, Boardman PE, Tang H, Hubbard SJ, Wilson SA, Yu J, Wang J, Yang H, and Consortium ICPM. 2004. A genetic variation map for chicken with 2.8 million single-nucleotide polymorphisms. Nature, 432:717–722. Xu S. 2003. Estimating polygenic effects using markers of the entire genome. Genetics, 163:789–801.

Chapter 12

Comparison of Index Selection, BLUP, MAS, and Whole Genome Selection Zhenmin Bao

Genetic evaluation of breeding values is the focus of quantitative genetics and breeding. It is also the basis for selection and mating plans in modern animal breeding programs. Accurate estimation of breeding values is the goal breeders tirelessly pursue. The essence of selection, whether for single-trait selection or for multipletrait selection, is to make assessment of the true breeding values of the individuals as accurate as possible. Such assessment is usually made with modern statistical analysis techniques using various data, including phenotypic and genetic information, of the individuals and their families to provide an estimated breeding value (EBV). In other words, parents are selected using EBV to obtain maximal genetic gains with important economic traits to obtain the largest economic benefits. The development of methods for predicting breeding values has experienced four phases: selection using the selection index, selection using the mixed linear model method, selection using marker-assisted selection (MAS), and most recently, selection using whole genome-based selection. In this chapter, I will summarize the theories and procedures, and whenever possible, compare each of these methods.

Selection Index-based Selection Selection Index Method Selection index is the initial stage of breeding value estimation. In actual animal breeding, a number of economic traits are often involved, and the selection for only a single trait is rare. Selection of multiple traits can be achieved through tandem selection, independent culling, and the selection index method. The selection index method is an estimation method in which an index of breeding value is assigned by the appropriate weight of all existing phenotypic data, including the phenotypic information of the individuals, siblings, progenitors, and progenies. The selection index method is one of the oldest selection methods. It was first applied to the practice of animal breeding by Hazel (1943). The general formula of comprehensive selection index was Next Generation Sequencing and Whole Genome Selection in Aquaculture © 2011 Blackwell Publishing Ltd. ISBN: 978-0-813-80637-2

Edited by Zhanjiang (John) Liu

185

186

Next Generation Sequencing and Whole Genome Selection in Aquaculture n

I = b1 x1 + b2 x2 + bn xn = ∑ bi xi = b′ X, i =1

where I is the comprehensive selection index, bi is the partial regression coefficient of index for each trait phenotypic value, Xi is the value for each trait, b is the partial regression coefficient vector, and X is the vector of phenotypic value for each trait. Here, the solution formula of the partial regression coefficient bi is b = P −1 Aw, where P is the phenotypic variance-covariance of the trait, A is the breeding value variance-covariance of the trait, and w is the factor of economic weight. The problem is to find the best value for the economic weight. This is done by finding what values will give the maximum correlation between the index and the breeding value. In the establishment of selection index formula, trait heritability, phenotypic variance, economic weight values, phenotypic correlation, genetic correlation, and other parameters are initially obtained; then the trait phenotypic variance-covariance matrix and breeding value variance-covariance matrix are established, and each partial regression coefficient is sought according to the formula; finally, each trait phenotypic value Xi or its deviation from the mean is fitted into the formula to calculate the value of the individual index.

Application of Selection Index in Aquaculture The selection index method has been widely applied in breeding programs including aquaculture species. Friars et al. (1990) established the selection index equations for Atlantic salmon in two periods: I mkt = 1.00 P1 + 0.03 P2 + 0.74 P3 + 0.38 P4 I mat = 1.00 P1 + 0.05 P2 + 6.90 P3 + 0.50 P5 + 3.10 P6. The first equation is the selection index of Atlantic salmon when it can be sold in the market, and the second one is the selection index when it reaches sexual maturity. In the equations, P1 is the average body length of young salmon in the family, P2 is the survival rate of young salmon, P3 is the average body length at the market size, P4 is the body length when it is sold, P5 is the average body length of individuals, and P6 is the average body length of those in the family at sexual maturity. Exponential equation 1 is used to preselect to eliminate certain individuals, and then exponential equation 2 is used for final selection. Genetic gains were evident after selection using these index formulas. After one generation of index selection, among six traits except the survival rate of young salmon, the genetic gain of the body length of young salmon was 0.78 cm, and the weight was 0.33 kg at market size. The index formula was improved by O’ Flynn et al. (1999): The first index selective equation is IG1 = 1.00 P1 + 0.5 P2 + 6.9 P3 + 3.1 P4 + 0.5 P5, where P1 to P5, respec-

Comparison of Index Selection, BLUP, MAS, and Whole Genome Selection

187

tively, indicate the average body length of individuals in the family, the survival rate of young salmon, the average harvest weight in the family, the average body length of individuals at sexual maturity, and the average body length in the family at sexual maturity. The second generation of the selective index equation is IG2 = 7.5 P1 + 9.2 P2 + 8.9 P3 + 1.2 P4 + 8.1 P5, where P1 to P5, respectively, indicate the survival rate of young salmon in the family, the percentage of individual nonmigratory salmon in the family, the average harvest body length in the family, the average harvest body length of individuals and the family, and the anti-bacterial kidney disease (BKD) survival rate. Index selection was conducted for Atlantic salmon in Canada for two generations. In comparison with the control, the individual harvest body weight was increased by 25.07% after the selection, an actual increase of 0.88 kg. Similarly, body weight was improved 77% in Nile tilapia through seven generations of population selection, family selection, and index selection (reviewed by Dunham et al., 2001). Compared with other multiple-trait selection methods such as tandem selection and independent culling, the selection index method is more efficient. Therefore, it has been the most common method in estimating aggregate breeding values in the last few decades.

The Defects of the Selection Index Method For a long time, the main breeding method has been the selection index method based on individual performance records or progeny mean. This method is simply practical, but it has two major defects: (1) it does not effectively correct environmental factors or some nongenetic factors; and (2) it cannot take full advantage of the information from all relatives. In actual animal breeding, it is extremely difficult to achieve the theoretically desired results with this method, and sometimes the difference is quite large. That is because the application of selection index method requires the following basic premises: 1. The breeding target traits are conspicuous; trait data for genetic evaluation are easily measurable; and there is no highly negative genetic correlation among target traits. 2. There are no systematic environmental effects in all observed values used for index calculation; there are no fixed genetic differences among candidate individuals; and various population parameters involved are known. These premises clearly demand conditions that are difficult to meet, resulting in the deviation of the EBV, and thereby decreasing the outcome of the selection. Therefore, the establishment of the selection index formula often depends on many years of practical experience of breeders.

Best Linear Unbiased Prediction (BLUP) To overcome the shortcomings of the selection index method, Henderson (1953) proposed a mixed linear model method in the early 1950s. This method combined

188

Next Generation Sequencing and Whole Genome Selection in Aquaculture

the strengths of the selection index method and the least-squares estimation method, and was referred to as BLUP (Henderson, 1953, 1975, 1984). Van Arendonk et al. (1999) classified the BLUP models into the polygenic model and the mixed inheritance model. The former indicates that the genetic effects of quantitative traits are controlled by many minor polygenes, while the latter implies that the quantitative traits are controlled by major and minor genes. The BLUP method has been widely used in genetic evaluation of livestock and poultry, and has attracted great attention in the breeding of aquatic animals in recent years. With aquaculture species, breeding value estimations have been mostly conducted with the polygenic model.

Basic Principles of BLUP BLUP is a statistical method whose basic principle is the combination of a linear statistical model methodology and quantitative genetics. There are quite a few models for animal genetic evaluations, but animal model BLUP results in most of the information and removes the deviation of fixed environment and genetic effects, and thus is capable of constant genetic evaluation of individuals from different years, generations, stocks, and age groups, in addition to other factors. Animal models are a series of mixed linear models with different structures whose random genetic effects are mainly general breeding values of the individuals. The general form of such models is y = Xb + Za + e, where y is phenotypic value vector; β is fixed effect vector; α is random genetic effect vector of the individual; X and Z are correlation matrices corresponding to fixed effect vector β and random effect vector α, respectively; and 2 ⎛ a ⎞ ⎛ O⎞ ⎛ a ⎞ ⎛ As a E ⎜ ⎟ = ⎜ ⎟ , Var ⎜ ⎟ = ⎜ ⎝ e ⎠ ⎝ O⎠ ⎝ e ⎠ ⎝O

O ⎞ , Is a2 ⎟⎠

Cov ( u, e ′ ) = 0, E ( y) = Xb

If in the above formula, Zα does not exist, it becomes a fixed model y = Xβ + e.If in above formula, Xβ = 1μ, it becomes a random model y = 1μ + Zα + e.Therefore, the fixed model and random model can be considered as special cases of fixed models where μ is general average, 1 is the vector of all elements are 1, A is numerator relationship matrix of all individuals, s a2 is breeding value vector of the individual (additive genetic variance), s e2 is random residual variance, and I is a unit matrix. The mixed-model equation is y = Xb + Za + e X ′Z ⎡ X ′X ⎤ ⎡ bˆ ⎤ ⎡ X ′γ ⎤ ⎢ Z ′X Z ′Z ± κA −1 ⎥ ⎢ ˆ ⎥ = ⎢ Z ′γ ⎥ , ⎣ ⎦ ⎣a ⎦ ⎣ ⎦ where k = s e2 s a2 = s γ2 (1 − h2 ) s a2 = (1 − h2 ) h2.

Comparison of Index Selection, BLUP, MAS, and Whole Genome Selection

189

Its characteristic is that the same estimation equation cannot only estimate fixed environmental and genetic effects but can also predict random genetic effects. The basic steps of their EBV are the following. According to the genetics knowledge and the actual production conditions, the observed values are expressed as the sum of effects of influential genetic and environmental factors minus the linear model, and then based on linear, unbiased, and the best principles, each effect in the model is solved with computers, where the key is to establish an appropriate mixed model and the corresponding model equations. With the animal model BLUP method, the fixed effects of each breeding population, each breeding field, and each genetic group, as well as the breeding values of each trait of a single animal, can be estimated, and then breeding selection and mating is conducted based on breeding values queue to calculate the genetic progress and analyze the inherited tendency. Compared with the selection index method, the BLUP method involves a larger number of equations required to be solved because it contains a variety of relationships of animals to be evaluated. The BLUP method uses ancestors, full siblings, half siblings, and other data to estimate individual breeding values, all of which make up for the defects of the selection index method. Henderson (1973) pointed out that when fixed effects and covariance matrix were known, the predicted values of the selection indexes of each trait of a single animal could be obtained with the mixed-model equations, and then the economic weight values could be considered to obtain the selection criteria. When fixed effects and covariance matrix were unknown, Gianola (1986) proposed that the variance matrix was estimated with restricted maximum likelihood (REML), which was then substituted into the mixed-model equations to obtain its unbiased estimated value. It is obvious that the BLUP method combines the strengths of the selection index method and the least-squares method and provided an effective way for breeding value estimation.

Main Features of Animal Model BLUP Compared with traditional breeding, BLUP enjoys the following advantages: 1. It makes full use of the relative information, with the blood relationships of all subjects reflected in the numerator relationship matrix. For each individual, the EBV may be affected by its relative information and can be predicted based on its relative information when no information of the individual itself is available. 2. It eliminates deviation caused by environmental conditions. In theory, results from BLUP are more accurate than those from progeny testing or phenotypic value of individual selection. 3. It offers the ability to adjust deviation due to nonrandom mating, such as selected mating. Therefore, it can be used in populations with selective breeding and inbreeding. This feature contrasts favorably with several other methods that require randomized mating, which is difficult to achieve in real situations. With the multiple records of the individual, BLUP can minimize deviation resulting from elimination. 4. It takes into account the genetic differences in different populations of different generations and can conduct joint genetic evaluation of different populations (provided that there are genetic links between the populations). Since environmental effects are eliminated, the EBVs are more comparative.

190

Next Generation Sequencing and Whole Genome Selection in Aquaculture

5. It is more efficient in estimating breeding values. Studies have shown that compared with index selection, BLUP significantly speeds up breeding process, especially with low-heritability and sex-limited traits. Belonsky and Kennedy (1988) showed that for single-trait selection of a closed population of pigs, when trait heritability was 0.10 and 0.60, respectively, efficiency of traditional index selection was only 64% and 91% that of the animal model BLUP.

Analysis Methods of Mixed Linear Models In analysis of variance (ANOVA), for a specific set of genetic populations (specific mating combinations), the total genetic variance can always be decomposed into additive, dominant, and other genetic variance components, and the genotype × environment interaction variance can be further decomposed, and covariance can also be decomposed into different components. Although the ANOVA method is simple, it cannot analyze unbalanced data. Henderson (1953) proposed three ANOVA methods for variance component estimation that were suitable for the analysis of variance of some simple experimental designs and unbalanced data. However, for some special linear models, the variance components estimated with such methods may have deviations. A series of analysis methods of mixed linear models, including maximum likelihood (ML), REML, and minimum norm quadratic unbiased estimation method (MINQUE), developed since the 1970s could overcome the limitations of Henderson’s methods. The mixed linear models can be expressed in the form of a matrix (Searle, 1968) as follows: y = Xb + e. The various effects in the model can usually be divided into fixed (b is a fixed effect vector) and random effects (e is a random effect vector), where has e ∼ MVN ( 0, s e2 I ) independent multivariate normal distribution, thus the dependent variable is also an independent normal random vector Y ∼ MVN ( Xb, s e2 I ) . In the specific application, other factors are also random effects in addition to residual effects. If r factors are independent random effects in the experimental designs, there are a total of (r + 1) random effects including residual effects. Then random effect vector e can be decomposed into (r + 1) items: y = Xb + e = Xb + U1 e1 + U 2 e2 + + U r er + er +1 = Xb + ∑ u U u eu + er +1, where U u is the coefficient matrix of the uth random factor eu : eu ∼ MVN ( 0, s u2 I ), er +1 ∼ MVN ( 0, s r2+1I ). Because all the random effects are independent, Cov ( eu, euT′ ) = 0,

Comparison of Index Selection, BLUP, MAS, and Whole Genome Selection

191

Cov ( eu, eTr +1 ) = 0. The variance and expected values of the dependent variable are as follows: E ( y ) = E ( Xb) + E ( e ) = XE ( b) + E

(∑

u

U u eu + er +1

)

= Xb + ∑ u U u E ( eu ) + E ( er +1 ) = Xb

Var ( y ) = Var ( Xb) + Var ( e ) = XVar ( b) XT + Var

(∑

u

U u eu + er +1

= ∑ u U u Var ( eu ) U + Var ( er +1 )

)

T u

= ∑ u s u2 U u UTu + s r2+1I.

In this mixed linear model, the dependent variable is the normal random vector

(

)

y ∼ MVN Xb, V = ∑ u s u2 U u UTu + s r2+1I . The statistics of the mixed linear model is first to estimate each variance component s u2, and then the fixed effects b. For the mixed model of balanced data, if Uu coefficient matrix consists of constant 0 or 1, it can be estimated with the variance analysis method. However, for the unbalanced data, because Uu coefficient matrix may have decimals and some random factors may not be independent of each other in some complex mixed linear models, such models cannot be analyzed by the variance analysis method. Hartley and Rao (1967) first proposed the analysis of unbalanced data in the mixed linear model with ML. The estimated values of ML method were affected by the fixed effect b, which may lead to serious biased estimates. To overcome such a shortcoming, Patterson and Thompson (1971) proposed the REML method, in which the likelihood value does not include the fixed effect. Subsequently, Rao (1972) proposed a much more convenient and superior MINQUE method based on the fact that Euclidean norm is the smallest. In this method, the estimate of variance component depends upon the a priori value of artificial selection, and as long as the a priori value [αu] does not depend on the experimental data, the MINQUE estimator is unbiased. The a priori value [αu] can be selected based on either the experience or the past analysis results. The simplest way is to set the a priori value of the residual effects as 1 (α r+1 = 1), and all other a priori values as 0 (αu = 0, u = 1, 2, … , r). This is called MINQUE(0) method, where the estimators of the obtained variance components are MINQUE(0) estimators. It is easy to estimate the mixed linear model with the MINQUE(0) method, eliminating the need for inverse calculations of the large matrix. However, the sampling variance of variance components estimated with MINQUE(0) tends to be larger. Another simple and easily calculated method is to set all a priori values as 1 (αu = 1, u = 1, 2, … , r + 1). This is called the MINQUE(1) method, where the estimators of the obtained variance components are MINQUE(1) estimators. All a priori values are set as the parameter values

192

Next Generation Sequencing and Whole Genome Selection in Aquaculture

(α u = σ u2) with MINQUE(θ), which is an unbiased estimation method with minimum variance (Rao, 1972); therefore, it is the best one in all introduced analysis methods of the mixed linear model. In addition, the MINQUE method does not require iterative calculations and does not limit the normal distribution of the linear model. With Monte Carlo simulation, Zhu (1989) compared five analytical methods— HEND3, ML, REML, MINQUE(1), and MINQUE(θ)—in terms of the power of estimating genetic variance components. The comparison showed that the analytical power was the same with REML, MINQUE(1), and MINQUE(θ), and that the results of estimating additive and dominant variance were the same with HEND3 and the first three methods. However, for the biomodel of Cockerham and Weir (1977), the ANOVA method could not produce the unbiased estimation of maternal and paternal variance components, and nor could the Henderson method because it imitated the ANOVA method, whereas in the Monte Carlo simulation and comparison of nested or factorial designs, HEND3 could give the unbiased estimation of each variance component as REML, MINQUE (1), and MINQUE(θ). For unbalanced data, since ML estimation requires that the estimated values of all variance components used for estimating variance components are nonzero, it is not an effective method of variance component estimation. The power of estimating variance components of unbalanced data with REML, MINQUE(1), and MINQUE(θ) is not the same. If the variance component is nonzero, these three methods can obtain unbiased estimated values and similar analytical power. However, if some variance components (such as maternal and paternal variance) are zero, MINQUE(1) and MINQUE(θ) can obtain estimations close to the true values of parameters, but REML always produces overestimation because of its restrictive condition that the estimated value must be greater than zero in parameter estimation. When the covariance components between paired traits are estimated, it is not necessary to design restrictive conditions in the REML method; therefore, just as MINQUE(1) and MINQUE(θ), REML is capable of unbiased estimation of covariance components. Compared with REML, another advantage of MINQUE is its nonrequirement of iterative operations. Therefore, in the analysis of the mixed model, MINQUE can be used to estimate variance and covariance components.

Analysis of Gene Effect To estimate the gene effect, the regression coefficients are required to be the fixed effect if the regression analysis is adopted. Therefore, the genetic effect should be defined as the fixed effect for gene effect estimation; that is, the analyzed genetic population is a set of specific materials. The genetic conclusion obtained through the analysis, therefore, is applicable only to this population, and the corresponding genetic variance components cannot be estimated while the genetic effect value is analyzed. In the process of genetic analysis of quantitative traits, the genetic material is normally a set of random samples obtained from some genetic population. What is intriguing is mainly the genetic variability of this genetic population. The variability of gene effect can be deduced from the estimation of each genetic variance component. In some experiments, the researcher not only needs to estimate variance components but also hopes to deduce some values of gene effect in the genetic model. If the necessary constraints are not set for the values of environmental effect and

Comparison of Index Selection, BLUP, MAS, and Whole Genome Selection

193

gene effect, fixed effect is generally inestimable. Although regression parameters cannot be estimated, the unbiased estimation of their functions can be conducted. For the mixed linear model m −1

y = Xb + ∑ U u eu + em u =1

m −1 ⎛ ⎞ ∼ MVN ⎜ Xb, V = ∑ σ u2 U u UTu + σ m2 I⎟ , ⎝ ⎠ u =1

the uth random factor is in compliance with multidimensional normal distribution: eu ∼ MNV ( 0, s u2 I ) . Henderson (1963) put forward that the BLUP of the random effect could be obtained if the variance component σ u2 of the random factor was known. If the specific test data is to be analyzed with the mixed linear model, the variance component σ u2 is required to be estimated. So it is practically impossible to obtain the BLUP of random effect value. In practice, it is common to replace the parameter of variance component σ u2 with its estimate and to predict the random effect with the following formula: ˆ y, ˆeu = σˆ u2 UTu Q where

(

)

ˆ=V ˆ −1 − V ˆ −1 X XT V ˆ −1 X + XT V ˆ −1 Q m−1

ˆ = ∑ sˆ u2 U u UTu + sˆ m2 I, V u =1

ˆ −1. the existing inverse matrix of V The predicted value then is no longer the real BLUP one, but the so-called BLUP predicted value. Since it is not the linear function of the dependent variable, the estimate cannot be guaranteed as an unbiased one. Zhu (1992) proposed the adoption of the relatively simple and convenient MINQUE method to predict the random effect value. The predicted value would be the linear function of the observation vector y and the unbiased predicted value of the random effect if the selection of the apriority is independent of the analyzed data. So this method is known as linear unbiased prediction (LUP). Although the unbiased predicted values of the random effect can be obtained through LUP, the extent of their variation is often inadequate. Zhu (1993) adopted the adjusted unbiased prediction (AUP) method. The AUP predicted value of the uth vector of the effect value is as follows: ˆeu = κ u (a u UTu Qα y ), where a uU uT Qα y is the linear unbiased prediction (LUP)

of eu and the adjusted coefficient κ u = ( nu − 1) sˆ u2 (a u2 y T Qα U u UTu Qα y ) . When sˆ u2 < 0, taking sˆ u2 = 0 , nu is the number of columns of the coefficient matrix Uu, and sˆ u2 is the estimation of the uth variance component.

194

Next Generation Sequencing and Whole Genome Selection in Aquaculture

In practical use, since the variance parameter is unknown, only the so-called BLUP instead of the BLUP values can be obtained. A great number of iterative operations are needed for the calculation of BLUP values, but the AUP values can be obtained with much fewer calculations, when genetic variance components are estimated. The prediction of AUP values is unbiased and efficient, although it is not linear for the gene effect. In contrast, the AUP method is better suited to predict the random effect of the mixed linear model. The estimation of genetic parameters is in fact that of variance components in statistics. The classic method in statistics is the variance analysis, which is only applicable to balanced data or random models, but almost all data used in aquatic animal breeding is unbalanced and often needs description with the mixed model. Nowadays, REML is the method more often used in the estimation of variance components, and is generally accepted all over the world. However, the REML method involves complicated calculations. Graser et al. (1987) proposed the derivative free restricted maximum likelihood algorithm (DFREML), which estimates variance components with the extreme value of the likelihood function obtained by changing the value of the independent variable in the parameter space with the derivative free method. This relatively simple algorithm makes it possible for the wide use of REML method in animal breeding.

Problems Related to Estimation of Breeding Values with BLUP The key to breeding value estimation with BLUP is to establish the proper mixed linear model and its system of equations. The model should reflect the actual situations to the largest extent. The reasonable model should be designed through the comprehensive consideration of environmental and genetic factors and the convenience of breeding according to various traits of target breeding animals, production traits, and the data structures obtained. The accuracy of the estimation depends upon the rationality of the model. The data structures and genetic parameters have a significant impact on the power of estimation when the animal model is used to estimate breeding values. First, different information sources have direct influence on the accuracy of breeding values. When each information source provides a mere record, the record of the individual itself is superior to other information. Second, the data quantity has an important effect. The accuracy of breeding value estimation with various information sources will significantly increase with the data quantity. Third, the larger the heritability of the trait is, the higher the accuracy of breeding value estimation. The increased reliability of the heritability estimation and of the assessment of the environmental effects can increase the accuracy of breeding value estimation. Repeatability can be used to validate the correctness of heritability estimation and estimation of the stock’s breeding value. The resulting prolonged generation interval should also be considered when the repeated record data is used to increase the accuracy of breeding value estimation. Therefore, many measures to increase the accuracy of breeding value estimation are made at the expense of lengthening generation intervals. The interaction between the two aspects above must be fully considered in the breeding

Comparison of Index Selection, BLUP, MAS, and Whole Genome Selection

195

plan. The actual estimates are more accurate and reliable than the borrowed values, but the borrowed ones may be more accurate if the population that provides data is too small and the estimated errors of the actual data are large. The basic population and additive relation matrix must be correctly defined when the animal model is used to predict the breeding value with BLUP. The individuals with unknown parents are often seen as the basic population in practical use, and it is assumed that they are sampled from a single population with a zero mean of breeding value and an identical variance. This usually results in some fallacies, which deserves particular attention in aquatic animal breeding. While providing accurate breeding values, BLUP can raise the inbreeding increment of the population because the selected individuals close in performance usually originate from the same family. Wray and Hill (1989) carried out a successive selection of single traits on a population of 26 boars and 156 sows, and renewed the sample population once every year. He found that the annual inbreeding rate was 3%, 5.33 times higher than that of the selection only based on the sow record. Gallardo et al. (2004) studied the selective breeding of two populations of silver salmon for four generations and obtained an average inbreeding increment per generation of 2.45% and 1.10%, respectively. The first population showed a far greater inbreeding increment because of its smaller “founder” number than the second. Inbreeding was also observed in the selective breeding of tilapia (Gall and Bakar, 2002). After three generations of selective breeding, the average inbreeding increments of the two populations were 2.04% and 2.96% per generation, respectively. To reduce the inbreeding rate, exotic wild individuals are usually introduced in the mating. However, such introduction should be avoided in breeding because it will reduce the genetic gain. There are other ways to reduce the inbreeding rate, such as increasing the size of base population, bringing down the selection intensity, and restricting the number of selected individuals from the same family (Gall and Bakar, 2002). Much BLUP breeding software nowadays, while calculating the breeding value, can formulate more reasonable mating plans with the pedigree records to meet the breeding need and can avoid inbreeding by prohibiting sib mating, thus restricting the inbreeding rate to an acceptable level with greatest genetic gain (Meuwissen, 1997; Grundy et al., 1998, 2000; Toro and Maki-Tanila, 1999; Meuwissen and Sonesson, 2004).

Applications of Animal Model BLUP in Aquaculture As a mainstream approach for animal breeding value prediction, BLUP method has been widely used worldwide for the assessment of breeding values for dairy cattle, beef cattle, sheep, horses, and pigs, as well as for the estimation of their genetic gain, genetic parameters, and combining abilities. It has also been widely used in aquaculture species, although not to the same extent as in livestock species (Gjoen and Gjerde, 1998). Gall and Huang (1988) suggested that the phenotypic selection could be replaced by BLUP for aquaculture species; breeding values for body weight of 98-day tilapia after selection of three generations was estimated (Gall and Bakar, 2002). According to his data, the average genetic gain for each generation was

196

Next Generation Sequencing and Whole Genome Selection in Aquaculture

2.61 ± 0.05 g, that is, a 40% improvement after the three generations of selection. This result using BLUP was obviously better than that of mass selection (Hulata et al., 1986; Huang and Liao, 1990). Also, the selective response of BLUP was more efficient than mass selection, about 20% to 30% higher (Hagger, 1991). Ponzoni et al. (2005) increased the body weight of Nile tilapia by 10% per generation on average after six generations of selection. With silver salmon, Neira et al. (2006) achieved an average genetic gain of 383.2 g per generation, a 13.9% increase in weight per generation, by using the animal model BLUP method. Kause et al. (2005) carried out a comprehensive selective breeding project of rainbow trout, based on body weight, sexual maturation ratios, and several other characteristics during different stages of development with BLUP. An average genetic gain of 8% per generation was achieved. At the Akvaforsk Genetics Center (Norway), the core collection of Atlantic salmon has been successfully improved since the 1970s. By using BLUP, they have improved seven characteristics of Atlantic salmon. The growth rate of the Atlantic salmon was doubled, making Atlantic salmon culture one of the most successful maricultures in the world. Luan et al. (2008) selected and bred the Penaeus chinensis based on their reproduction and anti-white spot syndrome virus (WSSV) property by BLUP, and the body weight of the selected P. chinesis was enhanced 13.28% in just one generation. However, the genetic progress of weight in the selected oyster was diverse (Toro et al. 1995; Ward et al. 2000). In my laboratory, selection of Chlamys farreri using BLUP allowed genetic gains of 68.4% and 32.3%, for growth rate and survival rate, respectively (unpublished). In addition to the above discussed aquaculture species, rapid progress is being made with many other aquaculture species such as mussels, scallops, Artemia, and shrimp. BLUP is highly recommended for breeding programs of these species, and especially for newly acquired wild species, because BLUP method is capable of overcoming the problems involved in conventional breeding such as self-breeding, intrafamily competition, cannibalism, individual marking, and mating preference. Currently, genetic progress of most fish and shellfish species is from traditional selective breeding (Hulata, 2001). Most species in aquiculture are bred by monoculture or polyculture, and their growth rate is selected by mass selection. From the current situation of aquaculture breeding, it can be seen that the development and interpenetration of several fields such as genetics, statistics, computational mathematics, and molecular biotechnology have promoted the development of methods for breeding value estimation. Since the interdisciplinary exchanges can expand the available information sources, enlarge the size of utilizable data, and improve methods of breeding value calculation, improving the accuracy of breeding value estimation should be the focus in the future. Quite a few aquaculture selective breeding projects have adopted mass selection (Nell et al., 1999; Rezk et al., 2003), intrafamily and interfamily selection (Hershberger et al., 1990; Dunham et al., 2001), and some other phenotypic value-based selections. The selection effect was limited when highheritability traits were selected by BLUP in a small population, which could not bring BLUP into full play. The advantages of BLUP can only be fully utilized when dealing with massive genetic evaluation with complex genetic components of different populations in years. With the development of aquaculture and the increase of breeding scale, BLUP method, with much development potential, is sure to become a rapid and effective estimation method of breeding values for aquaculture species.

Comparison of Index Selection, BLUP, MAS, and Whole Genome Selection

197

MAS Conventional breeding selection is mainly based on two types of data: pedigree information and phenotypic information. With the combination of these data, the EBV of animals is predicted by using BLUP. The genetic gain in quantitative traits of most animals has been made through selective breeding based on the EBV, which is calculated according to the phenotypic value, while the number of genes affecting these traits and their effects remain unknown. In fact, the genetic structure of the trait is treated as a “black box.” It is hoped that the genetic basis of the trait can be dissected so as to select for better genes and DNA segments. With the development of genomics, the molecular data of all individuals can now be collected without phenotypic information, especially in the early stages of life, reducing the breeding intervals and speeding up the breeding process. In recent years, due to the application of molecular marker technology and computer technology, the linkage information of markers and quantitative trait loci (QTL) is combined with modern animal breeding systems, resulting in markerassisted-BLUP (MBLUP) that provides more accurate estimation of the breeding values. For instance, Fernando and Grossman (1989) applied MBLUP to increase the amount of useful information and made the EBV more accurate. Adding marker information into the conventional BLUP model, that is, estimating the genes of the individual with the phenotype, pedigree, and DNA marker information, is simply the extension of conventional BLUP. Such methods coupling marker-QTL information into conventional BLUP have been collectively referred to as MAS.

Main Steps of MAS MAS in aquatic animal breeding depends on the progress of the following five aspects: (1) molecular marker development and genotyping; (2) genetic linkage and QTL mapping; (3) construction of reference population and phenotyping databases; (4) genetic evaluation: integrating the phenotypic and genotypic data with the statistical method and estimating the breeding values of every individual in the selection population; and (5) application of MAS: establishing the selection and mating programs with the molecular genetic information. The development of molecular markers, genotyping, and genetic linkage and QTL mapping were well documented in a recent book, Aquaculture Genome Technologies (Liu, 2007). As summarized in Table 12.1, good progress has been made in linkage mapping of aquaculture species. Instead of going through each the step of MAS, I will concentrate on some relevant aspects of MAS in comparison with conventional and whole genome-based selection.

The Number of Markers Needed in MAS Application The required number of molecular markers depends on specific applications. For instance, 6–15 simple sequence repeat (SSR) polymorphic markers with relatively

Table 12.1

Genetic linkage maps constructed in aquaculture species. Average resolution (cM)

Reference

AFLP, SSR SSR SNP, SSR SSR, AFLP, allozymes, VNTR, known genes, minisatellites, RAPD, SINE* SSR, AFLP SSLP

2.52 Unknown 5 1.8

Moen et al. (2004) Gilbey et al. (2004) Moen et al. (2008) Nichols et al. (2003)

8.5 5.4

Zimmerman et al. (2005) Kimura et al. (2005)

SSR

5.0

Chistiakov et al. (2005)

SSR, genes

2.4

Lee et al. (2005)

SSR, known genes, RAPD AFLP, SSR SSR, AFLP

18.8

Sun and Liang (2004)

7.6 Unknown

Cheng et al. (2010) Liao et al. (2007)

SSR, AFLP

Unknown

Liao et al. (2007)

SSR, SNP

4.2

Xia et al. (2010)

SSR, AFLP

7.3

Coimbra et al. (2003)

SSR Type I-SSR, SNP SSR

8.7 5.5 9.7

Waldbieser et al. (2001) Kucuktas et al. (2009) Franch et al. (2006)

SSR, allozymes

Unknown

Gharbi et al. (2006)

SSR

10.7

Sanetra et al. (2009)

SSR

5.3

Morishima et al. (2008)

SSR, SSR

3.9

Reid et al. (2007)

SSR

5.4

Wang CM, et al. (2007)

Species

Marker type

Atlantic salmon (Salmo salar) Rainbow trout (Oncorhynchus mykiss)

Medaka (Oryzias latipes) Sea bass (Dicentrarchus labrax) Tilapia (Oreochromis spp.) Common carp (Cyprinus carpio) Silver carp (Hypophthalmichthys molitrix) Bighead carp (Aristichthys nobilis) Grass carp (Ctenopharyngodon idella) Japanese flounder (Paralichthys olivaceus) Channel catfish (Ictalarus punctatus) Gilthead sea bream (Sparus aurata L.) Brown trout (Salmo trutta) Cichlid fish (Astatotilapia burtoni) Loach (Misgurnus anguillicaudatus) Atlantic halibut (Hippoglossus hippoglossus) Barramundi (Lates calcarifer)

(Continued)

198

Table 12.1 (Continued)

Species

Marker type

Average resolution (cM)

Reference

Yellow tails (Seriola quinqueradiata and Seriola lalandi) Patagonian pejerrey (Odontesthes hatcheri) Half-smooth tongue sole (Cynoglossus semilaevis) Zhikong scallop (Chlamys farreri)

SSR

2.7

Ohara et al., 2004

AFLP

Unknown

Koshimizu et al. (2010)

AFLP, SSR

8.4

Liao et al. (2009)

AFLP AFLP AFLP SSR AFLP, SSR AFLP

20.5 15.3 9.7 12.3 15.1 8.5

Wang et al. (2004) Wang et al. (2005) Li et al. (2005) Zhan et al. (2009) Xu et al. (2009) Liu et al. (2009)

AFLP AFLP, SSR

13.1 7.1

Wang et al. (2007) Qin et al. (2007)

SSR

9.2

AFLP, SSR

4.54

Houbert and Hedgecock (2004) Lallias et al. (2007a)

AFLP

8.0

Lallias et al. (2007b)

AFLP, RAPD, SSR

18.2

Liu et al. (2006)

AFLP

17.1

Zhou et al. (2006)

SSR AFLP, SSR SNP SSR

22 12.5 5.0 7.7

AFLP AFLP, SSR AFLP AFLP

— 12.5 12.2 14.5

Warren et al. (2007) Zhang et al. (2007) Du et al. (2010) Maneeruttanarungroj et al. (2006) Staelens et al. (2008) You et al. (2010) Li et al. (2006) Tian et al. (2008)

AFLP

10.0

Li et al. (2003)

Japanese scallop (Patinopecten yessoensis) Bay scallop (Argopecten irradians irradians) Oyster (Crassostvea gigas) European flat oyster (Ostrea edulis) Blue mussel (Mytilus edulis) Pacific abalone (Haliotis discus hannai Ino) Sea urchin (Strongylocentrotus nudus and Strongylocentrotus intermedius) Pacific white shrimp (Litopenaeus vannamei) Tiger shrimp (Penaeus monodon)

Chinese shrimp (Fenneropaeneus chinensis) Kuruma prawn (Penaeus japonicus)

AFLP, amplified fragment length polymorphism; SSR, simple sequence repeat (or microsatellite); SNP, single nucleotide polymorphism; VNTR, variable number of tandem repeat; RAPD, random amplified polymorphic DNA; SINE, short interspersed nuclear element; SSLP, simple sequence length polymorphism.

199

200

Next Generation Sequencing and Whole Genome Selection in Aquaculture

good quality are sufficient for parentage analysis and progeny identification. For marker-assisted introgression, just one to two markers would be enough if the markers are closely linked with functional genes controlling the traits. For the analysis of genetic diversity evaluation in the breeding process, four to six pairs of polymorphic amplified fragment length polymorphism (AFLP) primers or 6–15 SSR markers can meet the need. When MAS is used in genetic effects and EBV evaluation, relevant markers can be directly added for EBV estimation, as long as the trait is dominated by a single gene or several major genes, the QTL of the trait are finely mapped, and its genetic effects are analyzed. Currently, the number of markers available for MAS in aquaculture species is still very low. The real issue is the number of markers needed for QTL detection. The extent of linkage disequilibrium (LD) varies with different organisms. For instance, the moderate size of human LD (R2 ⭌ 0.2) is about 5 Kb (∼0.005 cM), while that of Holstein cattle is 100 Kb. Because of relatively low level of breeding and improvement as well as relatively high level of genomic variation, aquatic animals may have a shorter extent of LD. The Zhikong scallop genome is approximately 1200 Mb; if it is assumed that the moderate LD (R2 ⭌ 0.2) is 10 Kb, 120,000 markers are required to ensure that one marker can be detected in each QTL on average. Even with the 100 Kb LD extent, just like in the case of Holstein cattle, 12,000 markers are still needed for a genome-wide association study. If marker clustering and the loss of rare loci are taken into account, even greater numbers of markers are needed.

QTL Mapping Geldermann (1975) defined QTL as the site controlling various phenotypes of the quantitative traits. According to the polygenetic theory of quantitative traits, the number of genes that control the quantitative trait cannot be easily estimated. However, the location of the genetic effect that controls the quantitative traits can be detected. Therefore, QTL is statistically a location, and QTL mapping represents a probability standard for segments of genomes that influence the quantitative traits.

Linkage Analysis and LD Detection of QTLs Two major approaches exist for QTL detection: linkage analysis and association analysis. Linkage analysis is analyzing the relatedness of markers and phenotypes using genetic linkage mapping. Variance analysis, regression analysis, and likelihood ratio testing are commonly used in the statistical methods of linkage analysis. Several methods are available for linkage analysis, including (1) interval mapping (Lander and Botstein, 1989), (2) regression analysis (Haley and Knott, 1992; Martinez et al., 1999), (3) composite interval mapping (Jansen, 1993; Zeng, 1993, 1994), and (4) multiple interval mapping (Kao and Zeng, 1997, 2002), in which several QTLs and their interactions are included at the same time; the major defect of multiple interval mapping is that it is only capable of detecting the interactions between the major QTLs and often cannot detect QTLs with smaller effects.

Comparison of Index Selection, BLUP, MAS, and Whole Genome Selection

201

Another method for the QTL mapping is association analysis through detection of LD. For example, if two genes are linked and cosegregate, the frequency difference between the gamete of parental type and the recombinant type will definitely occur, causing the LD. Therefore, the principles for linkage analysis and association studies are similar. They are both based on the variation and the cosegregation of two adjacent loci. The LD between the marker and the QTL is the basis of QTL detection. The difference is that linkage analysis only considers the LD within the family, which can be broken in few generations due to recombination. Association studies, however, consider LD among populations. Therefore, QTL mapping by association analysis requires LD of the marker and the QTL through a number of generations of the whole population. Thus, the linkage of the marker and the QTL should be very close. QTL detection by genetic markers can only be performed when the marker and the QTL are in the state of LD. Therefore, QTL detection relies on the type and extent of LD of the population for analysis. For instance, in a hybrid population, LD is widespread. Dekkers (2007) pointed out that when two strains with different genotypes are hybridized, the LD could be detected in the hybrid population due to the difference in haplotype frequency. The LD extent in F1 population is large, and such a large extent of LD could still be found comprehensively in F2. Therefore, with the limited markers widely distributed in the genome (∼15–20 cM for every extent), the QTL can be genome scanned. This is the foundation of the QTL mapping with F2 or backcross progenies. The large extent of LD of the primary hybrid population limits the precision of QTL mapping. With the increased generations after hybridization, the extent of LD will reduce. Therefore, fine QTL mapping requires a dense marker map. However, the rapid decline of LD in the recombinant populations and the obvious reduction of the LD extent can influence mapping analysis. For aquatic animals, large families can be generated to alleviate this problem, but at the expense of increased workload. For the LD of the family in the outbreeding stock, because the linkage phase between the marker and the QTL varies in different families, the marker effect within the family rather than in the whole population needs to be fitted when the LD within the family is used for QTL detection. Most QTL detection in aquatic animals is conducted with the outbreeding families. For example, the family is constructed between the male and female individuals from two geographical populations. With exception of the loci with tight linkage, a closed population after several generations is usually in the equilibrium state. In such populations, only the markers closely linked with the QTL can be associated with the phenotype. Therefore, markerQTL association cannot be assured in populations because of random sampling effects. Dekkers (2004) proposed two strategies to detect the LD at population level: (1) evaluating the markers within or nearby the gene(s) associated with the trait (candidate gene); and (2) using high-density linkage map with a marker spacing of 0.5–2.0 cM to scan the genome. It is obvious that the success of the two methods relies on the extent of LD in the population. Studies show that the extent of LD in the human populations is generally shorter than 1 cM. For the livestock population, because of selection and inbreeding, the extent of LD was longer and could reach 10 cM (Farnir et al., 2000; Heifetz et al., 2005).

202

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Intermated recombinant inbreds

5 Positional cloning

Recombinant inbred lines

40 10 2

Pedigree Association mapping

Allele number

Research time (year)

Near-isogenic lines

F2/BC

1 1

1 × 104 Resolution (bp)

1 × 107 Current Opinion in Biotechnology

Figure 12.1 Schematic comparison of various methods for the identification of association of nucleotide polymorphisms with traits in terms of resolution, research time, and allele number. BC, backcross. (Figure from Yu and Buckler, 2006).

The candidate gene approach can be effective with the rich genome information of the species (such as humans and mice), with the information of mutation effects involving the candidate gene in other species, with the QTL region detected previously, and with the physiological foundation of the trait. Because aquatic animals have not been intensely selected and the size of the wild population is huge, most populations have a good Hardy–Weinberg (H-W) equilibrium. Therefore, the natural population can be used directly for LD analysis. Alternatively, resource populations for LD analysis can be constructed artificially. Compared with linkage mapping, association mapping has three advantages: a higher mapping resolution, more markers and a greater mapping population, and a shorter time (Figure 12.1). However, linkage mapping and association mapping can complement each other to further improve the power of statistical method. For example, linkage analysis can supply information in advance for cross-validation with the result of the association analysis. In practical work, association analysis can be used first to determine the rough QTL region, and linkage mapping can then be applied for the precise QTL localization. In recent years, due to the increasing availability of a large number of markers, association mapping is becoming more and more popular, especially for outcrossing species (such as humans) whose hybridization and genetic operation cannot be controlled. The high fecundity and genetic diversity of aquatic animals, in addition to being mostly outcrossing animals, make them most suitable for QTL detection using association analysis.

Comparison of Index Selection, BLUP, MAS, and Whole Genome Selection

203

QTL Mapping Methods A number of QTL mapping methods are available, including analysis of LD, transmission disequilibrium (TDT), identical by descent (IBD), pedigree disequilibrium test (PDT), and pedigree transmission disequilibrium test (PTDT). Based on the LD, Spelman et al. (1999) proposed the TDT method in 1999, which relates the marker to the trait through comparing the marker allele difference between the variant individuals and the nonvariant ones. The basic method of QTL analysis by association study is as follows: With the population as the basis, one sample in the population to be studied is selected. If the trait is of dichotomy, for example, “with”or “without” the trait, then the LD of the marker and the trait is determined by the difference in the gene frequency or the related significance of all marker categories between “with” or “without.” If the trait is continuous, then the LD is determined by the mean value of the trait, the distribution difference, or the related significance of all marker categories. PDT is based on the random variable of LD and was initially proposed by Martin et al. (2000). It was later revised as quantitative pedigree disequilibrium test (QPDT) by Zhang et al. (2001). PDT can use data from related nuclear families from extended pedigrees and is valid even when there is population substructure. Abecasis et al. (2000) and Malkin et al.(2002) developed the quantitative transmission disequilibrium test (QTDT), which provides another tool for family based tests of linkage disequilibrium for quantitative and discrete traits. The main principle of IBD is to find a certain segment, that is, IBD segment or region, that is from the same ancestor and shared among the progenies. Genes that are copies of a single gene in a common ancestor of the individuals who now carry them are said to be IBD. If the population has a specific variation, then the LD between the QTL and the close linkage marker is used to detect the IBD segment among the variant individuals, and the region is detected with the tight linkage marker to localize the gene precisely. Pong-Wong et al. (2001) proposed a fast algorithm for calculating IBD matrices. With the method of coancestry coefficient, Meuwissen et al. (2001) constructed the IBD matrix that could calculate the QTL allele using haplotype of individuals without pedigree information. In the field of livestock and poultry, the IBD method has been applied in gene mapping. Riquet et al. (1999) finely mapped the QTL influencing the percentage of milk fat in an outbreeding population by IBD. Meuwissen et al. (2002) mapped the QTL allele which controls the rate of cattle twins within an interval of 1 cM. Van Laere et al. (2003) localized the IGF2 which affects pig growth into a single base. For the pedigree-based analysis, the most prominent advantage of TDT is that it eliminates population stratification. Regarding population samples, genomic control (GC) and structured association (SA) are common methods adopted in the study of humans and plants to resolve the problems of population structure and stratification. A set of random markers are needed for the GC to evaluate the inflation rate of statistical tests resulting from the population structure on the assumption that such structure had the same effect on all the loci (Devlin and Roeder, 1999). In contrast, the SA analysis adopted a set of random markers to evaluate the population structure (Q) and then absorbed this estimate value to obtain the following statistics (Pritchard and Rosenberg, 1999; Pritchard et al., 2000; Falush et al., 2003). Modification of SA with

204

Next Generation Sequencing and Whole Genome Selection in Aquaculture

logistic regression, a standard mixed linear model method (Yu and Buckler, 2006), was reported, and it could be used for the association mapping at the multirelationship level. In this method, random markers were used to evaluate the Q and the relative kinship matrix (K) and then fitted into the mixed linear model to detect the linkage between markers and traits. This method could control the type I and type II errors effectively. It crossed the boundary of the family-based and the population-based sampling, providing a useful supplementary method to the contemporary association mapping methods.

Construction of Resource Population and Genetic Mating Design QTL mapping accuracy is related to the mapping population. For crop plants, QTL mapping populations are mainly the simple segregation populations, such as F2s, backcrosses, doubled hyploids, recombinant inbred lines, and advanced intercross lines, involving few parents. For aquatic animals, it is unrealistic to breed the inbred strains of high generations or pure strains suitable for fine mapping in a short time, simply not feasible because of the long generation time. However, certain features of aquatic animals such as their high fecundity can be exploited to establish mapping populations by using proper mating design. For example, the nested association mapping (NAM) population can be constructed to provide a high-resolution genome scan. Yu et al. (2008) proposed this strategy: “The NAM strategy addresses complex trait dissection at a fundamental level through generating a common mapping resource that enables researchers to efficiently exploit genetic, genomic, and systems biology tools. The proposed procedure in NAM involves the following steps: (1) selecting diverse founders and developing a large set of related mapping progenies [preferably recombinant inbred lines (RILs) for robust phenotypic trait collection], (2) either sequencing completely or densely genotyping the founders, (3) genotyping a smaller number of tagging markers on both the founders and the progenies to define the inheritance of chromosome segments and to project the high-density marker information from the founders to the progenies, (4) phenotyping progenies for various complex traits, and (5) conducting genomewide association analysis relating phenotypic traits with projected high-density markers of the progenies.” NAM should find its way to aquatic animals for analysis of QTLs and for long-term breeding programs. In addition to NAM, diallel cross, which is composed of all possible single crosses among a group of inbred lines, has become a common mating design for plant and animal breeding. In a full diallel, all parents are crossed to make hybrids in all possible combinations. Half diallels, which omit reciprocal crosses, can be used to reduce the number of crosses. Full diallels require twice as many crosses and entries in experiments as half diallels, but allow for testing for maternal and paternal effects. If such reciprocal effects are negligible, then a half diallel without reciprocals can be effective. Recently, Verhoeven et al. (2006) detected QTL through the diallel cross population. For aquatic animal breeding, a great number of studies on diallel cross have been conducted, but diallel cross has not been fully explored for QTL mapping.

Several Noteworthy Trends Dynamic Traits and Dynamic Trait QTLs The methods discussed above are for the detection of QTLs of the observed value of quantitative traits (usually the final result) at certain time point, that is, static QTL.

Comparison of Index Selection, BLUP, MAS, and Whole Genome Selection

205

However, the expression of traits is a process, so dynamic QTL is also important. The dynamic traits, also called development traits, are defined as the time-varying quantitative traits of the organism during its development. There are three types of dynamic trait QTLs: (1) the observed phenotypic values at different time points (or the time interval increment of the phenotypic value) are considered as the repeated records and the traits are analyzed under the frame of repeated observed value; (2) the observed values at different time points are considered as different traits, which are to be analyzed by the multivariate method; and (3) the mathematical model of the time points and the observed phenotypic values is fitted and the model parameters are analyzed by the multivariate method. Type one is the simplest, in which the data at different time points are analyzed with the conventional QTL method, respectively, to reveal that there might be different genes in different development stages. The conditional QTL mapping proposed by Zhu (1995) could also be used to analyze the net effects of two time points. However, the dynamic traits do not simply accumulate with time in many cases. Such a problem could be avoided by the method of multitrait QTL (Jiang and Zeng, 1995). However, with the growing number of time observation points, the numbers of both variable dimensions and the parameters will increase, thus increasing the calculation need. If time points are few, the dynamic process of the trait cannot be depicted accurately. To reduce the variable dimensions, principal component analysis can be conducted to obtain major comprehensive variables. In addition, the mathematical model of the growth curve can be used for depiction. Therefore, it is of biological significance to fit the relationship between the phenotypic observed value and the time with the mathematical models, such as the growth curve, and conduct the multitrait QTL on model parameters of biological significance. Main mathematical models include the growth model, the orthogonal polynomial model, and the allometric growth model. Wu and Lin (2006) believed that both the QTL mapping based on model selection and the Bayesian shrinkage estimation methods could be used for the functional localization of the dynamic traits.

eQTLs The analysis of expression profiles usually detects target genes and genetic effects through the comparison of differences in expression profiles among two or more treatments. The QTL with the expression profile as quantitative traits is called eQTL. When the genetic linkage of the eQTL was consistent with the location of this candidate gene, the gene related to quantitative traits could be determined (Jansen and Nap, 2001; Gibson and Weir, 2005). eQTL analysis has been conducted in yeast (Brem et al., 2002), maize, and mice (Schadt et al., 2003). eQTL is a new hot spot of international study in recent years.

MAS EBV Estimation According to Dekkers (2004), MAS refers to the following three types: (1) MAS based on the DNA in linkage equilibrium (LE) with a QTL (LE-MAS); (2) MAS based on the markers in LD with QTL (LD -MAS); and (3) MAS based on the

206

Next Generation Sequencing and Whole Genome Selection in Aquaculture

QTL effect resulting from genetic mutation (Gene-MAS). For LE markers, as the markers and QTL will be randomly allocated to different haplotypes in the population, the knowledge of the individual marker’s genotype is meaningless because it cannot provide the information of QTL genotype at that time and the influence on QTL is random if the marker is selected. In contrast, LD -MAS is the most useful. The application of LD -MAS in breeding requires two steps: (1) estimation of the genetic effect of molecular markers in a reference population and (2) calculation of breeding values of a group of selection candidates using the marker information. For quantitative traits, different researchers have presented different strategies for MAS EBV estimation. Dekkers (2007) proposed that once the marker linked with QTL was found, the marker effect could be evaluated by the linkage of genotype and phenotype. The estimated result could be used as the “molecule score” of each individual candidate in three selection strategies: (1) selection according to the molecule score only; (2) tandem selection (on the basis of the selection according to the molecule score, selection can then be made according to the EBV estimated by phenotype value); and (3) selection according to an index combined by the molecule score and the conventional EBV. It can be seen that Dekkers took the genetic effect of QTL only as a supplement to the phenotype EBV and added it to the conventional EBV according to the additive effect. Zhang and Smith (1992) simulated and evaluated the following three selection strategies. The selection basis were the molecule score of the marker effect, the BLUP EBV estimated by phenotypic value, and the index combined EBVs estimated by marker and phenotypic value (COMB, a combination of MAS and conventional BLUP). It was found that the selective response was the greatest in COMB, which was followed by BLUP EBV selection. The selection based on molecule score only ignored the information of influence on the trait by other genes (polygenes) and had the lowest selection efficiency unless all the gene effects influencing the trait were contained in the molecule score. However, the strategy requires only the estimation of the marker’s score without other phenotypic values, thus it is useful when it is very difficult to record the phenotypic value or the cost of such recording is too high (e.g., the disease trait or meat quality). Another strategy integrates the marker information into the BLUP. Such integration depends on the predictive ability of the marker. LD markers can be applied to the whole population, while the LE markers are family-specific and so cannot be applied to the evaluation of other families. Genotypic effect can be used as both the random and the fixed effect in EBV prediction. Theoretically, the variance of all phenotypes can be assumed to have been caused by the polymorphisms of genomes on some loci. In this way, the phenotypic covariance of some subpopulations featuring specific genetic similarity can be evaluated based on the average relationship of the population. If the gene or QTL effect is clearly known, it is supposed as the fixed effect in all other populations which can evaluate the effect of QTL variety by simple regression analysis. The regression can be used in the known genotype according to the class variable and to derive the unknown genotype (Kinghorn, 1999). If few alleles are known to be segregating, the fixed model is relatively sensitive. If there are multiple QTL and gene loci, the model also assumes effects being the same across families, which can be estimated separately, giving power to account for

Comparison of Index Selection, BLUP, MAS, and Whole Genome Selection

207

the dominance and epistasis, and then to estimate EBVs with the polygenetic model. The advantage of the QTL fixed model is that the limited QTL effects need to fit. The QTL effect can also be defined as the random effect model. Each individual has a different QTL effect, and the covariance is based on the probability of genetic relationship other than the numerator relationship matrix of the animal model. The random model is suitable for populations of multiple parental origins and different genotypic effects, such as the selective population of most aquatic animals. The random model does not require the assumption of the number of allelic genes on each QTL and can automatically adjust the possible interaction effect of the genetic background and QTL. If the information is complete, all effects of each animal can finally collapse into one genotypic effect value (Pong-Wong et al., 2001). Therefore, in the random model, the extension of the mixed linear model, it is not necessary to suppose that the genetic backgrounds are uniform. The estimated QTL EBV can be used as the EBV of the polygenic model, and the total EBV is simply the sum of each EBV.

The Application of MAS in the Breeding of Aquatic Animals Professor Okamoto’s research group (Fuji et al., 2007) provided an excellent case in the breeding of antilymphocystis disease of Japanese flounder, Paralichthys olivaceus. They mapped the disease resistance and found that it was controlled by a single locus in a dominant fashion. Using the linked microsatellite marker, they obtained a homozygous fish population. Upon crossbreeding of such homozygous individuals with the fast-growing strain, a new resistant strain with high yield was obtained. This is probably the only example, or one of the very few successful examples, of MAS in aquaculture species.

Genomic Selection The advantage of MAS is its relative simplicity. Once markers are identified to be linked with a performance trait, the presence of the superior alleles is used to select for broodstocks. However, its success depends on the percentage of genetic variation the QTLs of the linked markers can account for (Meuwissen and Goddard, 1996). In livestock species, an estimated 100–200 QTLs may be involved in an economic trait, of which 10–40 QTLs could explain 50% of the trait variation (Hayes and Goddard, 2001; Figure 12.2). Most often, QTL mapping cannot provide a full picture of all the involved QTLs but provides just a fraction of the major ones affecting the trait. Usually, no more than 50% of the trait variation can be detected. Use of the markers identified for major QTLs of a trait, therefore, covers no more than 50%–60% of the genetic variation affecting the trait. This is the major limitation of MAS for genetic gains. To estimate all the genetic effects, genome-wide coverage is required. Meuwissen et al. (2001) first put forward whole genome selection. It estimates the genetic effects (QTL) with genome-wide information from numerous markers. In a sense, genome

208

Next Generation Sequencing and Whole Genome Selection in Aquaculture 100 Percentage of genetic variance accounted for

90 80 70 60 50 40 30 Pigs

20 10

Dairy

0 0

20 40 60 80 Percentage of QTL (ranked in order of size)

100

Figure 12.2 Proportion of genetic variance explained by QTL ranked in order of size of effect. (Figure was modified from Hayes and Goddard, 2001).

selection is a form of MAS in which genetic markers covering the whole genome are used so that all QTLs are in linkage disequilibrium with at least one marker. As shown in Figure 12.2, the power of genome selection depends on the coverage of the genome by markers to ensure all QTLs are in LD with markers.

Main Steps of Whole Genome Selection As detailed in Chapters 9 and 10, there are two major steps in whole genome selection: (1) to estimate effects of various chromosome fragments across populations by using high-density markers such as single-nucleotide polymorphisms (SNPs) and (2) to assess genomic EBVs (GEBVs), in which the candidate animal individuals have been only genotyped, but phenotypic information is not available. The marker-trait association is based on training data for which both genotypes and phenotypes are known. Whole genome selection can be carried out by single marker, haplotype, and IBD approaches. Their differences lie in the numbers of effects to be estimated in each chromosome fragment: in single-marker analysis, each segment corresponds to an effect; in haplotype analysis, each segment corresponds to multiple effects. Currently, there are four methods to conduct whole genome selection via single marker or haplotype: LS, BLUP, BayesA, and BayesB. The distribution of chromosome segment effects is not taken into consideration in LS. The main problem of this method is the selection of significance levels and effects of the chromosome segment for EBV. Therefore, the segment effects are usually overestimated in multiple comparisons. Ridge regression is introduced in BLUP, which avoids overestimation. However, the variance of effects of a larger chromosomal segment is easy to be

Comparison of Index Selection, BLUP, MAS, and Whole Genome Selection

209

Table 12.2 Comparison of EBV versus true breeding value (TBV) in progeny with no phenotypic records (modified from Meuwissen et al., 2001).

LS BLUP BayesA BayesB

rTBV; EBV + SE

bTBV; EBV + SE

0.318 ± 0.018 0.732 ± 0.030 0.798 0.848 + 0.012

0.285 ± 0.024 0.896 ± 0.045 0.827 0.946 + 0.018

Mean of five replicated simulations. LS, least squares; BLUP, best linear unbiased prediction; Bayes, Bayesian method with inverse chi-square prior distribution and where the prior density of having zero QTL effects was increased; rTBV; EBV, correlation between estimated and true breeding values (equals accuracy of selection); bTBV; EBV, regression of true on estimated breeding value.

overestimated, thus lowering the accuracy of selection. When estimating the effects of single marker or haplotype, Bayes can take advantage of the known information that some segments contain QTLs with relatively larger or smaller effects, or do not contain QTLs at all. In reality, the distribution of genetic variances across loci involves many loci with no genetic variance (not segregating) and only a few with genetic variance. BayesA only takes into account the information of QTLs with larger and the smaller effects, whereas BayesB, which is based on BayesA, also takes information of non-QTL segments into account, thus making the estimation more accurate. Meuwissen et al. (2001) compared the accuracy of different methods (Table 12.2). Clearly, the Bayes methods provide a greater level of accuracy for prediction of the EBVs. In BayesA, effects of a large number of segments are considered to approach 0, while in BayesB, they are considered to be 0, thus improving the estimation accuracy of other segments’ effects. In addition, the whole genome selection can be conducted with the IBD method. The haplotype is superior to the single marker in that the haplotype can obtain higher accuracy of IBD segment estimation than the single marker. In practice, different haplotypes may carry the same QTL alleles; therefore, the probability that they are IBD and carry the same QTL alleles can be obtained via the covariance of different haplotypes, which can be applied to whole genome selection with simultaneous calculations in different locations of the genome.

Advantages of Whole Genome Selection Compared with MAS, whole genome selection has unparalleled advantages: (1) Whole genome selection can accurately detect and estimate all genetic effects and genetic variations, while MAS can detect only some genetic variations and can easily overestimate the genetic effects. (2) Marker coverage of the genome is much more complete and dense with genomic selection. For whole genome selection, the single marker or haplotype must be in sufficient LD with QTLs to estimate the effects of QTLs. This advantage certainly is a disadvantage as well because in many species,

210

Next Generation Sequencing and Whole Genome Selection in Aquaculture

the lack of sufficient marker coverage of the genome would prohibit the application of whole genome-based selection. Meuwissen et al. (2001) suggested that in dairy genome selection breeding, at least 30,000 markers are required for whole genome selection for the 3000 Mb bovine genome. (3) The genotypic detection of individuals in the early stage and the EBV estimation can shorten the generation interval and improve genetic progress. (4) Whole genome selection has great potential for traits requiring a high detection cost or for traits of low heritability. (5) The effective cost of whole genome selection is low. It is difficult to simultaneously select a large number of traits effectively with MAS without excessive costs, while all traits of interest can be simultaneously selected with whole genome selection.

Application of Genome Selection in Breeding Programs Since its proposal in 2001, whole genome selection has garnered much attention in many countries including the United States, the United Kingdom, Australia, The Netherlands, Israel, and Canada (Berry, 2007). Whole genome selection has great potential in aquaculture species with a large progeny population. For example, in fish breeding for disease resistance, case-control design can be adopted according to the actual breeding population. It is not necessary to conduct virus challenging tests or maintain the family with genome selection. After obtaining the training data, only genotyping is needed. Of course, a major challenge is the required use of a large fish population (about 30,000) for training data generation (T.H.E. Meuwissen, pers. comm.). Based on the aquatic breeding population of relative kinship, Nielsen et al. (2009) compared genome selection and BLUP breeding in terms of the accuracy of EBV estimation. The experimental population was composed of five generations, 1000 candidate individuals from 100 families per generation. On the whole, GWEBV (genome-wide breeding values) had estimation accuracy 33% higher than BLUP EBV. In GWEBV, the information of internal and external difference of the full-sib family was used for EBV estimation, thus the estimation accuracy was significantly improved.

Summary Conventional selection programs were based on phenotypic evaluations. In spite of their long cycles of selection period and inaccuracies of phenotypic evaluations, the most significant selection progress has been mostly made with conventional selection. This is true for crops, livestock, and aquaculture species as well. Much of the great successes have depended on the experience of the breeders and their ability to detect the great genotypes through examination of the phenotypes. Although extremely successful, the experience of a super breeder can hardly be passed to the next generation of breeders. The complexity of certain performance traits adds additional difficulties to the traditional selection. Tight genetic linkage of certain traits, especially those opposing the desired traits, often frustrates breeders and limits real progress in the long term. Genotype–environment interactions shed an additional layer of complexity to conventional selection.

Comparison of Index Selection, BLUP, MAS, and Whole Genome Selection

211

MAS is a tool that can add to the experience of breeders and enhance phenotype evaluations. Rather than selecting through examination of phenotypes only, determination of the presence of superior alleles provides a great theoretical advantage to MAS. In 1999, Young expressed his “cautiously optimistic vision” on MAS. However, its progress has not been as rosy as expected in the last decade. Perhaps the limited progress of MAS was mainly due to the lack of a sufficient number of molecular markers. Marker technologies are no longer a major limiting factor because of the development of genomics, especially the rapidly developing technology of next generation sequencing. Now it is quite easy to obtain a large number of molecular markers. However, there are still a number of major challenges: genotyping of a large number of markers is still very expensive; dissecting the molecular basis of each marker and trait and its genetic effects is still a daunting task; and the regulatory network of genes, pathways, and traits to establish interlinkage models of markers, QTLs, and traits has yet to be understood in aquaculture species. Compared with terrestrial crops and livestock species, breeding programs for aquatic animals are undoubtedly behind. Aquatic organisms also have a number of challenging biological characteristics rarely seen in livestock animals. For example, they are mostly poikilothermal animals, and their biological traits are more complicated and more affected by the environment. Although favorable for the construction of the ideal population for genetic analysis, their extremely high fecundity and immense number of progenies demands additional requirements for analytical methods. In spite of these challenges, the potential for genetic and genomic selection in aquaculture species is tremendously high. To convert such a potential into reality, studies on genomics and functional genomics of aquatic organisms should be promoted; large-scale development of molecular markers and construction of fine linkage and QTL maps should be accelerated; and, in particular, bioinformatics efforts need to be enhanced to establish databases of both genotypes and phenotypes. Only after achieving these initial goals will whole genome selection become the most powerful approach for selection of aquaculture species.

Acknowledgments The author wishes to thank Zhanjiang (John) Liu for reviewing and revising this chapter, and his students and staff for data collection and useful discussions and comments. This study was supported by the National Basic Research Program of China (973 Program, 2010CB126406), the National High-Tech R&D Program of China (863 Program, 2006AA10A408), the National Nonprofit Special Grant of China (NYHYZX07-047), and earmarked funds from the Modern Agri-Industry Technology Research System.

References Abecasis GR, Cookson WO, and Cardon LR. 2000. Pedigree tests of transmission disequilibrium. Eur J Hum Genet, 8:545–551.

212

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Belonsky GM and Kennedy BW. 1988. Selection on individual phenotype and best linear unbiased predictor of breeding value in a closed swine herd. J Anim Sci, 66:1124–1131. Berry D. 2007. Report on the potential of genomic selection in Irish dairy cattle. Genomic Selection in Ireland. Draft1. Brem RB, Yvert G, Clinton R, and Kruglyak L. 2002. Genetic dissection of transcriptional regulation in budding yeast. Science, 296:752–755. Cheng L, Liu L, Yu X, Wang D, and Tong J. 2010. A linkage map of common carp (Cyprinus carpio) based on AFLP and microsatellite markers. Anim Genet, 41:191–198. Chistiakov DA, Hellemans B, Haley CS, Law AS, Tsigenopoulos CS, Kotoulas G, and Bertotto D, Libertini A, and Volckaert FAM. 2005. A microsatellite linkage map of the European sea bass Dicentrarchus labrax L. Genetics, 170:1821–1826. Cockerham CC and Weir BS. 1977. Quadratic analyses of reciprocal crosses. Biometrics, 33:187–203. Coimbra MRM, Kobayashi K, Koretsugu S, Hasegawa O, Ohara E, Ozaki A, Sakamoto T, Naruse K, and Okamoto N. 2003. A genetic linkage map of the Japanese flounder, Paralichthys olivaceus. Aquaculture, 220:203–218. Dekkers JC. 2004. Commercial application of marker-and gene-assisted selection in livestock: strategies and lessons. J Anim Sci, 82(E-Suppl):E313–E328. Dekkers JCM. 2007. Strategies, limitations and opportunities for marker-assisted selection in livestock. In: Marker-assisted Selection: Current Status and Future Perspectives in Crops, Livestock, Forestry and Fish, edited by EP Guimarães, J Ruane, BD Scherf, A Sonnino, and JD Dargie. Food and Agriculture Organization of the United Nations. FAO, Rome, Italy. Devlin B and Roeder K. 1999. Genomic control for association studies. Biometrics, 55:997–1004. Du ZQ, Ciobanu DC, Onteru SK, Gorbach D, Mileham AJ, Jaramillo G, and Rothschild MF. 2010. A gene-based SNP linkage map for pacific white shrimp, Litopenaeus vannamei. Anim Genet, 41:286–294. Dunham RA, Majumdar K, Hallerman E, Bartley D, Mair G, Hulata G, Liu Z, and Pongthana N. 2001. Status of aquaculture genetics and prospects for the third millenium. Proceedings Conference on Aquaculture in the Third Millenium, Bangkok, Thailand, pp. 129–157. Falush D, Stephens M, and Pritchard JK. 2003. Inference of population structure using multilocus genotype data: Linked loci and correlated allele frequencies. Genetics, 164:1567–1587. Farnir F, Coppieters W, Arranz JJ, Berzi P, Cambisano N, Grisart B, Karim L, Marcq F, Moreau L, Mni M, et al. 2000. Extensive genome-wide linkage disequilibrium in cattle. Genome Res, 10:220–227. Fernando RL and Grossman M. 1989. Marker assisted selection using best linear unbiased prediction. Genet Sel Evol, 21:467–477. Franch R, Louro B, Tsalavouta M, Chatziplis D, Tsigenopoulos CS, Sarropoulou E, Antonello J, Magoulas A, Patarnello T, Power DM, Kotoulas G, and Bargelloni L. 2006. A genetic linkage map of the hermaphrodite teleost fish Sparus aurata L. Genetics, 174:851–861. Friars GW, Bailey JK, and Coombs KA. 1990. Correlated responses to selection for grilse length in Atlantic salmon. Aquaculture, 85:171–176. Fuji K, Hasegawa O, Honda K, Kumasaka K, Sakamoto T, and Okamoto N. 2007. Markerassisted breeding of a lymphocystis disease-resistant Japanese flounder (Paralichthys olivaceus). Aquaculture, 272:291–295. Gall GAE and Bakar Y. 2002. Application of mixed-model techniques to fish breed improvement: Analysis of breeding-value selection to increase 98-day body weight in tilapia. Aquaculture, 212:93–113. Gall GAE and Huang N. 1988. Heritability and selection schemes for rainbow-trout—Bodyweight. Aquaculture, 73:43–56.

Comparison of Index Selection, BLUP, MAS, and Whole Genome Selection

213

Gallardo JA, Garcia X, Lhorente JP, and Neira R. 2004. Inbreeding and inbreeding depression of female reproductive traits in two populations of Coho salmon selected using BLUP predictors of breeding values. Aquaculture, 234:111–122. Geldmann H. 1975. Investigations on inheritance of quantitative characters in animals by gene markers. I. Methods. Theor Appl Genet, 46:319–330. Gharbi K, Gautier A, Danzmann RG, Gharbi S, Sakamoto T, Høyheim B, Taggart JB, Cairney M, Powell R, Krieg F, Okamoto N, Ferguson MM, Holm LE, and Guyomard R. 2006. A linkage map for brown trout (Salmo trutta): Chromosome homeologies and comparative genome organization with other salmonid fish. Genetics, 172:2405–2419. Gianola D. 1986. On selection criteria and estimation of parameters when the variance is heterogeneous. Theor Appl Genet, 72:671–677. Gibson G and Weir B. 2005. The quantitative genetics of transcription. Trends Genet, 21:616–623. Gilbey J, Verspoor E, McLay A, and Houlihan D. 2004. A microsatellite linkage map for Atlantic salmon (Salmo salar). Anim Genet, 35:98–105. Gjoen HM and Gjerde B. 1998. Comparing breeding schemes using individual phenotypic values and BLUP breeding values as selection criteria. The 6th World Congress on Genetics Applied to Livestock Production, University of New England, University of New England, Armidale, NSW, Australia, pp. 111–114. Graser HU, Smith SP, and Tier B. 1987. A derivative-free approach for estimating variancecomponents in animal-models by restricted maximum-likelihood. J Anim Sci, 64:1362–1370. Grundy B, Villanueva B, and Woolliams JA. 1998. Dynamic selection procedures for constrained inbreeding and their consequences for pedigree development. Genet Res, 72:159–168. Grundy B, Villanueva B, and Woolliams JA. 2000. Dynamic selection for maximizing response with constrained inbreeding in schemes with overlapping generations. Anim Sci, 70:373–382. Hagger C. 1991. Effects of selecting on phenotype, on index or on breeding value, on expected response, genetic-relationships and accuracy of breeding values in an experiment. J Anim Breed Genet, 108:102–110. Haley CS and Knott SA. 1992. A simple regression method for mapping quantitative trait loci in line crosses using flanking markers. Heredity, 69:315–324. Hartley HO and Rao JNK. 1967. Maximum-likelihood estimation for mixed analysis of variance model. Biometrika, 54:93–108. Hayes B and Goddard ME. 2001. The distribution of the effects of genes affecting quantitative traits in livestock. Genet Sel Evol, 33:209–229. Hazel LN. 1943. The genetic basis for constructing selection indexes. Genetics, 28:476–490. Heifetz EM, Fulton JE, O’ Sullivan N, Zhao H, Dekkers JC, and Soller M. 2005. Extent and consistency across generations of linkage disequilibrium in commercial layer chicken breeding populations. Genetics, 171:1173–1181. Henderson CR. 1953. Estimation of variance and covariance components. Biometrics, 9:226–252. Henderson CR. 1963. Selection index and expected genetic advance. In: Statistical Genetics and Plant Breeding, edited by WD Hanson and HF Robinson. National Academy of Science, National Research Council Publication, 982, Washington, DC, pp. 141–163. Henderson CR. 1973. Sire evaluation and genetic trends. J Anim Sci., 1973:10–41. Henderson CR. 1975. Best linear unbiased estimation and prediction under a selection model. Biometrics, 31:423–447. Henderson CR. 1984. Applications of Linear Models in Animal Breeding. Guelph University Press, Guelph, Canada.

214

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Hershberger WK, Myers JM, Iwamoto RN, Mcauley WC, and Saxton AM. 1990. Genetic changes in the growth of Coho salmon (Oncorhynchus kisutch) in marine net-pens, produced by 10 years of selection. Aquaculture, 85:187–197. Huang CM and Liao IC. 1990. Response to mass selection for growth-rate in Oreochromis niloticus. Aquaculture, 85:199–205. Hubert S and Hedgecock D. 2004. Linkage maps of microsatellite DNA markers for the Pacific oyster Crassostrea gigas. Genetics, 168:351–362. Hulata G. 2001. Genetic manipulations in aquaculture: A review of stock improvement by classical and modern technologies. Genetica, 111:155–173. Hulata G, Wohlfarth GW, and Halevy A. 1986. Mass selection for growth-rate in the Nile tilapia (Oreochromis niloticus). Aquaculture, 57:177–184. Jansen RC. 1993. Interval mapping of multiple quantitative trait loci. Genetics, 135:205–211. Jansen RC and Nap JP. 2001. Genetical genomics: The added value from segregation. Trends Genet, 17:388–391. Jiang CJ and Zeng ZB. 1995. Multiple-trait analysis of genetic-mapping for quantitative trait loci. Genetics, 140:1111–1127. Kao CH and Zeng ZB. 1997. General formulas for obtaining the MLEs and the asymptotic variance-covariance matrix in mapping quantitative trait loci when using the EM algorithm. Biometrics, 53:653–665. Kao CH and Zeng ZB. 2002. Modeling epistasis of quantitative trait loci using Cockerham’s model. Genetics, 160:1243–1261. Kause A, Ritola O, Paananen T, Wahlroos H, and Mantysaari EA. 2005. Genetic trends in growth, sexual maturity and skeletal deformations, and rate of inbreeding in a breeding programme for rainbow trout (Oncorhynchus mykiss). Aquaculture, 247:177–187. Kimura T, Yoshida K, Shimada A, Jindo T, Sakaizumi M, Mitani H, Naruse K, Takeda H, Inoko H, Tamiya G, and Shinya M. 2005. Genetic linkage map of medaka with polymerase chain reaction length polymorphisms. Gene, 363:24–31. Kinghorn BP. 1999. Use of segregation analysis to reduce genotyping costs. J Anim Breed Genet, 116:175–180. Koshimizu E, Strüssmann CA, Okamoto N, Fukuda H, and Sakamoto T. 2010. Construction of a genetic map and development of DNA markers linked to the sex-determining locus in the Patagonian pejerrey (Odontesthes hatcheri). Mar Biotechnol (NY), 12:8–13. Kucuktas H, Wang S, Li P, He C, Xu P, Sha Z, Liu H, Jiang Y, Baoprasertkul P, et al. 2009. Construction of genetic linkage maps and comparative genome analysis of catfish using gene-associated markers. Genetics, 181:1649–1660. Lallias D, Beaumont AR, Haley CS, Boudry P, Heurtebise S, and Lapègue S. 2007a. A firstgeneration genetic linkage map of the European flat oyster Ostrea edulis (L.) based on AFLP and microsatellite markers. Anim Genet, 38:560–568. Lallias D, Lapègue S, Hecquet C, Boudry P, and Beaumont AR. 2007b. AFLP-based genetic linkage maps of the blue mussel (Mytilus edulis). Anim Genet, 38:340–349. Lander ES and Botstein D. 1989. Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics, 121:185–199. Lee BY, Lee WJ, Streelman JT, Carleton KL, Howe AE, Hulata G, Slettan A, Stern JE, Terai Y, and Kocher TD. 2005. A second-generation genetic linkage map of tilapia (Oreochromis spp.). Genetics, 170:237–244. Li L, Xiang JH, Liu X, Zhang Y, Dong B, and Zhang XJ. 2005. Construction of AFLP-based genetic linkage map for Zhikong scallop, Chlamys farreri Jones et Preston and mapping of sex-linked markers. Aquaculture, 245:63–73. Li YT, Keren B, Emanuela M, Vicki W, Stephen M, Sandy K, Peter C, Nigel P, and Sigrid L. 2003. Genetic mapping of the kuruma prawn Penaeus japonicus using AFLP markers. Aquaculture, 219:143–156.

Comparison of Index Selection, BLUP, MAS, and Whole Genome Selection

215

Li ZX, Li J, Wang QY, He YY, and Liu P. 2006. AFLP-based genetic linkage map of marine shrimp Penaeus (Fenneropenaeus) chinensis. Aquaculture, 261:463–472. Liao M, Zhang L, Yang G, Zhu M, Wang D, Wei Q, Zou G, and Chen D. 2007. Development of silver carp (Hypophthalmichthys molitrix) and bighead carp (Aristichthys nobilis) genetic maps using microsatellite and AFLP markers and a pseudo-testcross strategy. Anim Genet, 38:364–370. Liao X, Ma HY, Xu GB, Shao CW, Tian YS, Ji XS, Yang JF, and Chen SL. 2009. Construction of a genetic linkage map and mapping of a female-specific DNA marker in half-smooth tongue sole (Cynoglossus semilaevis). Mar Biotechnol (NY), 11:699–709. Liu WD, Bao XB, Song WT, Zhou ZC, He CB, and Yu XJ. 2009. The construction of a preliminary genetic linkage map in the Japanese scallop Mizuhopecten yessoensis. Yi Chuan, 31:629–637. Liu X, Liu X, Guo X, Gao Q, Zhao H, and Zhang G. 2006. A preliminary genetic linkage map of the Pacific abalone Haliotis discus hannai Ino. Mar Biotechnol (NY), 8:386–397. Liu Z. 2007. Aquaculture Genome Technologies. Blackwell Publishing, Ames, IA. Luan S, Kong J, and Wang QY. 2008. Methods and application of aquatic animal breeding value estimation: a review. Mar Fisher Res, 29:101–107. Malkin I, Ginsburg E, and Elston RC. 2002. Increase in power of transmission-disequilibrium tests for quantitative traits. Genet Epidemiol, 23:234–244. Maneeruttanarungroj C, Pongsomboon S, Wuthisuthimethavee S, Klinbunga S, Wilson KJ, Swan J, Li Y, Whan V, Chu KH, Li CP, Tong J, Glenn K, Rothschild M, Jerry D, and Tassanakajon A. 2006. Development of polymorphic expressed sequence tag-derived microsatellites for the extension of the genetic linkage map of the black tiger shrimp (Penaeus monodon). Anim Genet, 37:363–368. Martin ER, Monks SA, Warren LL, and Kaplan NL. 2000. A test for linkage and association in general pedigrees: The pedigree disequilibrium test. Am J Hum Genet, 67:146–154. Martinez V, Neira R, and Gall GAE. 1999. Estimation of genetic parameters from pedigreed populations: Lessons from analysis of alevin weight in Coho salmon (Oncorhynchus kisutch). Aquaculture, 180:223–236. Meuwissen TH. 1997. Maximizing the response of selection with a predefined rate of inbreeding. J Anim Sci, 75:934–940. Meuwisseni TH and Sonesson AK. 2004. Genotype-assisted optimum contribution selection to maximize selection response over a specified time period. Genet Res, 84:109–116. Meuwissen THE and Goddard ME. 1996. The use of marker haplotypes in animal breeding schemes. Genet Sel E, 28:161–176. Meuwissen THE, Hayes BJ, and Goddard ME. 2001. Prediction of total genetic value using genome-wide dense marker maps. Genetics, 157:1819–1829. Meuwissen THE, Karlsen A, and Lisen S. 2002. Fine mapping of a quantitative trait locus for twinning rate using combined linkage and linkage disequilibrium mapping. Genetics, 161:373–379. Moen T, Hoyheim G, Munck H, and Gomes-Raya L. 2004. A linkage map of Atlantic salmon (Salmo salar) reveals an uncommonly large difference in recombination rate between the sexes. Anim Genet, 35:81–89. Moen T, Hayes B, Baranski M, Berg PR, Kjoglum S, Koop BF, Davidson WS, Omholt SW, and Lien S. 2008. A linkage map of the Atlantic salmon (Salmo salar) based on EST-derived SNP markers. BMC Genomics, 15:223. Morishima K, Nakayama I, and Arai K. 2008. Genetic linkage map of the loach Misgurnus anguillicaudatus (Teleostei: Cobitidae). Genetica, 132:227–241. Neira R, Diaz NF, Gall GAE, Gallardo JA, Lhorente JP, and Manterola R. 2006. Genetic improvement in Coho salmon (Oncorhynchus kisutch). I: Selection response and inbreeding depression on harvest weight. Aquaculture, 257:9–17.

216

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Nell JA, Smith IR, and Sheridan AK. 1999. Third generation evaluation of Sydney rock oyster Saccostrea commercialis (Iredale and Roughley) breeding lines. Aquaculture, 170:195–203. Nichols KM, Young WP, Danzmann RG, Robison BD, Rexroad C, Noakes M, Phillips RB, Bentzen P, Spies I, Knudsen K, Allendorf FW, Cunningham BM, Brunelli J, Zhang H, Ristow S, Drew R, Brown KH, Wheeler PA, and Thorgaard GH. 2003. A consolidated linkage map for rainbow trout (Oncorhynchus mykiss). Anim Genet, 34:102–115. Nielsen HM, Sonesson AK, Yazdi H, and Meuwissen THE. 2009. Comparison of accuracy of genome-wide and BLUP breeding value estimates in sib based aquaculture breeding schemes. Aquaculture, 289:259–264. O’ Flynn FM, Bailey JK, and Friars GW. 1999. Responses to two generations of index selection in Atlantic salmon (Salmo salar). Aquaculture, 173:143–147. Ohara E, Nishimura T, Nagakura Y, Sakamoto T, Mushiake K, and Okamoto N. 2004. Genetic linkage maps of two yellowtails (Seriola quinqueradiata and Seriola lalandi). Aquaculture, 244:41–48. Patterson H and Thompson R. 1971. Recovery of inter-block information when block sizes are unequal. Biometrika, 58:545–554. Ponzoni RW, Hamzah A, Tan S, and Kamaruzzaman N. 2005. Genetic parameters and response to selection for live weight in the GIFT strain of Nile tilapia (Oreochromis niloticus). Aquaculture, 247:203–210. Pritchard JK and Rosenberg NA. 1999. Use of unlinked genetic markers to detect population stratification in association studies. Am J Hum Genet, 65:220–228. Pritchard JK, Stephens M, and Donnelly P. 2000. Inference of population structure using multilocus genotype data. Genetics, 155:945–959. Qin YJ, Liu X, Zhang HB, Zhang GF, and Guo XM. 2007. Genetic mapping of size-related quantitative trait loci (QTL) in the bay scallop (Argopecten irradians) using AFLP and microsatellite markers. Aquaculture, 272:281–290. Rao CR. 1972. Estimation of variance and covariance components in linear models. J Am Stat Assoc, 67:112–115. Reid DP, Smith CA, Rommens M, Blanchard B, Martin-Robichaud D, and Reith M. 2007. A genetic linkage map of Atlantic halibut (Hippoglossus hippoglossus L.). Genetics, 177:1193–1205. Rezk MA, Smitherman RO, Williams JC, Nichols A, Kucuktas H, and Dunham RA. 2003. Response to three generations of selection for increased body weight in channel catfish, Ictalurus punctatus, grown in earthen ponds. Aquaculture, 228:69–79. Riquet J, Coppieters W, Cambisano N, Arranz JJ, Berzi P, Davis SK, Grisart B, Farnir F, Karim L, Mni M, et al. 1999. Fine-mapping of quantitative trait loci by identity by descent in outbred populations: application to milk production in dairy cattle. Proc Natl Acad Sci U S A, 96:9252–9257. Sanetra M, Henning F, Fukamachi S, and Meyer A. 2009. A microsatellite-based genetic linkage map of the cichlid fish, Astatotilapia burtoni (Teleostei): A comparison of genomic architectures among rapidly speciating cichlids. Genetics, 182:387–397. Schadt EE, Monks SA, Drake TA, Lusis AJ, Che N, Colinayo V, Ruff TG, Milligan SB, Lamb JR, Cavet G, et al. 2003. Genetics of gene expression surveyed in maize, mouse and man. Nature, 422:297–302. Searle SR. 1968. Another look at Henderson’s methods of estimating variance components. Biometrics, 24:749. Spelman RJ, Garrick DJ, and van Arendonk JAM. 1999. Utilisation of genetic variation by marker assisted selection in commercial dairy cattle populations. Lives Prod Sci, 59:51–60. Staelens J, Rombaut D, Vercauteren I, Argue B, Benzie J, and Vuylsteke M. 2008. High-density linkage maps and sex-linked markers for the black tiger shrimp (Penaeus monodon). Genetics, 179:917–925.

Comparison of Index Selection, BLUP, MAS, and Whole Genome Selection

217

Sun XW and Liang LQ. 2004. A genetic linkage map of common carp (Cyprinus carpio L.) and mapping of a locus associated with cold tolerance. Aquaculture, 238:165–172. Tian Y, Kong J, and Wang WJ. 2008. Construction of AFLP-based genetic linkage maps for the Chinese shrimp Fenneropaeneus chinensis. Chinese Science Bulletin, 53:1205– 1216. Toro M and Maki-Tanila A. 1999. Establishing a Conservation Scheme. DLO Institute of Animal Health, Lelystad, The Netherlands. Toro JE, Sanhueza MA, Winter JE, Senn CM, Aguila P, and Vergara AM. 1995. Environmental effects on the growth of the Chilean oyster Ostrea chilensis in five mariculture locations in the Chiloe Island, southern Chile. Aquaculture, 136:153–164. Van Arendonk JAM, Bink MCAM, Bijma P, Bovenhuis H, de Koning D -J, and Brascamp EW. 1999. Use of Phenotypic and Molecular Data for Genetic Evaluation of Livestock. Proceedings of the “From Jay Lush to Genomics: Visions for Animal Breeding and Genetics’’ Conference, Iowa State University, Ames, IA, pp. 60–69. Van Laere AS, Nguyen M, Braunschweig M, Nezer C, Collette C, Moreau L, Archibald AL, Haley CS, Buys N, Tally M, et al. 2003. A regulatory mutation in IGF2 causes a major QTL effect on muscle growth in the pig. Nature, 425:832–836. Verhoeven KJF, Jannink JL, and McIntyre LM. 2006. Using mating designs to uncover QTL and the genetic architecture of complex traits. Heredity, 96:139–149. Waldbieser GC, Bosworth BG, Nonneman DJ, and Wolters WR. 2001. A microsatellite-based genetic linkage map for channel catfish, Ictalurus punctatus. Genetics, 158:727–734. Wang CM, Zhu ZY, Lo LC, Feng F, Lin G, Yang WT, Li J, and Yue GH. 2007. A microsatellite linkage map of Barramundi, Lates calcarifer. Genetics, 175:907–915. Wang L, Song L, Chang Y, Xu W, Ni D, and Guo X. 2005. A preliminary genetic map of Zhikong scallop (Chlamys farreri Jones et Preston 1904). Aquaculture Res, 36:643–653. Wang L, Song L, Zhang H, Gao Q, and Guo X. 2007. Genetic linkage map of bay scallop, Argopecten irradians irradians (Lamarck 1819). Aquaculture Res, 38:409–419. Wang S, Bao Z, Pan J, Zhang L, Yao B, Zhan A, Bi K, and Zhang Q. 2004. AFLP linkage map of an intraspecific cross in Chlamys farreri. J Shellfish Res, 23:491–499. Ward RD, English LJ, Mcgoldrick DJ, Maguire GB, Nell JA, and Thompson PA. 2000. Genetic improvement of the Pacific oyster Crassostrea gigas (Thunberg) in Australia. Aquac Res, 31:35–44. Warren AA, Song LS, Meola DM, Xu ZK, Xiang JH, and Warren W. 2007. Characterization and mapping of expressed sequence tags isolated from a subtracted cDNA library of Litopenaeus vannamei injected with white spot syndrome virus. J Shellfish Res, 26:1247–1258. Wray NR and Hill WG. 1989. Asymptotic rates of response from index selection. Anim Prod, 49:217–227. Wu R and Lin M. 2006. Functional mapping—How to map and study the genetic architecture of dynamic complex traits. Nat Rev Genet, 7:229–237. Xia JH, Liu F, Zhu ZY, Fu J, Feng J, Li J, and Yue GH. 2010. A consensus linkage map of the grass carp (Ctenopharyngodon idella) based on microsatellites and SNPs. BMC Genomics, 11:135. Xu K, Li Q, Kong L, and Yu R. 2009. A first-generation genetic map of the Japanese scallop Patinopecten yessoensis-based AFLP and microsatellite markers. Aquaculture Research, 40:35–43. You EM, Liu KF, Huang SW, Chen M, Groumellec ML, Fann SJ, and Yu HT. 2010. Construction of integrated genetic linkage maps of the tiger shrimp (Penaeus monodon) using microsatellite and AFLP markers. Anim Genet, 41:365–376. Young ND. 1999. A cautiously optimistic vision for marker-assisted selection. Mol Breed, 5:505–510.

218

Next Generation Sequencing and Whole Genome Selection in Aquaculture

Yu JM and Buckler ES. 2006. Genetic association mapping and genome organization of maize. Curr Opin Biotechnol, 17:155–160. Yu JM, Holland JB, McMullen MD, and Buckler ES. 2008. Genetic design and statistical power of nested association mapping in maize. Genetics, 178:539–551. Zeng ZB. 1993. Theoretical basis for separation of multiple linked gene effects in mapping quantitative trait loci. Proc Natl Acad Sci U S A, 90:10972–10976. Zeng ZB. 1994. Precision mapping of quantitative trait loci. Genetics, 136:1457–1468. Zhan A, Hu J, Hu X, Hui M, Wang M, Peng W, Huang X, Wang S, Lu W, Sun C, and Bao Z. 2009. Construction of microsatellite-based linkage maps and identification of size-related quantitative trait loci for Zhikong scallop (Chlamys farreri). Anim Genet, 40:821–831. Zhang L, Yang C, Zhang Y, Li L, Zhang X, Zhang Q, and Xiang J. 2007. A genetic linkage map of Pacific white shrimp (Litopenaeus vannamei): Sex-linked microsatellite markers and high recombination rates. Genetica, 131:37–49. Zhang S, Zhang K, Li J, Sun F, and Zhao H. 2001. Test of association for quantitative traits in general pedigrees: The quantitative pedigree disequilibrium test. Genet Epidemiol, 21:S370–S375. Zhang W and Smith C. 1992. Computer simulation of markers-assisted selection utilizing linkage disequilibrium. Theor Appl Genet, 83:813–820. Zhou Z, Bao Z, Dong Y, Liu X, Song L, He C, and Wang L. 2006. AFLP analysis in populations of Strongylocentrotus intermedius, S. nudus and hybrids (S. intermedius x S. nudus). Yi Chuan, 29:443–448. Zhu J. 1989. Estimation of Genetic Variance Components in the General Mixed Model. North Carolina State University, Raleigh, NC. Zhu J. 1992. Mixed model approaches for estimating genetic variances and covariances. J Biomath, 7:1–11. Zhu J. 1993. Methods of predicting genotype value and heterosis for offspring of hybrids. J Biomath, 8:32–44. Zhu J. 1995. Analysis of conditional genetic-effects and variance-components in developmental genetics. Genetics, 141:1633–1639. Zimmerman AM, Wheeler PA, Ristow SS, and Thorgaard GH. 2005. Composite interval mapping reveals three QTL associated with pyloric caeca number in rainbow trout, Oncorhynchus mykiss. Aquaculture, 247:85–95.

Index

Additive genetic variance 155, 188 Affymetrix 116, 124–125, 128–131, 134 AFLP 10–14, 133, 135, 198–200 Allozyme 6, 14, 198 Amplicon 8, 38, 42, 64, 70, 72, 85, 138 Array comparative genome hybridization (aCGH) 24–28 AutoSNP 99–100, 106, 112 BAC end sequences 71 Bacterial artificial chromosome (BAC) 9, 27, 70, 98, 131 Bar-coded primers 57 bayesA 154–155, 167, 170–172, 208–209 bayesB 154–155, 157, 167, 170–172, 176, 208–209 bayesC 154 beadChip 125–126, 134–135 Between-family selection 153, 156 Biallelic marker 114 BLAST 74, 93–95, 103, 106, 118 BLUP 154, 156–157, 161, 167–168, 170, 173, 187–190, 193–197, 206, 208–210 Breeding value estimation 153, 157, 185, 188–189, 194, 196 Bridge PCR 39, 42–43 CAP3 96, 99–100, 112 cDNA 27, 48, 57, 62, 64–65, 91, 110–111, 117, 120 Centromere 22, 24 Chromosome banding 24 CLC Genomics Workbench 103–104, 106 Composite interval mapping 200 Comprehensive selection index 185–186 Contigs 45, 70, 72, 74, 77–78, 80, 82, 85, 87, 92, 99–100, 111–112, 116, 119–120, 131 Copy number variation (CNV) 3–5, 21–22, 30, 124 Cot value 26

dbEST 92–93, 96, 111–112 De novo assembly 38, 45, 101, 103, 105 Deletion 3–7, 11, 21–24, 29–30, 38, 82, 99 Duplication 3, 5, 21–24, 27–30, 109, 136– 138, 140, 143, 158 Dynamic array 124, 127–130 Dynamic QTL 195 Effective population size 151, 157, 177 Emulsion PCR 41–43 Enzymatic fragmentation 58–59 Epigenome 44, 49 EST derived SNP 91–92, 110, 112, 116–117, 119–120, 131 Estimated breeding value (EBV) 152–153, 156, 166, 185, 209 Exon 9, 46, 59, 106, 117–118, 137–138 Exon-primed intron crossing (EPIC) 137, 205 Expressed sequence tag (EST) 9, 48, 70, 91, 109, 123, 135 Family selection 153, 155–156, 159, 166, 179–180, 196 Fluorescence in situ hybridization (FISH) 34 Fluorescence resonance energy transfer (FRET) 51 Food and Agriculture Organization (FAO) 165 454 sequencing 29, 38, 62, 64–66, 103 Full-length cDNA 48 Gene associated SNP 91, 110, 119, 135 Gene effect 192–194, 206 genechip 106, 124–125, 128–129, 134 Genetic gain 156, 166, 174, 176–178, 180, 185–186, 195–197, 207 Genetic variation 3, 6, 11–12, 153, 158, 207, 209 Genome 10 K project 165

Next Generation Sequencing and Whole Genome Selection in Aquaculture © 2011 Blackwell Publishing Ltd. ISBN: 978-0-813-80637-2

Edited by Zhanjiang (John) Liu

219

220

Index

Genome analyzer 38–40, 42, 46, 50, 62, 101, 103–104 Genome sequencer (GS-FLEX) 36, 62, 64 Genome-wide association studies (GWAS) 30 Genome-wide estimated breeding value (GWEBV) 152 Genomic relationship matrix 155, 166–168, 170–171, 175 Genomic selection 151–161, 165–167, 170–171, 173–180 Goldengate 124–126, 128–131, 134, 140 GW-BLUP 154–155, 157 Hardy-Weinberg Equilibrium 202 Heritability 5, 153, 156–159, 173, 175, 177, 186, 190, 194, 196, 210 Homopolymer sequence 82 Hox gene cluster 126, 136 Identical by decent (IBD) 203, 208–209 Illumina 28–40, 42–43, 46–50, 57, 59, 61–62, 64, 66, 101–104, 116, 124–126, 128–131, 134–135, 140, 142–143, 158 Illumina sequencing 49, 57, 64, 101–102 Inbreeding rate 195 Indels 4, 6, 11, 22, 44–45, 99, 102 Insertion 3–7, 11, 21–23, 29–30, 38, 82, 99 Interval mapping 200 Intrachromosomal duplication 22 Intron 8–9, 27, 92, 106, 116–118, 120, 137–138 Inversion 3–5, 22, 24, 29–30, 125 iSelect 124–126, 128–131 Karyotype 24, 136 Ligase mediated sequencing 41 Linear unbiased prediction (LUP) 154, 176, 187, 193, 209 Linkage analysis 13, 200–202 Linkage disequilibrium (LD) 124, 134, 151, 166, 200, 203, 208 Linkage equilibrium 205 Linkage mapping 123, 130–131, 134, 197, 200, 202 Long interspersed nucleotide elements (LINE) 23 Marker 3, 5–15, 45, 69, 71, 91–92, 114, 118–119, 123–124, 128, 130–131, 133–135,

137–140, 151–160, 166–167, 169–173, 175–178, 185, 197–211 Marker assisted-BLUP 197 Marker density 155, 157, 159–160, 175, 177 Marker-assisted selection 133, 152–153, 159, 166, 185 MAS selection 133, 135, 166, 173, 176, 185, 197, 200, 205–211 Massarray 124, 127–131 Methylome 50 Microarray 25–26, 124, 127 Microsatellite 7–12, 14, 123, 133, 135, 199, 207 Minor allele frequency 74, 101, 113, 115, 131 Mitochondrial marker 7, 14 Mixed linear model 185, 187–188, 190–194, 204, 207 Molecular marker 5, 8, 10, 13, 123, 197, 206, 211 Multiple interval mapping 200 Multiple sequence alignment 185, 187–188, 190–194, 204, 207 Multiplex amplifiable probe hybridization 27 Multisite variant (MSV) 22,137, 139–140, 143 Mutation-drift equilibrium 175 NCBI 44–46, 84–85, 92–96, 101, 103, 112, 116 Nebulization 58, 64 Nested association mapping 204 Next generation sequencing 14, 28–29, 36, 42–43, 45, 48–51, 57–58, 60, 91, 101–104, 109, 119–120, 123, 129, 165, 211 Openarray 124, 127–129 Paralogous sequence variant (PSV) 22, 133, 137, 139, 140, 143 Pedigree 135, 155, 157, 166–168, 170–171, 173–174, 195, 197, 202, 203 Pedigree transmission disequilibrium 203 Phenotype 23, 69, 91, 110, 123, 133, 152, 156, 158, 167, 170, 173, 175, 177, 197, 200–201, 206, 208, 210–211 Phenotypic selection 180, 195 Phenotypic value vector 188 Phenotypic variance 186

Index PHRAP 96–99, 102 PHRED 93, 96, 98–99, 106, 110–111 Phylogenetic tree 12, 94, 96 POLYBAYES 93, 96–99, 101, 106, 111 Polymorphic information content (PIC) 11 Polymorphic loci 69 Polymorphism 4–13, 21–24, 30,43–46, 69–71, 80, 82, 85, 91, 97–99, 101, 109, 119, 123–124, 133–134, 137, 143, 158, 165–166, 199–200, 202, 206, 208 Pseudoreference 70, 72–75, 78, 87, 89 Pyrosequencing 38, 42–43, 45, 65, 72, 74, 101, 139 QTL 10, 123, 130–131, 133–135, 151–154, 156–159, 161, 166, 168–170, 175–176, 178, 197, 200–209, 211 RAPD 10–14, 133, 198–199 Rearrangement 3–6, 21–23 Recombinant inbred lines 202, 204 Reduced representation library 70–71, 89, 130, 139 Reference assembly 103 Repeatmasker 78 Repetitive elements 25, 119–120 Resequencing 43–46 Resource family 10, 114 Restriction enzyme 6, 11–12, 71–73, 87 RFLP 6–7, 11–12, 14, 133 Scaffold 64, 99, 118 Segmental inversion 4–5 Segregating population 133 Selection index 185–189 Short interspersed nucleotide element (SINE) 198–199 Short tandem repeats 78 Shotgun libraries 43, 64

221

Simple sequence repeats (SSR) 7, 197–200 Simulation 159–160, 166, 173–176, 178, 192, 209 Single nucleotide polymorphism (SNP) 4, 30, 45, 69, 91, 109, 123, 133, 143, 152, 165, 189, 208 Single strand conformation polymorphism 124 SNP chips 142 SNP discovery 69–72, 79, 88–89, 91–96, 98–99, 101–103, 105–106, 109, 123, 129, 135, 137–139 SNPstream 124, 127–130 Solexa sequencing 38 SOLiD sequencing 43, 65 Sonication 58 Static QTL 204 Tandem repeats 45, 78 Taqman assay 124, 129, 140 Training generation 167 Transcriptome 43, 46, 48–49, 105, 111 Transcriptome profiling 48 Transmission disequilibrium 161, 203 Transposon 78, 119 Two-base encoding 41–42 Type I marker 9, 91 UTR 9, 138 Variable number tandem repeat (VNTR) 198–199 Whole genome based selection 21, 123, 151, 185, 197, 210 Whole genome duplication 21, 109, 136, 143 Whole genome selection 1, 5, 21, 69, 185, 208–211 Within-family genome selection 146