Plant Genomics. Methods and Protocols

METHODS IN MOLECULAR BIOLOGY™ P Series Editor John M. Walker School of Life Sciences University of Hertfordshire Ha...

Author: Daryl J. Somers | Peter Langridge | J.P. Gustafson

213 downloads 2181 Views 7MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

METHODS

IN

MOLECULAR BIOLOGY™

P

Series Editor John M. Walker School of Life Sciences University of Hertfordshire Hatfield, Hertfordshire, AL10 9AB, UK

For other titles published in this series, go to www.springer.com/series/7651

METHODS

IN

MOLECULAR BIOLOGY™

Plant Genomics Methods and Protocols

Edited by

Daryl J. Somers*, Peter Langridge†, and J. Perry Gustafson‡ Molecular Breeding and Biotechnology, Vineland Research and Innovation Centre, Vineland Station, Ontario, Canada* Australian Centre for Plant Functional Genomics, University of Adelaide, Glen Osmond, Australia† Division of Plant Sciences, University of Missouri, Columbia, MO, USA‡

Editors Daryl J. Somers Molecular Breeding and Biotechnology Vineland Research and Innovation Centre Vineland Station, Ontario Canada

Peter Langridge Australian Centre for Plant Functional Genomics University of Adelaide Glen Osmond, Australia

J. Perry Gustafson Division of Plant Sciences University of Missouri Columbia, MO, USA

ISBN: 978-1-58829-997-0 e-ISBN: 978-1-59745-427-8 ISSN: 1064-3745 e-ISSN: 1940-6029 DOI: 10.1007/978-1-59745-427-8 Library of Congress Control Number: 2008940985 © Humana Press, a part of Springer Science+Business Media, LLC 2009 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Humana Press, c/o Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of going to press, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Printed on acid-free paper springer.com

Preface This volume is divided into chapters which consider the primary issues and methodologies surrounding plant genomics research. Plant genomics is largely concerned with associating functional genes or gene mutations with phenotype. Therefore, chapters are included that cover the areas of gene discovery and functional analysis of genes. Further chapters focus on the primary tools and sub-disciplines of genetic mapping, mRNA, protein and metabolite profiling. Methods are included that explore gene functional analysis via transformation, mutation, protein function and gene expression. The volume includes chapters on data management which consider the expansion of plant genomics databases and bioinformatics analysis tools. The volume is concluded with chapters aimed at discussing the application and deployment of molecular plant breeding technology from the use of markers in breeding, development of genetically modified plants/crop species, analysis of existing populations for novel alleles and gene/trait associations and genome sequencing.

v

Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 3 4

5 6 7 8 9

10

11

12

13 14

Role of Model Plant Species . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Richard Flavell New Technologies for Ultra-High Throughput Genotyping in Plants . . . . . . . . . . Nikki Appleby, David Edwards, and Jacqueline Batley Genetic Maps and the Use of Synteny. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chris Duran, David Edwards, and Jacqueline Batley A Simple TAE-Based Method to Generate Large Insert BAC Libraries from Plant Species . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bu-Jun Shi, J. Perry Gustafson, and Peter Langridge Transcript Profiling and Expression Level Mapping. . . . . . . . . . . . . . . . . . . . . . . . Elena Potokina, Arnis Druka, and Michael J. Kearsey Methods for Functional Proteomic Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christof Rampitsch and Natalia V. Bykova Stable Transformation of Plants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Huw D. Jones and Caroline A. Sparks Transient Transformation of Plants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Huw D. Jones, Angela Doherty, and Caroline A. Sparks Bridging the Gene-to-Function Knowledge Gap Through Functional Genomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stephen J. Robinson and Isobel A. P. Parkin Heterologous and Cell-Free Protein Expression Systems. . . . . . . . . . . . . . . . . . . . Naser Farrokhi, Maria Hrmova, Rachel A. Burton, and Geoffrey B. Fincher Functional Genomics and Structural Biology in the Definition of Gene Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maria Hrmova and Geoffrey B. Fincher In situ Analysis of Gene Expression in Plants . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sinéad Drea, Paul Derbyshire, Rachil Koumproglou, Liam Dolan, John H. Doonan, and Peter Shaw Plant and Crop Databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . David E. Matthews, Gerard R. Lazo, and Olin D. Anderson Plant Genome Annotation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shu Ouyang, Françoise Thibaud-Nissen, Kevin L. Childs, Wei Zhu, and C. Robin Buell

vii

v ix 1 19 41

57 81 93 111 131

153 175

199 229

243 263

viii

15

Contents

Molecular Plant Breeding: Methodology and Achievements . . . . . . . . . . . . . . . . . Rajeev K. Varshney, Dave A. Hoisington, Spurthi N. Nayak, and Andreas Graner 16 Practical Delivery of Genes to the Marketplace . . . . . . . . . . . . . . . . . . . . . . . . . . . David A. Fischhoff and Molly N. Cline 17 Ecological Genomics of Natural Plant Populations: The Israeli Perspective . . . . . . Eviatar Nevo 18 Genome Sequencing Approaches and Successes . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Imelfort, Jacqueline Batley, Sean Grimmond, and David Edwards Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

283

305 321 345

359

Contributors OLIN D. ANDERSON • Western Regional Research Center, Albany, CA, USA NIKKI APPLEBY • Australian Centre for Plant Functional Genomics, Institute for Molecular Biosciences and School of Land, Crop and Food Sciences, University of Queensland, Brisbane, Australia JACQUELINE BATLEY • Australian Centre for Plant Functional Genomics, Institute for Molecular Biosciences and School of Land, Crop and Food Sciences, University of Queensland, Brisbane, Australia C. ROBIN BUELL • The Institute for Genomic Research, Rockville, MD, USA RACHEL A. BURTON • Australian Centre for Plant Functional Genomics, School of Agriculture, Food and Wine, University of Adelaide, Glen Osmond, SA, Australia NATALIA V. BYKOVA • Agriculture and Agri-food Canada, Cereal Research Centre, Winnipeg, MB, Canada KEVIN CHILDS • The Institute for Genomic Research, Rockville, MD, USA MOLLY N. CLINE • Monsanto, St. Louis, MO, USA PAUL DERBYSHIRE • John Innes Centre, Norwich, UK ANGELA DOHERTY • CPI Division, Rothamsted Research, Harpenden, Hertfordshire, UK LIAM DOLAN • John Innes Centre, Norwich, UK JOHN H. DOONAN • John Innes Centre, Norwich, UK SINÉAD DREA • Department of Molecular, Cell and Developmental Biology, Yale University, New Haven, CT, USA CHRIS DURAN • Australian Centre for Plant Functional Genomics, Institute for Molecular Biosciences and School of Land, Crop and Food Sciences, University of Queensland, Brisbane, Australia ARNIS DRUKA • Scottish Crop Research Institute, Invergowrie, Dundee, Scotland, UK DAVID EDWARDS • Australian Centre for Plant Functional Genomics, Institute for Molecular Biosciences and School of Land, Crop and Food Sciences, University of Queensland, Brisbane, Australia NASER FARROKHI • Department of Biological Sciences, California State University, Long Beach, CA, USA GEOFFREY B. FINCHER • Australian Centre for Plant Functional Genomics, School of Agriculture, Food and Wine, University of Adelaide, Glen Osmond, SA, Australia DAVID A. FISCHHOFF • Monsanto, St. Louis, MO, USA RICHARD FLAVELL • Ceres, Inc., Thousand Oaks, CA, USA ANDREAS GRANER • Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany SEAN GRIMMOND • Institute for Molecular Biosciences, University of Queensland, Brisbane, Australia ix

x

Contributors

J. PERRY GUSTAFSON • Division of Plant Sciences, University of Missouri, Columbia, MO, USA DAVE A. HOISINGTON • International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Patancheru, India MARIA HRMOVA • Australian Centre for Plant Functional Genomics, School of Agriculture, Food and Wine, University of Adelaide, Glen Osmond, SA, Australia MICHAEL IMELFORT • Australian Centre for Plant Functional Genomics, School of Land, Crop and Food Sciences, University of Queensland, Brisbane, Australia HUW D. JONES • CPI Division, Rothamsted Research, Harpenden, Hertfordshire, UK MICHAEL J. KEARSEY • School of Biosciences, University of Birmingham, Birmingham, UK RACHIL KOUMPROGLOU • John Innes Centre, Norwich, UK PETER LANGRIDGE • Australian Centre for Plant Functional Genomics, University of Adelaide, Glen Osmond, Australia GERARD R. LAZO • Western Regional Research Center, Albany, CA, USA DAVID E. MATTHEWS • Department of Plant Breeding, Cornell University, Ithaca, NY, USA SPURTHI N. NAYAK • International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Patancheru, India EVIATAR NEVO • Institute of Evolution and the International Graduate Center of Evolution, University of Haifa, Mount Carmel, Haifa, Israel SHU OUYANG • The Institute for Genomic Research, Rockville, MD, USA ISOBEL A. P. PARKIN • Agriculture and Agri-Food, Saskatoon Research Centre, Saskatoon, SK, Canada ELENA POTOKINA • School of Biosciences, University of Birmingham, Birmingham, UK CHRISTOF RAMPITSCH • Agriculture and Agri-food Canada, Cereal Research Centre, Winnipeg, MB, Canada STEPHEN J. ROBINSON • Agriculture and Agri-Food, Saskatoon Research Centre, Saskatoon, SK, Canada BU-JUN SHI • Australian Centre for Plant Functional Genomics, University of Adelaide, Glen Osmond, Australia PETER SHAW • John Innes Centre, Norwich, UK CAROLINE A. SPARKS • CPI Division, Rothamsted Research, Harpenden, Hertfordshire, UK FRANÇOISE THIBAUD-NISSEN • The Institute for Genomic Research, Rockville, MD, USA RAJEEV K. VARSHNEY • International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Patancheru, India WEI ZHU • The Institute for Genomic Research, Rockville, MD, USA

Chapter 1 Role of Model Plant Species Richard Flavell Summary The use of model or reference species has played a major role in furthering detailed understanding of mechanisms and processes in the plant kingdom over the past 25 years. Species which have been adopted as models for dicotyledons and monocotyledons include arabidopsis and rice and more recently brachypodium, Such models are diploids, have few and small chromosomes, well developed genetics, rapid life cycles, are easily transformed and have extensive sets of technical resources and databases curated by international resource centres. The study of crop genomics today is deeply rooted in earlier studies on model species. Genomes of model species share reasonable genetic synteny with key crop plants which facilitates the discovery of genes and association of genes with phenotypes. While some mechanisms and processes are conserved across the plant kingdom and so can be revealed by studes on any model species, others have diverged during evolution and so are revealed by studying only a closely related model species. Examples of processes that are conserved across the plant kingom and others that have diverged and therefore need to be understood by studying a more closely related model species are described. Key words: Genomes, Synteny, Comparative genomics, Genome sequence.

1. Introduction Evolutionary and comparative genetics between plant species has validated the use of one species as a model for another, for the purpose of understanding plant biology. The process of deliberately selecting “model” species over the last two decades, suitable for amassing information rapidly and cheaply by thousands of scientists, has provided a revolution in our understanding of plants. The complete genome sequences and gene–trait associations revealed for these species has provided enormous insight into all plant species, their chromosomes, genes, pathways, evolution and hence relationships to one another and has provided an early framework Daryl J. Somers et al. (eds.), Methods in Molecular Biology, Plant Genomics, vol. 513 © Humana Press, a part of Springer Science + Business Media, LLC 2009 DOI: 10.1007/978-1-59745-427-8_1

1

2

Flavell

for understanding the genetic and molecular diversity in plants and plant processes. Yet, it is only a beginning because of the immense diversity across the plant kingdom. Because of this diversity, the concept of one or a few species being “models” suitable for all species is flawed. The major challenges are therefore (1) to evaluate the current framework gained from the relatively few “model” species, (2) to use the framework to understand many species, recognizing both the strengths and weaknesses of the framework for comparative biology and (3) to extend the framework by studying additional, specially selected, species based on plant phylogeny. While at any one-time model species are useful for providing predictions relevant to other members of the plant kingdom, they leave, of course, the need to test the predictions for any particular species, for example, the crop species that provide our food, feed, fiber and energy. However, the framework of understanding gained from selected “model” species is a wonderful starting point to evaluate any species in detail with speed and insight.

2. History It was during the 1980s when plant scientists worldwide were studying processes and traits in a very large range of plant species, especially economically important species, that it became accepted both in the scientific community and the funding agencies, in the EU and USA particularly, that much more benefit could be gained by focusing on one or two species as models for crops and processes across the plant kingdom. It was controversial because the models being touted were not economically important crops and it meant fewer funds for the favourite and important crops such as maize, tomato, wheat and barley about which a lot of information was being gathered. Yet, it had become obvious that having a large number of scientists studying Escherichia coli, yeast, Drosophila and Homo sapiens produced so much more detailed and understood information that knowledge of plants, important as they are, was being left behind. In consequence, the most talented minds were not being attracted to plant biology on the same scale as to model organisms. It had also become obvious that it was going to be possible to sequence whole plant genomes to unleash the power of genomics and so debates arose as to which genome would be sequenced and how the results would be used. The molecular genetics approaches of the models mentioned above were the most appealing especially also because plant breeding is based on genetics and genomics. Thus, the vision was adopted to learn the sequences of all the genes in some model plant and determine their function via mutational genetics and reverse genetics.

Role of Model Plant Species

3

An ideal model needs to be able to be studied to give rise to relevant information more quickly and cheaply than studying other species (1, 2). Some of the key features of an initial model are shown in Table 1. Speed, cost and convenience are key features. They drive scientists and funding agencies, especially in this day and age of the competitive environments in which there is a need to demonstrate substantial progress in a very short time. With these features being fulfilled in a model, it is impossible for an equivalent number of experiments to be done on more cumbersome species. In the 1980s, fulfilling the vision appeared possible only with a diploid species that had a small genome, a rapid life cycle and that could easily be transformed with novel genes. Many other factors also held a place in the debate, including how easy it was to grow the plant in a small environment. These are the reasons why Arabidopsis became the leading contender around the world (3–6) after some debate about Petunia and some other species. Friedrich Laibach had studied Arabidopsis from the early 1900s, and Erna Rheinholz in the early 1940s, but it was Glass (7), Redei (8) and Koornneef (9) who opened up mutational genetics in the species. While the genomics-based approaches were being developed for Arabidopsis, mainly in USA and Europe, rice genomics was being driven, especially in Asia and USA, by the importance of rice as a crop and the fact that its genome is also small and strains of rice are easily transformable. The “full” japonica genome sequence was published in 2002 (10, 11) with several updates being published subsequently from the international sequencing consortium including telomere repeats (http://rgp.dna.affrc. go.jp) and the sequence of centromeres (12).

Table 1 Preferred attributes of a model crop species Attributes Small genome Rapid life cycle Easily transformed Diploid genetics with few chromosome/gene duplications Well positioned in plant phylogeny Small stature for growth in small space Large number of seeds produced Convenient for discovery of gene–trait linkages at low cost, high speed

4

Flavell

Arabidopsis, classified within the eudicots lineage of flowering plants, inevitably has major limitations as both a model and a framework reference for monocots that occur in the other major lineage of flowering plants (Fig. 1). That is why rice plays such an important role for understanding monocots and monocot genomes, and complements Arabidopsis for studying angiosperms in general. While experiments with rice are not as fast and as cheap as Arabidopsis, the large volume of work being done in Asia has resulted in a lot being achieved at a fast pace. Much of the thinking behind the experimental approaches was learnt from Arabidopsis, which, in turn, was modelled after yeast, Drosophila etc. While the genomics of other species has been initiated, they have intrinsic difficulties that prevent such rapid progress in genetics, gene–trait linkages and developmental biology compared with rice and Arabidopsis. Nevertheless, poplar has been adopted as a model for trees since some strains of it are readily transformable and the US Department of Energy’s Joint Genome Institute (JGI, www.jgi.doe.gov) has completed the sequence of its genome (13–15). The sorghum genome has been recently sequenced by the JGI and that of corn is well advanced, as is

Fig. 1. Angiosperm phylogeny modified from Angiosperm Phylogeny Group (65, 66). Arabidopsis is in Brassicales of the rosids, and rice is in Poales of the monocots.

Role of Model Plant Species

5

that of Medicago which can serve as a model for certain legumes. The genome of Brachypodium distachyon is also being sequenced. This species, with its small genome and relative ease of transformation, has been adopted recently as a model for temperate C3 monocot grasses that will hopefully provide information particularly relevant to wheat, barley and other grasses (16, www. brachypodium.org). The success of Arabidopsis as the leading model species and its value can be inferred from the number of publications and the databases devoted to the species since 1985. In those days just a few dozen papers per year were published on Arabidopsis. In 2006, there were more than 2,200 in peer-reviewed journals (17). The Arabidopsis Information Resource (TAIR, 18) reports that there are now ~16,000 Arabidopsis researchers in about 6,200 laboratories worldwide. They are linked together under the auspices of “The Multinational Coordinated Arabidopsis thaliana Functional Genomics Project” (MCAtFGP) that publishes an update each year. These statistics mean that Arabidopsis has attracted much competitive grant money and people to devote their research careers to the study of the model plant. The initiative has had an enormous impact on plant biology. Spending the same time and amount of money could not have led to anything like our current understanding of plant biology had we continued in the same way as prior to the early 1980s. The 2007 report of the MCAtFGP makes the case as follows: “Research on Arabidopsis has provided most of the breakthroughs made in plant science over the last ten years and, given the continuing rapid progress, will drive the major discoveries in plant science for the next ten years. The resources and expertise are available to meet the goal of discovering a function for all the Arabidopsis genes of major significance within a reasonable timeframe. Given a high level of continuing support over several decades the ultimate goal of obtaining a working understanding of how a flowering plant functions down to a molecular level is within sight. Such a working model would be of incalculable benefit to future generations of scientists, farmers, environmentalists and society at large.” The major claim that “Arabidopsis has provided most of the breakthroughs over the past ten years” is a very bold one but accurate overall, illustrating the impact of this model on the molecular genetics of plants.

3. Genomics, Tools and Databases for Arabidopsis and Rice

The selection of Arabidopsis and rice as the principal models with which to develop, rapidly and cheaply, understanding of plant biology went hand in hand with the completion of full genome

6

Flavell

sequences (http://plantgdb.org/AtGDB,19), collections of full length cDNAs (18, 20), descriptions of expressed genes via deep EST sequencing, development of the use of microarrays and deep signature sequencing (www.dbi.udel.edu) to study gene expression patterns in different organs and growth conditions, the production of stocks with T-DNA mutations in “every” gene, stocks with transgenes inserted, recombinant inbred lines and mapping populations, molecular markers for quantitative trait loci (QTL) mapping and much more. These are detailed on The Arabidopsis Information Resource (TAIR) website for Arabidopsis and on The Rice Genome Resource Center website for rice http://www. rgrc.dna.affrc.go.jp/ and are described in part in other chapters of this book (see also 21, 22). The physical resources for Arabidopsis and rice have been deposited in stock centres to facilitate curation, QC and access for all (http://arabidopsis.info;www. biosci.ohio-state.edu/pcmb/facilities/abrchome.htm;http:// www.rgrc.dna.affrc.go.jp/). Similarly databases describing the compendium of genomics information have been established from the beginning (see TAIR). These open access tools and databases have been of extraordinary value to drive forward the development and use of these species as models. For Arabidopsis, they were associated with goals set by the scientific community and the US National Science Foundation to, for example, find the function of every gene, and now micro RNA (23), by 2010 (24). The forward-looking research emphases are on the networks formed by the physical, genetic, metabolic and regulatory interactions between genes, proteins and metabolites. The very large number of experiments assessing the levels of expression of Arabidopsis genes under many different conditions in different organs (see TAIR) is a wonderful resource for addressing the functions of genes, networks and genes that are co-regulated. These databases are also useful for selecting promoters with specific expressions patterns. The complete genome sequences of different accessions of Arabidopsis and rice are also being determined to better understand mutational events and variation in populations and, in association with QTL mapping, to link variation in genes with traits. Over 250,000 high quality single nucleotide polymorphisms (SNPs) are available from sequencing several Arabidopsis accessions (see TAIR). Recently, Arabidopsis genomics research has led the way in describing a global view on methylation patterns using high resolution tiling microarrays (25) to add to the fast growing field of epigenetics. With all this data there is special emphasis on data storage, analysis and visualization. This requires the formation of userfriendly databases and development of annotations that are adopted across species. Descriptions of genes and processes in different species must be harmonized to enable comparisons to

Role of Model Plant Species

7

be made with accuracy. This has not historically occurred in gene description terms. Arabidopsis descriptors based on chromosome location provide unambiguous reference points, but these are meaningless for across-species comparisons. However, the Gene Ontology terminology is an attempt to provide such terms and is being developed for plants (26, www.geneontology.org). The combined use of genetic variation and phenotypic screens has been developed in a huge number of ways to gain a primary understanding of gene–trait relationships. Three sorts of approaches have been adopted. First, and the most widely used has been to screen large populations of mutants with T-DNA (see TAIR) or transposon (27) insertions to find the variant which has the desired phenotypic change and then to sequence around the T-DNA/transposon insert in the selected plant to find the gene into which it has inserted (e.g., 21, 28, 29). While the approach has been very successful, the fact that mutations often occur during transformation at sites other than where the T-DNA is inserted, and that multiple T-DNAs are frequently inserted means that tests to check the complete linkage between the T-DNA/transposon and phenotype must be carried out. Alternatively, multiple T-DNA/transposon insertions at the same locus, causing the same phenotype, can be obtained to establish the gene–trait association. Failure of studies with T-DNA/transposon insertion mutants to identify a phenotypic change can be due to (1) the screens deployed not being appropriate or (2) that the mutated gene is duplicated in the genome and so mutations in all members of the gene family would be required to see the phenotypic effect. The second approach has been so-called “activation tagging” (28, 30), where T-DNAs carrying a strong enhancer of expression are inserted into plant genomes at a very large number of locations, with the assumption that when an enhancer inserts close to a gene the gene will be activated and phenotypic changes will give a gene–trait association for that gene. Populations carrying the enhancers are screened, plants with desired phenotypes selected, the genomic location of the T-DNA(s) determined and nearby genes examined for altered expression. The genes can then be tested individually for their ability to cause similar phenotypic changes when expressed at higher levels and/or in different cells. The third approach, which has been widely adopted by many, includes the companies Ceres (www.ceres.net) (2), Monsanto (www.monsanto.com) Mendel (www.mendelbio.com) and Icoria (www.icoria.com) (now Monsanto). The third approach has also been adopted by Crop Design (now BASF) for rice. These companies have operated high throughput strategies, exploiting the ease of transformation of Arabidopsis, to mis-express large numbers of transgenes under the control of very active promoters and then to screen the resulting plants for changes in defined traits. Genetic variation emanating from changes in the level of

8

Flavell

expression might be equivalent to that frequently occurring in natural populations as well as in breeding (crop improvement) populations. Where the mis-expressed gene is from another species then the protein sequence is different from that in Arabidopsis and so the effects of this variation can also be scored. Failure of mis-expression to cause a detectable phenotype can be because (1) the amount of RNA and protein being expressed is not affecting the networks that link expression of the gene with the manifested trait, (2) the screen is not examining the relevant trait, or (3) changes in the levels of expression of multiple genes are required to create a phenotypic change. In this situation, no conclusions about the role of the gene in a trait can be drawn. With this approach there is the possibility that the phenotypes are due to over-expression of homologous gene silencing due to the formation of double stranded RNA from the transgene insert or cluster of inserts. Typically not all transformants show the same phenotype and this opens up the possibility of multiple mechanisms for causing a change in phenotype. Tens of thousands of full length cDNAs as well as genomic DNAs have been put through this regime and morphological phenotypes, including flowering time, scored visibly and in over 20 screens covering a wide range of stresses, including drought, salt, heat, cold tolerance, low nitrogen, high and low light, traits very important in applied plant breeding. These screens have taken advantage of the small size of Arabidopsis and the ability to evaluate the plants in growth rooms, greenhouse, in soil and on defined media in petri dishes. They could not be done easily or cheaply on this scale with larger plants. This illustrates the very special advantage of Arabidopsis for such studies. The experiments developed on this scale also required a very efficient pipeline of gene cloning, plant transformation, seed collection and screening coupled with efficient sample tracking and data collection. All of these approaches have led to knowledge of hundreds or thousands of gene–trait linkages, some by loss of gene function and others by activation of gene function. When a gene–trait linkage has been found it can be checked by evaluating independent transgenic events and showing strict inheritance of the trait with the transgene over generations. These gene–trait linkages are clearly defined by the specific genetic background of the accession of the model species used. How useful is the genetic background of such a model species, selected for the speed and cost of doing the experiments, for predicting gene–trait linkages in other species that have diverged significantly from the models during evolution? This is a key question because the answer will determine the extent to which the use of models will be of direct utility to applied plant breeding.

Role of Model Plant Species

4. Evolutionary Divergence and the Utility of Model Species

9

There always was and will be arguments about the relevance of one species as a model for others. Evolutionary divergence will always provide limitations to the precise relevance of results from one species for another. That is why it has been suggested that species like Arabidopsis should be considered a reference species for others, that is, one with which other species can be compared, rather than providing necessarily relevant predictions. Thus, from this point of view, information from a model is the starting point from which to discover how the parts of the “toolkits” of evolution have been used, modified and reused within and between plant species. Often particular traits immediately suggest the limitations of a model. For example, Arabidopsis and rice do not make tubers (31), they both have a C3 mode of photosynthesis and not C4 or CAM, and do not have perennial habits like trees (32), but Arabidopsis has been shown to produce secondary thickening and thus can serve as a model for wood formation (33). These two model plants are not known to interact with nitrogen fixing bacteria or mycorrhizae like some 225,000 other species (34). Undoubtedly, at more detailed levels there will be countless differences between species that undermine the precise transfer of knowledge from a model to another species. Nevertheless, we are likely to be surprised that what seem like major diversifications in phenotype will have origins in relatively small changes in how parts in the “toolkits” of plant evolution become modified and reused. Protein sequences are relatively highly conserved across species but their coding sequences are frequently reused with variant promoters and other regulatory sequences to provide functional diversity. While it is expected that closely related proteins will carry out the same function in different species it is obviously expected that mutations in coding sequences within and between species will diversify basic functions somewhat by changing affinities for substrates, binding affinities to other proteins, metabolites, DNA and RNA complexes, etc. To discover this, homologues, paralogues and potential orthologues can be screened similarly in the same model species to understand the extent to which diversity in coding sequence has led to differences in function. Promoters, introns and 5′ and 3′ untranslated regions are much less conserved than protein sequences. The effects of diversification of promoters, introns and 5′ and 3′ untranslated regions can also be evaluated rapidly in models such as Arabidopsis or rice. Thus, comprehensive understanding can be readily gained about the relationships between gene structure, function and trait. Even if proteins from diverged species create the same phenotype, when mis-expressed in the model species, what is the

10

Flavell

probability that mis-expression of the same gene or its ortholog will create equivalent phenotypic variation in another species? This is an extremely important question in relation to use of models for defining gene–trait associations in other species. For this to happen, it is necessary for the networks from gene to trait to be reasonably conserved and for the equivalent genetic change of the mis-expression not to be present in the recipient already. Where trait improvements have been under high selection it may be that the equivalent mutation of the mis-expression event in the model will be present already. Thus, no change in phenotype following mis-expression of a gene in a different species can be due to the lack of conservation of the genetic networks underlying traits or the equivalent genetic change is already present. It does not mean that the gene is not concerned with that trait in the new species. Because of the divergence of genetic networks and systems during plant evolution the best way to evaluate the utility of models is in an evolutionary context and with the aid of phylogenetic trees. Species most closely related to each other phylogenetically are likely to be better models for each other. Thus, ideally there needs to be model species selected to be at each of the key nodes of plant evolution. (see Fig. 1). The US Department of Energy’s Joint Genome Institute in deciding which genomes to sequence has recognized this. Thus, they have opted to sequence the genome of Aquilega formosa because it is a member of the basal-most eudicot clade (Ranunculales, Fig. 1) and positioned nearly equidistant between the current models Arabidopsis and rice. The sequence of this, coupled with some functional understanding, should lead to a much deeper understanding of the evolution of morphological, physiological, reproductive and biochemical innovations in angiosperm evolution. The foxtail millet genome sequence will add to the comparisons within the C4 monocot grasses. Cotton, cassava and eucalyptus will help fill out the dicot lineages and Arabidopsis lyrata and Capsella rubella will enable genomic changes in the arabidosis lineage to be understood better. Mimulus guttatus will also be sequenced to aid the NSFfunded integrated ecological and genomic analysis of M. guttatus, M. nasutus, M. lewisii and M. cardinalis, a well-known, leading model series for studying ecological and evolutionary genetics in nature. To put the flowering plants in perspective Physcomitrella patens, moss (35) has been sequenced as has Chlamydamonas reinhardtii, a green alga (36). Studying equivalent traits in multiple species in diverse phylogenetic groups enables what is conserved and what has diverged to be determined and added to the phylogenetic trees, to further define useful models for particular groups of species–trait combinations. With respect to Arabidopsis and rice, in particular, it can be expected that some networks and gene–trait linkages be

Role of Model Plant Species

11

conserved because they are ancient, predating the separation of the dicot and monocot lineages and others are not conserved because they arose after separation of these lineages (see Fig. 1). Which developmental networks are conserved can be discovered, for example, by mis-expressing the same genes in both species and looking for an equivalent phenotypic change. There is a vast literature that illustrates the ways in which understanding from models is being tested across angiosperms, especially the Brassicas, Solanaceae, grasses and tree species. The comparative biology is being analyzed at all levels of biological complexity from the simple comparison of gene sequences through developmental and biochemical pathways to the complex effects of gene changes on a whole phenotype. In the first type of comparison, the similarities and differences in a gene sequence can be described precisely but, in the latter, all the similarities and differences in the networks essential for a plant phenotype cannot be described because they are unknown. Yet, it seems reasonable to suggest that if systems are conserved between some eudicots (Arabidopsis) and monocots (rice) then they are likely to be conserved across the majority of flowering plants and that the basic systems were established early in angiosperm evolution. Thus, focusing on similarities between rice and Arabidopsis seems, for today, of great value for assessing the utility of these models for flowering plants overall.

5. Examples of Comparative Biology that Illustrate the Utility of Model Species and Extent of Conservation of Genetic Networks During Evolution 5.1. Genome Synteny

From the very large number of examples in the literature just a few are given here to illustrate the utility of models for defining hypotheses for other species including economically important crops. As genomes diverge over time they accumulate mutations that include not only base changes in specific genes, but also changes in the number and distribution of repeated sequences, including transposable elements. Such changes create huge numbers of chromosomal differences within and between species resulting in major changes in DNA content, but not necessarily in the order of genes along chromosomal segments. Plant breeding depends on the frequency of recombination between genes and so knowing the order of genes in linkage blocks is very useful. Gene order is conserved during evolution and thus reflects the phylogenetic relationships between species. Thus, knowing the order of genes along a chromosomes segment of one species (model) can be a guide to the order of genes along

12

Flavell

the evolutionary equivalent chromosomal segment of another related species. The earliest whole genome comparative maps were developed among species in the Solanaceae family (37). Arabidopsis exhibits extensive conserved synteny with closely related Brassica species and Capsella (38, 39). There is synteny between Arabidopsis and soybean, especially along chromosome 1, and extensive synteny between Arabidopsis and tomato (39– 41) and Prunus (42). However, superimposed on this synteny are rounds of local duplication, and sometimes translocation, of genome segments that often get fixed in evolution and this superficially undermines the microsynteny. Gene losses are often associated with these rounds of duplication (43). Gene colinearity is especially well conserved between segments of grass genomes, for example, rice and wheat, maize and barley, etc. in spite of large differences in genome size, (44). On close inspection, microsynteny also often breaks down due to gene deletion, duplication and local rearrangements, for example, between rice and maize (45, 46). Gene synteny is extremely useful because it enables any one grass genome to be used as a model for any other related species with respect to gene order in segments to (1) predict the position of QTLs mapped in one species on the chromosomes of another, (2) aid orthologous gene assignments and gene– trait determinants and (3) reveal features of chromosome evolution. Rice having the smallest genome and being completely sequenced is serving as the primary syntenic model for all other grasses in these comparisons. The number and kind of duplications and rearrangements, etc. fixed during grass evolution can be traced based on genome synteny deviations. It has also been used successfully to enable genes to be isolated from large complex chromosomes by chromosome walking using a smaller syntenic genome as a guide (47). 5.2. Gibberellin Metabolism and Plant Development

The conservation or divergence of the complex pathways behind plant traits is difficult to describe but as more information on the details of genes and particularly gene functions emerge then similarities and differences between models and other species will help us understand the utility of models and the information within them. A particularly important example is provided by the conservation of gibberellin (GA) metabolism and plant development across the dicots and monocots. The control of height in plants is controlled in part by GAs (48, 49). The basic biosynthetic pathway appears to be similar in pea, wheat and rice since inactivating different steps along the pathway causes loss of GAs and dwarf phenotypes (50–52). In wheat, the wellknown dwarfing genes are della proteins, which act as repressors of growth and GAs promote growth by participating in a process that results in ubiquitination of the della proteins so that they are

Role of Model Plant Species

13

targeted for degradation by the 35S proteasome. In Arabidopsis, the equivalent proteins when lacking the conserved della domain are not degraded. The mutations in wheat that cause dwarfism are in the della region and are therefore thought to cause dwarfism by the protein not being recognized by GAs and therefore not degraded. The role of della proteins in repressing growth is not compromised by the mutations in the della domain. The information currently available implies that there is clearly conservation of the GA biosynthetic and signalling pathway between Arabidopsis and cereals, and the control of growth by orthologous della proteins (51). 5.3. Flowering

Another example highlighting both conservation and some divergence between dicots and monocots in genes and developmental pathways is provided by research into the flowering process. Over 70 genes have been found to influence flowering in Arabidopsis. Some of these play a similar role in rice and cereals (53–56). However, in Arabidopsis an extended cold temperature period promotes flowering by epigenetically down regulating the amount of the floral repressor FLC, a MADs box transcription factor. The analogous dominant repressor gene in wheat, VRN2, is also down regulated by cold conditions but is unrelated to FLC. It is a Zn finger transcription factor related to the CONSTANS protein family (57, 58). Also, the VRN1 gene in wheat is closely related to the AP1 gene in Arabidopsis but AP1 has not been shown to play any part in vernalization in Arabidopsis (59). Similarly, the wheat and barley vernalization gene VRN3 is an ortholog of the Arabidopsis gene FT that controls flowering late in the pathway (60). This gene product is now believed to be the “florigen” that moves between cells to promote flowering (56, 61). Thus, it appears that different genes have evolved to play a part in the flowering control network in monocots versus dicots. Ceres, Inc., along with many others, has assayed in rice many of the genes found to control traits in Arabidopsis via mis-expression similar to what was done in Arabidopsis. Genes giving similar phenotypes for height, flowering time, branching, tolerance to heat, disease, drought, etc. have been found. This implies not only that the links between specific single genes and complex organs/traits are conserved at least to some extent, but also that the relationship between the level of activity of a specific gene and the trait is also conserved. This conservation is remarkable given all the opportunities for change and diversity during evolution.

5.4. Stature

Mis-expression of an Arabidopsis AP2 transcription factor leads to a reduction in height and growth of more leaves in Arabidopsis and a similar reduction in height and more tillers in rice (Fig. 2.) (Ceres, unpublished). The results imply that target molecules in both species recognize the Arabidopsis transcription factor and

14

Flavell

Control

Control

Transgenic

Transgenics

Fig. 2. Reductions in stature created by mis-expression of an Arabidopsis gene in Arabidopsis and rice.

conserved downstream pathways are activated/repressed to produce similar changes in phenotype. The figure also illustrates that the extent of the altered phenotype in rice varies between different transgene integration events. This enables the degree of dwarfing required to be selected from amongst the populations of transgenic plants. The above examples illustrate that the information gained relatively cheaply and rapidly via models provides much essential information for understanding biological systems across the phylogenetic spectrum including in crops. The use of model species is therefore relevant to building a platform of information for tomorrow’s plant breeding.

6. Predicting QTLs from Model Species?

One of the problems in plant breeding is the discovery of the loci that contain variable alleles for specific traits. Can information from models help? When a set of genes affecting the same trait has been uncovered in model species, this presumably defines the genes where variation in expression can improve or reduce the trait. The set of genes is therefore a compendium of genes that should mirror the hopefully near complete set of QTLs in a species for the trait. The more complete this set then the higher the probability that one or more of the genes will be

Role of Model Plant Species

15

responsible for limiting the trait in a crop and that selection of the right one will lead to enhancement of the trait. This possible utility of gene–trait mapping in model species needs to be tested extensively because of the potential value in breeding programs. In canola, it has already been established that major variation in flowering time is located at loci equivalent to the major flowering time genes in Arabidopsis (62, 63).

7. Conclusions The use of model or reference species for plant science in general needs no defence. It has been, as expected, a resounding success. However, working out which pieces of information are precisely relevant for which plant is complex and will need much more detailed evolutionarily biology to be understood. Whatever the limitations, model species will continue to be essential for developing understanding at a far greater depth in general than can be done with more difficult species for which the same tools and databases are not available. For example, with all the mutant genes available which result in changes in leaf development it is surely the case that most of the rules and systems determining leaf development will be substantiated in Arabidopsis and it will be variants on the principles that will account for all the other leaf morphologies in angiosperms. The models will also be essential for providing hypotheses to be tested in other species of interest. Models are ideally cheap and fast to explore. The resources built up to enable more and more, faster and cheaper work are an impressive part of a model species’ treasure chest. Since resource development will continue, it is expected that the value of using well developed models to explore the complex problems of plant biology will be even greater. Indeed, as the research moves more and more into three dimensional, dynamic descriptions of specific cells using massive amounts of data it is hard to believe that systems other than those in models will be on the frontiers. It has yet to be revealed the extent to which developmental pathways are conserved between models and crop species. However, for species that are phylogenetically close, for example, Arabidopsis and canola, the answer must be very high. The many examples where mis-expression of a gene produces similar phenotypes in Arabidopsis and rice suggests that Arabidopsis is indeed a useful model for many processes in monocots as well as dicots. It will be very interesting to learn which genetic networks are conserved and which are not. Those which are basically conserved will provide a framework of understanding for angiosperms and this information will provide very important conclusions for

16

Flavell

evolutionary studies. Even where, from first sight, Arabidopsis is not a good model for other species, for example, monocot seed development, many genes and systems have been uncovered relating to the small amount of endosperm in Arabidopsis that are relevant to the much larger amounts of endosperm in monocot seeds (64). As crop species genomes get sequenced it will be easier to find orthologs to genes of known function amongst syntenic regions and QTLs mapped to model genomes can be sought in the crop genomes to help in plant breeding. The plethora of gene–trait linkages known from models species provides a significant platform for generating predictions for crops. The genomic sites and genes of eQTLs behind traits discovered in models can also be explored similarly in crops. Perhaps heterosis between genotypes will be uncovered by application of methods to study gene expression at key loci in models such that it will be possible to predict which parents will give the best heterosis in hybrids. Meanwhile, it appears that the most significant discoveries will continue to be made in the key model species. Model species provide the backbone to today’s plant research and will for the foreseeable future. References 1. Flavell, R.B. (1992) The value of model systems for the future plant breeder, in Plant Breeding in the 1990s (Stalker, H.T. and Murphy, J.P. eds.), CAB International, Oxford, UK, pp. 409–419. 2. Flavell, R.B. (2005) Model plants with special emphasis on Arabidopsis thaliana and crop improvement, in Proceedings of the International Congress (Tuberosa, R., Phillips, R.L., and Gale, M. eds.), Avenue Media, Bologna, Italy, pp. 365–378. 3. Somerville, C. (1989) Arabidopsis blooms. Plant Cell 1, 1131–1135. 4. Meyerowitz, E.M. (1989) Arabidopsis, a useful weed. Cell 56, 263–269. 5. Somerville, C. and Koornneef, M. (2002) A fortunate choice: the history of Arabidopsis as a model plant. Nat. Rev. Genet. 3, 883–889. 6. Bevan, M.W. and Walsh, S. (2004) Positioning Arabidopsis in plant biology. A key step toward unification of plant research. Plant Physiol. 135, 602–606. 7. Glass, B. (1951) Cold Spring Harbor. Symp. Quant. Biol. 16, 281. 8. Redei, G.P. (1992) A heuristic glance at the past of Arabidopsis genetics, in Methods in Arabidopsis Research (Konz, C., Chua, N.-H.,

9.

10.

11.

12.

13.

14.

15.

and Schell, J. eds.), World Scientific Publishing Co., Singapore, pp. 1–15. Koornneef, M., Dellaert, L., and van der Veen, J.H. (1982) EMS- and radiation-induced mutation frequencies at individual loci in Arabidopsis thaliana (L.) Heynh. Mutat. Res. 93, 109–123. Yu, J., Hu, S.N., Wang, J., et al. (2002) A draft sequence of the rice genome (Oryza sativa L ssp.indica). Science 296, 79–92. Goff, S.A., Ricke, D., Lan, T.H., et al. (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296, 92–100. Ma, J., Wing, R.A., Bennetzen, J.L., and Jackson, S.A. (2007) Plant centromere organization: a dynamic structure with conserved functions. Trends Genet. 23, 134–139. Tuskan, G.A., DiFazio, S.P., and Teichmann, T. (2003) Poplar genomics is getting popular: the impact of the poplar genome project on tree research. Plant Biol. 5, 1–3. Wullschleger, S.D., Jansson, S., and Taylor, G. (2002) Genomics and forest biology. Plant Cell 14, 2651–2655. Brunner, A.M., Busov, V.B., and Strauss, S. (2004) Poplar genome sequence: functional genomics in an ecologically dominant plant species. Trends Plant Sci. 9, 49–56.

Role of Model Plant Species 16. Draper, J., Mur, L.A.J., Jenkins, G., et al. (2001) Brachypodium distachyon: a new model system for functional genomics in grasses. Plant Physiol. 127, 1539–1555. 17. MCAt-FGP. (2007) The Multinational Coordinated Arabidopsis thaliana Functional Genomics Project Annual Report. 18. The Rice Full-length cDNA Consortium. (2003) Collection, mapping, and annotation of over 28,000 cDNA clones from japonica rice. Science 301, 376–379. 19. Phillips, R.L., Leung, H., and Cantrell, R. (2004) An international platform for the assessment of gene function in rice. Proceedings of the 4th International Crop Science Congress, Brisbane, Australia. Published on CD-ROM. www.cropscience.org.au 20. TAIR. (2007) The Arabidopsis Information Resource. www.arabidopsis.org 21. Haas, B.J., Volfovsky, N., Town, C.D., Troukhan, M., Alexandrov, N., Feldmann, K.A., Flavell, R.B., White, O., and Salzberg, S.L. (2002) Full-length messenger RNA sequences greatly improve genome annotation. Genome Biol. 3(6), Epub May 30. 22. The Rice Genome Resource Center (RGRC). http://www.rgrc.dna.affrc.go.jp/ 23. Maher, C., Stein, L., and Ware, D. (2006) Evolution of Arabidopsis microRNA families through duplication events. Genome Res. 16, 510–519. 24. Somerville, C. and Dangl, J. (2000) Genomics: plant biology in 2010. Science 290, 2077–2078. 25. Zhang, X., Yazachi, J., Sundaresan, A., et al. (2006) High resolution mapping and functional analysis of DNA methylation in Arabidopsis. Cell 126, 1189–1201. 26. Clark, J.I., Brooksbank, C., and Lomax, J. (2005) It’s all GO for plant scientists. Plant Physiol. 138, 1268–1279. 27. Miyao, A., Tanaka, K., Murata, K., Sawaki, H., Takeda, S., Abe, K., Shinozuka, V., Onosato, K., and Hirochika, H. (2003) Target site specificity of the TOS 17 retrotransposon shows a preference for insertion within genes and against insertion in retrotransposonrich regions of the genome. Plant Cell 15, 1771–1780. 28. Ichikawa, T., Nagazawa, M., Kawashima, M., et al. (2003) Sequence database of 1172 T-DNA insertion lines in Arabidopsis activation-tagging lines that showed phenotypes in T1 generation. Plant J. 36, 421–429. 29. Young, J.C., Krysan, P.J., and Sussman, M.R. (2001) Efficient screening of Arabidopsis

30.

31.

32.

33.

34.

35.

36.

37.

38.

39.

40.

41.

42.

17

T-DNA insertion lines using degenerate primers. Plant Physiol. 125, 513–518. Weigel, D., Ahn, J.H., Blazquez, M.A., et al. (2000) Activation tagging in Arabidopsis. Plant Physiol. 122, 1003–1014. Fernie, A.R. and Willmitzer, L. (2001) Molecular and biochemical triggers of potato tuber development. Plant Physiol. 127, 1459–1465. Plomion, C., Leprovost, G., and Stokes, A. (2001) Wood formation in trees. Plant Physiol. 127, 1513–1523. Nieminen, K.M., Kauppinen, L., and Helariutta, Y. (2004) A weed for wood? Arabidopsis as a genetic model for xylem development. Plant Physiol. 135, 653–659. Gadkar, V., David-Schwartz, R., Kunik, T., and Kapulnik, Y. (2001) Arbuscular mycorrhizal fungal colonization. Factors involved in host recognition. Plant Physiol. 127, 1493–1499. Schaefer, D.G. and Zryd, J.-P. (2001) The moss Physcomitrella patens, now and then. Plant Physiol. 127, 1430–1438. Gutman, B.L. and Niyogi, K.K. (2004) Chlamydomonas and Arabidopsis. A dynamic duo. Plant Physiol. 135, 607–610. Bonierbale, M.W., Plaisted, R.L., and Tanksley, S.D. (1988) RFLP maps based on a common set of clones reveal modes of chromosomal evolution in potato and tomato. Genetics 120, 1095–1103. Acarkan, A., Rossberg, M., Koch, M., and Schmidt, R. (2000) Comparative genome analysis reveals extensive conservation of genome organization for Arabidopsis thaliana and Capsella rubella. Plant J. 23, 55–62. Rossberg, M., Theres, K., Acarkan, A., Herrero, R., Schmitt, T., Schumaker, K., Schmitz, G., and Schmidt, R. (2001) Comparative sequence analysis reveals extensive microcolinearity in the Lateral Suppressor regions of the tomato, Arabidopsis, and Capsella genomes. Plant Cell 13, 979–988. Grant, D., Cregan, P., and Shoemaker, R.C. (2006) Genome organization in dicots: genome duplication in Arabidopsis and synteny between soybean and Arabidopsis. Proc. Natl. Acad. Sci. USA 97, 4168–4173. Ku, H.-M., Vision, T., Liu, J., and Tanksley, S.D. (2000) Comparing sequenced segments of the tomato and Arabidopsis genomes: large-scale duplication followed by selective loss creates a network of synteny. Proc. Natl. Acad. Sci. USA 97, 9121–9126. Jung, S., Main, D., Staton, M., Cho, I., Zhebentyayeva, T., Arús, P., and Abbott, A. (2006) Synteny conservation between the Prunus genome

18

43.

44.

45.

46.

47.

48.

49.

50.

51.

52.

Flavell and both the present and ancestral Arabidopsis genomes. BMC Genomics 7, 81. Timms, L., Jimenez, R., Chase, M., Lavelle, D., McHale, L., Kozik, A., Lai, Z., Heesacker, A., Knapp, S., Rieseberg, L., Michelmore, R., and Kesseli, R. (2006) Analyses of synteny between Arabidopsis thaliana and species in the Asteraceae reveal a complex network of small syntenic segments and major chromosomal rearrangements. Genetics 173, 2227–2235. Devos, K.M. and Gale, M.D. (2000) Genome relationships: the grass model in current research. Plant Cell 12, 636–646. Tarchini, R., Biddle, P., Wineland, R., Tingey, S., and Rafalski, A. (2000) The complete sequence of 340 kb of DNA around the rice Adh1-Adh2 region reveals interrupted colinearity with maize chromosome 4. Plant Cell 12, 381–391. Bennetzen, J.L. and Ma, J. (2003) The genetic colinearity of rice and other cereals on the basis of genomic sequence analysis. Curr. Opin. Plant Biol. 6, 128–133. Griffiths, S., Sharp, R., Foote, T.N., Bertin, I., Wanous, M., Reader, S., Colas, I., and Moore, G. (2006) Molecular characterization of Ph1 as a major chromosome pairing locus in polyploidy wheat. Nature 439, 749–752. Peng, J.R., Richards, D.E., Hartley, N.M., Murphy, G.P., Devos, K.M., Flingham, J.E., Beales, J., Fish, L.J., Worland, A.J., Pelica, F., Sudakar, D., Christou, P., Snape, J.W., Gale, M.D., and Harberd, N.P. (1999) “Green revolution” genes encode mutant gibberellin response modulators. Nature 400, 256–261. Fu, X., Sudhakar, D., Peng, J., Richards, D.E., Christou, P., and Harberd, N.P. (2001) Expression of Arabidopsis GAI in transgenic rice represses multiple gibberellin responses. Plant Cell 13, 1791–1802. Thomas, S.G. and Hedden, P. (2006) Gibberellin metabolism and signal transduction, in Plant Hormone Signalling (Hedden, P. and Thomas, S.G. eds. Blackwell Publishing Ltd., Oxford, UK, pp. 147–184. Hedden, P. (2006) Essay 20.2 Plant Physiology. 4th Edition, online. Green Revolution Genes. www.plantphys.net Sakamoto, T., Miura, K., Itoh, H., Tatsumi, T., Ueguchi-Tanaka, M., Ishiyama, K., Kobayashi, M., Agrawal, G.K., Takeda, S., Abe, K., Miyao, A., Hirochika, H., Kitano, H., Ashikari, M., and Matusoka, M. (2004) An overview of gibberellin metabolism enzyme genes and their related mutants in rice. Plant Physiol. 134, 1642–1653.

53. Izawa, T., Takahashi, Y., and Yano, M. (2003) Comparative biology comes into bloom: genomic and genetic comparison of flowering in rice and Arabidopsis. Curr. Opin. Plant Biol. 6, 113–120. 54. Hayama, R. and Coupland, G. (2004) The Molecular basis of diversity in the photoperiodic flowering responses of Arabidopsis and rice. Plant Physiol. 135, 677–684. 55. Anderson, C.H., Jensen, C.S., and Petersen, K. (2004) Similar genetic switch systems might integrate the floral inductive pathways in dicots and monocots. Trends Plant Sci. 9, 105–107. 56. Imaizumi, T. and Kay, S.A. (2006) Photoperiodic control of flowering: not only by coincidence. Trends Plant Sci. 11, 550–558. 57. Yan, L., Loukoianov, A., Blechl, A., Tranquilli, G., Ramakrishna, W., San Miguel, P., Bennetzen, J.L., Echenique, V., and Dubcovsky, J. (2004) The wheat VRN2 gene is a flowering repressor down-regulated by vernalization. Science 303, 1640–1644. 58. Griffiths, S., Dunford, R.P., Coupland, G., and Laurie, D.A. (2003) The evolution of CONSTANS-like gene families in barley, rice and Arabidopsis. Plant Physiol. 131, 1855–1867. 59. Yan, L., Loukoianov, A., Tranquilli, G., Helguera, M., Fahima, T., and Dubcovsky, J. (2003) Positional cloning of the wheat vernalization gene VRN1. Proc. Natl. Acad. Sci. USA 100, 6263–6268. 60. Yan, L., Fu, D., Li, C., Blechl, A., Tranquilli, G., Bonafede, M., Sanchez, A., Valarik, M., Yasuda, S., and Dubcovsky, J. (2006) The wheat and barley vernalization gene VRN3 is an orthologue of FT. Proc. Nat. Acad. Sci. 103, 19581–19586. 61. Jaeger, K. and Wigge, P. (2007) FT protein acts as a long-range signal in Arabidopsis. Curr. Biol. 17, 1050–1054. 62. Osborn, T.C., Kole, C., Parkin, I.A.P., et al. (1997) Comparison of flowering time genes in Brassica rapa, B. napus and Arabidopsis thaliana. Genetics 146, 1123–1129. 63. Okazaki, K., Sakamoto, K., Kikuchi, R., et al. (2007) Mapping and characterization of FC homologs and QTL analysis of flowering time in Brassica oleracea. Theor. Appl. Genet. 114, 595–608. 64. Olsen, O.-A. (2004) Nuclear endosperm development in cereals and Arabidopsis thaliana. Plant Cell 16, S214–S227. 65. Daly, D.C., Cameron, K.M., and Stevenson, D.W. (2001) Plant systematics in the age of genomics. Plant Physiol. 127, 1328–1333. 66. Angiosperm Phylogeny Group. (1998) Ann. Missouri Bot. Gard. 84, 1–49.

Chapter 2 New Technologies for Ultra-High Throughput Genotyping in Plants Nikki Appleby, David Edwards, and Jacqueline Batley Summary Molecular genetic markers represent one of the most powerful tools for the analysis of plant genomes and the association of heritable traits with underlying genetic variation. Molecular marker technology has developed rapidly over the last decade, with the development of high-throughput genotyping methods. Two forms of sequence-based marker, simple sequence repeats (SSRs), also known as microsatellites and single nucleotide polymorphisms (SNPs) now predominate applications in modern plant genetic analysis, along the anonymous marker systems such as amplified fragment length polymorphisms (AFLPs) and diversity array technology (DArT). The reducing cost of DNA sequencing and increasing availability of large sequence data sets permits the mining of this data for large numbers of SSRs and SNPs. These may then be used in applications such as genetic linkage analysis and trait mapping, diversity analysis, association studies and marker-assisted selection. Here, we describe automated methods for the discovery of molecular markers and new technologies for high-throughput, low-cost molecular marker genotyping. Genotyping examples include multiplexing of SSRs using Multiplex-Ready™ marker technology (MRT); DArT genotyping; SNP genotyping using the Invader® assay, the single base extension (SBE), oligonucleotide ligation assay (OLA) SNPlex™ system, and Illumina GoldenGate™ and Infinium™ methods. Key words: Diversity array technology, DArT, GoldenGate™, Infinium™, Invader®, MultiplexReady™ marker technology, MRT, Oligonucleotide ligation assay, OLA, Simple sequence repeat, SSR, Single Base Extension, SBE, Single Nucleotide Polymorphism, SNP, SNPlex™.

1. Introduction The application of molecular markers to advance plant breeding is now well established (1). Modern agricultural breeding is dependent on molecular markers for the rapid and precise analysis of germplasm, trait mapping and marker-assisted selection (MAS). Molecular markers can be used to select parental Daryl J. Somers et al. (eds.), Methods in Molecular Biology, Plant Genomics, vol. 513 © Humana Press, a part of Springer Science + Business Media, LLC 2009 DOI: 10.1007/978-1-59745-427-8_2

19

20

Appleby, Edwards, and Batley

genotypes in breeding programs, eliminate linkage drag in backcrossing and select for traits that are difficult to measure using phenotypic assays. Molecular markers have many other uses in genetics, such as the detection of alleles associated with genetic diseases, paternity assessment, forensics and inferences of population history (2, 3). Furthermore, molecular markers are invaluable as a tool for genome mapping in all systems, offering the potential for generating very high-density genetic maps that can be used to develop haplotypes for genes or regions of interest (4). Insight into the organisation of the plant genome can be obtained by calculating a genetic linkage map using molecular markers. Genetic mapping places molecular genetic markers on linkage groups based on their co-segregation in a population. Markers that are transferable between species also enable studies of synteny and genome rearrangement across taxa. Molecular markers are complementary tools to traditional selection. They can increase our understanding of phenotypic characteristics and their genetic association, which may modify the breeding strategy. DNA-based markers have many advantages over phenotypic markers in that they are highly heritable, relatively easy to assay and are not affected by the environment. The bulk of variation at the nucleotide level is often not visible at the phenotypic level. This variation can be exploited in molecular genetic marker systems. Two sequence-based marker systems, single nucleotide polymorphisms (SNPs) and simple sequence repeats (SSRs) (see Note 1) are the principal markers utilised in plant genetic analysis. These are supplemented by anonymous systems such as amplified fragment length polymorphisms (AFLPs) and diversity array technologies (DArT). 1.1. What are SNPs?

DNA sequence differences are the basic requirement for the study of molecular genetics. SNPs are the ultimate form of molecular genetic marker, as a nucleotide base is the smallest unit of inheritance, and a SNP represents a single nucleotide difference between two individuals at a defined location. There are three different forms of SNPs: transitions (C/T or G/A), transversions (C/G, A/T, C/A, or T/G) or small insertions/deletions (indels) (5). SNPs are direct markers as the sequence information provides the exact nature of the allelic variants. Furthermore, this sequence variation can have a major impact on how the organism develops and responds to the environment. SNPs represent the most frequent type of genetic polymorphism and may therefore provide a high density of markers near a locus of interest (6). SNPs can differentiate between related sequences, both within an individual and between individuals within a population. The frequency and nature of SNPs in plants is beginning to receive considerable attention. Studies of sequence diversity have recently been performed for a range of plant species and these

New Technologies for Ultra-High Throughput Genotyping in Plants

21

have indicated that SNPs appear to be abundant in plant systems, with one SNP every 100–300 bp (7). SNPs at any particular site could in principle involve four different nucleotide variants, but in practice they are generally biallelic. This disadvantage, when compared with multiallelic markers such as SSRs, is compensated by the relative abundance of SNPs. SNPs are also evolutionarily stable, not changing significantly from generation to generation. The low mutation rate of SNPs makes them excellent markers for studying complex genetic traits and as a tool for understanding genome evolution (8). The high density of SNPs makes them valuable for genome mapping, and in particular, they allow the generation of ultrahigh-density genetic maps and haplotyping systems for genes or regions of interest and map-based positional cloning. SNPs are used routinely in crop breeding programs (1), for genetic diversity analysis, cultivar identification, phylogenetic analysis, characterisation of genetic resources and association with agronomic traits (4). The applications of SNPs in crop genetics have been extensively reviewed by Rafalski (4) and Gupta et al. (1). These reviews highlight that for several years SNPs will coexist with other marker systems. However, with the development of new technologies to increase throughput and reduce the cost of SNP assays, along with further plant genome sequencing, the use of SNPs will become more widespread. 1.2. Simple Sequence Repeats

SSRs are one of the most powerful genetic markers in biology. They have been found in all prokaryotic and eukaryotic genomes analysed to date and are widely and ubiquitously distributed throughout eukaryotic genomes (9, 10). SSRs are short stretches of DNA sequence occurring as tandem repeats of mono-, di-, tri-, tetra-, penta- and hexanucleotides. They are highly polymorphic and informative markers. The high level of polymorphism is due to mutation affecting the number of repeat units. The value of SSRs is due to their genetic co-dominance, abundance, dispersal throughout the genome, multiallelic variation and high reproducibility. These properties provide a number of advantages over other molecular markers, namely, that multiple SSR alleles may be detected at a single locus using a simple polymerase chain reaction (PCR)-based screen, very small quantities of DNA are required for screening, and analysis is amenable to automated allele detection and sizing (11). The hypervariability of SSRs among related organisms makes them excellent markers for a wide range of applications, including genetic mapping, molecular tagging of genes, genotype identification, analysis of genetic diversity, phenotype mapping and MAS (12, 13). SSRs demonstrate a high degree of transferability between species, as PCR primers designed to an SSR within one species frequently amplify a corresponding locus in related species, enabling comparative genetic and genomic analysis.

22

Appleby, Edwards, and Batley

Studies of the potential biological function and evolutionary relevance of SSRs is leading to a greater understanding of genomes and genomics (14). SSRs were initially considered to be evolutionally neutral (15); however, recent evidence suggests an important role in genome evolution (16). Early suggestions that the majority of DNA was ‘junk’ or had no biological function are being challenged by the discovery of new functions for these sequences and various functional roles have now been attributed to SSRs. For example, SSRs are believed to be involved in gene expression, regulation and function (17, 18) and there are numerous lines of evidence suggesting that SSRs in non-coding regions may also be of functional significance (19). In addition, SSRs provide hot spots of recombination, a variety of SSRs have been found to bind nuclear proteins and there is direct evidence that SSRs can function as transcriptional activating elements (20). 1.3. Diversity Array Technology

DArT is a generic and cost-effective genotyping method based on hybridising DNA to microarrays (21). It was invented to overcome some of the limitations of other molecular marker technologies, and in particular, it does not require prior sequence information. Other advantages of DArT include high multiplexing level for high-throughput analysis and provision of data at low cost. The main technology applications of DArT include genome profiling, genetic map construction and Quantitative Trait Loci (QTL) identification, genetic diversity analysis and cultivar identification (22–26). DArT works in a similar way to AFLP in reducing the complexity of a DNA sample to obtain a representation of the genome. The preferred method of complexity reduction relies on a combination of restriction enzyme digestion and adapter ligation, followed by PCR amplification (27) with subsequent hybridisation-based detection.

1.4. Why Novel Marker Technologies are Required?

During the past two decades, several molecular marker technologies have been developed and applied for plant genome analysis, predominantly assessing the differences between individual plants within a species. These marker technologies have been applied to plant breeding to allow breeders to use the genetic composition or genotype of plants as a criterion for selection in the breeding process. However, because of the relatively high cost associated with the development of this technology, these methods have only been applied to a limited number of crop species, predominantly in developed countries. Even in these situations, the application of molecular markers has tended to focus on a small number of high value traits or genomic regions. The recent application of association mapping via linkage disequilibrium (LD) in plants demonstrates the requirement to be able to identify and screen large numbers of markers, rapidly and at low cost.

New Technologies for Ultra-High Throughput Genotyping in Plants

23

The development of technologies that increase marker throughput with reducing cost will broaden the uptake of MAS to include more diverse crops and a greater variety of traits.

2. New Marker Discovery Methods

2.1. In Silico SNP Discovery

Large quantities of sequence data are generated through cDNA or genome sequencing projects internationally and these provide a valuable resource for the mining of molecular markers. This will be further accelerated with the application of new sequencing technology from Roche (454), Illumina (Solexa) and Applied Biosystems (SOLiD) (see Imelfort et al. this volume). The challenge of in silico SNP discovery is not the identification of polymorphic bases, but the differentiation of true SNP polymorphisms from the often more abundant sequence errors. High-throughput sequencing remains prone to inaccuracies as frequent as one error every one hundred base pairs. This incorrect base calling impedes the electronic filtering of sequence data to identify potentially biologically relevant polymorphisms. There are several different sources of error which need to be taken into account when differentiating between sequence errors and true polymorphisms. The primary source of sequence error comes from the automated reading of raw data, due to the fine balance between the desire to obtain the greatest sequence length and the confidence that bases are called correctly. Phred is the most widely adopted software used to call bases from Sanger chromatogram data (28, 29). The primary benefit of this software is that it provides a statistical estimate of the accuracy of calling each base, and therefore provides a primary level of confidence that a sequence difference represents true genetic variation. There are several software packages that take advantage of this feature to estimate the confidence of sequence polymorphisms within alignments. Where sequence trace files are available, and nucleotide quality may be measured, software such as PolyBayes and Polyphred are the most efficient means to differentiate between true SNPs and sequence error (see Note 2). Unfortunately, complete sequence trace file archives are rarely available for data sets collated from a variety of sources. Furthermore, sequence quality scores do not identify errors in the sequence incorporated before the base calling process. The principal cause of these prior errors is the inherently high error rate of the reverse transcription process required for the generation of cDNA libraries for Expressed Sequence Tag (EST) sequencing. Similar errors are also inherent, though to a lesser extent, in any PCR amplification process that may be part

24

Appleby, Edwards, and Batley

of a sequencing protocol. In cases where trace files are unavailable, the identification of sequence errors can be based on two further methods to determine SNP confidence; redundancy of the polymorphism in an alignment and co-segregation of SNPs to define a haplotype. The frequency of occurrence of a polymorphism at a particular locus provides a measure of confidence in the SNP representing a true polymorphism, and is referred to as the SNP redundancy score. By examining SNPs that have a redundancy score equal to or greater than two (two or more of the aligned sequences represent the polymorphism), the vast majority of sequencing errors are removed. Although some true genetic variation is also ignored due to its presence only once within an alignment, the high degree of redundancy within the data permits the rapid identification of large numbers of SNPs without the requirement for sequence trace files. However, while redundancy-based methods for SNP discovery are highly efficient, the non-random nature of sequence error may lead to certain sequence errors being repeated between runs around locations of complex DNA structure. Therefore, errors at these loci would have a relatively high SNP redundancy score and appear as confident SNPs. In order to eliminate this source of error, an additional independent SNP confidence measure is required. This can be determined by the co-segregation of SNPs to define a haplotype. True SNPs that represent divergence between homologous genes co-segregate to define a conserved haplotype, whereas sequence errors do not cosegregate with a haplotype. Thus, a co-segregation score, based on whether a SNP position contributes to defining a haplotype is a further independent measure of SNP confidence. By using the SNP score and co-segregation score together, true SNPs may be identified with reasonable confidence. Three tools currently apply the methods of redundancy and haplotype co-segregation: autoSNP (30, 31), SNPServer (32) and autoSNPdb. SNPServer is based on autoSNP and provides a real time Internet-based SNP discovery tool, combining redundancy-based SNP discovery and haplotype co-segregation scoring. Sequences may be submitted for assembly with CAP3 (33) or submitted preassembled in ACE format. Alternatively, a single sequence may be submitted for Basic local Alignment Search Tool (BLAST) comparison with a sequence database (34). Identified sequences are then processed for assembly with CAP3, and subsequent redundancy-based SNP discovery. SNPServer has an advantage in being the only real time Webbased tool that allows users to rapidly identify novel SNPs in sequences of interest. The recently developed autoSNPdb combines the SNP discovery pipeline of autoSNP with a relational database, hosting information on the polymorphisms, cultivars and gene annotations, to enable efficient mining and interrogation of the data. Users may search for SNPs within genes

New Technologies for Ultra-High Throughput Genotyping in Plants

25

with specific annotation or for SNPs between defined cultivars. AutoSNPdb can integrate both Sanger and pyrosequencing data enabling efficient SNP discovery from next generation sequencing technologies. 2.2. SSR Discovery

Previously, the discovery of SSR loci was limited to the construction of genomic DNA libraries enriched for SSR sequences, followed by DNA sequencing (35). This process is both timeconsuming and expensive due to the specific sequencing required. The availability of large quantities of sequence data now makes it more economical and efficient to use computational tools to identify SSR loci. Flanking DNA sequence may then be used to design suitable forward and reverse PCR primers to assay the SSR. Several computational tools are currently available for the identification of SSRs within sequence data as well as for the design of PCR amplification primers. These include SSRPrimer (36), which integrates two such tools, enabling the simultaneous discovery of SSRs within single or bulk sequence data, and the design of specific PCR primers for the amplification of these loci. The Web-based version of SSRPrimer permits the remote use of this package with any sequence of interest. SSR Taxonomy Tree demonstrates the application of SSRPrimer to the complete GenBank database, with the results organised as a taxonomic hierarchy for browsing or searching for SSR amplification primers in any species of interest (37). Because of the redundancy in EST sequence data, with data sets often being derived from several distinct cultivars, it is now possible to predict the polymorphism of SSRs in silico. Using an extended version of autoSNPdb, polymorphic SSRs are distinguished from monomorphic SSRs by the representation of varying motif lengths within an alignment of sequence reads. The identification of SSRs that are predicted to be polymorphic between defined varieties greatly reduces the cost associated with the application of these markers.

3. New Genotyping Technologies 3.1. New Genotyping Technologies for SNPs

Many new marker technologies involve improving the genotyping of SNPs, reflecting the increasing popularity of these markers. SNPs can be identified within a gene of interest, or within close proximity to a candidate gene. Although the SNP may not be directly responsible for the observed phenotype, it can be used for the positional cloning of the gene responsible (1) and as a diagnostic marker. Furthermore, SNPs are useful to define haplotypes in regions of interest. The success of the human HapMap project (38), where a very large

26

Appleby, Edwards, and Batley

number of SNPs were assayed over a range of individuals from different groups, demonstrates the value that can be gained from SNP studies. Reducing costs could enable similar studies to be undertaken to gain a greater understanding of plants. 3.1.1. Invader® Assay

The Invader assay® is a relatively new technology designed specifically for genotyping SNPs (39, 40). In this technology, an oligonucleotide Invader probe is designed to anneal immediately next to the variable site, in the opposite direction to a secondary, allele-specific probe. The secondary probe contains a 5′-flap that is non-complementary to the target DNA and so is unable to hybridise to the target sequence. The 3′-end of the bound Invader probe overlaps the primary probe by a single base at the site of the allelic variant or SNP. A three-dimensional complex is formed by hybridisation of the secondary allele-specific overlapping probe to the target DNA containing a SNP site. This complex is only produced if the secondary probe is complementary to the allele and the Invader probe is present. The annealing of the probe complementary to the SNP allele induces cleavage by a thermostable, structure-specific flap endonuclease (FEN). The cleaved 5′-flap fragment then triggers a secondary cleavage reaction between a quencher molecule, a fluorophore and the cleaved fragment, which results in a fluorescent emission. If the secondary probe is not complementary to the SNP allele and no invasive complex is created, the FEN does not perform cleavage and no fluorescence is observed (Fig. 1). There are several different approaches to detect the cleavage. Most commonly this method is detected on a fluorescence resonance energy transfer (FRET™) cassette; however, it can also be detected by fluorescence polarisation probes or by mass spectrometry. The Invader® assay is a highly accurate method, has a low failure rate, and can detect very small (zeptomol) quantities of target DNA. However, it does require the PCR amplification of the target DNA and the design of a specific secondary probe for each of the SNP alleles. This increases the cost of the method, which makes it unsuitable for high-throughput genotyping. While the assay has traditionally been used to interrogate one SNP in one sample per reaction, novel chip- or bead-based approaches are being tested to make this efficient and accurate assay adaptable to multiplexing and high-throughput SNP genotyping. The Biplex Invader® assay (41) was recently developed, which allows the detection of both alleles in the same reaction tube. There are two signal fluorophores attached to two different FRETTM cassettes (FRET 1 and 2) that are spectrally distinct and specific to either allele of the biallelic system. The ratios of the two fluorescent signals allow a genotype to be assigned. This method is very attractive for researchers who want to genotype a small number of SNPs over large populations. The utility of

New Technologies for Ultra-High Throughput Genotyping in Plants

27

Fig. 1. Overview of the Invader® assay.

this new technology in plants has been demonstrated by Gupta et al. (42) for the accurate determination of gene copy number in a molecular breeding program involving both transgenic and non-transgenic plants. 3.1.2. Illumina GoldenGate™ and Infinium™ Assays

The Illumina GoldenGateTM technology is a novel array technology based on microbeads assembled into 96 sample arrays, with redundant bead types for increased confidence calls (43). This technology is particularly suited for high-throughput genotyping (44). The arrays have up to 50,000 beads, each around 3 microns in diameter. The beads are distributed among 1,520 bead types, with each bead type representing a different oligonucleotide probe sequence. This provides 30 copies of each bead type, with the result that a genotype call is based on the average of many replicates. This inherent redundancy increases robustness and genotyping accuracy. The assay performs allelic discrimination directly on genomic DNA, then generates a synthetic allele-specific PCR template before performing PCR on this artificial template. This is a reversal of conventional SNP genotyping assays which usually use PCR to amplify a SNP of interest and carry out allelic discrimination on the PCR product. The Illumina Bead Station GoldenGateTM assay is most suitable for researchers performing large-scale

28

Appleby, Edwards, and Batley

association studies, such as whole-genome linkage mapping and large-scale fine mapping. It can be carried out in 384, 768 and 1536 sample formats using custom SNP panels. The GoldenGateTM assay was developed specifically for multiplexing to high levels while retaining the flexibility to choose any SNPs of interest to assay. GoldenGateTM assay technology involves two allele-specific oligonucleotides (ASOs) and one locus-specific oligonucleotide (LSO) for each SNP (Fig. 2). The ASOs are designed to have a Tm of 60°C, within the range 57–62°C, and the LSO has a Tm of 57°C, within the range 54–60°C. Each ASO consists of a 3′ portion that hybridises to the DNA at the SNP locus, with the 3′ base complementary to one of the two SNP alleles, and a 5′ portion that incorporates a universal PCR primer sequence (P1 or P2, each associated with a different allele). The LSOs consist of three parts: at the 5′ end is a SNP locus-specific sequence; the middle contains an address sequence complementary to one of the capture sequences on the array; and there is a universal PCR priming site (P3′) at the 3′ end. The genomic DNA is attached to a solid support before the start of the assay, and the oligonucleotides targeted to specific SNPs of interest are then annealed to the DNA. The attachment step is performed to improve assay specificity by removing unbound and non-specifically hybridised oligonucleotides using stringency washes, while the correctly hybridised oligonucleotides remain on the solid phase. Following the annealing and washing steps, an allele-specific primer extension step is carried out, in which DNA polymerase extends the ASOs if their 3′ base is complementary to the SNP (45). This is followed by ligation of the extended ASOs to their corresponding LSOs, which creates the PCR templates. This ligated product is amplified by PCR using universal primers that are complementary to a universal sequence in the 3′-end of the ligation probes and 5′-ends of the allele-specific primers, respectively. The ligation probe contains a SNP-specific tag-sequence, and the universal allele-specific primers carry an allele-specific fluorescent label in their 5′ end. The three universal PCR primers P1, P2 (each fluorescently labelled with a different dye) and P3 associate a fluorescent dye with each SNP allele. After PCR, the amplified products are captured on beads carrying complementary target sequences for the SNP-specific tag of the ligation probe. Each SNP is assigned a different address sequence, which is contained within the LSO. Each of these addresses is complementary to a unique capture sequence represented by one of the bead types in the array. Therefore, the products of the assays hybridise to different bead types in the array, allowing all genotypes to be read simultaneously. The ratio of the two primer-specific fluorescent signals identifies the genotype as either of the two homozygotes or heterozygote. This universal address system, consisting of artificial

New Technologies for Ultra-High Throughput Genotyping in Plants

Fig. 2. Overview of the Illumina GoldenGate™ assay.

29

30

Appleby, Edwards, and Batley

sequences that are not SNP specific, allows any set of SNPs to be read on a common, standard array, providing flexibility and reducing array manufacturing costs. Custom assays are made on demand by building the address sequences into the SNP-specific assay oligonucleotides. In order to identify suitable SNPs for the GoldenGateTM assay, only 40 bp of sequence surrounding the SNP is required, and either strand can be chosen for the assay. One major advantage of the GoldenGateTM method is that it requires only three universal primers for PCR, regardless of the number of assays, which saves on costs, and primer sequence-related differences in amplification rates between SNPs are eliminated. This new technology has recently been applied to barley, with the development of a barley Illumina GoldenGateTM assay. This high-throughput SNP platform provides barley researchers with a unique integrated mapping and diversity analysis platform based on more than 3,000 gene-based markers. Genome-wide genotyping of fixed sets of hundreds of thousands of SNPs is performed using the novel InfiniumTM II assays. In this assay, a whole-genome amplification step is used to increase the amount of DNA up to 1,000-fold. The DNA is fragmented and captured on a bead array by hybridisation to immobilised SNP-specific primers, followed by extension with hapten-labelled nucleotides. The primers hybridise adjacent to the SNPs and are extended with a single nucleotide corresponding to the SNP allele. The incorporated hapten-modified nucleotides are detected by adding fluorescently labelled antibodies in several steps to amplify the signals. 3.1.3. Single Base Extension and MALDI–TOF Assays

A popular technology for genotyping SNPs is the minisequencing technique (8), also known as primer extension or single base extension (SBE). In this method, a detection primer is designed to target a sequence immediately upstream of the SNP. The 3′-terminus of the oligonucleotide is then extended, by only one base, by a DNA polymerase using labelled dideoxynucleotide triphosphates (ddNTPs). The terminating fluorescent dye corresponds to a specific ddNTP nucleotide base, making it possible to detect up to four allelic variants at a variable site and discriminate heterozygous from homozygous genotypes. Different detection platforms such as microarrays (45), capillary electrophoresis (46), pyrosequencing (47), flow cytometry (48), mass spectrometry (49) or fluorescence plate readers (50) can be employed with this minisequencing method, demonstrating its flexibility and adaptation to different analytical technologies. A novel marker technology, the Sequenom iPLEXTM Assay, uses the SBE coupled with a matrix-assisted laser desorption/ ionisation time of flight (MALDI–TOF) mass spectrometer (Fig. 3). The iPLEX™ assay begins with PCR amplification of the target region containing the SNP, as with the SBE. However,

New Technologies for Ultra-High Throughput Genotyping in Plants

31

Fig. 3. Overview of the Sequenom iPLEX™ assay.

the PCR primers each have a specific 10-mer tag attached at the 3′ end. The PCR product is treated with Shrimp Alkaline Phosphatase to remove the unincorporated dNTPs, and the multiplex reaction is extended by one base using specific primers. The reaction is desalted to optimise mass spectrometric analysis, and the genotypes are analysed using the MassARRAY workstation. Up to 24 SNPs can be assayed together in one iPLEX™ reaction and this method has been used by Törjek et al. (51) to develop a set of 112 SNP markers in Arabidopsis thaliana, which suggests that the method can be used as a medium to high-throughput genotyping system.

32

Appleby, Edwards, and Batley

3.1.4. Oligonucleotide Ligation Assay

A further novel marker technology for genotyping SNPs is the oligonucleotide ligation assay (OLA) (52). This method is based on the properties of an enzymatic reaction in which two adjacent oligonucleotides may be covalently joined by a DNA ligase when annealed to a complementary DNA target. Both of the primers must have perfect base pair complementarity at the ligation site, allowing the discrimination of two alleles at a SNP site. The OLA method has recently been commercialised using the Applied Biosystems SNPlex system, which uses OLA for allelic discrimination and ligation product amplification (53). Genotype information is encoded into a universal set of dye-labelled, mobility-modified fragments, called Zipchute™ Mobility Modifiers, for rapid detection by capillary electrophoresis. The same set of ZipchuteTM Mobility Modifiers are used for every SNPlex pool, regardless of which SNPs are chosen In the first step of the SNPlex, an OLA reaction is performed, where ASO and LSO probes hybridise to the target sequence (Fig. 4). These allele-specific and locus-specific probes ligate when they are hybridised to a perfectly matching sequence at the SNP site. At the same time, universal linkers are ligated to the distal termini of the ASO and LSO ligation probes. These linkers contain universal PCR primer-binding sequences and sequences complementary to ASO and LSO probes. A unique ZipCode sequence is attached at the 5′-end of the genomic equivalent sequence within each ASO, allowing the OLA step to encode the genotype information of every SNP into unique ligation products. No optimisation of the OLA is required as all probes are designed to function under the same hybridisation conditions. The unligated probes and linkers, along with any excess genomic DNA, are removed by enzymatic digestion using exonuclease I and lambda exonuclease, to ensure efficiency of the subsequent PCR reaction. This is a simultaneous PCR amplification of the purified ligation products with a single pair of PCR primers, one of which is biotinylated. The use of the universal pair of PCR primers ensures that optimisation of PCR reaction conditions is not required. The biotinylated amplicons are then bound within wells of streptavidin-coated microtitre plates. This allows the non-biotinylated strands to be removed, leaving the single-stranded amplicons bound to the plate. The fluorescently labelled universal ZipChuteTM probes then hybridise to the bound single-stranded amplicons. Each ZipChuteTM probe contains a sequence complementary to the unique ZipCode sequence within each ASO and contains a mobility modifier, which assigns to each ZipChuteTM probe a specific rate of mobility during capillary electrophoresis. The specifically bound ZipChuteTM probes are analysed using an Applied Biosystems 3730/3730xl DNA Analyser. One SNP is typically characterised by two possible alleles, therefore the two fluorescent peaks in an electropherogram represent the two alleles of a specific SNP.

New Technologies for Ultra-High Throughput Genotyping in Plants

Fig. 4. Overview of the Applied Biosystems SNPlex™ assay.

33

34

Appleby, Edwards, and Batley

3.2. New SSRs Technologies

Novel technologies for SSRs have been limited to new approaches to increase the multiplex ratio of the SSRs, to increase throughput and decrease costs. One such technology is the Multiplex-Ready™ Marker technology (MRT), developed at the University of Adelaide. This reduces marker deployment costs for fluorescent-based SSR analysis, and increases genotyping throughput by more efficient electrophoretic separation of the SSRs. MRT is a single step, closed tube assay which involves two PCR steps (Fig. 5). In the first step, target loci are amplified using locus-specific primers, tagged at the 5′ end with a defined sequence. This PCR product is used as a template in the second PCR step, in which short dye-labelled primers complementary to the defined sequence amplify the products for automated analysis. The use of the defined primer tag sequence improves automation by providing a consistent PCR yield for markers within a multiplex assay, as well as between reactions. The system is open to flexible dye labelling and has robust tolerance to variations in the concentration and quality of the target DNA. Furthermore, it is compatible with standard capillary electrophoresis instrumentation. The method has been applied for high-throughput analysis of markers used in cereal breeding and is currently being deployed in several Australian cereal research and breeding programs.

3.3. Diversity Array Technology

DArT is diversity array technology (21) that assays for the presence of a specific DNA fragment within a representative sample from total genomic DNA. The method does not require prior sequence knowledge, so can be used for plants for which little or no sequence information is available (see Note 3). The method consists of several steps. The first steps involve complexity reduction in the DNA of interest; creation of a library, which is then arrayed onto a glass slide; followed by hybridisation of fluorescently labelled DNA onto the slides; and lastly detection of the hybridisation signal. DArT reduces the complexity of a DNA sample to obtain a representation. This involves restriction enzyme digestion and adapter ligation, followed by amplification (27). The genomic representation contains two types of fragments, constant fragments, found in any representation which is prepared from a DNA sample from an individual belonging to a given species, and polymorphic fragments, only found in some but not all of the samples. These polymorphic fragments are the informative DArT markers. Their presence or absence in a sample is assayed by hybridising the representation to a DArT array consisting of a library of that species. The library creation involves generating genomic representations from a pool of individuals covering the genetic diversity of the species that is being studied. These fragments are cloned into a vector and transformed into Escherichia coli.

New Technologies for Ultra-High Throughput Genotyping in Plants

Fig. 5. Overview of the Multiplex-Ready™ Marker assay.

35

36

Appleby, Edwards, and Batley

Within the library, each colony contains one of the fragments from the genomic representation. A selection of clones from the library are arrayed into 384-well plates. The fragments within the library are then amplified and spotted onto glass slides using a microarrayer to form the genotyping DArT array. The genotyping arrays are hybridised with genomic representations of individual DNA samples prepared using the same complexity reduction method. These representations are labelled with one fluorescent label, while the vector fragment is labelled with a different fluorescent label to act as a reference. Each individual representation will only hybridise to matching fragments on the genotyping array, thereby displaying a unique hybridisation pattern. The hybridised slides are washed to remove unbound labelled DNA and then scanned to detect the fluorescent signal emitted from the hybridised fragments. There have been many applications of DArT in plant genomics. A comprehensive collection of DArT markers that are polymorphic for wheat and barley germplasm has been assembled, with over 1,000 markers for barley and 2,000 for wheat. Services are also offered for other crops such as apple, cassava, tomato, sorghum, ryegrass, chickpea, sugarcane, lupin, banana and coconut.

4. Conclusions Molecular markers have many applications in plant breeding, and the ability to detect the presence of a gene (or genes) controlling a particular desired trait has given rise to MAS. These new technologies make it possible to speed up the breeding process. For example, a desired trait may only be observed in the mature plant, but MAS allows researchers to screen for the trait at a much earlier growth stage. Further advantages of molecular markers are that they make it possible to select simultaneously for many different plant characteristics. They can also be used to identify individual plants with a defined resistance gene without exposing the plant to the pest or pathogen in question. In order to increase throughput and decrease costs, it is necessary to eliminate bottlenecks throughout the genotyping process, as well as minimise sources of variability and human error to ensure data quality and reproducibility. These new technologies may be the way forward for the discovery and application of molecular markers and will enable the application of markers for a broader range of traits in a greater diversity of species than currently possible.

New Technologies for Ultra-High Throughput Genotyping in Plants

37

Notes 1. SSRs are also referred to as microsatellites following the method of their initial identification. They are now more commonly called SSRs. 2. PolyPhred integrates phred base calling and quality information within phrap-generated sequence alignments (54). The alignments are viewed and marked for inspection using Consed (55). This method has now been extended to include Bayesian statistical analysis. PolyBayes (56) is a fully probabilistic SNP detection algorithm that calculates the probability that discrepancies at a given location of a multiple alignment represent true sequence variations as opposed to sequencing errors. This calculation takes into account the alignment depth, the base calls in each sequence, the base quality values, the base composition in the region and the expected a priori polymorphism rate. 3. Where there is a large amount of sequence data is available for a species, markers such as SNPs and SSRs will provide more information and should be used. In species for which there is limited sequence available, anonymous markers such as DArT may be more cost-effective. References 1. Gupta, P.K., Roy, J.K., and Prasad, M. (2001) Single nucleotide polymorphisms: A new paradigm for molecular marker technology and DNA polymorphism detection with emphasis on their use in plants. Curr. Sci. 80, 524–535. 2. Brumfield, R.T., Beerli, P., Nickerson, D.A., and Edwards, S.V. (2003) The utility of single nucleotide polymorphisms in inferences of population history. Trends Ecol. Evol. 18, 249–256. 3. Collins, A., Lau, W., and De la Vega, F.M. (2004) Mapping genes for common diseases: The case for genetic (LD) maps. Hum. Hered. 58, 2–9. 4. Rafalski, A. (2002) Applications of single nucleotide polymorphisms in crop genetics. Curr. Opin. Plant Biol. 5, 94–100. 5. Edwards, D., Forster, J.W., Chagné, D., and Batley, J. (2007) What are SNPs?, in Association Mapping in Plants (Oraguzie, N.C., Rikkerink, E.H.A., Gardiner, S.E., and De Silva, H.N., eds.), Springer, NY, 41–52. 6. Batley, J. and Edwards, D. (2007) SNP applications in plants, in Association Mapping in Plants (Oraguzie, N.C., Rikkerink, E.H.A.,

7.

8.

9.

10.

11.

12.

Gardiner, S.E. and, De Silva, H.N. eds.) Springer, NY, 95–102. Edwards, D., Batley, J., Cogan, N.O.I., Forster, J.W., and Chagné, D. (2007) Single nucleotide polymorphism discovery, in Association Mapping in Plants (Oraguzie, N.C., Rikkerink, E.H.A., Gardiner, S.E., and De Silva, H.N. eds.) Springer, NY, 53–76. Syvanen, A.C. (2001) Genotyping single nucleotide polymorphisms. Nat. Rev. Genet. 2, 930–942. Tóth, G., Gáspári, Z., and Jurka, J. (2000) Microsatellites in different eukaryotic genomes: Survey and analysis. Genome Res. 10, 967–981. Katti, M.V., Ranjekar, P.K., and Gupta, V.S. (2001) Differential distribution of simple sequence repeats in eukaryotic genome sequences. Mol. Biol. Evol. 18, 1161–1167. Schlötterer, C. (2000) Evolutionary dynamics of microsatellite DNA. Nucleic Acids Res. 20, 211–215. Tautz, D. (1989) Hypervariability of simple sequences as a general source for polymorphic DNA markers. Nucleic Acids Res. 17, 6463–6471.

38

Appleby, Edwards, and Batley

13. Powell, W., Machray, G.C., and Provan, J. (1996) Polymorphism revealed by simple sequence repeats. Trends Plant Sci. 1, 215–222. 14. Subramanian, S., Mishra, R.K., and Singh, L. (2003) Genome-wide analysis of microsatellite repeats in humans: Their abundance and density in specific genomic regions. Genome Biol. 4, R13. 15. Awadalla, P. and Ritland, K. (1997) Microsatellite variation and evolution in the Mimulus guttatus species complex with contracting mating systems. Mol. Biol. Evol. 14, 1023–1034. 16. Moxon, E.R. and Wills, C. (1999) DNA microsatellites: Agents of evolution. Sci. Am. 280, 94–99. 17. Kashi, Y., King, D., and Soller, M. (1997) Simple sequence repeats as a source of quantitative genetic variation. Trends Genet. 13, 74–78. 18. Gupta, M., Chyi, Y.-S., Romero-Severson, J., and Owen, J.L. (1994) Amplification of DNA markers from evolutionarily diverse genomes using single primers of simple-sequence repeats. Theor. Appl. Genet. 89, 998–1006. 19. Mortimer, J., Batley, J., Love, C., Logan, E., and Edwards, D. (2005) Simple sequence repeat (SSR) and GC distribution in the Arabidopsis thaliana genome. J. Plant Biotechnol. 7, 17–25. 20. Li, Y.-C., Korol, A.B., Fahima, T., Beiles, A., and Nevo, E. (2002) Microsatellites: Genomic distribution, putative functions and mutational mechanisms: A review. Mol. Ecol.11, 2453–2465. 21. Jaccoud, D., Peng, K., Feinstein, D., and Kilian, A. (2001) Diversity arrays: A solid state technology for sequence information independent genotyping. Nucleic Acids Res. 29, e25. 22. Xia, L., Peng, K., Yang, S., Wenzl, P., de Vicente, C., Fregene, M., and Kilian, A. (2005) DArT for high-throughput genotyping of cassava (Manihot esculenta) and its wild relatives. Theor. Appl. Genet. 110, 1092–1098. 23. Yang, S., Pang, W., Ash, G., Harper, J., Carling, J., Wenzl, P., Huttner, E., and Kilian, A. (2006) Low level of genetic diversity in cultivated pigeonpea compared to its wild relatives is revealed by diversity arrays technology (DArT). Theor. Appl. Genet. 113, 585–595. 24. Xie, Y., McNally, K., Li, C.Y., Leung, H., and Zhu, Y.Y. (2006) A high-throughput genomic tool: Diversity array technology complementary for rice genotyping. J. Integr. Plant Biol. 48, 1069–1076. 25. Akbari, M., Wenzl, P., Vanessa, C., Carling, J., Xia, L., Yang, S., Uszynski, G., Mohler, V., Lehmensiek, A., Kuchel, H., Hayden, M.J.,

26.

27.

28.

29.

30.

31.

32.

33.

34.

35.

36.

37.

Howes, N., Sharp, P., Rathmell, B., Vaughan, P., Huttner, E., and Kilian, A. (2006) Diversity arrays technology (DArT) for high-throughput profiling of the hexaploid wheat genome. Theor. Appl. Genet. 113, 1409–1420. Wenzl, P., Li, H., Carling, J., Zhou, M., Raman, H., Paul, E., Hearnden, P., Maier, C., Xia, L., Caig, V., Ovesna, J., Cakir, M., Poulsen, D., Wang, J., Raman, R., Smith, K.P., Muehlbauer, G.J., Chalmers, K.J., Kleinhofs, A., Huttner, E., and Kilian, A. (2006) A high-density consensus map of barley linking DArT markers to SSR, RFLP and STS loci and phenotypic traits. BMC Genom. 7, 206. Wenzl, P., Carling, J., Kudrna, D., Jaccoud, D., Huttner, E., Kleinhofs, A., and Kilian, A. (2004) Diversity arrays technology (DArT) for whole-genome profiling of barley. PNAS. 101, 9915–9920. Ewing, B. and Green, P. (1998a) Basecalling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8, 186–194. Ewing, B., Hillier, L., Wendl, M.C., and Green, P. (1998b) Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8, 175–185. Barker, G., Batley, J., O’Sullivan, H., Edwards, K.J., and Edwards, D. (2003) Redundancy based detection of sequence polymorphisms in expressed sequence tag data using autoSNP. Bioinformatics 19, 421–422. Batley, J., Barker, G., O’Sullivan, H., Edwards, K.J., and Edwards, D. (2003) Mining for single nucleotide polymorphisms and insertions/ deletions in maize expressed sequence tag data. Plant Physiol. 132, 84–91. Savage, D., Batley, J., Erwin, T., Logan, E., Love, C.G., Lim, G.A.C., Mongin, E., Barker, G., Spangenberg, G.C., and Edwards, D. (2005) SNPServer: A real-time SNP discovery tool. Nucleic Acids Res. 33, W493–W495. Huang, X. and Madan, A. (1999) CAP3: A DNA sequence assembly program. Genome Res. 9, 868–877. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. (1990) Basic local alignment search tool. J. Mol. Biol. 215, 403–410. Edwards, K.J., Barker, J.H.A., Daly, A., Jones, C., and Karp, A. (1996) Microsatellite libraries enriched for several microsatellite sequences in plants. Biotechniques 20, 758–760. Robinson, A.J., Love, C.G., Batley, J., Barker, G., and Edwards, D. (2004) Simple sequence repeat marker loci discovery using SSRPrimer. Bioinfomatics 20, 1475–1476. Jewell, E., Robinson, A., Savage, D., Erwin, T., Love, C.G., Lim, G.A.C., Li, X., Batley, J.,

New Technologies for Ultra-High Throughput Genotyping in Plants

38.

39.

40. 41.

42.

43.

44.

45.

46.

47.

Spangenberg, G.C., and Edwards, D. (2006) SSR Primer and SSR Taxonomy Tree: Biome SSR discovery. Nucleic Acids Res. 34, W656–W659. Hapmap, C.A. (2003) The International HapMap Project: The International HapMap Consortium. Nature 426, 789–796. Mein, C.A., Barratt, B.J., Dunn, M.G., Siegmund, T., Smith, A.N., Esposito, L., Nutland, S., Stevens, H.E., Wilson, A.J., Phillips, M.S., Jarvis, N., Law, S., De Arruda, M., and Todd, J.A. (2000) Evaluation of single nucleotide polymorphism typing with invader on PCR amplicons and its automation. Genome Res. 10, 330–343. Olivier, M. (2005) The Invader® assay for SNP genotyping. Mutat. Res. 573, 103–110. Olivier, M., Chuang, L.M., Chang, M.S., Chen, Y.T., Pei, D., Ranade, K., de Witte, A., Allen, J., Tran, N., Curb, D., Pratt, R., Neefs, H., de Arruda, M., Law, S., Neri, B., Wang, L., and Cox, D.R. (2002) High-throughput genotyping of single nucleotide polymorphisms using new biplex invader technology. Nucleic Acids Res. 30, e53. Gupta, M., Niaunsuksiri, W., Schulenberg, G., Hartl, T., Novak, S., Bayan. J., Vanopduop, N., Bing, J., and Thompson, S. (2008) A non-PCR-based Invader® assay quantitatively detects single-copy genes in complex plant genomes. Mol. Breeding 21, 173–181. Fan, J.-B., Oliphant, A., Shen, R., Kermani, B.G., Garcia, F., Gunderson, K.L., Hansen, M., Steemers, F., Butler, S.L., Deloukas, P., Galver, L., Hunt, S., McBride, C., Bibikova, M., Rubano, T., Chen, J., Wickham, E., Doucet, D., Chang, W., Campbell, D., Zhang, B., Kruglyak, S., Bentley, D., Haas, J., Rigault, P., Zhou, L., Stuelpnagel, J., and Chee, M.S. (2003) Highly parallel SNP genotyping. Cold Spring Harb. Symp. Quant. Biol. 68, 69–78. Gunderson, K.L., Steemers, F.J., Lee, G., Mendoza, L.G., and Chee, M.S. (2005) A genome-wide scalable SNP genotyping assay using microarray technology. Nat. Genet. 37, 549–554. Pastinen, T., Kurg, A., Metspalu, A., Peltonen, L., and Syvänen, A.-C. (1997) Minisequencing: A specific tool for DNA analysis and diagnostics on oligonucleotide arrays. Genome Res. 7, 606–614. Batley, J., Mogg, R., Edwards, D., O’Sullivan, H., and Edwards, K.J. (2003). A highthroughput SNuPE assay for genotyping SNPs in the flanking regions of Zea mays sequence tagged simple sequence repeats. Mol. Breeding 11, 111–120. Ekstroem, B., Alderborn, A., and Hammerling, U. (2000) Pyrosequencing for SNPs.

48.

49.

50.

51.

52.

53.

54. 55.

56.

57.

39

Proceedings of SPIE—The International Society for Optical Engineering 3926, 134–139. Chen, J., Iannone, M.A., Li, M.-S., Taylor, J.D., Rivers, P., Nelsen, A.J., Slentz-Kesler, K.A., Roses, A., and Weiner, M.P. (2000) A microsphere-based assay for multiplexed single nucleotide polymorphism analysis using single base chain extension. Genome Res. 10, 549–557. Haff, L.A. and Smirnov, I.P. (1997) Singlenucleotide polymorphism identification assays using a thermostable DNA polymerase and delayed extraction MALDI-TOF mass spectrometry. Genome Res. 7, 378–388. Hsu, T.M., Chen, X., Duan, S., Miller, R.D., and Kwok, P.-Y. (2001) Universal SNP genotyping assay with fluorescence polarization detection. BioTechniques 31, 560–570. Törjek, O., Berger, D., Meyer, B.C., Müssig, C., Schmid, K.J., Sörensen, T.R., Weisshaar, B., Mitchell-Olds, T., and Altmann, T. (2003) Establishment of a high-efficiency SNP-based framework marker set for Arabidopsis. Plant J. 36, 122–140. Landegren, U., Kaiser, R., Sanders, J., and Hood, L. (1988) A ligase-mediated gene detection technique. Science 241, 1077–1080. Tobler, A.R., Short, S., Andersen, M.R., Paner, T.M., Briggs, J.C., Lambert, S.M., Wu, P.P., Wang, Y., Spoonde, A.Y., Koehler, R.T., Peyret, N., Chen, C., Broomer, A.J., Ridzon, D.A., Zhou, H., Hoo, B.S., Hayashibara, K.C., Leong, L.N., Ma, C.N., Rosenblum, B.B., Day, J.P., Ziegle, J.S., de la Vega, F.M., Rhodes, M.D., Hennessy, K.M., and Wenz, H.M. (2005) The SNPlex genotyping system: A flexible and scalable platform for SNP genotyping. J. Biomol. Tech. 16, 398–406. Greeen, P. (1994) Phrap. unpublished. www. Phrap.org. Gordon, D., Abajian, C. and, Green, P. (1998) Consed: A graphical tool for sequence finishing. Genome Res. 8, 195–202. Marth, G.T., Korf, I., Yandell, M.D., Yeh, R.T., Gu, Z.J., Zakeri,, H., Stitziel,, N.O., Hillier, L., Kwok,, P.Y. and, Gish,, W.R. (1999) A general approach to single nucleotide polymorphism discovery. Nat. Genet. 23, 452–456. Chagné, D. , Batley, J. , Edwards, D., and Forster, J.W. (2007) Single nucleotide polymorphisms genotyping in plants, in Association Mapping in Plants (Oraguzie, N.C., Rikkerink, E.H.A. , Gardiner, S.E. and, De Silva, H.N. , eds. ), Springer, NY, 77– 94.

Chapter 3 Genetic Maps and the Use of Synteny Chris Duran, David Edwards, and Jacqueline Batley Summary Genetic linkage maps represent the order of known molecular genetic markers along a given chromosome for a given species. This provides an insight into the organisation of a plant genome. In comparative genomics, synteny is the preserved order of genes on chromosomes of related species which results from descent from a common ancestor. Comparative mapping is a valuable technique to identify similarities and differences between species and enables the transfer of information from one map to another and assists in the reconstruction of ancestral genomes. This chapter demonstrates the application of online resources to identify candidate genes underlying a QTL, conduct genome comparisons, identify syntenic regions and view comparative genetic maps in grass and Brassica species. Key words: Comparative mapping, CMap, Gramene, Single Nucleotide Polymorphism (SNP), Simple Sequence Repeat (SSR).

1. Introduction 1.1. Genetic Mapping

Insight into the organisation of a plant genome can be obtained by assembling a genetic linkage map using molecular markers. The use of molecular markers offers an opportunity to rapidly identify the genetic locations of large numbers of regions that govern important agronomic traits, and the resultant molecular genetic maps provide a means to link heritable traits with underlying genome sequence variation. Genetic mapping places markers on linkage groups based on their segregation in a population. Genetic maps can be constructed using molecular markers derived

Daryl J. Somers et al. (eds.), Methods in Molecular Biology, Plant Genomics, vol. 513 © Humana Press, a part of Springer Science + Business Media, LLC 2009 DOI: 10.1007/978-1-59745-427-8_3

41

42

Duran, Edwards, and Batley

from coding or non-coding genome sequence. Markers such as amplified fragment length polymorphisms (AFLPs), simple sequence repeats (SSRs) and single nucleotide polymorphisms (SNPs) allow rapid and precise analysis of germplasm, and trait mapping for marker-assisted breeding and selection (1–3). Genetic maps are prepared by analysing segregating populations derived from crosses of genetically diverse parents, and estimating the recombination frequency among genetic loci. The individuals of the population can be derived using selfing, backcrossing or produced through microspore culture of double haploid plants. The distance between the markers on a genetic map is related to the recombination frequency between the markers, with a greater frequency of recombination reflecting a greater genetic distance. Recombination frequency is measured in centimorgans, which is a proportional measurement of the chance that a marker at one locus will be separated from a marker on another locus by a recombinant event. Different chromosomal regions vary in their recombination frequency. Because of this, genetic maps cannot be used to measure physical distance between markers on the genome and only provide an approximation of physical distance, as well as a representation of marker order along the chromosome. Genetic maps provide an insight into genome organisation, the evolution of species, synteny between related species and rearrangement across taxa (4). They can also be used for the identification of candidate genes for genetically mapped traits. Markers linked to heritable traits can be used for marker-assisted selection (MAS), potentially reducing the time for the breeding of improved varieties. Markers linked to traits may be used for map-based cloning of the underlying gene responsible for the trait (5). A physical map is represented as an annotated chromosomal map using nucleotide bases to measure distance (6). These maps may be created from the assembly of DNA sequence from genome sequencing projects or through chromosome deletion or rearrangement analysis with physical markers. 1.2. Synteny

Synteny is the preserved order of genes on chromosomes of related species which results from descent from a common ancestor. A chromosomal region of one species is said to be syntenic with a chromosomal region in another species if the regions carry two or more homologous genes (7). During evolution, chromosome rearrangements result in disruptions of synteny. The analysis of synteny has several applications in genomics (8). Shared synteny is one of the most reliable criteria for establishing the orthology of genomic regions in different species. Additionally, exceptional conservation of synteny can reflect important functional relationships between genes.

Genetic Maps and the Use of Synteny

43

Synteny between the genomes of different plant species was first characterised in grass species by Bevan and Murphy (9). While significant synteny has been identified across grass species (10–12), rearrangements and mutations over evolutionary time decrease the synteny between more distantly related species. With the development of advanced high-throughput genetic marker technologies and the increasing number of plant genome sequencing projects, a greater understanding of the relationship and evolution of plant genomes will become apparent. Analysis of synteny between species provides a greater understanding of genome structure and evolution, and can be used for the identification of markers and genes linked to important agronomic traits, where information from one species may be transferred to another related species (13). Detailed comparative analysis within the Brassicaceae has demonstrated the practical value of synteny between the sequenced genome of the model plant Arabidopsis and cultivated Brassica species (14). This comparison permits the colocation of related traits from different genetic maps and across different species. Comparisons between Brassica and Arabidopsis have identified significant regions of synteny and duplication. Lukens et al. (15) identified 34 syntenic regions between the Arabidopsis genome and a genetic map of Brassica oleracea, representing over 28% of the B. oleracea genetic map length. In a more recent study by Parkin et al. (16), syntenic blocks were identified covering almost 90% of the mapped length of the Brassica napus genome. Each conserved block contained on average 7.8 shared loci and had an average length of 14.8 cM in B. napus and 4.8 Mb in Arabidopsis.

2. Comparative Mapping Comparative genetic mapping based on the alignment of chromosomes using common molecular markers helps researchers translate information from one map to another and allows the transfer of knowledge from one genome to another related genome (17, 18). Comparative mapping is of particular relevance to the breeding of the allotetraploid Brassica crops where conservation between the three progenitor genomes permits transfer of knowledge to the more complex polyploids. RFLP and SSR markers are frequently applied for comparative genetic mapping since they are often transferable between related species. The linkage arrangement of markers can be compared between closely related species if the same molecular markers are used for genetic mapping. This has been demonstrated

44

Duran, Edwards, and Batley

in Brassica, where it has been shown that the linear order of genes is conserved over a large evolutionary timescale between the amphidiploid AB and AC genomes and the diploid progenitor genomes. In a study by Axelsson et al. (19), two RFLP maps of Brassica juncea were developed and compared. One of the maps was generated using a synthetic B. juncea (a chromosomedoubled interspecific hybrid of Brassica rapa and Brassica nigra) crossed to a natural B. juncea. The second map was generated using two natural B. juncea cultivars. The comparison of these two maps showed that the genomic segments derived from the A and B genomes were perfectly conserved in the AB amphidiploid and the two maps were collinear, showing that synteny can extend throughout the entire genome. They concluded that the genomes of B. juncea and its diploid progenitor have remained essentially unchanged since polyploidy and speciation. Comparative genetic mapping may be extended to more divergent species. Brassica species are in the same family as Arabidopsis thaliana, and these genera diverged ~15–21 million years ago (20). DNA sequences of homologous genes are similar between the 2 taxa. It is therefore possible to use RFLP probes from one species to map-related loci in the other species. Comparative mapping in B. rapa, B. napus and Arabidopsis suggests possible single locations in A and C genome regions syntenic with resistance gene clusters on Arabidopsis chromosome 5 (21). Comparative genetic mapping can be used to study the evolution of important agronomic genes between closely related species. This has been demonstrated in a study between Arabidopsis and the Brassica species B. nigra, B. oleracea, B. rapa and B. juncea, in which the genomic region controlling flowering time have revealed extensive duplication in the Brassica genome. Axelsson et al. (19) used QTL analysis to study the evolution of genes controlling flowering time in four genomes: A, B, AB and C. Comparative mapping showed that a chromosomal region from the top of chromosome 5 in Arabidopsis corresponded to six homoeologous copies in B. juncea. The segment in Arabidopsis contained three genes known to be important in flowering: CO (CONSTANS), FY and FLC (FLOWERING LOCUS C). CO encodes a putative transcription factor and is a regulator in the photoperiod promotion pathway (22), and FLC encodes a MADS box domain transcription factor and is a key regulator of the autonomous flowering pathway. QTLs were detected in three of these six replicated segments. Brassica CO gene homologs mapped close to the QTL peaks. FLC mapped further away for six of the seven QTLs, while FY was not tested. The flowering time QTLs were also mapped in B. nigra, B. oleracea and B. rapa and results suggested that the CO QTLs detected in the different species could be the result of duplicated copies of the same ancestral gene, probably the ancestor of CO.

Genetic Maps and the Use of Synteny

45

3. Materials 3.1. Gramene

Gramene is an online comparative mapping database for rice and related grass species (23). Gramene contains information on cereal genomic and EST sequences, genetic maps, relationships between maps, details of rice mutants and molecular genetic markers. It incorporates a version of CMap, which can display and compare physical and genetic maps, markers and traits (see Subheading 2.3). CMap can draw comparison between maps, providing insight into syntenic regions and enabling comparative genetic mapping. Gramene includes maps of rice, maize, barley, wheat and oat, which are anchored by a set of curated correspondences.

3.2. CMap

CMap is one of the most powerful tools for viewing and comparing genetic and physical maps and has been applied successfully for comparison of genetic maps within and between related grass species (24, 25). It was originally developed for the Gramene project (http:// www.gramene.org/CMap/). This tool has been further applied for the comparison of genetic maps from different Brassica species (26). CMap can display genetic maps and identify syntenic regions by comparing maps where there is correspondence between markers.

4. Methods 4.1. Identification of Candidate Genes Underlying a QTL for Bacterial Blight Disease Resistance Trait in Rice

This first example will show how we can use a resource such as Gramene and CMap to identify candidate genes underlying a QTL. 1. Go to the Gramene website (http://www.gramene.org). Along the top menu header, go to ‘Search’ and choose QTL from the drop-down list. From the options displayed, select ‘Simple Search’. 2. Click on ‘Biotic stress’ from the Browse by Trait Category section. 3. Identify ‘blast disease resistance’ in the list of trait names, and select ‘view’ on the rightmost column (see Note 1). 4. In the resulting search table, select QTL accession ID AQAF001 (see Note 2). This will display the Gramene QTL entry display (Fig. 1). Under the tab Map Positions (see Note 3), there is a listing for a QTL map ‘CNHZAU Zh97/Ming63 RI QTL 2002’. Click ‘View Comparative Map’ to see this map. 5. The resulting image (Fig. 2) shows the QTL map, with the selected QTL highlighted. You will note there are 2 markers linked to this QTL, C161 and R753. 6. Scroll down to the ‘Map Options’ section, and click on ‘Add Maps Right’. From the drop-down set, choose ‘Genetic: Rice – JRGP RFLP 2000 [2]’ (see Note 4). Then, from the submenu, choose linkage group 1 ‘1 [53,53]’ and click ‘Add Maps’.

46

Duran, Edwards, and Batley

Fig. 1. The Gramene QTL entry display.

7. This displays the markers from the reference map linking to the genetic map. CMap allows the user to limit the view, effectively homing in on regions that are most relevant to the researcher. Go to map options and limit the recently added map by entering ‘0’ in the Start row, and ‘13’ in the Stop row. Clicking redraw should give you a map similar to Fig. 3. 8. Figure 3 shows that both markers are linked to the annotated genes drp1 and fs2. Marker R753 is also marking a position in the genes d2 and a18. Clicking on these genes will take you to the feature display for the entry (see Note 5). 9. To see where these markers align on the physical genome sequence, Click ‘Add Maps Right’ again, and this time add ‘Sequence: Rice – Gramene Annot Seq 2006 [2]’ selecting chromosome 1. Limit the view by setting start and stop at 100,000 and 2,000,000, respectively. Finally, expand ‘feature options’ and set clone and Gene Prediction to ‘ignore’ and press ‘Redraw’. 10. The resulting image and Fig. 4 depict how the markers map to the physical map of chromosome 1. Notice that it shows

Genetic Maps and the Use of Synteny

Fig. 2. CMap representation of the rice QTL map with the selected QTL highlighted.

47

48

Duran, Edwards, and Batley

Fig. 3. CMap representation of the markers linked to annotated genes relating to the selected QTL.

Genetic Maps and the Use of Synteny

49

Fig. 4. CMap representation depicting the rice physical map positions for markers under the selected QTL.

4.2. Use of Homologous Markers Between Rice and Barley to Identify Traits that may be Associated with a Given Barley Molecular Marker

the QTL range for AQAF001 (AQAF001-BLRS) to the right, and suggests several more candidate genes for study (COIN, bh1h71, bh1h125, osa-MIR159b, P56-D5). It also shows other rice blast-related QTL markers (AQAF002, AQAF003), which may be of interest. 1. Go to the Gramene website (http://www.gramene.org). Along the top menu header, go to ‘Search’ and click on Markers from the drop-down list.

50

Duran, Edwards, and Batley

2. Select ‘Markers Search:’. 3. In the ‘Find’ box, enter ‘ABG391’, select RFLP from the ‘Type’ drop-down menu and click on the search button. This will display the Gramene Marker entry display (Fig. 5). 4. Open the Map Positions section (see Note 3) and there is a listing for a Hordeum vulgare Genetic map called Barley consensus 2003. Click ‘View Comparative Map’ to see this map. The resulting map (Fig. 6) shows linkage group 5H from the barley consensus genetic map, with the selected marker highlighted. 5. Go to the ‘Map Options’ section, and click on ‘Add Maps Right’. From the drop-down set, choose ‘Genetic: Rice – JRGP RFLP 2000’, linkage group 3 and click ‘Add Maps’. 6. This map has a large number of gene annotations, so limit the rice map by setting start to ‘130’ and stop to ‘160’. This will result in the map shown in Fig. 7. This map shows the homologous marker B240 on the rice genetic map. This map is annotated with a variety of gene annotations in rice.

Fig. 5. The Gramene marker entry display.

Genetic Maps and the Use of Synteny

Fig. 6. CMap representation of linkage group 5H from a barley genetic map with the selected marker highlighted.

51

52

Duran, Edwards, and Batley

Fig. 7. CMap representation of the region between 130 and 160 cM of the rice genetic map, depicting correspondence to the barley genetic map.

Genetic Maps and the Use of Synteny

53

7. Follow the links through the gene labelled Aox4. This gene is an alternative oxidase homologue, which is associated with salt and dehydration tolerance. 4.3. Identifying Regions of Synteny Between Two Species by Graphically Browsing Syntenic Sections of Chromosomes Using EnsEMBL SyntenyView

1. Go to the Gramene website (http://www.gramene.org). Along the top menu header, go to ‘Genomes’ and select ‘Oryza sativa ssp japonica’ from the drop-down list. 2. In the Rice Synteny Vs Maize FPC Map section, choose ‘Rice Chr 1 versus Maize’ and click ‘Go’. 3. The resulting image is generated using EnsEMBL’s SyntenyView software (see Note 6). It shows regions of synteny between rice chromosome 1 and different maize chromosomes (Fig. 8).

Fig. 8. Rice EnsEMBL SyntenyView demonstrating syntenic regions between rice and maize.

54

Duran, Edwards, and Batley

4. Clicking on a coloured ‘synteny block, (see Fig. 8) will take you to the EnsEMBL viewer page for the selected section, which includes the syntenic regions at the top to allow switching between the two species.

Notes 1. At this stage you can also choose to filter by species, by selecting your species of interest from the ‘Species’ box. The default is ‘all species’. 2 You may need toi browse several pages to find the accession ID. Alternatively, click the ‘QTL Accession ID’ header of the result list. This will sort the results by accession ID. 3. You may need to extend this section to view. 4. The QTL map shows the related markers to be RFLP markers. 5. Following the links through fs2 will tell you that the gene name is fine stripe-2, and that it is characterised by white and fine speckles in leaves caused by a chlorophyll deficiency. 6. The EnsEMBL browser allows biological information to be anchored as features of a genome sequence. The browser provides a comprehensive view of the complete annotated genome allowing ease of navigation between data sets. References 1. Gupta, P.K., Roy, J.K., and Prasad, M. (2001) Single nucleotide polymorphisms: A new paradigm for molecular marker technology and DNA polymorphism detection with emphasis on their use in plants. Curr. Sci. 80, 524–535. 2. Rafalski, A. (2002) Applications of single nucleotide polymorphisms in crop genetics. Curr. Opin. Plant Biol. 5, 94–100. 3. Batley, J. and Edwards, D. (2007) SNP applications in plants, in Association Mapping in Plants (Oraguzie, N.C., Rikkerink, E.H.A., Gardiner, S.E., and De Silva, H.N., eds.), Springer, New York, NY, 95–102. 4. Choi, S.R., Teakle, G.R., Plaha, P., Kim, J.H., Allender, C.J., Beynon, E., Piao, Z.Y., Soengas, P., Han, T.H., King, G.J., Barker, G.C., Hand, P., Lydiate, D.J., Batley, J., Edwards, D., Koo, D.H., Bang, J.W., Park, B.-S., and Lim, Y.P. (2007) The reference genetic linkage map for the multinational Brassica rapa genome sequencing project. Theor. Appl. Genet. 115, 777–792. 5. Edwards, D., Salisbury, P.A., Burton, W.A., Hopkins, C.J., and Batley, J. (2007) Indian mustard, in Genome Mapping and Molecular

6.

7.

8. 9.

10. 11.

12.

Breeding in Plants. Vol II Oilseeds (Kole, C., ed.), Springer, Berlin, 179–210. Cullis, C.A. (2007) Flax, in Genome Mapping and Molecular Breeding in Plants. Vol II Oilseeds (Kole, C., ed.), Springer, Berlin, 275–296. Miller, R. (1997) Linkage mapping of plant and animal genomes, in Genome Mapping (Dear, P.H., ed.), IRL Press, Oxford, 27–48. McCouch, S.R. (2001) Genomics and synteny. Plant Physiol. 125, 152–155. Bevan, M. and Murphy, G. (1999) The small, the large and the wild – the value of comparison in plant genomics. Trends Genet. 15, 211–214. Devos, K.M. (2005) Updating the Crop circle. Curr. Opin. Plant Biol. 8, 155–162. Feuillet, C. and Keller, B. (2002) Comparative genomics in the grass family: Molecular characterization of grass genome structure and evolution. Ann. Bot. (Lond) 89, 3–10. Nadeau, J.H. and Sankoff, D. (1998) Counting on comparative maps. Trends Genet. 14, 495–501.

Genetic Maps and the Use of Synteny 13. Wicker, T., Stein, N., Albar, L., Feuillet, C., Schlagenhauf, E., and Keller, B. (2001) Analysis of a contiguous 211 kb sequence in diploid wheat (Triticum monococcum L.) reveals multiple mechanisms of genome evolution. Plant J. 26, 307–316. 14. Mayerhofer, R., Wilde, K., Mayerhofer, M., Lydiate, D., Bansal, V.K., Good, A.G., and Parkin, I.A. (2005) Complexities of chromosome landing in a highly duplicated genome: Toward map-based cloning of a gene controlling blackleg resistance in Brassica napus. Genetics 171, 1977–1988. 15. Lukens, L., Zou, F., Lydiate, D., Parkin, I., and Osborn, T. (2003) Comparison of a Brassica oleracea genetic map with the genome of Arabidopsis thaliana. Genetics 164, 359–372. 16. Parkin, I.A., Gulden, S.M., Sharpe, A.G., Lukens, L., Trick, M., Osborn, T.C., and Lydiate, D.J. (2005) Segmental structure of the Brassica napus genome based on comparative analysis with Arabidopsis thaliana. Genetics 171, 765–781. 17. Chao, S., Sharp, P.J., Worland, A.J., Warham, E.J., Koebner, R.M.D., and Gale, M.D. (1989) RFLP-based genetic maps of wheat homologous group-7 chromosomes. Theor. Appl.Genet. 78, 495–504. 18. Moore, G., Devos, K.M., Wang, Z., and Gale, M.D. (1995) Cereal genome evolution – Grasses, line up and form a circle. Curr. Biol. 5, 737–739. 19. Axelsson, T., Bowman, C.M., Sharpe, A.G., Lydiate, D.J., and Lagercrantz, U. (2000) Amphidiploid Brassica juncea contains conserved progenitor genomes. Genome 43, 679–688. 20. Koch, M., Haubold, B., and Mitchell-Olds, T. (2000) Evidence for homology of flowering time genes VFR2 from Brassica rapa and FLC

21.

22.

23.

24.

25.

26.

55

from Arabidopsis thaliana. Theor. Appl. Genet. 102, 425–430. Kole, C., Williams, P.H., Rimmer, S.R., and Osborn, T.C. (2002) Linkage mapping of genes controlling resistance to white rust (Albugo candida) in Brassica rapa (syn. campestris) and comparative mapping to Brassica napus and Arabidopsis thaliana. Genome 45, 22–27. Osborn, T. and Lukens, L. (2003) The molecular genetic basis of flowering time variation in Brassica species, in Brassicas and Legumes, from Genome Structure to Breeding. (Nagata, T. and Tabata, S., eds.), Springer, Berlin, 69–86. Ware, D.H., Jaiswal, P., Ni, J., Yap, I.V., Pan, X., Clark, K.Y., Teytelman, L., Schmidt, S.C., Zhao, W., Chang, K., Cartinhour, S., Stein, L.D., and McCouch, S.R. (2002) Gramene, a tool for grass genomics. Plant Physiol. 130, 1606–1613. Gonzales, M.D., Archuleta, E., Farmer, A., Gajendran, K., Grant, D., Shoemaker, R., Beavis, W.D., and Waugh, M.E. (2005) The legume information system (LIS): An integrated information resource for comparative legume biology. Nucleic Acids Res. 33, D660–D665. Jaiswal, P., Ni, J., Yap, I., Ware, D., Spooner, W., Youens-Clark, K., Ren, L., Liang, C., Zhao, W., Ratnapu, K., Faga, B., Canaran, P., Fogleman, M., Hebbard, C., Avraham, S., Schmidt, S., Casstevens, T.M., Buckler, E.S., Stein, L., and McCouch, S. (2006) Gramene: A bird’s eye view of cereal genomes. Nucleic Acids Res. 34, D717–D723. Lim, G.A.C., Jewell, E.G., Li, X., Erwin, T.A., Love, C., Batley, J., Spangenberg, G., and Edwards, D. (2007) A comparative map viewer integrating genetic maps for Brassica and Arabidopsis. BMC Plant Biol. 7, 40.

Chapter 4 A Simple TAE-Based Method to Generate Large Insert BAC Libraries from Plant Species Bu-Jun Shi, J. Perry Gustafson, and Peter Langridge Summary Large insert libraries are valuable tools for the positional cloning of genes of interest, physical mapping of chromosomes, comparative genomics, and molecular breeding. There are five types of large DNA insert libraries; cosmid, yeast artificial chromosomes (YACs), bacteriophage P1, bacterial artificial chromosomes (BACs): and P1-derived artificial chromosomes (PACs) libraries. Of these libraries, BAC libraries are the most widely used due to their ease of manipulation, large insert size, and stability. This chapter reports on a simplified method for plant BAC library construction. This method involves isolation and partial digestion of intact nuclei, selection of appropriate size of DNA via pulsed-field gel (PFG) electrophoresis, elution of DNA from agarose gels, ligation of DNA into the BAC vector, electroporation of the ligation mix into Escherichia coli cells, and estimation of insert sizes. The whole process takes 1–3 months depending on the genome size and coverage required. We have used this method to produce BAC libraries from different plant species including sunolgrass (Phalaris coerulescens L.), barley (Hordeum vulgare L.), lupin (Lupinus angustifolias L.) and rye (Secale cereale L.). Key word: BAC library, Phalaris, Barley, Rye, pIndigoBAC-5, Large insert.

1. Introduction Large deoxyribonucleic acid (DNA) insert libraries are essential for positional cloning, physical mapping, genome sequencing, and comparative genomics. There are five types of high-capacity vectors used to construct large insert libraries. These are cosmid, yeast artificial chromosomes (YACs), bacteriophage P1, bacterial artificial chromosomes (BACs) and P1-derived artificial chromosomes (PACs). Cosmid libraries were first created in 1978 (1) and

Daryl J. Somers et al. (eds.), Methods in Molecular Biology, Plant Genomics, vol. 513 © Humana Press, a part of Springer Science + Business Media, LLC 2009 DOI: 10.1007/978-1-59745-427-8_4

57

58

Shi, Gustafson, and Langridge

can contain DNA insert sizes of up to 50 kb. Cosmid clones are packed into phage λ particles, which can be transfected into a bacterial host strain for propagation. Cosmid clones are not stable in vivo. YAC libraries were first generated in 1987 (2) and can have insert sizes up to 3,000 kb. YAC clones are propagated in yeast cells but they also tend to be unstable. Bacteriophage P1 libraries were first created in 1990 (3) and can have inserts as large as 100 kb. P1 clones are transducted with a bacterial host strain for replication and are stable in vivo. The first BAC libraries were produced in 1992 (4) and can have insert sizes of up to 350 kb. BACs are not artificial chromosomes as their name states. They are derived from the Escherichia coli F-factor plasmid, which contains four essential genes (parA, parB, OriS and RepE) for strict copy number control and unidirectional DNA replication (5). Both features promote plasmid maintenance and stability. Thus, BAC clones are stable in vivo. BAC clones are transformed and propagated in bacterial cells, and use the LacZ gene for positive clone selection. PAC libraries were first developed in 1994 on the basis of P1 and BACs, and therefore, PACs share a lot of features with BACs (6). For example, PAC clones are also transformed and propagated in bacterial cells, and are stable in vivo. However, PACs have a lower efficiency in shotgun cloning than BACs. In addition, PACs contain smaller inserts than BACs. Moreover, PACs use the sacB gene for positive clone selection. Of all of the above-listed vectors, BAC libraries are the most advantageous, and are especially easy to handle. BAC DNA can be easily purified and is straightforward in its utilisation as a template for direct end-sequencing (7). Therefore, BACs are currently the most widely used vector for the construction of large DNA insert libraries. A lot of large insert BAC libraries from different plant species have been constructed. As the costs for positional cloning, physical mapping and genome sequencing decreases, the demand for BAC libraries will continue to increase. Recently, we used BACs to construct several plant BAC libraries including rye (Secale cereale L.) (8). Rye is an important cereal in terms of its ability to produce a crop under various abiotic stresses such as drought, cold, saline and acid soils (9). The construction of plant BAC libraries involves the following steps: preparation of vector (this step can be skipped if a ready-use vector is purchased), isolation of nuclei from plants, partial digestion of megabase DNA, size selection of high molecular weight (HMW) DNA, ligation of vector and HMW DNA, electroporation of ligation mix into E. coli cells, minipreparation of plasmid DNAs, colony picking into 384-well plates and finally the storage of BAC clones. The first six steps are crucial for successful library construction. The final steps are important for maintaining and storing a good quality BAC

A Simple TAE-Based Method to Generate Large Insert BAC Libraries

59

library. The whole process takes 1–3 months depending on the genome size and especially the desired genome coverage. These features mean that the construction of a BAC library is both technically demanding compared to the construction of a general library and time consuming. Over the last few years, many methods have been developed for the construction of BAC libraries (5, 10–14). However, all the approaches have many common problems such as small insert sizes, low transformation efficiency and high empty vector background. We have attempted to combine and modify the technique in order to solve these common BAC library construction problems and to maximise the presence of very large BAC fragments in the library. First, we applied various separation conditions for maximising large-size selection of DNA fragments. These modifications remove small DNA fragments while reducing degradation of large DNA fragments. Second, we used Trisacetate EDTA (TAE) buffer instead of Tris-borate EDTA (TBE) buffer, which made a dialysis step unnecessary, thereby avoiding any loss or degradation of large DNA fragments eluted from an agarose gel during dialysis. Third, we stored the ligation on ice, which improved maintenance of size and transformation efficiency. The above modifications were demonstrated to be effective. We have used this modified method to successfully construct several good quality BAC libraries. This chapter will describe in detail procedures of this modified method for the construction of plant BAC libraries containing a significant number of large BAC DNAs. For all procedures used in BAC library construction, the highest grade of chemicals available should be used. In addition, the use of deionised water will ensure that every laboratory using the techniques will be utilising the same quality of water. The equipment used was available in our laboratory, and is not necessarily the best equipment. Many other brands and models will work equally well.

2. Materials 2.1. BAC Vector Preparation

1. The BAC vector used can be purchased in a ready-use pIndigoBAC-5 vector (Fig. 1) from Epicentre Biotechnologies Company (Epicentre, Madison, WI, USA). This pIndigoBAC-5 vector is 7.5 kb in size and derived from pBeloBAC11 (Fig. 1). This vector consists of the repE, parA, parB, and parC elements from the F factor of E. coli, a gene for chloramphenicol resistance, a bacteriophage cosN site, a bacteriophage P1 loxP site and a multiple cloning site that lies within the lacZ gene for colour selection for positive

60

Shi, Gustafson, and Langridge

Fig. 1. Diagram of pBeloBAC11 and pIndigoBAC-5 bacterial artificial chromosome (BAC) vectors. pIndigoBAC-5 BAC (top) is the first cloning-ready BAC vector to become commercially available. This vector is derived from pIndigoBAC (not shown) and pBeloBAC11 (bottom), the latter of which is the most widely used BAC vector. pIndigoBAC-5 BAC has two unique cloning sites, BamHI and Hind III, flanked by Not I sites, which allow for the easy excision of the vector insert. The vector contains a mutation within the lacZ gene, which enhances blue colour. The vector also contains parA, parB, parC and RepE genes, which control copy number and direction of deoxyribonucleic acid (DNA ) replication and a chloramphenicol-resistance gene, ChlR, for antibiotic selection of transformants. The complete sequence of pIndigoBAC-5 is available at www.epicentre.com.

A Simple TAE-Based Method to Generate Large Insert BAC Libraries

61

clones. The multiple cloning site is flanked by NotI sites, which allow for the easy excision of the vector insert. 2. Electroporation competent DH10B cells can be purchased from Invitrogen Company (Invitrogen, San Diego, CA, USA; itemed ElectroMAX DH10B). This strain [F–, endA1, recA1, galU, galK, deoR, nupG, rpsL, ΔlacX74, Φ80lacZΔM15, araD139, Δ(ara, leu)7697, mcrA, Δ(mrr-hsdRMSmcrBC), λ–] has mutations that block restriction of foreign DNA by endogenous restriction endonucleases, restriction of DNA containing methylated DNA and recombination and takes up large DNA fragments. 3. Luria-Bertani (LB) medium (10 g/L bacto-tryptone, 5 g/L bacto-yeast extract, 10 g/L NaCl). Adjust to pH 7.5. Autoclave. 4. Restriction enzymes and 10× restriction buffers (New England Biolabs, Ipswich, MA, USA). 5. DNA ladder (HyperLadder I) (Bioline, Alexandria, NSW, Australia). 6. Heat-Killable (HK) phosphatase, Tris-acetate (TA) buffer and 100 mM CaCl2 (Epicentre). 7. T4 DNA ligase and 10× T4 DNA ligase buffer (New England Biolabs). 8. 10× TBE buffer [890 mM Tris-borate, 890 mM boric acid, 20 mM ethylen-diamine tetracetic acid (EDTA), pH8.3]. Autoclave. Store at room temperature. 9. 50× TAE buffer (2Mtris-acetate 50mM EOTA, pH8.3). Autoclave. Store at room temperature. 10. 6× gel loading buffer: 0.25% (w/v) bromophenol blue and 40% (w/v) sucrose in Tris-EDTA (TE) (pH 8.0) buffer. Autoclave. Store at room temperature. 11. Chloramphenicol (Sigma-Aldrich, St. Louis, MO, USA). 12. Ethidium bromide (EtBr) (Sigma-Aldrich). 13. Glycerol (Sigma-Aldrich). 14. Agarose (Sigma-Aldrich). 15. Qiagen Plasmid Midi Kit (Qiagen, Valencia, CA, USA). 16. MinElute Gel Extraction Kit (Qiagen). 17. 1-L flasks, 250-ml Falcon tubes, 1.5-ml microcentrifuge tubes and sterile razor blades. 18. Electroporator (Model GenePulserXcell) (Bio-Rad, Hercules, CA, USA). 19. 37oC thermostat shaker (Model Amper Chart Multitron II) (INFORS AG, Bottmingen, Switzerland). 20.37oC thermostat incubator (S.E.M., Adelaide, SA, Australia). 21. Gel apparatus (Model SUB-CELL GT) (Bio-Rad).

62

Shi, Gustafson, and Langridge

22. Ultraviolet (UV) transilluminator (Model TFX-200M and TFX-35M UV) (Gibco BRL, Melbourne, Vic., Australia). 23. Spectrophotometer (Model UV-160A) (SHIMADZU, Kyoto, Japan). 24. Microcentrifuge (Model 5415D) (Eppendorf, Hamberg, Germany). 25. GenePulser Cuvettes (0.1 cm electrode gap) (Bio-Rad). 2.2. Plant Tissue Preparation

1. Seeds (disinfecting the seed surfaces with a 20% sodium hypochlorite solution for 10 min helps minimise any fungal growth). 2. Pots and soil (disinfect pots and soil to help prevent fungal growth). 3. Temperature-controlled glasshouse. 4. Sterile scissors and clean plastic bags. 5. −80oC freezers.

2.3. Nuclei Isolation from Plants and Megabase DNA Agarose Plug Preparation

1. 10× Homogenisation buffer (HB) stock: 0.1 M Trizma base, 0.8 M KCl, 0.1 M EDTA, 10 mM spermidine and 10 mM spermine. Adjust pH to 9.4–9.5 with NaOH. Store the stock at 4oC. 2. Wash buffer: 1× HB plus autoclaved 0.5 M sucrose and 0.5% Triton X-100. Store at 4oC. Add β-mercaptoethanol to 0.15% before use and place on ice for use. 3. Suspension buffer: 1× HB. Store at 4oC. 4. Lysis buffer: 0.5 M EDTA and 1% sodium lauryl sarcosine. Adjust pH to 9.0–9.3 with NaOH. Autoclave. Store at room temperature. Add proteinase K (Sigma-Aldrich) to 0.1–1 mg/ml before use. 5. Phenylmethylsulfonyl fluoride (PMSF) (Sigma-Aldrich). Store as 50 mM stock solution in isopropanol at 4°C, but use in a final concentration of 0.1 mM. 6. 0.5 M EDTA (pH 9.0–9.3). Autoclave. Store at room temperature. 7. 0.05 M EDTA (pH 8.0). Autoclave. Store at room temperature. 8. TE [10 mM Tris-HCl (pH 8.0) and 1 mM EDTA (pH 8.0)]. Autoclave. Store at room temperature. 9. Low melting temperature (LMT) agarose (SeaPlaque) (Cambrex Bio Science Rockland, Rockland, ME, USA). 10. Mortars, pestles, liquid nitrogen, 1-L beakers, ice and ice boxes, funnels, small paintbrush, 50-ml Falcon tubes and 250-ml Falcon tubes.

A Simple TAE-Based Method to Generate Large Insert BAC Libraries

63

11. Miracloth (Calbiochem, La Jolla, CA, USA) and Kimwipes (Kimberly-Clark, Milsons Point, NSW, Australia). 12. Magnetic stirrer (IEC, Melbourne, Vic., Australia). 13. Centrifuge (Model Avanti J-E) (Beckman Coulter, Palo Alto, CA, USA). 14. Plug moulds (Bio-Rad). 15. Thermostat shaking waterbath (RATEK Instruments, Boronia, Vic., Australia). 2.4. Partial Digestion of Megabase DNA Agarose Plugs and Size Selection

1. Restriction enzymes and 10× restriction buffers. 2. 0.5 M EDTA (pH 8.0). 3. 1× TAE buffer. 4. Partial digestion buffer: 128 µl 10× restriction buffer, 16 µl bovine serum albumin (BSA) (10 mg/ml), 64 µl 40 mM spermidine, 1.6 µl 1 M dithiothreitol (DTT), and 1,070 µl H2O. 5. Agarose. 6. Ethidium bromide (EtBr). 7. λ ladder PFG marker (New England Biolabs). 8. Megabase DNA agarose plugs. 9. Ice, ice boxes, a ruler, sterile razor blades, sterile Petri dish plates, 1.5-ml microcentrifuge tubes and sterile small spatulas. 10. CHEF Mapper XA Pulse Field Gel (PFG) Electrophoresis System (Bio-Rad). 11. UV transilluminator (Model GeneFlash) (Syngene, Frederick, ML, USA). 12. 37oC waterbath (Contherm Scientific, Petone, New Zealand).

2.5. DNA Elution from Agarose Gels

1. λ DNA (Promega, Madison, WI, USA). 2. EtBr. 3. Agarose. 4. 1× TAE buffer. 5. 6× gel loading buffer. 6. Electro DNA eluter (Model 422) (Bio-Rad). 7. Gel apparatus (Mini-Sub-Cell) (Bio-Rad). 8. UV transilluminator. 9. Power PAC 300 (Bio-Rad).

2.6. Ligation of Vector and DNA

1. T4 DNA ligase and 10× T4 DNA ligase buffer. 2. 16oC waterbath (Julabo Labortechnik, Seelbach, Germany).

64

Shi, Gustafson, and Langridge

2.7. Transformation of Ligation into E. Coli DH10B-Competent Cells

1. 100-mm diameter Petri dish LB agar plates with 12.5 µg/ ml chloramphenicol, 80 µg/ml 5-bromo-4-chloro-3-indolylβ-D-galactoside or 5-bromo-4-chloro-3-indolyl-β-D-galactopyranoside (X-gal) and 100 µg/ml isopropyl-β-D-thiogalactoside (IPTG). 2. ElectroMAX DH10B competent cells. 3. Super-optimal broth with catabolite repression (SOC) medium (20 g/L bacto-tryptone, 5 g/L bacto-yeast extract, 0.5 g/L NaCl, 2.5 mM/L KCl). Adjust pH to 7.0 with NaOH. Autoclave. Add filter-sterilised MgSO4 to 10 mM, MgCl2 to 10 mM and glucose to 20 mM before use. 4. 10-ml culture tubes and sterile plastic spreaders. 5. GenePulserXcell electroporator. 6. 37oC thermostat shaker. 7. 37oC thermostat incubator. 8. GenePulser cuvettes (0.1 cm electrode gap).

2.8. Estimation of Insert Size

1. LB with 12.5 µg/ml chloramphenicol. 2. P1, P2, and P3 buffers from Plasmid Miniprep Kit (Qiagen). 3. Isopropanol and ethanol. 4. NotI restriction enzyme and 10× restriction buffer. 5. 6× gel loading buffer. 6. Agarose. 7. 1× TAE buffer. 8. λ ladder PFG marker. 9. EtBr. 10. Sterile toothpicks, 50-ml Falcon tubes and 1.5-ml microcentrifuge tubes. 11. 37oC thermostat shaker. 12. Benchtop centrifuge (Model Rotanta 460R) (Hettich, Tuttlingen, Germany). 13. Eppendorf microcentrifuge. 14. 37oC thermostat incubator. 15. CHEF Mapper XA PFG Electrophoresis System. 16. UV transilluminator.

2.9. Bulk Ligation, Transformation, Colony Picking, Library Duplication, and Storage

1. T4 DNA ligase and 10 × T4 DNA ligase buffer. 2. 22 × 22 cm2 square plates (Q-trays) containing LB agar with 12.5 µg/ml chloramphenicol. 3. ElectroMAX DH10B-competent cells. 4. SOC medium.

A Simple TAE-Based Method to Generate Large Insert BAC Libraries

65

5. Freezing medium: 10 g/L bacto-tryptone, 5 g/L bacto-yeast extract, 10 g/L NaCl, 36 mM K2HPO4, 13.2 mM KH2PO4, 1.7 mM Na-citrate, 6.8 mM (NH4)2SO4 and 4.4% glycerol. Autoclave and then add filter-sterilised MgSO4 stock solution to 0.4 mM. 6. Ethanol (20% and 80%). 7. 10-ml culture tubes, sterile glass spreaders and a rubber roller. 8. 16oC waterbath. 9. GenePulserXcell electroporator. 10. GenePulser cuvettes (0.1 cm electrode gap). 11. 37oC thermostat shaker. 12. 37oC thermostat incubator. 13. 384-well plates and lids (Genetix, New Milton, Hampshire, UK). 14. QPix2 robot (Genetix). 15. Qiagen tape pads. 16. QFill2 (Genetix). 17. −80oC freezers. 2.10. Reproducing the BAC Library on Filters

1. Denaturing solution (mixture of 0.5 N NaOH and 1.5 M NaCl): dissolve 87.6 g NaCl and 20 g NaOH in autoclaved deionised H2O to a final volume of 1L. 2. Neutralising solution (1.5 M NaCl and 0.5 M Tris-HCl): dissolve 87.6 g NaCl and 121.1 g Trizma base in 900 ml autoclaved deionised H2O. Adjust pH to 7.0 with concentrated HCl and then make up to 1 L. 3. High-performance positively charged-nylon membrane (Performa) (Genetix). 4. UV cross linker (Model GS Gene linker) (Bio-Rad, Richmond, CA, USA). 5. 0.34 mm-thickness blotting paper (Whatman, Maidstone, Kent, UK). 6. Forceps.

3. Methods 3.1. BAC Vector Preparation

1. We recommend purchasing electroporation DH10B competent cells from Invitrogen, as their transformation efficiency was greater than 1.0 × 1010 transformants/µg of pUC19 DNA.

66

Shi, Gustafson, and Langridge

2. Place a Bio-Rad GenePulser cuvette (0.1 cm electrode gap) onto ice for 10 minutes (min). Preset gene-pulsing conditions a Bio-Rad electroporator (GenePulserXcell) as follows: 1,800 voltage (V), 25 µF capacitance, 200 Ω resistance, and 1 mm cuvette. Then take 1 µl of plasmid BAC vector and place into a 1.5-ml microcentrifuge tube containing 20 µl of electroporation DH10B competent cells. Slowly pipette twice in order to mix the vector and cells. Then place the mixture into a cuvette. It is important not to let any bubbles be generated. Place the cuvette into the GenePulser chamber. Press and rapidly release the pulse button. Take the mixture from the cuvette into a 10-ml culture tube containing 1 ml SOC medium. Place the tube into a 37oC incubator and incubate while shaking at 150 rpm for 1 hour (h). Then spread 200 µl onto an LB agar plate containing 12.5 µg/ml chloramphenicol, 80 µg/ml X-gal and 100 µg/ml IPTG. Place the inoculated LB agar plate into a 37oC incubator without shaking overnight. 3. When colonies appear on the plate with blue- and/or whitecolour, pick and place a single well-isolated blue-colour colony into a 1-L flask containing 200 ml LB with 12.5 mg/L chloramphenicol. Place the flask into the 37°C incubator overnight with continuous shaking. 4. Prepare plasmid BAC vector using Qiagen Midiprep Kit according to the manufacturer’s instructions, where 200 ml of culture yields about 200 µg. 5. Take 5 µg plasmid BAC vector for restriction digestion at 37°C for 3 h in a 100 µl 1 × TA buffer (Epicentre) using HindIII, EcoRI, or BamHI depending on which restriction enzyme is selected for BAC library construction. 6. Heat the digestion at 75°C for 15 min to inactivate the restriction enzyme, and then add 6 µl 100 mM CaCl2, 2 µl 10× TA buffer, 5 µl of HK phosphatase, and 7 µl sterile Milli Q (MQ) water. Incubate the reaction at 30°C for 1 h (see Note 1). 7. Heat the reaction mixture at 65°C for 30 min to inactivate the HK phosphatase (see Note 2). 8. Load the reaction and a DNA ladder (HyperLadder I) separately on a 1% agarose gel in 1× TAE buffer and run the gel overnight at 30 V (see Note 3). 9. Stain the gel with 0.5 µg/ml EtBr for 20 min, then cut the gel containing the digested BAC vector under a long wave UV light (Fig. 2). Elute the digested BAC vector from the gel using Qiagen’s MinElute Gel Extraction Kit (Fig. 2). 10. Use a spectrophotometer to measure DNA concentration and adjust it to 25 ng/µl.

A Simple TAE-Based Method to Generate Large Insert BAC Libraries M

V

V

67

V M (kb)

10.0 8.0 6.0 4.0 3.0 2.5 2.0

Fig. 2. Preparation of linearised and dephosphorylated bacterial artificial chromosome (BAC) vector deoxyribonucleic acid (DNA) used for BAC library construction. Plasmid BAC vector DNA was purified using Qiagen Plasmid Mid Kit (column-basis), digested with Bam HI, and dephosphorylated with Heatkillable (HK ) phosphatase and/or calf intestinal phosphatase (CIP ). The treated DNA labelled as V was loaded on a 1% Trisacetate EDTA (TAE) agarose gel and run at 30 V overnight (left photo). The very left lane labelled as M shows a DNA ladder (HyperLadder I). After electrophoresis, the gel was stained with ethidium bromide. The DNA band from the slot-well was cut using a scalpel and the DNA was eluted from the gel using Qiagen MinElute Gel Extraction Kit. The DNA eluted from the gel was loaded on the same percentage agarose gel and run under the same conditions as above to ensure its correction and quality (right photo). The left lane shows the eluted BAC vector DNA, while the right lane shows the DNA ladder. The size (kb) of each fragment in the ladder is indicated.

11. Directly transform 1 µl vector into electroporation DH10Bcompetent cells to test for complete digestion. In the meantime, ligate vector with T4 DNA ligase to test for complete dephosphorylation. Less than 100 colonies in each test can be considered as good quality. 3.2. Seed Sowing, Plant Growth, and Harvest

1. Sow enough seeds, 200–400 depending on which species is being used, that can generate 100–200 g leaf tissue into sterile pots containing sterile soil. Water as often as necessary to keep the plants growing and non-water stressed. 2. Harvest leaf tissue using sterile scissors 2–3 weeks after seed germination, when the plants have grown to 15–20 cm or contain four to six fully-expanded leaves (see Note 4). Place detached tissue into a clean plastic bag and store at −80oC or use freshly.

3.3. Nuclei Isolation from Plants and Megabase DNA Agarose Plug Preparation

1. Place total 100 g of either fresh or frozen tissue in a mortar. Pour liquid nitrogen into half of the mortar and then use a pestle to grind the tissue into very fine powder (see Note 5). 2. Place 1-L beaker on ice and transfer the ground tissue into the beaker, and add 600–700 ml of ice-cold Wash buffer (about 6–7 ml/g tissue).

68

Shi, Gustafson, and Langridge

3. Gently stir for 15–20 min (see Note 6). 4. Filter the homogenate through two layers of Miracloth, and carefully squeeze the pellet to maximally recover the nucleicontaining solution (see Note 7). 5. Transfer the filtered solution into 250-ml Falcon tubes. Centrifuge the tubes at 1,800 g at 4°C for 20 min. 6. Carefully pour the supernatant out and gently re-suspend the pellets in the residual buffer using a small sterile paintbrush. 7. Add ice-cold 200 ml wash buffer into each tube and gently mix with the nuclei suspension. Centrifuge the tubes at 1,800 g at 4°C for 15 min. 8. Repeat steps 6–7 twice or more. After the final centrifuge, carefully pour the supernatant out and then use Kimwipes tissue to carefully remove any residual buffer. Add 2–5 ml suspension buffer (an amount that makes the nuclei in a medium concentration) to re-suspend the nuclei pellet using a pipette tip that has had its end cut-off. The suspended nuclei can be viewed under a fluorescence microscope after stained with 4′-6-diamidino-2-phenylindole (DAPI) at a final concentration of 10 µg/ml for 5 m in the dark (Fig. 3). 9. Prepare 10 ml 1% LMT agarose in suspension buffer and maintain it in a 45°C waterbath before use. 10. In the meantime, warm the nuclei suspension in the 45°C waterbath for 5–10 min. Then gently mix the nuclei suspension and LMT agarose to an equal volume by slowly pipetting three to five times using the same cut-off pipette tip. 11. Transfer the mixture to plug moulds using the same cut-off pipette tip. Leave the moulds on ice for 10–20 min until the plugs are completely solidified. 12. Transfer the plugs into a 50-ml Falcon tube (about 50 plugs/ tube) containing 5–10 volumes (vol). of Lysis buffer. 13. Incubate the tubes in a 50°C waterbath with gently shaking for 24–36 h. 14. Replace the Lysis buffer with 0.5 M EDTA (pH 9.0–9.3) and incubate the plugs for 1 h in the 50°C waterbath with gently shaking. 15. Replace the 0.5 M EDTA (pH 9.0–9.3) with 0.05 M EDTA (pH 8.0) and incubate the plugs for 1 h on ice. 16. Use 0.05 M EDTA (pH 8.0) to wash the plugs again and then store in 0.05 M EDTA (pH 8.0) at 4°C (see Note 8).

A Simple TAE-Based Method to Generate Large Insert BAC Libraries

69

Fig. 3. 4′-6-Diamidino-2-phenylindole (DAPI) staining of nuclei isolated from barley plants. Nuclei were isolated from barley plants using homogenisation buffer (HB ). DAPI staining of nuclei was carried out as described in the text. A final concentration of 10 µg/ml of DAPI was applied. Nuclei represented by N were viewed using a DMLB fluorescence microscope (Leica, Chatsworth, CA, USA) with an appropriate light filter. DAPI produces light blue fluorescence with an excitation wavelength of 345 nm.

3.4. Partial Digestion of Megabase DNA Agarose Plugs and Size Selection

1. Take out 20–30 plugs stored in 0.05 M EDTA (pH 8.0). Rinse the plugs with 10–20 vol of ice cold TE buffer and then leave them in ice cold TE buffer plus 0.1 mM PMSF on ice for 1 h. 2. Pour the PMSF-containing TE buffer out and then add fresh TE buffer without PMSF. Leave on ice for 1 h. 3. Repeat step 2 once. 4. Take four plugs for partial digestion test. Cut each plug into 16 pieces with a sterile razor blade in a sterile Petri dish on ice. Place eight pieces (i.e. half a plug) into a single ice-cold 1.5-ml microcentrifuge tube. For four plugs, there are a total of 8 1.5-ml microcentrifuge tubes. Add 100 µl partial digestion buffer and incubate on ice for 1 h. 5. Change the partial digestion buffer with a fresh twice, once per hour while on ice. 6. Make serial restriction enzyme dilutions (Hind III, EcoRI, or BamHI, depending on which enzyme is selected for BAC library construction) with partial digestion buffer (e.g. 0.0, 0.1, 0.2, 0.5, 1.0, 2.0, 5.0, and 40 unit/5 µl) (see Note 9). 7. Add 5 µl of each enzyme dilution to each of the eight microcentrifuge tubes. Mix by gently tapping the tubes and incubate on ice for 40 min to allow for diffusion of the enzyme into the agarose matrix. 8. Incubate the tubes in a 37°C water bath for 30 min.

70

Shi, Gustafson, and Langridge

9. Place on ice and add 16 µl of 0.5 M EDTA (pH 8.0) to each tube. Mix by tapping the tubes. Incubate the tubes on ice for about 10 min to terminate the digestions. 10. Make 150 ml of 1% agarose gel (in 1× TAE buffer) and leave at a 50oC waterbath. Set a 13 × 14 cm gel casting stand with one 15-well 1.5-mm-thick comb (Bio-Rad). Pour into the gel in the stand. Leave some of the gel in the 50oC waterbath. 11. Load eight pieces of plugs from one tube into each well in the order of increasing units using a spatula. Load the λ ladder PFG marker into a side well or a central well. 12. Seal the wells with the 50°C 1% agarose. 13. Run the gel at 5–60 seconds (s) linear ramp, 6 V/cm, 11°C in 1× TAE buffer for 18 h. 14. Stain the gel with 0.5 µg/ml EtBr and photograph the gel (Fig. 4). 15. Localise a DNA range between 100 kb and 300 kb and find out which enzyme dilution results in the highest amount of HindIII Unit: 0 0.1 0.2 0.5 1

2

5

40 M (kb) 727.5 679.0 630.5 582.5 533.5 485.0 436.5 388.0 339.5 291.0 242.5 194.0 145.5 97.0 48.5

1

2

3 4

5

6 7

8 9

Fig. 4. Partial digestion test of high molecular weight deoxyribonucleic acid (HMW DNA) from barley plants with HindIII restriction enzyme. Lanes from left to right contain the same amount of the same HMW DNA digested with increasing concentrations of Hind III restriction enzyme. The very right lane labelled M (lane 9) contains λ ladder pulsed-field gel (PFG ) marker. The size of each fragment in the marker is indicated. The enzyme units used are also indicated. An optimally partially digested size range is highlighted with an orange dotted rectangle, which is between 100–300 kb. The region highlighted with a red dotted rectangle has the largest percentage of DNA fragments. The amount of enzyme used to produce this largest percentage of DNA fragments will be selected for a large scale partial digestion.

A Simple TAE-Based Method to Generate Large Insert BAC Libraries

71

partially digested DNA within this range. Then use this dilution to do a large-scale partial digestion. 16. Take ten plugs for a large scale partial digestion following steps 3–13 with some slight modifications (see Note 10). 17. After finishing step 13, change the switch time to 3–5 s and continue running the gel for 6 h (see Note 11). 18. Cut the unstained central part of the gel corresponding to 100–300 kb DNA fragment size located by a ruler (Fig. 5). 19. Make 150 ml of 1% agarose gel (in 1× TAE buffer) and leave in a 50oC waterbath. Set a 13 × 14 cm gel casting stand without a comb. Place the cut gel fraction on the top of the stand with the orientation the same as in the original gel in step 10. Pour the pre-warmed gel into the casting stand to slightly cover the cut gel fraction. Leave some of the gel in the 50oC waterbath. 20. After set, use a sterile razor blade to make two slots on each side of the gel (one slot on each side aligns with the top of the cut gel fraction while the other slot on each side aligns with the bottom of the cut gel fraction). Fill all slots with the λ ladder PGF marker. Seal the slots with the 50°C 1% agarose. M(kb) D

D M(kb)

291.0

291.0

97.0

97.0

Fig. 5. First-size selection of partially digested high molecular weight deoxyribonucleic acid (HMW DNA). The gel was vertically cut into three pieces after pulsed-field gel (PFG ) electrophoresis. The two flanking pieces that contain the λ ladder PFG marker (M) as well as a small part of the partially digested HMW DNA (D) were stained with ethidium bromide to confirm whether digestion matched the test. The remaining centre piece that contains most of the partially digested HMW DNA was stored at 4oC. The stained two flanking pieces were aligned with a ruler (in the centre) and photographed. The gel between the two red dotted lines (or between 100 and 300 kb) was excised and subjected to PFG electrophoresis for a second round of size selection.

72

Shi, Gustafson, and Langridge

21. Run the gel at 3–5 s switch time, 6 V/cm, 11°C in 1× TAE buffer for 20 h. 22. Cut the two sides of the gel containing the marker and a small part of the cut gel fraction and stain with 0.5 µg/ml EtBr. Leave the rest of the gel at 4°C. 23. Take a photograph with a ruler in the centre (Fig. 6). Cut the second size-selected fraction located by a ruler. The cutgel fraction can be further divided into six sub-fractions and either used immediately or stored at −20°C in 70% ethanol (see Note 12). 3.5. DNA Elution from Agarose Gels

1. Use a razor blade to cut each of the sub-fractions into very small pieces (do not mix up these sub-fractions) in a sterile Petri dish plate on ice. The fraction stored in 70% ethanol will need to be rinsed with 1× TAE buffer first and then left in 10–20 vol. of 1× TAE buffer at 4°C overnight before use (see Note 12). M(kb) M(kb)

145.5 97.0

D M(kb) M(kb)

F1 F2

145.5 97.0

F3 F4

145.5 97.0 48.5

48.5

F5 F6

145.5 97.0 48.5

48.5

Fig. 6. Second size selection. The λ ladder pulsed-field gel (PFG) marker was loaded in two positions on each side of the gel, which are represented by M in both red and brown. As in the first-size selection, the gel was also vertically cut into three pieces after PFG electrophoresis. The two flanking pieces that contain the λ ladder PFG marker (M) as well as a small part of the partially digested high molecular weight deoxyribonucleic acid (HMW DNA) (D) were stained with ethidium bromide, while the remaining centre piece that contains most of the partially digested HMW DNA was stored at 4oC. The stained two flanking pieces were aligned with a ruler (in the centre) and photographed. The region between 100 kb in red and 150 kb in brown highlighted with two red lines in the centre piece will be excised according to the photograph and further divided into six sub-regions highlighted with red dotted lines and labelled as F1, F2, … and F6.

A Simple TAE-Based Method to Generate Large Insert BAC Libraries

73

2. Set up an electro-eluter apparatus from Bio-Rad (Model 422) according to the manufacturer’s instructions and let stand on ice. Place the small pieces of each sub-fraction into individual tubes. Pour ice-cold 1× TAE buffer into the tank. 3. Run a constant current (10 mA/tube) for about 1 h at 4oC, then reverse running under the same condition for 1 min. 4. Carefully remove the supernatant above the collection cup. Carefully take the solution out of the collection cup with a sterile cut-off pipette tip (see Note 13). 5. Set a 1% agarose minigel containing 0.5 µg/ml EtBr. Load 1 µl of the eluted DNA along with different amounts of λ DNA (12.5, 25.0, 37.5, 50.0, 75.0 ng). Δ. Run at a constant 100 V for 30 min. Take an image under UV light (Fig. 7), and estimate the DNA concentration by comparing with the λ DNA standards. 3.6. Ligation of Vector and DNA

1. Add different amounts of the eluted DNA (12.5, 25.0, 37.5, 50.0, 62.5, 75.0, 87.5, and 100.0 ng) into 1.5-ml microcentrifuge tubes. Add 1 μl vector (25 ng/μl), 10 μl 10× ligation buffer, five units T4 DNA ligase and sterile deionised H2O into each tube to make a total volume of 100 μl (see Note 14). 2. Incubate the ligations in a 16°C waterbath for 16 h. Store the ligations on ice (see Note 15).

3.7. Transformation of Ligation into E. coli DH10B Cells

1. Thaw ElectroMAX DH10B competent cells (Invitrogen) on ice and dispense 18 µl into pre-chilled 1.5-ml microcentrifuge tubes on ice. Pre-cool the electroporation cuvettes (1 mm electrode gap) on ice. Prepare SOC medium and dispense 1 ml each into individual sterile 10-ml culture tubes at room temperature. Label the culture tubes corresponding to the ligation tubes.

Eluted DNA (1ml each)

l DNA (12.5ng/ml )

F1 F2 F3 F4 F5 F6

1ml 2ml 3ml 4ml 5ml

Fig. 7. Determination of the deoxyribonucleic acid (DNA ) concentration eluted from an agarose gel after the second-size selection. 1 μl of the eluted DNA from each subregion (F1, F2, … and F6) as shown in Fig. 6 was loaded in the left lanes of the gel. The λ DNA at a concentration of 12.5 ng/μl was loaded with different amounts as indicated in the right lanes. Since the λ DNA is 50 kb in size, the concentration of the eluted DNA sized from 100 kb to 300 kb can be easily compared and calculated.

74

Shi, Gustafson, and Langridge

2. Place 2 µl of each ligation into individual competent cell tubes. Gently pipette once or twice to mix them. 3. Transfer the mixture of ligation and competent cells into pre-cooled electroporation cuvettes. Electroporate under conditions: 1,800 V, 25 µF capacitance, 200 Ω resistance, and 1 mm cuvette (see Note 16) 4. Transfer the electroporated cells into the 10-ml culture tubes containing 1 ml SOC medium and incubate in a 37°C incubator for 1 h with vigorous shaking (120–160 rpm). 5. Plate 50 and 200 µl of each culture on 100-mm diameter sterile Petri dish LB agar plates containing 12.5 µg/ml chloramphenicol, 80 µg/ml X-gal and 100 µg/ml IPTG. Incubate the plates in a 37°C incubator overnight without shaking. 6. Count both white and blue (if any) colonies and determine the number of the white colonies per microlitre of ligation (see Note 17). 3.8. Estimation of Insert Size

1. Randomly pick 44 white colonies into individual 50-ml Falcon tubes each containing 5 ml LB and 12.5 µg/ml chloramphenicol using sterile toothpicks. Incubate the tubes in a 37°C incubator overnight with vigorous shaking (120–160 rpm). 2. Centrifuge the tubes at 4°C using a benchtop centrifuge (Rotanta 460R) at 2,310 g (or 3,000 rpm) for 8 min. Remove the supernatant using a water-driving sucker. Add 200 µl of ice-cold P1 buffer and vortex at room temperature to re-suspend cell pellets. 3. Transfer the suspended cells into 1.5-ml microcentrifuge tubes and add 400 µl of freshly made P2 buffer. Mix by gently inverting the tubes —four to six times. Stand the tubes at room temperature for less than 5 min (see Note 18). 4. Add 300 µl of ice-cold P3 buffer. Mix the contents by gently inverting the tubes four to six times. Stand the tubes on ice for more than 7 min. 5. Centrifuge the tubes using an Eppendorf microcentrifuge at 4°C at 16,100 g (or 13,200 rpm) for 25 min (see Note 19). 6. Carefully transfer about 800–850 µl of each supernatant to a new microcentrifuge tube. Add 550 µl isopropanol and mix thoroughly. 7. Centrifuge the tubes using an Eppendorf microcentrifuge at 12,000 g (or 11,400 rpm) at room temperature for 5 min. 8. Remove the supernatant. Add 400 µl of 70% ethanol and centrifuge the tubes using an Eppendorf microcentrifuge at 12,000 g for 2 min to wash the DNA pellets.

A Simple TAE-Based Method to Generate Large Insert BAC Libraries

75

9. Carefully remove the supernatant. Air-dry the pellets for 5–10 min (see Note 20). Add 30 µl of TE buffer (pH 8.0) and leave for 30 min until the pellets are dissolved. Add 10 µl of NotI digestion mixture (0.25 units of NotI, 4 µl of 10 × digestion buffer, 3.5 µl of water, 0.5 µl of 10 mg/ ml BSA, and 2 µl of 40 mM spermidine) into each tube. 10. Incubate the tubes in a 37°C incubator for 3 h. Add 6 µl of 6× DNA loading buffer into each tube. 11. Set a 14 × 20 cm gel casting stand with a 45-well 1.5-mm thick comb. Prepare and pour 200 ml of 1% agarose in 1× TAE buffer at about 50°C into the casting stand. Leave some of the gel in a 50°C waterbath. 12. Load λ ladder PFG Marker in a central well and seal the well with the 50°C 1% agarose. Load the minipreped DNA samples into the remaining wells. 13. Run the gel at 1–30 s linear ramp, 6 V/cm, 11°C in 1× TAE buffer for 17 h. 14. Stain the gel with 0.5 µg/ml EtBr. Take a photograph of the gel (Fig. 8) and analyse the insert sizes.

M(kb)

194.0 145.5 97.0 48.5 24.5

7.5 (Ve

Fig. 8. Determination of bacterial artificial chromosome (BAC ) clone insert sizes. Deoxyribonucleic acid (DNA) from randomly selected BAC clones was completely digested with Not I, and fragments were separated by pulsed-field gel (PFG ) electrophoresis under conditions: 1% agarose gel buffer, 1–30 s linear ramp, 6 V/cm, 11°C in 1 × Trisacetate EDTA (TAE). for 17 h. The λ ladder PFG marker labelled as M was loaded in the centre of the gel. Five sizes (24.5, 48.5, 97.0, 145.5 and 194.0 kb) of the λ ladder PFG marker are aligned with yellow dotted lines for easy estimation of the BAC clone insert sizes. The average sizes of the BAC inserts in this figure are estimated to be ~145 kb.

76

Shi, Gustafson, and Langridge

3.9. Bulk Ligation, Transformation, Colony Picking, Library Duplication, and Storage

1. If the insert sizes meet your requirements, then make a large scale of ligation under the same conditions as used for the test ligation. 2. Transform all the ligation into Invitrogen ElectroMAX DH10B competent cells using the same conditions as used for the test ligation. 3. Utilise a robot (QPix2) to pick individual colonies into 384well plates containing 70 μl freezing media filled manually or by a robot (QFill2). The robot picking pins need to sterilise with 20% ethanol for 3 s, 80% ethanol for 3 s and hot air blow drying for 10 s after each time picking. 4. Seal the plates with Qiagen tape pads using a sterile rubber roller. Incubate the plates in a 37°C incubator overnight. Check the number of empty wells if any next morning. Reinoculate individual colonies into the empty wells using sterile toothpicks, and incubate for an additional 8 h. 5. Make one or more copy of the plates using QPix2. Place Qiagen tape pads and then Genetix lids on each individual plate and store at –80°C.

3.10. Reproducing the BAC Library on Filters

1. Use QPix2 to inoculate BAC clones from the 384-well plates onto Genetix 22 × 22 cm Performa membranes. Spot 18,432 BAC clones from 48 384-well plates twice onto one membrane. This number and arrangement will depend on the particular robot used. 2. Use forceps to place the membranes on 22 × 22 cm LB agar plates (Q-Trays) containing 12.5 μg/ml chloramphenicol. Incubate in a 37°C incubator for 16–24 h until colonies are 1–2 mm in diameter. 3. Remove the membranes and place them (the colonies side up) on a piece of 0.34 mm-thickness Whatman paper prewetted with 70 ml of the denaturing solution in a 22 × 22 cm2 plate and incubate the membranes for 15 min. 4. Transfer the membranes into the neutralising solution in a 22 × 22 cm2 plate and incubate the membranes for 10 min. Then, transfer the membranes to a dry piece of Whatman paper and leave to air-dry for 1–2 h. 5. Place the membranes (the DNA side up) in a UV crosslinker (GS Gene linker, Bio-Rad), and expose at 120,000 μJ/cm2 (see Note 21). Alternatively, the membranes can be baked at 80oC for 2 h. These measures are used to fix the DNA to the membranes irreversibly. Now the membranes are ready for hybridisation against any probe of interest. Store the membranes at room temperature or 4oC.

A Simple TAE-Based Method to Generate Large Insert BAC Libraries

3.11. Characterisation of BAC Libraries

77

Characterisation of BAC libraries can be performed by estimation of the average insert size and detection of the average number of clones hybridised with single copy probes.

Notes 1. HK phosphatase can be combined with calf intestinal phosphatase (CIP) for dephosphorylation. Reaction time can be extended to 2 h if necessary. 2. Inactivation of HK and CIP activities can be done by heating at 85°C for 15 min or at 75°C for 10 min in the presence of 5 mM EDTA (pH 8.0), or by phenol/chloroform extraction and then precipitate with ethanol. 3. Running with low voltage will not result in overheating the gel and can easily separate cut and uncut forms of plasmid vector. 4. Most of plants can usually be harvested two or three times. 5. Grinding the tissue into a very fine powder will yield more nuclei. However, you must be sure that the tissue is not thawed during grinding, otherwise, nuclei degradation will occur. So periodically during grinding slowly add liquid nitrogen to the mortar if needed, but be extremely careful not to spill any tissue out of the mortar. 6. Initially, manual stirring is necessary because both powder and solution are frozen and are hard to be stirred by a mechanical stirrer. This assistance will reduce time of nuclei in the solution, which could result in some degradation. 7. If the filtered solution contains some tissue powder, another filtering with two layers of Miracloth should be applied, which will reduce contamination from any unwanted chloroplast DNA. 8. The DNA at this step can be stored at 4°C for up to 1 year without significant degradation. 9. The zero unit and 40 units are used for negative and positive controls, respectively. If partial digestion is significantly different between two units, a third unit between them should be applied. This is in order to find the mos suitable unit, with which the DNA is most effectively partially digested, or which produces the best partially digested DNA pattern. 10. The modifications include skipping the test with different amounts of restriction enzyme, and the utilisation of a slot well for loading partially digested plugs. In addition, a λ

78

Shi, Gustafson, and Langridge

ladder PFG marker is loaded into two side wells of the gel. Furthermore, only the two sides of the gel containing the marker are stained and photographed with a ruler at one side. 11. Changing the switch time to 3–5 s and running additional hours can maximise the removal of small-sized DNA fragments from the gel. In addition, it can also condense the range of the gel to be cut, which makes size separation effective in the subsequent second-size selection. 12. The plugs can be stored indefinitely in 70% ethanol at −20°C. The 70% ethanol-stored plugs can be used 3 h after soaked in a large volume of sterilised distilled water at room temperature with several changes of water and gentle shaking. 13. Eluted DNA should be used as soon as possible. It is better to use it the same day it is eluted out. Always use cut-off pipette tips to manipulate HMW genomic DNA. This will avoid mechanical shearing during any operation. 14. During ligation, both DNA and vector have the capacity to circularise and form tandem oligomers; therefore, it is necessary to adjust the DNA concentration in the ligation reaction to optimise the number of correct ligation products. To achieve this, further evaluating molecular vector to DNA ratios may be required. 15. The ligations should not be incubated more than 16 h at 16°C, otherwise small insert products could be generated. The ligation reactions can be inactivated at 65oC for 15 min or with addition of 2.5 μl proteinase K (10 mg/ml) per 100 μl ligation. Storing the ligations on ice can minimise any changes in size and transformation efficiency over a long time compared to storing the ligations at 4°C, which can only allow minimise any changes in size and transformation efficiency for 7 days. 16. Insert size can be increased with a lower voltage without significantly losing transformation efficiency. However, any voltage lower than 1,400 V will reduce transformation efficiency. Transformation efficiency can be increased by desalting the ligations on membrane filters (0.05 µm VMWP, 13 mm in diameter) (Millipore, Bedford, MA, USA) floating on 30% polyethylene glycol (PEG) 8000 in Petri dish plate for 1 h on ice. 17. The number of clones desired, the genome size, and the desired genome coverage are taken into consideration when deciding whether the experiment should go on or not. 18. Prolonged alkaline lysis may degrade plasmid DNA, may permanently denature the supercoiled plasmid DNA, or may

A Simple TAE-Based Method to Generate Large Insert BAC Libraries

79

render it unsuitable for use in downstream applications. On the contrary, the cells should lyse in less than 5 min. 19. The centrifuge time can be extended to 30–35 min if pellets are not formed tightly. 20. Overdrying the pellets will make the large DNAs difficult to be dissolved. 21. UV cross-linking is recommended for nylon membranes as this leads to covalent attachment of the DNA to the nylon and also allows the membranes to be re-probed several times.

Acknowledgements We thank the Australian Centre for Plant Functional Genomics of the University of Adelaide (Australia) for support of this work. We are especially grateful to Chun-Ji Liu for providing an initial protocol for construction of a BAC library.

References 1. Collins, J. and Hohn, B. (1978) Cosmids: a type of plasmid gene-cloning vector that ispackageable in vitro in bacteriophage lambda heads. Proc. Natl. Acad. Sci. USA 75, 4242–4246. 2. Burke, D.T., Carle, G.F., and Olson, M.V. (1987) Cloning of large segments of exogenous DNA into yeast by means of artificial chromosome vectors. Science 236, 806–812. 3. Sternberg, N. (1990) Bacteriophage P1 cloning system for the isolation, amplification, and recovery of DNA fragments as large as 100 kilobase pairs. Proc. Natl. Acad. Sci. USA 87, 103–107. 4. Shizuya, H., Birren, B., Kim, U.-J., et al. (1992) Cloning and stable maintenance of 300-kilobasepair fragments of human DNA in Escherichia coli using an F-factor-based vector. Proc. Natl. Acad. Sci. USA 89, 8794–8797. 5. Zhang, H.B., Woo, S.S., and Wing, R.A. (1996) BAC, YAC and Cosmid Library Construction, in Plant Gene Isolation (Foster, G. and Twell, D., eds.), Wiley, New York, pp. 75–99. 6. Ioannou, P.A., Amemiya, C.T., Garnes, J., Kroisel, P.M., Shizuya, H., Chen, C., Batzer, M. A., and De Jong, P.J. (1994) A new bacteriophage P1-derived vector for the propa-

7.

8.

9.

10.

11.

gation of large human DNA fragments. Nat. Genet. 6, 84–89. Kelley, J.M., Field, C.E., Craven, M.B., Bocskai, D., Kim, U.-J., Rounsley, S.D., and Adams, M.D. (1999) High throughput direct end sequencing of BAC clones. Nucleic Acids Res. 27, 1539–1546. Shi, B.J., Collins, N., Miftahudin, B., Langridge, P., and Gustafson, P. (2006) Construction of a rye cv. Blanco BAC library, and progress towards cloning the rye Alt3 aluminium tolerance gene. Vortr. Pflanzenzuchtg. 71, 205–209. Ma, X.F., Wanous, M.K., Houchins, K., Rodriguez, Milla, M.A., Goicoehea, P.G., Wang, Z., Xie, M., and Gustafson, J.P. (2001) Molecular linkage mapping in rye (Secale cereale L.). Theor. Appl. Genet. 102, 517–523. Amemiya, C.T., Ota, T., and Litman, G.W. (1996) Nonmammalian Genomic Analysis: A Practical Guide (Lai, E. and Birren, B., eds.) Academic Press, San Diego, CA, pp. 223– 256. Birren, B., Green, E.D., Klapholz, S., Myers, R.M., and Roskams, J. (eds.) (1997) Analyzing DNA. CSH Laboratory Press, Cold Spring Harbor, New York.

80

Shi, Gustafson, and Langridge

12. Osoegawa, K., Woon, P.Y., Zhao, B., et al. (1998) An improved approach for construction of bacterial artificial chromosome libraries. Genomics 52, 1–8. 13. Choi, S. and Wing, R.A. (2000) Plant Molecular Biology Manual, 2nd ed. (Gelvin, S. and Schilperoort, R., eds.), Kluwer

Academic Publishers, Norwell, MA, pp. 1–28. 14. Peterson, D.G., Tomkins, J.P., Frisch, D.A., Wing, R.A., and Paterson, A.H. (2000) Construction of plant bacterial artificial chromosome (BAC) libraries: an illustrated guide. J. Agric. Genomics 5 (Beavis, B. and May, G., eds).

Chapter 5 Transcript Profiling and Expression Level Mapping Elena Potokina, Arnis Druka, and Michael J. Kearsey Summary Transcript abundance data from cRNA hybridizations to Affymetrix microarrays can potentially be used to identify genetic markers to facilitate high-throughput genotyping. We have shown that it is easily possible to use the information from Affymetrix expression arrays to accurately identify over 4,000 robust polymorphic transcript-derived markers (TDMs). We developed the method to identity TDM polymorphisms from experiments involving two tissues in two commercial varieties of barley and their doubledhaploid progeny. These TDMs represent ~18% of the total barley genes on the chip and can be used to predict the genotypes in an F1-derived, doubled-haploid population. According to our estimates, 35% of the TDMs reveal nucleotide polymorphism of the particular gene (single feature polymorphisms, SFPs) while 65% mark polymorphism resulting in extreme variation of gene expression (genetic expression markers, GEMs). These latter are probably mainly cis-acting regulators while a small proportion, ~5%, are loosely or un-linked transregulators. Key words: Affymetrix, Expression analysis, Barley, SNP, Transcript-derived markers.

1. Introduction High density oligonucleotide arrays have provided a powerful tool for transcriptome profiling of plant crop species. In addition to their ability to estimate transcript abundance via cRNA, microarrays have been used to recognize cRNA sequence polymorphism allowing simultaneous genotyping and gene expression measurement within the same experiment (1). Several recent studies have explored the possibility of using transcript abundance data from cRNA hybridizations to Affymetrix microarrays to reveal genetic polymorphisms that can be used as markers to genotype individuals in mapping populations (1, 2).

Daryl J. Somers et al. (eds.), Methods in Molecular Biology, Plant Genomics, vol. 513 © Humana Press, a part of Springer Science + Business Media, LLC 2009 DOI: 10.1007/978-1-59745-427-8_5

81

82

Potokina, Druka, and Kearsey

Each gene on an Affymetrix chip (referred to as a Contig) is typically represented by 11 different 25 bp oligos covering probes of the coding region of that gene. Each of these probes is present as a perfect match (PM) and mismatch (MM) oligonucleotide. The PM exactly matches the sequence of a particular standard genotype, while the MM differs from this in a single substitution in the 13th base. The expression level of a gene is a function of the hybridization intensity of all 11 probes. Genotyping uses each one of the probes independently, providing an opportunity to measure 11 × 25 bp fragments per gene for comparing two genotypes, for example, parental genotypes of a mapping population. If the sequence of the particular probe of both parental samples is similar then the probe for both parents produces a clear signal of equal intensity. Any nucleotide substitutions or deletions in one of the parent samples will affect the hybridization kinetics and, therefore, can be identified by a lower probe signal. This approach was successfully applied to yeast (1), Arabidopsis (3) and barley (4, 5) for detection of thousands of sequence polymorphisms termed single feature polymorphisms (SFPs) (6). Recently, West et al. (2) introduced gene expression markers (GEMs) which are based on gene expression differences, not on individual probe hybridization. GEMs are characterized by large difference in transcript levels between the parents of a segregating population causing a distinctly bimodal distribution of expression phenotypes in a recombinant inbred line population. Making no attempt to separate SFPs from GEMs, we suggest a simple and efficient algorithm to distinguish a very large number of polymorphic transcript-derived markers (TDMs) in cRNA profiling data from replicated Affymetrix microarrays and illustrate this with data from two parental genotypes and their doubled haploid (DH) progenies. Following previous studies (2) we did not recognize differentially expressed genes between parents as a starting point for GEM identification. We did not empirically choose a particular n-fold expression difference between the two parental genotypes to identify genes with non-overlapping distributions in expression value. The method does not separate hybridization affinity between probe and transcript sequences (i.e. probe effect) from transcript abundance (gene expression) as an initial step for SFP detection (1). We ignored the issue of whether our TDMs represent a nucleotide polymorphism by themselves (SFPs), or they just mark them by extreme allele-specific expression differences (GEMs). The TDM method simply identifies probes which can be ‘binarised’ across the DH lines. The algorithm was explored using expression data from a segregating DH population of barley derived from a cross between varieties Steptoe (St)/Morex (Mx). We identified 2,449 and 3,858 TDMs from leaf and germinating embryo. We compared the predicted TDMs genotypes for the 30 DH lines with the SNP genotypes for 203 genes and found that 95% of TDMs accurately predict the SNP genotype of over 98% of the DH lines.

Transcript Profiling and Expression Level Mapping

83

2. Materials 2.1. Mapping Population

1. We describe the approach with a sample of the data for one tissue of the barley St × Mx DH population (7). We used messenger ribonucleic acid (mRNA) from seedling leaves for expression profiling from 35 recombinant lines. These lines (the ‘mini-mapper’ set) were selected from a larger population of 150 DH lines based on informative recombination events allowing markers to be positioned evenly across all chromosomes. 2. Affymetrix product #900515 GeneChip® Barley1 Genome Array.

3. Methods 3.1. Plant Material, RNA Isolation and GeneChip Hybridizations

1. To obtain seedling leaf tissue, ten sterilized seeds per line were sown in each of three replicate 13 cm2 pots. One pot of every member of the ‘trial set’ was randomized in each of the three randomized blocks and each block placed in a separate Snijders growth cabinet set at 17°C with 16 h light/12°C 8 h dark periods at a light intensity of 400 u Einstein/m/s. After 12 days, leaves of seven to eight seedlings from each pot were collected, bulked and flash frozen in liquid nitrogen; tissues from all three replicate pots of each line were bulked for RNA isolation. 2. RNA was isolated, processed and hybridized to the Barley1 GeneChip (complete description and references at http:// www.affymetrix.com/products/arrays/specific/barley.affx) using Trizol procedures (8). The labelling, hybridization and GeneChip data acquisition were conducted at the Affymetrix facility at Iowa State University, USA. In total, 41 Affymetrix Barley1 GeneChip hybridizations were analyzed: 3 replications for both St and Mx and non-replicated hybridizations of 35 DH lines. Forty-one CEL files with results of cRNA profiling were available for the analysis.

3.2. Access to Probe Level Data from CEL Files

The exploratory analytical method is based on the probe level data of the PM values that are background adjusted, normalized and log-transformed following the RMA approach (robust multiarray average) (9). The Bioconductor open-source software is a convenient tool for this purpose. 1. Create a directory, move all the relevant CEL files to that directory (e.g. c:/barley/) 2. Install R (http://www.r-project.org/). When you use the R program it issues a prompt (‘ > ’) when it expects input commands. 3. Type the following commands in order

84

Potokina, Druka, and Kearsey

> source (“http://bioconductor.org/getBioC.R”) # install Bioconductor > getBioC(“affy”) # call package for Affymetrix data > library(affy) # load the affy package > setwd(“c:/barley”) # indicate working directory > RNA < - ReadAffy() # read all CEL files in working directory, alphabetical order > RNA < - bg.correct.rma(RNA) # background correction > RNA < - normalize.AffyBatch.quantiles(RNA) # normalization by the quantile method > barley.probeNames < - probeNames(RNA) # look at 22801 probesets with 11 probes each > table(table(barley.probeNames)) > probeset11 < - names(which(table(barley.probeNames) = = 11)) > indexBprobes < - indexProbes(RNA,which = “pm”, genenames = probeset11) > pm.i.xy < - indices2xy(unlist(indexBprobes), abatch = RNA) > barley.names11xy < - apply(cbind(rownames(pm.i.xy),pm.i .xy),1,paste,collapse = “--”) > RNA.logpm < – log(pm(RNA, probeset11),2) # extract perfect match values to a matrix in logarithmic scale > write.table(RNA.logpm,file = “barley_pm_log2.txt,” sep = ” ”) # write perfect match values into .txt file > q() # leave R As a result, we have a “.txt” file with background adjusted, normalized PM values for 250,811 probes (22,801 contigs, 11 probes each). The PM values are in logarithmic scale. 3.3. Principle of TDM Detection

The principle of the approach is that a reliable TDM should (1) demonstrate a detectable difference in signal intensity between the two parents and (2) allow the set of DH lines to be divided into two groups each containing one of the parental alleles. The polymorphism showed by SFPs or GEMs markers in many cases might be evident without any calculation; one can predict parent allele for each particular DH line just visually (Fig. 1). The task is to screen the set of 250,811 probes and to identify those probes that divide the whole set of 41 hybridizations into two sharp clusters, as in the case of Contig5061_at probe 7, or any of the probes of Contig11524_at (Fig. 1). Obviously, in a case of SFPs, one (or very few) probe(s) per contig will fulfill this criterion, while for GEMs all or most of the probes of a contig will show the clear divergence.

3.4. Detection of TDMs

1. For each probe the corresponding 41 PM values were divided into two clusters using the k-mean clustering approach grouping being achieved by minimizing the variation within the clusters.

SFP marker

TDM markers

480.9 623.6 813.5

M1

Contig5061_at9

Contig5061_at10

Contig5061_at11

12.5 36.1 61.1 29.6 9.8 88.4 41.1 39.9

Contig10883_at4

Contig10883_at5

Contig10883_at6

Contig10883_at7

Contig10883_at8

Contig10883_at9

Contig10883_at10

Contig10883_at11

16.6

1142.4

Contig5061_at8

Contig10883_at3

474.2

Contig5061_at7

23.6

884.9

Contig5061_at6

Contig10883_at2

910.4

Contig5061_at5

23.6

M2

277.3

Contig5061_at4

Contig10883_at1

479.9

314.2

Contig5061_at3

140.0

40.5

138.6

18.9

32.4

101.3

71.2

12.7

30.4

32.4

47.5

334.9

303.9

1403.1

134.5

781.1

787.2

151.0

213.3

614.5

480.9

Contig5061_at2

432.6

545.1

Contig5061_at1

M2

48.0

40.5

71.8

14.3

42.2

55.5

36.7

11.7

14.9

11.5

13.4

M3

765.0

600.9

537.0

1183.0

411.4

1039.4

819.0

284.0

289.4

729.4

544.9

M3

711.4

154.2

417.0

56.6

208.3

156.7

100.6

34.8

180.2

101.1

183.2

S1

942.7

601.0

420.5

1083.5

25.8

684.1

547.1

348.1

301.4

604.0

431.1

S1

674.7

183.4

434.2

56.3

148.9

208.7

117.1

60.4

339.8

300.1

222.4

S2

652.6

385.2

346.9

985.3

28.1

683.4

384.7

244.0

226.3

606.4

350.9

S2

632.3

144.2

485.3

66.7

147.6

204.6

98.0

29.4

199.0

203.2

205.7

S3

775.4

551.5

411.9

1187.0

25.3

506.8

595.0

273.5

286.8

834.6

434.9

S3

72.0

34.2

68.8

12.2

17.4

50.3

22.2

10.7

21.0

15.6

21.0

DH116

640.0

538.1

470.9

990.5

405.9

1052.4

719.6

160.1

311.0

423.6

397.2

DH116

611.1

155.9

501.4

67.9

169.2

162.9

116.7

41.4

247.4

252.8

285.9

DH12

650.5

538.9

341.1

890.7

26.1

753.0

601.7

143.8

298.4

479.2

369.4

DH12

33.2

83.0

99.4

10.5

52.5

64.6

44.2

10.8

22.1

18.1

16.2

DH13

813.0

665.6

472.4

1150.2

374.0

866.0

655.5

410.1

411.7

691.9

460.6

DH13

645.0

182.4

369.1

66.5

127.1

193.1

80.7

38.9

291.6

211.8

136.9

DH130

719.0

549.2

442.4

1163.6

285.6

886.7

588.9

214.7

258.0

728.1

450.5

DH130

517.8

205.8

372.8

37.6

153.3

147.2

90.1

22.7

266.4

156.1

233.7

DH135

730.0

635.1

612.7

1247.5

466.1

1290.7

1009.8

307.2

451.6

767.2

593.0

DH135

…

…

…

…

…

…

…

…

…

…

…

DH…

…

…

…

…

…

…

…

…

…

…

…

DH…

Fig. 1. Transcript-derived markers (TDM) as a combination of both single feature polymorphisms (SFPs) (e.g. Contig5061_at probe 7) and genetic expression markers (GEMs) (Contig11524_at all probes). Bold and underlined is Morex (Mx) allele visually recognizable among doubled haploid (DH) progeny lines.

GEM markers

M1

Transcript Profiling and Expression Level Mapping 85

86

Potokina, Druka, and Kearsey

2. The probes with two non-overlapping clusters are identified. 3. Assuming that values of cluster1 and cluster2 are normally distributed, we calculate the mean and standard deviation for cluster1 and cluster2 (Table 1). 4. Determine how many members of cluster1 could be significantly settled within the distribution of cluster2 and vice versa. To do that we use the formula Z1= |(x1− m2|s2), where x1 is the PM value of a member of cluster1, m2 and s2 are mean and standard deviation of cluster2; a similar calculation was performed for members of cluster2 (Z2= |(x2− m1)|/s1) (Table 2). By this way we obtain the standardized normal score for each member of both clusters, usually denoted by Z, and often called a Z-score. This follows the standard normal distribution N(0,1), and we, therefore, may use the corresponding statistical table, showing the cumulative probability that the particular value of cluster1(x1) belongs to distribution N(m2, s2) of cluster2. We used a Z1 ≥ 2.576 (P ≤ 0.01) to indicate 99% probability that probe i does not belong to the other cluster, otherwise it is treated as a missing datum. 5. This is repeated for all members of both clusters, and the total number of missing data is calculated (Table 2). More missing data per probe means less divergence between clusters and a weaker chance that the particular probe is a real marker. We only accepted those probes which had no more than one missing individual out of 41 to form the preliminary set of candidate markers, for example, probes which could be sharply divided into two practically non-overlapping clusters. 6. For the preliminary selected set of markers, check whether the parents are consistently different in all three replicates, as in the case of Contig5061_at 7 (Table 2). Here, three replicates of St belong to cluster1 and three replicates of Mx to cluster2. Consequently, 17 DH lines from cluster1 may be assigned to the St allele for that particular locus; the other 17 DH lines from cluster2 embody the Mx allele; one DH line (DH74) could not be genotyped (see Note 1). 7. The final two approaches to verifying the TDMs involved mapping them and constructing haplotypes (graphical genotypes) for all chromosomes of DH lines. If the linkage map exists for the particular cross, the simplest and most efficient way is to incorporate TDMs into the map with MapManager QTX software using the “Distribute” option. If the experiment was designed for an unknown cross, a small set of SNP markers is recommended to act as anchors to identify and orient each chromosome. The SNP anchor markers and TDMs are combined in one set and assigned to linkage groups using minimal LOD = 3.0. A mapping

8.39 (2)

8.91 (2)

8.12 (2)

9.83 (2)

9.79 (2)

8.89 (1)

10.16 (2) 10.4 (2)

8.25 (2)

8.30 (1)

8.91 (2)

Contig5061_at3

Contig5061_at4

Contig5061_at5

Contig5061_at6

Contig5061_at7

Contig5061_at8

Contig5061_at9

Contig5061_at10 9.28 (1)

Contig5061_at11 9.67 (1)

10.02 (1)

9.68 (2)

8.15 (2)

8.18 (1)

9.51 (1)

9.09 (2)

M3

9.42 (2)

8.59 (1)

7.93 (2)

7.82 (1)

9.24 (2)

8.46 (1)

S2

8.99 (2)

9.22 (1)

8.10 (2)

8.16 (1)

9.70 (1)

8.76 (1)

S3

9.88 (1)

9.23 (1)

8.72 (2)

10.1 (1)

9.35 (2)

8.59 (2)

8.44 (2)

9.94 (1)

9.60 (1)

9.11 (2)

8.69 (2)

9.23 (1)

7.17 (1)

8.22 (1)

8.90 (2)

8.53 (1)

DH12

8.67 (1)

9.32 (2)

9.07 (2)

8.88 (2)

9.76 (2)

9.36 (2)

8.68 (2)

8.69 (2)

9.43 (1)

8.85 (1)

DH13

–

–

–

–

–

–

9.35 (2)

9.07 (2)

8.41 (2)

9.80 (1)

9.67 (1)

9.38 (1)

8.88 (2)

10.2 (2)

–

–

–

–

N(8.3, 0.79)

N(10.1, 0.19)

N(9.0, 0.36)

N(7.3, 0.39)

N(8.1, 0.26)

N(9.5, 0.16)

N(8.7, 0.16)

Cluster2

N(9.2, 0.15)

N(8.9, 0.20)

N(8.7, 0.18)

N(9.6, 0.17)

N(9.4, 0.22)

N(9.2, 0.14)

N(10.4, 0.21) N(9.9, 0.18)

N(5.0, 0.63)

N(9.6, 0.19)

N(9.6, 0.18)

N(8.3, 0.26)

N(8.6, 0.20)

N(9.1, 0.16)

N(9.1, 0.11)

DH … Cluster1

4.71 (2) 8.55 (1) –

10.04 (1) 9.56 (2)

9.49 (2)

7.32 (1)

8.28 (1)

8.73 (2)

8.63 (1)

DH116

10.21 (2) 9.95 (1)

4.69 (2) 4.81 (2) 4.66 (2)

9.42 (2)

9.10 (1)

8.44 (2)

8.24 (1)

9.24 (2)

8.75 (1)

S1

The intensity signals of the probes are in a logarithmic scale. The cluster which each belongs is indicated in brackets beneath. The two rightmost columns represent mean and standard deviation of both clusters. Shaded and bold are Morex (Mx) and Steptoe (St) allele visually recognizable among doubled haploid (DH) progeny lines for probe 7

9.58 (1)

9.23 (1)

9.07 (1)

10.2 (2)

7.07 (1) 8.68 (1)

9.61 (2)

9.62 (2)

7.24 (1)

7.74 (1)

9.26 (2)

8.91 (2)

Contig5061_at2

8.76 (1)

9.09 (2)

M2

Contig5061_at1

M1

Table 1 K-mean clustering results for each probe of Contig5061_at taken as an example

Transcript Profiling and Expression Level Mapping 87

Z-score for members of cluster1

3.69

1.50

1.62

1.62

4.81

3.75

1.88

2.31

2.56

2.75

1.44

2.75

4.56

1.31

1.75

1.94

1.38

Contig5061_at2 N1(9.1, 0.16)

8.91 (Mx1)

9.26 (Mx2)

9.24 (St1)

9.24 (St2)

8.73 (DH116)

8.90 (DH12)

9.20 (DH141)

9.13 (DH152)

9.09 (DH155)

9.06 (DH173)

9.27 (DH184)

9.06 (DH24)

8.77 (DH27)

9.29 (DH29)

9.22 (DH41)

9.19 (DH43)

9.28 (DH46)

9.35 (DH73)

9.53 (DH7)

9.39 (DH44)

9.53 (DH22)

9.66 (DH200)

9.38 (DH177)

9.33 (DH177)

9.55 (DH169)

9.35 (DH167)

9.83 (DH146)

9.34 (DH140)

9.93 (DH136)

9.58 (DH135)

9.51 (DH130)

9.43 (DH13)

9.70 (St3)

9.51 (Mx3)

Contig5061_at2 N2 (9.5, 0.16)

1.56

2.69

1.81

2.69

3.50

1.75

1.44

2.81

1.56

4.56

1.50

5.19

3.00

2.56

2.06

3.75

2.56

Z-score for members of cluster2

5.88 (DH61)

5.29 (DH46)

5.37 (DH44)

5.72 (DH43)

4.43 (DH41)

5.64 (DH200)

4.72 (DH184)

4.00 (DH177)

4.92 (DH173)

4.46 (DH169)

4.39 (DH167)

6.00 (DH160)

5.08 (DH141)

4.71 (DH12)

4.66 (St3)

4.81 (St2)

4.69 (St1)

Contig5061_at7 N1(5.0, 0.63)

3.06

3.81

3.71

3.27

4.90

3.37

4.53

5.44

4.28

4.86

4.95

2.91

4.08

4.54

4.61

4.42

4.57

Z-score for members of cluster1

8.60 (DH63)

8.77 (DH4)

9.59 (DH27)

8.72 (DH24)

6.75 (DH22)

8.80 (DH155)

6.83 (DH152)

7.25 (DH146)

8.99 (DH140)

7.84 (DH136)

8.86 (DH135)

8.16 (DH130)

8.55 (DH13)

8.67 (DH116)

8.68 (Mx3)

7.07 (Mx2)

8.89 (Mx1)

Contig5061_at7 N2(8.3, 0.79)

5.71

5.98

7.29

5.90

2.78

6.03

2.90

3.57

6.33

4.51

6.13

5.02

5.63

5.83

5.84

3.29

6.17

Z-score for members of cluster2

Table 2 Z-score (absolute value) for cluster members of two probes of Contig5061_at: Contig5061_at7 fits to the criterion of transcript-derived marker (TDM); Contig5061_at2 does not pass the criterion. (Values in bold, P £ 0.01)

88 Potokina, Druka, and Kearsey

2.00 4.76 (DH89)

4.17 (DH88)

6.40 (DH74)

4.48

5.23

2.41

3.97

With Z-score less than 2.58 the corresponding member value is considered as unclassified (missing) datum

Total non-significant (missing): 1

3.13

9.00 (DH85)

9.42 (DH89)

2.44

5.16 (DH7)

Total non-significant (missing): 24

2.81

9.05 (DH74)

9.49 (DH88)

2.19

Total significant: 40

2.88

9.04 (DH64)

9.45 (DH79)

Total significant: 17

2.56

9.09 (DH61)

8.74 (DH85)

8.32 (DH79)

7.60 (DH73)

5.94

5.27

4.13

Transcript Profiling and Expression Level Mapping 89

90

Potokina, Druka, and Kearsey

procedure is strongly recommended as a final proof of the identified set of TDM markers (see Note 2). 8. The haplotypes of DH lines are checked against the created linkage map and all TDMs producing more than 2% single marker double recombinants are excluded. 9. The remaining occasional double recombinants are readily detected and replaced as ‘missing’ genotypes.

Notes 1. A significant factor with TDMs is the nature of the polymorphisms they are detecting. According to our estimates based on known sequence of certain probes in St and Mx, 35% of the TDM markers identify probes with nucleotide polymorphism in the particular gene (SFPs) while 65% do not, and simply mark polymorphism for extreme variation of gene expression (GEMs). This raises the important question of whether or not the 65% of GEMs actually reflect the location of the genes and so can be used for the localization of the corresponding genes. In other words, do the marked loci represent cis- or transregulating factors. For the St/Mx population we were able to compare the predicted TDM genotypes for the 30 DH lines with the SNP genotypes for 203 genes and found that 95% of genes match exactly or are wrong for just 1 or 2 out of 30 lines while 5% fail for more than 10% of the lines (10). When we try to map the poorly fitting nine TDMs that do not match the SNP genotypes, we find that seven easily map elsewhere on the genome. Significantly, two of them map to the precise position occupied by the SNP identified in a different mapping population, Oregon/Wolfe, and hence could indicate duplicate genes. Another TDM perfectly coincided with the corresponding trans-eQTL (LOD = 16). We conclude, therefore, that ~5% of TDMs could be due to duplicate genes, chance sequence alignments with RNA from elsewhere, or they may be the product of polymorphic trans-acting regulators. We would expect that from our approach; GEMs were identified only if they provided a distinctly bimodal distribution in the DH line gene expression data. Such GEMs (contigs) would show the highest LOD score when performing eQTL analysis. It was recently reported that generally cis-eQTLs have much greater LOD scores than the trans-acting eQTLs (11). At a genome-wide significance of P < 0.05, 60–65% of the eQTLs were regulated in trans in two tissues of rats, whereas, at a higher significance level (P < 10–4), 85–100% of eQTLs were regulated in cis (12). Based on these reports and our results

Transcript Profiling and Expression Level Mapping

91

we assume that with the criteria established trans-factors have a lower chance of being selected compared to cis-factors causing allele-specific expression. 2. To construct a TDM-based genetic linkage map for the St/ Mx cross we used JoinMap Version-3.0 (13). The SFP markers were assigned to linkage groups using anchor markers with minimal LOD = 3.0. Next, the mapping procedure consisted of adding loci one by one, starting from the most informative pair of loci. For each added locus, the best position was searched by comparing the goodness-of-fit of the resulting map for each tested position. The quality of the resulting maps was estimated by the probability of loci averaged over individuals [locus averages − log10(P)]. This probability may indicate a number of possible genotyping errors recognized by double recombinants. Loci with the lowest probability are iteratively removed after each round of mapping. The map was considered acceptable when the probability [expressed as −log10(P)] did not exceed 0.20 which, with 30 DH lines, meant that individual loci had no more than one line with a double crossover involving that gene.

Acknowledgement This research was supported by a research grant from the Biotechnology and Biological Sciences Research Council (BBSRC) of the United Kingdom.

References 1. Ronald, J., Akey, J.M., Whittle, J., Smith, E.N., Yvert, G., and Kruglyak, L. (2005) Simultaneous genotyping gene-expression measurement and detection of allele-specific expression with oligonucleotide arrays. Genome Res. 15, 284–291. 2. West, M.A.L., Leeuwen, H., Kozik, A., Kliebenstein, D.K., Doerge, R.W., Clair, D.A., and Michelmore, R.W. (2006) High-density haplotying with microarray-based expression and single feature polymorphism markers in Arabidopsis. Genome Res. 16, 787–795. 3. DeCook, R., Lall, S., Nettleton, D., and Howell, S.H. (2006) Genetic regulation of gene expression during shoot development in Arabidopsis. Genetics 172, 1155–1164. 4. Rostoks, N., Borevitz, J.O., Hedley, P.E., Russell, J., Mudie, S., Morris, J., Cardle, L., Marshall, D.F., and Waugh, R. (2005) Single-

feature polymorphism discovery in the barley transcriptome. Genome Biol. 6, R54. 5. Cui, X., Xu, J., Asghar, R., Condamine, P., Svensson, J.T., Wanamaker, S., Stein, N., Roose, M., and Close, T.J. (2005) Detecting single-feature polymorphisms using oligonucleotide arrays and robustified projection pursuit. Bioinformatics 21, 3852–3858. 6. Brem, R.B., Yvert, G., Clinton, R., and Kruglyak, L. (2002) Genetic dissection of transcriptional regulation in budding yeast. Science 296, 752–755. 7. Kleinhofs, A., Kilian A., Saghai-Maroof, M.A., Biyashev, R.M., Hayes, P., Chen, F.Q., Lapitan, N., Fenwich, A., Blake, T.K., Kanazin, V.et al., (1993) A molecular, isozyme and morphological map of the barley genome. Theor. Appl. Genet. 86, 705–712.

92

Potokina, Druka, and Kearsey

8. Caldo, R.A., Nettleton, D., and Wise, R.P. (2004) Interaction-dependent gene expression in Mla-specified response to barley powdery mildew. Plant Cell 16, 2514–2528. 9. Irizarry, R.A., Hobbs, B., Collin, F., BeazerBarclay, Y.D., Antonellis, K.J., Scherf, U., and Speed, T.P. (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4(2), 249–264. 10. Luo, Z.W., Potobina, E., Druka, A., Wise, R., Waugh, R., Kearsey, M.J. (in Press) Robust, high density genotyping from gene-expression data in species with un-sequenced genomes. Genetics

11. Yamashita, S., Wakazono, K., Nomoto, T., Tsujino, Y., Kuramoto, T., et al. (2005) Expression quantitative trait loci analysis of 13 genes in the rat prostate. Genetics 171, 1231–1238. 12. Hubner, N., Wallace, C.A., Zimdahl, H., Petretto, E., Schulz, H., et al. (2005) Integrated transcriptional profiling and linkage analysis for identification of genes underlying disease. Nat. Genet. 37, 243–253. 13. Van Ooijen J.W. and Voorrips R.E. (2001) JoinMap® 3.0. Software for the calculation of genetic linkage maps. Plant Research International. Wageningen: The Netherlands.

Chapter 6 Methods for Functional Proteomic Analyses Christof Rampitsch and Natalia V. Bykova Summary The term ‘Proteomics’ was introduced in 1997 to describe a growing interest in the study of the proteome – the expressed protein set of an organism. As this new discipline evolved, it quickly became obvious that proteomics would be a very complex and ambitious undertaking, perhaps even more so than genomics, which had engendered it. New techniques for both the separation and analysis/identification of proteins were emerging or being refined, and these facilitated the development of this new field. Many proteomics experiments are now routine in some laboratories. In this chapter we describe a typical proteomics experiment, using examples from our laboratory: the separation of complex mixtures of proteins by 2-dimensional electrophoresis and subsequent identification of a protein spot by mass spectrometry with two commonly used instruments: MALDI-QqTOF and ESI-ion trap. Key words: Plant proteomics, Two-dimensional electrophoresis, Mass spectrometry.

1. Introduction The simplest and oldest method for producing a two-dimensional (2-D) array of separated proteins representing a proteome is by 2-D gel electrophoresis (2-DE) (1, 2). Most commonly this technique combines isoelectric focusing (IEF) and denaturing polyacrylamide gel electrophoresis (SDS-PAGE) to resolve protein mixtures by their isoelectric points (pI) in the first dimension, and molecular mass (Mr) in the second. To separate proteins by 2-DE they must first be purified to some degree and brought into a solution that is compatible with IEF. To achieve this is not complicated, especially if some losses are acceptable. A widely used procedure is acetone/TCA precipitation (3), which eliminates many cellular contaminants, leaving an acetone powder rich in proteins. Many variations and alternate approaches Daryl J. Somers et al. (eds.), Methods in Molecular Biology, Plant Genomics, vol. 513 © Humana Press, a part of Springer Science + Business Media, LLC 2009 DOI: 10.1007/978-1-59745-427-8_6

93

94

Rampitsch and Bykova

have been published, especially for plants, a popular one is extraction with phenol followed by methanol precipitation (4, 5). To solubilize precipitated proteins, a typical IEF solution contains water, urea, a non-ionic detergent such as 3-[(3-cholamidopropyl) dimethylammonio]-1-propanesulfonate (CHAPS), dithiothreitol (DTT) and ampholytes. This works well for many samples, but for plants particular attention must also be paid to eliminating, or at least mitigating, the presence of phytochemicals like tannins, polyphenols, etc., which interfere with protein stability or integrity. These are dealt with on a case-by-case basis with IEF-compatible additives (5), and thus even ‘difficult’ tissues such as wood and pine needles can yield well-resolved 2-D gels (6). Widely used additives are thiourea, protease inhibitors, other non-ionic detergents, polyvinyl polypyrrolidone and antioxidants. Nonetheless, the recovery of certain proteins, especially membrane-bound hydrophobic proteins, is inevitably compromised because IEF on immobilized gradient strips cannot tolerate salts (>50 mM NaCl), nor ionic detergents (>0.1% (w/v) SDS). IEF is now nearly always performed on immobilized gradient strips, which are available commercially in many pH ranges, or prepared in-house (7). The advantage of these is in run-to-run reproducibility, however the older dynamic gradient gels run in polyacrylamide tubes should not be dismissed, since they tolerate elevated salt and detergent without loss of resolution and are reproducible if care is taken (8). The second dimension is most commonly SDS-PAGE as originally described by Laemmli (9). The result of 2-DE is a physical array of as many as 5,000 resolved proteins (2) (see Fig. 1). Many proteomics experiments rely exclusively on 2-DE for protein separation. Mass spectrometry (MS) is now the technology of choice for the identification of gel-separated proteins using rapidly growing

Fig. 1. Coomassie blue-stained gel showing soluble proteome of wheat callus separated by IEF (pH 4–7) and SDS-PAGE (12%). Proteins indicated by arrows were phosphorylated in vivo (10). Figure reprinted from Rampitsch et al. (10). Copyright 2006, with permission from Elsevier.

Methods for Functional Proteomic Analyses

95

sequence databases (11). Two principal methods are commonly used with MS for protein identification. Both share similarities as they rely on the cleavage of isolated samples with a digestion agent such as trypsin and sample introduction into the mass spectrometer as peptide ions in the gas phase. The mass spectrometer determines the mass-to-charge ratio (m/z) of peptide or protein ions that are generated by electrospray ionization (ESI) or matrix-assisted laser desorption/ionization (MALDI) sources. ESI is a ‘soft’ ionization method for MS that generates ions from peptide and protein solutions through the vapourization of liquid in an electric field, whereas MALDI is a ‘soft’ ionization technique that produces ions through the pulsed, ultraviolet laser irradiation of crystalline deposits of peptides and proteins. Proteins with a full-length sequence present in a database can be identified with high certainty and high throughput using the accurate masses obtained by MALDI peptide mass fingerprinting (PMF) after a single MS analysis (Figs. 2A and 3B, C). Simple protein mixtures can also be deciphered by MALDI PMF (12).

Fig. 2. Identification of protein in spot 2 (protein identification as in Table 1). (A) Single MS MALDI-QqTOF spectrum of spot 2 protein digest. (B) An MS/MS spectrum of the peptide precursor m/z 985.503, with an amino acid sequence deduced de novo. The peptide sequence derived from y-ion (C-terminal) series is shown; in addition the most prominent b-ions (with b0 as a result of water loss) and the diagnostic phenylalanine immonium ion m/z 120.083 are indicated. Three peptides were matched to the protein sequence. (C) The three best scoring alignments of the queried peptide sequences and corresponding homologous peptides from the database entry produced by the conventional Basic Local Alignment Search Tool (BLAST) search engine. Figure reprinted from Rampitsch et al. (10). Copyright 2006, with permission from Elsevier.

Fig. 3. Identification of a site of phosphorylation for the protein in spot 9 by MALDI-QqTOF MS and MS/MS analysis. (A) two-dimensional resolution of highly phosphorylated (spot 9) and moderately phosphorylated (spot 5) protein forms by IEF 2D PAGE (protein identification as in Table 1). (B) and (C) MALDI-QqTOF MS peptide mass mapping analysis of the protein spots. Ion peaks at m/z 1698.714 and 1742.738 corresponding to phosphopeptides are indicated with asterisk

Methods for Functional Proteomic Analyses

97

If no conclusive identification is achieved using this approach, the protein digest should be analysed by tandem MS (MS/MS) peptide fragmentation (Figs. 2B and 3D, E) either using MALDI or nano-ESI. Tandem MS analysis produces data that allow highly specific database searches so that proteins only partially present in a database, or relevant clones in an EST database, can be identified (Fig. 2C). Furthermore, proteins not present in a database that are strongly homologous to a known protein can be identified (Table 1). It is important to point out that there is no need to determine the complete sequence of peptides in order to search a database – a short sequence stretch consisting of three to four amino acid residues provides enough search specificity when combined with the mass of the intact peptide and the masses of corresponding fragment ions in a peptide sequence tag. However, the interpretation of mass spectra for protein identification requires proteomic-specific bioinformatics software. Mascot and SEQUEST are the leading search engines but there are others available and they use the same principle method based on comparison of the experimentally observed fragment ions against those that would be expected for every known peptide sequence that could be generated from the known proteome of the organism under investigation (13). Despite the success of ongoing genomic sequencing projects, the demand for de novo peptide sequencing has not been eliminated (14, 15). Long and accurate peptide sequences are required for protein identification by homology search and for the cloning of new genes. The presence of continuous series of mass spectrometric fragment ions containing either C terminus (y-type ions) or N terminus (b-type ions) has been successfully used to determine de novo sequences using fragment ion spectra of peptide from a tryptic digest (Fig. 2B and 3D, E). The peptide sequence can be deduced by calculating precise mass difference between adjacent y- or b-ions and for this, instruments that allow the acquisition of tandem mass spectra with very high mass resolution without compromising sensitivity, such as QqTOF, TOF-TOF, FTICR or Orbitrap, are necessary. These features also make it possible and practical to apply selective isotopic labelling of the peptide C-terminal carboxyl group in order to distinguish y-ions from other fragment ions in the tandem mass spectra (16, 17). Fig. 3. (continued) in panel B and are not present in spectrum of moderately phosphorylated form, panel C. The mass difference 80 Da between peaks at m/z 1698.714 (panel B) and m/z 1618.749 (panel C) is indicative of phosphorylation. K indicates contaminating keratin peptide peaks often seen with lower sample protein amount. (D) and (E) Tandem MS sequencing analysis obtained using collision-induced dissociation (CID) of candidate phosphopeptides at m/z 1698.714 and 1742.738, respectively. The peaks denoted b* correspond to phosphorylated fragment b-ion series which exhibited further fragmentation by β-elimination of phosphoric acid (–98 Da) from phosphoserine residue and contain dehydroalanyl residue instead. The internal yb fragment ion at m/z 599.25 corresponding to GSATNW* and containing phosphoserine is shown. Im[W] indicates tryptophan immonium ion. Figure reprinted from Rampitsch et al. (10). Copyright 2006, with permission from Elsevier.

98

Rampitsch and Bykova

Table 1 Identification of wheat callus phosphoproteins using ‘Mascot’, de novo sequencing in combination with BLAST. Table reproduced from Rampitsch et al. (10) No.a Putative Identity

ID

Taxonomy

MS/MS-MASCOT and conventional BLASTb

5

XP_474367.1 GI:50929679

O. sativa

STNEALLVLEAYR (Mascot)

OSJNBb0017I01.8

DFHAAHPADAFSTSFGGGAALACVAAQPR (79% identity) EST AZO2

CD864746.1 GI:32548562

T. aestivum

STNEALLVLEAYR (Mascot) SYAPFPPGCMFHSEGGLK (Mascot)

EST AZO3

CD874031.1 GI:32557847

T. aestivum

STNEALLVLEAYR (Mascot) DFHAAHPADAFSTSFGGGAALACVAAQPR (Mascot)

2

6OS acidic ribosomal Protein P3

NP_194319.1 GI:15236029

A. thaliana

QHQGELESAADGPYDLKR (de novo) GVFTFVCR (de novo) VSPNSALFQVVLGQS AGLPGGGAGNGAAA (part, de novo)

9

OSJNBb0017I01.8

XP_474367.1 GI:50929679

O. sativa

STNEALLVLEAYR (Mascot) DFHAAHPADAFSTSFGGGAALACVAAQPR (79% identity)

EST wlm96

CA684816.1 GI:25272614

T. aestivum

SYAPFPPGCMFHSEGGLK (Mascot)

EST AZO2

CD864746.1 GI:32548562

T. aestivum

VG[pS]ATNWAAAWDDAAI (Mascot) STNEALLVLEAYR (Mascot) SYAPFPPGCMFHSEGGLK (Mascot) VG[pS]ATNWAATWDEAAI (Mascot) (continued)

Methods for Functional Proteomic Analyses

99

Table 1 (continued) No.a Putative Identity EST AZO3

ID

Taxonomy

MS/MS-MASCOT and conventional BLASTb

CD874031.1 GI:32557847

T. aestivum

STNEALLVLEAYR (Mascot) DFHAAHPADAFSTSFGGGAALACVAAQPR (Mascot)

a

Spot numbers correspond to 2-D gels in Fig. 1 Proteins were identified by MS/MS analysis and ‘Mascot’ search of MS/MS spectra followed by de novo interpretation of unmatched spectra with BLAST. All identifications met statistical confidence criteria according to ‘Mascot’ and BLAST scoring schemes BLAST basic local alignment search tool b

High accuracy, sensitivity and dynamic range of modern tandem mass spectrometers enable the peptides to be sequenced and their post-translational modifications (PTMs) to be identified. Mutually exclusive PTMs and heterogeneous modifications at distinct amino-acid residues lead to further complexity at the protein level (18). More than 200 different types of PTM have been characterized and new ones are regularly reported (19). As PTMs alter the molecular mass of proteins and are usually present at substoichiometric levels, their mapping, identification and characterization often presents formidable analytical challenges (10, 20, 21). Many post-translationally modified peptides generate distinctive modification-specific signals in MS/MS experiments, including loss of the PTM from the intact peptide (neutral loss) or other ion signals characteristic for PTM moiety. An example of tandem MS-based identification of two highly similar phosphopeptides with assignment of the phosphorylation sites in a protein spot from 2-D gel is shown in Fig. 3D, E. Often microfluidics is also used in conjunction with MS which is based on nanolitre-flow high performance liquid chromatrographic (LC) systems for protein and peptide separations prior to MS analysis. MS/MS for the amino acid sequencing of individual peptides relies on the automated, mass-specific selection and collision-induced dissociation of peptide ions inside a mass spectrometer (Fig. 4). Below we describe a typical proteomics experiment. Proteins are extracted from wheat tissue, separated by 2-DE and spots of interest are excised, digested and analyzed either by MALDIQqTOF MS/MS or by LC-MS. Two querying softwares are used to identify the excised proteins. An outline is shown in Fig. 5.

Fig. 4. Automated data-dependent LC-MS/MS separation and identification of a protein spot from the total wheat seed proteome 2-D gel map. A prepared tryptic digest was introduced into the Finnigan LTQ (Thermo Electron, San Jose, CA) mass spectrometer using an online C18 reverse-phase nano-column via a nano-flow HPLC (UltiMate 3000, Dionex,) for peptide separation. (A) The base peak chromatogram displays the intensities of most intense ions in all survey MS scans performed during the analysis. (B) An example of the survey MS scan acquired at 24.54 min retention time (RT) during peptides elution with a 40 min gradient of 2–80% acetonitrile. (C) and (D) Tandem MS fragmentation obtained in information-dependent acquisition mode using collision-induced dissociation (CID) of the corresponding precursor ions at m/z 681.13 and 980.14 also shown in (B), respectively. The sequence-specific fragment ions allowed unambiguous identification of the peptide sequences using Mascot search engine (v. 2.0.01, Matrixscience, UK).

Methods for Functional Proteomic Analyses

101

Fig. 5. A flowchart showing the four principal steps used in the proteomics experiments described in this chapter.

2. Materials 2.1. Protein Extraction and Separation

For all procedures, the highest grades of chemical available were used (except where mentioned); water should have a resistance of at least 18 MΩ and all solutions should be freshly prepared. The procedure will also require equipment for first and second dimension electrophoresis, and general laboratory equipment. In our lab, we use a MultiphorII unit (GE Healthcare) for the first dimension and an Ettan Dalt 6 unit (GE Healthcare) for the second. 1. Acetone containing 10% (w/v) TCA, 0.07% (w/v) dithiothreitol (DTT). 2. Acetone containing 0.07% (w/v) DTT. 3. IEF solution: (7 M urea, 3 M thiourea, 2% (w/v) CHAPS, 20 mM DTT, 0.5% (v/v) ampholyte (BioRad: BioLyte 3–10, at 40% stock). 4. Strip equilibration solution 1: 50 mM Tris-HCl pH 8.8, 6 M urea, 30% (v/v) glycerol, 2% (w/v) SDS, 1% (w/v) DTT. 5. Strip equilibration solution 2: as above, but replace DTT with 2.5% (w/v) iodoacetamide. 6. Solutions for Laemmli SDS-PAGE, see (9).

102

Rampitsch and Bykova

2.2. In-Gel Digestion

For general contamination precautions, see Note 1. 1. 100 mM ammonium bicarbonate (NH4HCO3, HPLC grade) in Milli-Q water. 2. Acetonitrile (HPLC grade). 3. 10 mM DTT in 100 mM NH4HCO3. 4. 55 mM iodoacetamide in 100 mM NH4HCO3. 5. 50% (v/v) acetonitrile in 50 mM NH4HCO3. 6. 0.5 M CaCl2 in water. 7. Digestion buffer: 10 ml containing 100 mM NH4HCO3, 10% acetonitrile and 2.5 mM CaCl2. (Make fresh before digestion, add CaCl2 last into the mixed buffer to avoid precipitation, use 0.5 M stock solution.) 8. Stock solution of modified sequencing grade trypsin (Promega?, Fisher Scientific, Pittsburgh, USA) at 0.1 μg/μl in 1 mM HCl (see Note 2). 9. 5% (v/v) formic acid in water. 10. 1% (v/v) formic acid, 5% (v/v) acetonitrile in water. 11. 1% (v/v) formic acid, 60% (v/v) acetonitrile in water. 12. 1% (v/v) formic acid in 99% (v/v) acetonitrile (make fresh prior to use). 13. Benchtop Eppendorf Centrifuge 5415R (Brinkmann Instruments, Mississauga, Canada). 14. Incubator, heating blocks or water bath capable of maintaining 56 and 37°C.

2.3. Purification and Concentration Prior to MS Analysis

1. 5% (v/v) formic acid in water. 2. 5% (v/v) formic acid, 50% (v/v) acetonitrile in water. 3. Reversed-phase packing SelfPack POROS 20 R2 (Applied Biosystems, Foster City CA) suspended in 50% (v/v) methanol in a ratio of about 30 μl resin to 1 ml methanol solution. 4. Eppendorf GELoaderTM Tips 1–10 μl or 1–20 μl (has a flexible 15 mm capillary with a defined diameter of less than 0.3 mm). 5. Matrix solution for MALDI analysis: 15 mg 2,5-dihydroxybenzoic acid (DHB) in 100 μl of 50% (v/v) acetonitrile in 5% aqueous formic acid. 6. Precoated borosilicate nano-ES spray capillaries (Proxeon Biosystems, Odense, Denmark). 7. Nano-ES purification needle holders (Proxeon Biosystems, Odense, Denmark). 8. Mini Centrifuge Galaxy Mini C1213 (VWR International, Mississauga, Canada).

Methods for Functional Proteomic Analyses

103

3. Methods 3.1. Protein Extraction

1. Harvest fresh plant tissue directly into liquid nitrogen and grind to a fine powder with a mortar and pestle. 2. Weigh 0.6 g of ground tissue into a 15 ml glass centrifuge tube. 3. While vortexing, add 8 ml of acetone, 10% (w/v) TCA, 0.07% (w/v) DTT at –20°C. 4. Incubate at –20°C for a minimum of 1.5 h (or overnight). 5. Centrifuge at 12,000 g, 20 min, –5°C. 6. To pellet, add 8 ml of acetone, 0.07% (w/v) DTT at –20°C while vortexing and centrifuge as before. 7. Repeat wash for a total of six to eight times to remove all traces of TCA. The final wash may be left overnight at –20°C. 8. Centrifuge at 12,000 g. If pellet is still green from chlorophyll, then an additional one or two wash may reduce this. 9. Dry precipitate with nitrogen gas using a very gentle stream of nitrogen through a pasteur pipette. 10. Store dried powder at –80°C until required. 11. To the dried acetone powder add ~200 μl of IEF solution. 12. Use a glass rod to mix the powder and buffer, adding more sample buffer as required (typically 4 ml). 13. Sonicate the sample five times for 5 s in a water bath set to 22°C: heating the sample above 28°C may result in protein carbamylation, and chilling below ~20°C will result in urea precipitation. 14. Centrifuge for 30 min at 30,000 g. 15. Repeat the centrifugation if any particulate material is still present. If a solid pellet does not result after centrifugation, the sample should be filtered through a siliconized glass wool pad and re-centrifuged. 16. Remove a portion for Bradford (or other) protein analysis. Starting with acetone powder from 0.9 g fresh weight of wheat leaf tissue will require ~4 ml of sample buffer. This will give (on average) 5 μg/3 μl of sample using the ‘micro’ Bradford assay (BioRad). 17. Once protein content is determined add a small amount of bromophenol blue powder and mix. 18. Aliquot the sample into 200 μg per 500 μl and store at –80°C.

104

Rampitsch and Bykova

3.2. Two-Dimensional Gel Electrophoresis

This procedure is written for use with a MultiphorII IEF unit (GE Healthcare) and an Ettan Dalt 6 electrophoresis unit (GE Healthcare) following the manufacturer’s instructions and using 24 cm IEF strips using passive in-gel rehydration to load samples into the IEF strip. A 24 cm strip requires 450 μl of IEF buffer containing ~600 μg protein (this can be optimized). 1. Samples should be centrifuged, preferably at 90,000 g, prior to use. 2. Pipette the sample into a clean rehydration tray (BioRad or GE Healthcare reswelling tray). 3. Peel off the protective plastic layer from the IEF strip. 4. Lay the strip (gel side down) onto the sample. Take care to remove any bubbles and ensure that the entire sample is in good contact with the strip. 5. Overlay the strip with mineral oil (Dry Strip Cover Fluid: GE Healthcare). 6. Place at 20°C for 12–18 h. 7. Remove the strip from the oil and rinse with 200 μl of water. 8. Blot excess water with five layers of damp Whatman filter paper. 9. Place strip (gel side up) into IEF apparatus (MultiphorII: GE Healthcare). 10. Place damp electrode paper over the ends of the gel. These papers have been wetted and blotted with Whatman paper to remove excess water. 11. Place the electrodes over the paper strips. 12. Overlay with cover fluid and start the IEF using a ramped programme as suggested: Total is ~35 kVh, but this may be increased, especially if horizontal streaking is present in the final 2-D gel. The current should not exceed 50 μA/strip at any point during the run. High current is indicative of salt contamination.

Step 1

0–250 V

1 mA

2W

1 h:45 min

Step 2

250 V

1 mA

2W

1 h:30 min

Step 3

250–1,200 V

1 mA

2W

3 h:00 min

Step 4

1.2 kV

1 mA

2W

1 h:30 min

Step 5

1.2–3 kV

1 mA

2W

7 h:00 min

Step 6

3 kV

1 mA

2W

5 h:00 min

Methods for Functional Proteomic Analyses

105

13. After the run, strips can be stored frozen at –80°C, or equilibrated as follows. 14. Equilibrate each strip in two changes of 5 ml equilibration solution 1 for 8 min each. 15. Rinse strip briefly in water and transfer to strip equilibration solution 2. Two changes of 5 ml for 8 min each. 16. Second dimension SDS-polyacrylamide gels were prepared exactly as described by the manufacturer (GE Healthcare). For the gel in Fig. 1, 13% polyacrylamide gels were made; the percentage is chosen based on the desired resolution of proteins, with a 13% gel yielding good resolution of proteins from ~15 to 100 kDa. 17. The strips were embedded in 0.5% (w/v) low melt agarose prepared in PAGE running buffer, heated to 60°C and pipetted onto the surface of the polyacrylamide gel. Position the strip so that the acidic end is on the left side of the cassette. 18. Use a spatula to push strip onto gel surface and allow the agarose to set. 19. Once set, insert the plates into the electrophoresis apparatus and start electrophoresis. Run second dimension at 2.5 W per gel for the first 30 min, and then at 100 W, regardless of the number of gels. The total run time will be 4.75 h for six gels and 3.5 h for three gels. For overnight runs the total power should be set at 1–2 W per gel. It is advisable to stir the upper buffer periodically (hourly) during the run. Interrupt power during this operation. 20. After running, the gel apparatus is dismantled and gels are fixed in 12.5% (w/v) TCA (laboratory grade) for a minimum of 20 min. 21. Stain by slowly adding 27 ml of 1% (w/v) Coomassie brilliant blue R250 in 95% ethanol per 400 ml TCA. This stain may be reused until sensitivity is diminished. 22. Destain in distilled water overnight. 3.3. In-Gel Digestion of Protein Spots (See Note 3)

1. Cut the spots from the gel into cubes (1 mm3 size), transfer the gel pieces to 1.5 ml tubes. These can be kept at –20°C until further treatment.

3.3.1. Washing, Reduction and Alkylation of In-Gel Protein Spots

2. Clean scalpel first in 50% methanol then in Milli-Q water after every protein spot. Cut spots on sterile part of parafilm in petri dish and use new parafilm for every new spot. 3. Wash the gel pieces with 200 μl water, vortex for 10 min, centrifuge at 3,000 g for 2 min and discard the supernatant. 4. Wash the gel pieces with 200 μl 100 mM NH4HCO3, vortex for 10 min.

106

Rampitsch and Bykova

5. Add 200 μl acetonitrile, vortex for 10 min, centrifuge at 3,000 g for 2 min. 6. Remove all liquid, dry in a vacuum centrifuge for 5 min (not longer). 7. Add 100 μl of 100 mM NH4HCO3, 10 mM DTT (there should be enough reducing buffer to cover the gel pieces completely, if not increase the volumes accordingly), incubate for 45 min at 56°C; cool down to room temperature for 5–10 min, centrifuge at 3,000 g for 2 min. 8. Replace the solution with 55 mM iodoacetamide (10 mg/ml) in 100 mM NH4HCO3; incubate at room temperature in the dark for 30 min with occasional vortexing (see Note 4). 9. Centrifuge at 3,000 g for 2 min and remove all liquid; wash the gel pieces with 200 μl 100 mM NH4HCO3, vortex for 10 min. 10. Add 200 μl acetonitrile, vortex for 10 min, centrifuge at 3,000 g for 2 min and remove all liquid. 11. Repeat the washing 2 × 5 min with 200 μl of 50% acetonitrile, 50 mM NH4HCO3. 12. Centrifuge the gel pieces down, remove all liquid, and dry in a vacuum centrifuge for 15 min. 3.3.2. Digestion with Trypsin

1. Make 10 ml of fresh digestion buffer. 2. Dissolve 1 trypsin vial (Promega, modified, sequencing grade) with 200 μl of 1 mM HCl standard resuspension buffer (supplied by manufacturer) to prepare the stock solution with 0.1 μg/μl trypsin concentration. Keep on ice until starting the reaction. 3. Calculate how much volume of total trypsin digestion buffer will be needed (typically 10 μl up to 80 μl depending on the gel spot size) to cover all gel pieces. Add trypsin stock solution to a final concentration 12 ng/μl in the digestion buffer. Keep the trypsin stock solution and digestion buffer on ice at all times. 4. Add trypsin digestion buffer to the dry gel spots and rehydrate on ice for 30–40 min. After 15 min check if the buffer has been absorbed by the gel pieces, if so, add more buffer without enzyme just to cover gel pieces and keep them wet during digestion. 5. Close the lids of the tubes well to prevent evaporation and incubate at 37°C overnight.

3.3.3. Extraction of Peptides from Gel Spots

1. Remove samples from 37°C, bring to room temperature, centrifuge the gel pieces and liquid condensate down.

Methods for Functional Proteomic Analyses

107

2. Add 50 μl of 5% formic acid, vortex strongly for 5 min, centrifuge at 3,000 g for 2 min and collect supernatant into fresh siliconized tube (avoid getting the gel pieces). 3. Add 100 μl of 1% formic acid in 5% acetonitrile, vortex strongly for 15 min, centrifuge at 3,000 g for 2 min and collect the supernatant into the same fresh tube. 4. Add 100 μl of 1% formic acid in 60% acetonitrile, vortex strongly for 15 min, centrifuge at 3,000 g for 2 min and collect the supernatant into the same tube. 5. Add 50 μl of 1% formic acid in 99% acetonitrile, vortex strongly for 15 min, centrifuge at 3,000 g for 2 min and collect the supernatant into the same tube. 6. Dry down in a vacuum centrifuge. Dried extracts can be stored at –20°C until further analysis. 3.4. Purification of Peptides Prior to MALDI Analysis

1. Make a nano-column from an Eppendorf GELoaderTM Tip, squeeze the tip to close it (or narrow it), using another pipette fill the tip with 20 μl of 50% acetonitrile in 5% aqueous formic acid. 2. Pipette about 5 μl of POROS 20 R2 reversed-phase packing material. Add 20 μl of 50% acetonitrile in 5% aqueous formic acid and press liquid through the tip by using a 1 ml syringe adapted to the tip with suitable tubing, watch the growing column in the tip, and stop if it has reached a length of ~5–10 mm (see Note 5). 3. Wash this nano-column again with 20 μl of 50% acetonitrile in 5% aqueous formic acid. 4. Equilibrate the column with 2 × 20 μl of 5% formic acid; finally leave a few μl of liquid in the column. 5. Redissolve the dried digest in 20 μl 5% formic acid, vortex for 10 min, centrifuge at 16,000 g for 20 min, carefully retain the supernatant (see Notes 6 and 7). 6. Add the sample into the column; use the whole amount of extracted sample with small and/or weak Coomassie spots and with silver-stained spots. 7. After adding the sample, load by pressing all liquid through with the syringe. 8. Wash the bound peptides with 3 × 20 μl of 5% formic acid, and after the last washing step pass all liquid through. 9. Add 2 μl of DHB matrix solution in 50% acetonitrile in 5% aqueous formic acid, connect 1 ml plastic syringe and shake the solution down to move the elution buffer at once to the top of column. Press liquid with the syringe and elute the sample in small drops on MALDI target.

108

Rampitsch and Bykova

10. Deposit external calibration solution (mixture of synthetic peptides dissolved in matrix solution) on MALDI target and proceed with mass spectra acquisition on an instrument configured with MALDI ion source (Q-TOFs, Ion Traps or other instruments depending on availability). 3.5. Purification of Peptides Prior to Direct Nano-ESI MS/ MS Analysis

1. Perform steps 1–8 as described in Subheading 3.3. 2. Align the nano-ES spray capillary in the nano-ES purification needle holders for spinning in the benchtop minicentrifuge. 3. Cut the purification Eppendorf GELoaderΤΜ Tip (with the nano-column containing bound and washed peptides) 2 mm above the thin conical part and insert into the nano-ES spray capillary fixed in the centrifuge. 4. For elution of peptides add 1–3 μl of 50% (v/v) acetonitrile in 5% aqueous formic acid solution into the cut purification tip with the nano-column. Centrifuge briefly at 3,000 g. 5. Take the purification tip out and check if the eluate is in the nano-ES spray capillary. 6. If the sample is eluted, discard the purification tip, mount the nano-ES spray capillary with the sample into the nanoESI ion source and acquire mass spectra (see Note 8).

Notes 1. Fresh 50–100 ml stocks of water, NH4HCO3 buffer, formic acid and acetonitrile should be used for the preparation of a new series of samples. Dust from the laboratory environment rapidly accumulates in solutions and reagents resulting in massive contamination of samples with human and sheep keratins and/or polymeric detergents, which makes sequencing exceedingly difficult, sometimes impossible with very small amounts of sample. Gloves should be worn at all times during sample preparation (no talcum powder). Perform all operations in a laminar flow hood to preserve a dust-free environment. The solutions for extraction should be made fresh in tubes suitable and stable for acetonitrile and formic acid (siliconized Eppendorf tubes or suitable Falcon tubes). 2. Sequencing Grade Modified Trypsin (Promega), 100 μg total amount with five vials per 20 μg lyophilized powder, store at –20°C (for maximum 12 months). Specific activity ³ 5 U/μg protein. Dissolve one vial in 200 μl of 1 mM HCl

Methods for Functional Proteomic Analyses

109

(resuspension buffer included) to prepare 0.1 μg/μl stock solution. Excess of trypsin stock solution can be stored frozen at –20°C in 20 μl aliquots for 1–2 months. Thaw the aliquot only once just before preparation of the digestion buffer. 3. The described in-gel digestion protocol is applicable without modifications to spots/bands excised from 1- or 2-D PAGE gels stained with Coomassie brilliant blue R-250 or G-250. For silver-stained gels a MS-compatible silver staining protocol is recommended (22). The major concern in applying the silver staining technique when followed by microanalytical protein characterization is that the reagents used to improve staining sensitivity and contrast must not modify proteins covalently. Thus, treatment of gels with crosslinking reagents (such as glutaraldehyde) or strong oxidizers, such as chromates or permanganates should be avoided. In addition, a destaining step using potassium ferricyanide in sodium thiosulphate will be required for the silver-stained gel spots prior to washing, reduction and alkylation. Otherwise, the protocol is applicable without changes. 4. It is important to perform iodoacetamide treatment for no longer than 30 min to prevent overalkylation of the samples (23). 5. The method allows femtomole level MS/MS sequencing of peptides from unseparated peptide mixtures (24). The desalted and concentrated sample is eluted in a small volume from the column and can be used with both MALDI and nano-ESI MS (25). 6. Centrifugation is necessary to spin down small gel pieces or other particles which can contaminate the extract and therefore result in the column blockage and sample loss. 7. Storage of reconstituted material will lead to performance loss (e.g. oxidation of Met residues in peptides), however if necessary, store at –80°C. Avoid multiple freeze-thaw cycles or exposure to frequent temperature changes. 8. Alternatively, extracted peptide mixtures can be directly analysed (without prior purification and concentration) using an on-line system with a nano- (or micro)-HPLC directly interfaced into a tandem mass spectrometer (nano-LC/MS analysis). This increases resolution and sensitivity for sample analysis by simultaneous purification, separation in time and sequencing of peptides.

110

Rampitsch and Bykova

References 1. O’Farrell, P.H. (1975) High resolution twodimensional electrophoresis of proteins. J. Biol. Chem. 250, 4007–4021. 2. Görg, A., Weiss, W., and Dunn, M.J. (2004) Current two-dimensional electrophoresis technology for proteomics. Proteomics 4, 3665–3685. 3. Damerval, C. (1986) Technical improvements in two-dimensional electrophoresis increases the level of genetic variation detected in wheat seedling proteins. Electrophoresis 7, 52–54. 4. Hurkman, W.J. and Tanaka, C.K. (1986) Solubilization of plant membrane proteins for analysis by two-dimensional electrophoresis. Plant Physiol. 81, 802–806. 5. Saravan, R.S. and Rose, J.K.C. (2004) A critical evaluation of sample extraction techniques for enhanced proteomic analysis of recalcitrant plant tissues. Proteomics 4, 2522–2532. 6. Vâlcu, C.-M. and Schlink, K. (2006) Efficient extraction of proteins from woody plant samples for two-dimensional electrophoresis. Proteomics 6, 4166–4175. 7. Westermeier, R. (2001) Electrophoresis in Practice, 3rd Ed. Wiley VCH, Weinheim. 8. Fernando, D.D. (2005) Characterization of pollen tube development in Pinus strobus (Eastern white pine) through proteomic analysis of differentially expressed proteins. Proteomics 5, 4917–4926. 9. Laemmli, U.K. (1970) Cleavage of structural proteins during the assembly of the head of bacteriophage T4. Nature 227, 680–687. 10. Rampitsch, C., Bykova, N.V., Mauthe, W., Yakandawala, N., and Jordan, M. (2006) Phosphoproteomic profiling of wheat callus labelled in vivo. Plant Sci. 171, 488–496. 11. Aebersold, R. and Mann, M. (2003) Mass spectrometry-based proteomics. Nature 422, 198–207. 12. Jensen, O.N., Podtelejnikov, A.V., and Mann, M. (1997) Identification of the components of simple protein mixtures by high-accuracy peptide mass mapping and database searching. Anal. Chem. 69, 4741–4750. 13. Rappsilber, J. and Mann, M. (2002) What does it mean to identify a protein in proteomics? Trends Biochem. Sci. 27, 74–78. 14. Standing, K.G. (2003) Peptide and protein de novo sequencing by mass spectrometry. Curr. Opin. Struct. Biol. 13, 595–601. 15. Rampitsch, C., Bykova, N.V., McCallum, B., Beimcik, E., and Ens, W. (2006) Analysis of

16.

17.

18.

19.

20.

21.

22.

23.

24.

25.

the wheat and Puccinia triticina (leaf rust) proteomes during a compatible host-pathogen interaction. Proteomics 6, 1897–1907. Shevchenko, A., Chernushevich, I., Ens, W., Standing, K.G., Thomson, B., Wilm, M., et al. (1997) Rapid ‘de novo’ peptide sequencing by a combination of nanoelectrospray, isotopic labelling and a quadrupole/time-offlight mass spectrometer. Rapid Commun. Mass Spectrom. 11, 1015–1024. Shevchenko, A., Chernushevich, I., Wilm, M., and Mann, M. (2000) De novo peptide sequencing by nanoelectrospray tandem mass spectrometry using triple quadrupole and quadrupole/time-of-flight instruments. Methods Mol. Biol. 146, 1–16. Bykova, N.V., Rampitsch, C., Krokhin, O., Standing, K.G., and Ens, W. (2006) Determination and characterization of site-specific N-glycosylation using MALDI-Qq-TOF tandem mass spectrometry: case study with a plant protease. Anal. Chem. 78, 1093–1103. Jensen, O.N. (2006) Interpreting the protein language using proteomics. Nat. Rev. 7, 391–403. Bykova, N.V., Egsgaard, H., and Møller, I.M. (2003) Identification of 14 new phosphoproteins involved in important plant mitochondrial processes. FEBS Lett. 540, 141–146. Bykova, N.V., Stensballe, A., Egsgaard, H., Jensen, O.N., and Møller, I.M. (2003) Phosphorylation of formate dehydrogenase in potato tuber mitochondria. J. Biol. Chem. 278, 26021–26030. Shevchenko, A., Wilm, M., Vorm, O., and Mann, M. (1996) Mass spectrometric sequencing of proteins from silver stained polyacrylamide gels. Anal. Chem. 68, 850–858. Lapko, V.N., Smith, D.L., and Smith, J.B. (2000) Identification of an artefact in the mass spectrometry of proteins derivatized with iodoacetamide. J. Mass Spectrom. 35, 572–575. Wilm, M., Shevchenko, A., Houthaeve, T., Breit, S., Scheigerer, L., Fotsis, T., and Mann, M. (1996) Femtomole sequencing of proteins from polyacrylamide gels by nanoelectrospray mass spectrometry. Nature 379, 466–469. Stensballe, A., Andersen, S., and Jensen, O.N. (2001) Characterization of phosphoproteins from electrophoretic gels by nano-scale Fe(III) affinity chromatography with off-line mass spectrometry analysis. Proteomics 1, 207–222.

Chapter 7 Stable Transformation of Plants Huw D. Jones and Caroline A. Sparks Summary This chapter provides an overview of the main steps in the process to produce stably transformed plants. Most transformation methods use tissue culture to recover adult plants from regenerable explants and can be divided into three stages: (1) choice and preparation of explant tissue, (2) deoxyribonucleic acid (DNA) delivery, (3) callus induction/regeneration and selection. Each of these stages is introduced from a general perspective and a detailed protocol for our exemplar species, wheat, is given. We focus here on DNA delivery by particle bombardment as Agrobacterium-mediated transformation methods for wheat are reported elsewhere (29). Key words: Transformation, Particle bombardment, Explant, Tissue culture, Selection, DNA delivery, Transgene, Wheat.

1. Introduction Genetic transformation underpins a range of specific research methods for identifying genes and studying their function in planta. It also allows the direct manipulation of specific traits via introduction of novel genes into locally adapted germplasm. A range of research strategies that incorporate transformation as a component are in common use. In model plant species, complementation of mutants and populations tagged with T-DNAs or heterologous transposons are proving uniquely useful for identifying and validating the function of genes and promoters [see recent reviews (1, 2)]. The availability of strongly constitutive, tissue-specific or inducible promoter sequences and small interfering ribonucleic acid (siRNA) technology is facilitating highly targeted over-expression and precise down-regulation of candidate Daryl J. Somers et al. (eds.), Methods in Molecular Biology, Plant Genomics, vol. 513 © Humana Press, a part of Springer Science + Business Media, LLC 2009 DOI: 10.1007/978-1-59745-427-8_7

111

112

Jones and Sparks

genes. In addition, fluoro- or colorimetric reporter genes, matrix attachment regions, epitope tags or targeting sequences are increasingly incorporated into transgene cassettes to study gene expression, organelle morphology or protein trafficking. Plant genetic transformation involves two distinct stages: the delivery of DNA into the nucleus of a competent cell and the recovery of fertile plants from that transformed cell. In a few, mainly model species, methods have been developed to target transformation to zygotic, gametic or pre-gametic cells, such that transgenic plants can be identified at seed germination. Such ‘in planta’ or ‘germ line’ methods are well developed for Arabidopsis (3, 4) and have also been demonstrated in Medicago truncatula (5) and Brassica campestris (packchoi) (6). There are also two recent reports of rice and wheat transformation using an in planta method (7, 8); however, still the vast majority of plant transformation, especially in crop species, is done via regeneration of adult plants through a callus phase in tissue culture utilising the remarkable plasticity and totipotency of somatic plant cells (9). This chapter provides an overview of the main steps in the plant transformation process. It is divided into the following sections: choice of explant, DNA delivery, callus induction/ regeneration and selection. Each section also provides a detailed protocol for that step using wheat as an exemplar species.

2. Materials 2.1. Donor Plants for Explant Tissues

The condition of wheat donor plants (Triticum aestivum L.) is critical to successful transformation. In order to provide healthy plants with consistent quality, plants are grown as follows (see Notes 1, 2): 1. Soil: 75% fine-grade peat, 12% screened sterilised loam, 10% 6 mm screened lime-free grit, 3% medium vermiculite, 2 kg osmocote plus/m3 (slow-release fertiliser, 15N/11P/13K plus micronutrients), 0.5 kg PG mix/m3 (14N/16P/18K granular fertiliser plus micronutrients) (Petersfield Products, Leicestershire, UK). 2. Five plants per 21-cm diameter plastic pot [Nursery Trades (Lea Valley) Ltd., Hertfordshire, UK]. Plants are stripped to leave five tillers per plant once plants are 6–8 weeks old. 3. Vernalisation of winter wheat varieties is carried out at 4–5oC for 8 weeks from sowing. 4. Growth room conditions: 18–20oC day and 14–15oC night temperatures under a 16 h photoperiod provided by banks of hydrargyrum quartz iodide (HQI) lamps 400 W (Osram

Stable Transformation of Plants

113

Ltd., Berkshire, UK) to give an intensity of ~700 μmol/m2/s photosynthetically active radiation (PAR). 5. Watering: Initially all plants are top watered in order to monitor water requirements and thereby provide sufficient water without water logging. An automated flooding system is used once the root system reaches the base of the pot. 6. Pests and disease: These are kept to a minimum by restricting access to growth rooms and following good housekeeping practices. Any diseased plants are discarded immediately. To avoid mildew, the fungicide Fortress (DOW Agrosciences Ltd., Hertfordshire, UK) is applied as a preventative. Amblyseius caliginosus [Nursery Trades (Lea Valley) Ltd.] is used as a biological control agent to manage thrips. 7. Sterilising agents: 70% (v/v) aqueous ethanol, 10% (v/v) aqueous Domestos (Lever Fabergé Ltd., Surrey, UK) and sterile water (see Note 3). 2.2. Stock Solutions and Callus Induction Medium

Solutions 1–9 below are the recipes for stock solutions of basal culture media components from which the final callus induction media (solutions 10 and 11) are prepared (see Notes 3, 4).

2.2.1. Stock Solutions of Basal Culture Media Components

1. MS Macrosalts (×10): 16.5 g/L NH4NO3 (Fisher Scientific UK, Leicestershire, UK), 19.0 g/L KNO3 (Fisher Scientific UK), 1.7 g/L KH2PO4 (Fisher Scientific UK), 3.7 g/L MgSO4·7H2O (Fisher Scientific UK), 4.4 g/L CaCl2·2H2O (Fisher Scientific UK) (see Note 5). Autoclave at 121°C for 20 min and store at 4°C (see Note 6). 2. L7 Macrosalts (×10): 2.5 g/L NH4NO3, 15.0 g/L KNO3, 2.0 g/L KH3PO4, 3.5 g/L MgSO4·7H2O, 4.5 g/L CaCl2·2H2O (see Note 5). Autoclave at 121oC for 20 min and store at 4oC (see Note 6). 3. L7 Microsalts (×1,000): 15.0 g/L MnSO4 (Fisher Scientific UK) (see Note 7), 5.0 g/L H3BO3 (Fisher Scientific UK), 7.5 g/L ZnSO4·7H2O (Fisher Scientific UK), 0.75 g/L KI (Fisher Scientific UK), 0.25 g/L Na2MoO4·2H2O (VWR International Ltd., Leicestershire, UK), 0.025 g/L CuSO4·5H2O (Fisher Scientific UK), 0.025 g/L CoCl2·6H2O (Sigma-Aldrich Dorset UK). Prepare 100 ml at a time. Filter sterilise (see Note 8) and store at 4oC (see Note 6). 4. 3AA Amino acids (×25): 18.75 g/L L-Glutamine (SigmaAldrich), 3.75 g/L L-Proline (Sigma-Aldrich), 2.5 g/L L-Asparagine (Sigma-Aldrich). Store solution at −20oC in 40 ml aliquots (see Note 6). 5. MS Vitamins (-Glycine) (×1,000): 0.1 g/L Thiamine HCl (Sigma-Aldrich), 0.5 g/L Pyridoxine HCl (Sigma-Aldrich), 0.5 g/L Nicotinic acid (Sigma-Aldrich). Prepare 100 ml at a time. Filter sterilise (see Note 8) and store at 4oC (see Note 6).

114

Jones and Sparks

6. L7 Vitamins/Inositol (×200): 40.0 g/L myo-Inositol (SigmaAldrich), 2.0 g/L Thiamine HCl, 0.2 g/L Pyridoxine HCl, 0.2 g/L Nicotinic acid, 0.2 g/L Ca-Pantothenate (SigmaAldrich), 0.2 g/L Ascorbic acid (Sigma-Aldrich). Store at –20oC in 10 ml aliquots (see Note 6). 7. 2,4-Dichlorophenoxyacetic acid (2,4-D) (Sigma-Aldrich): 1 mg/ml in ethanol/water (dissolve powder in ethanol then add water to volume). Mix well. Filter sterilise (see Note 8) and store at –20oC in 1 ml aliquots (see Note 6). 8. Silver nitrate (AgNO3) solution (Sigma-Aldrich): 20 mg/ml in water. Mix well. Filter sterilise (see Note 8) and aliquot into 1 ml volumes. Store at –20°C in the dark (see Notes 6, 9). 9. Agargel (×2) (Sigma-Aldrich): Prepare in 400 ml volumes at 10 g/L and sterilise by autoclaving at 121oC for 20 min. Store at room temperature and melt in microwave before use (see Note 10). 2.2.2. Callus Induction Media

10. MSS 3AA/2 9%S (×2): 200 ml/L MS macrosalts, 2 ml/L L7 microsalts, 20 ml/L ferrous sulphate chelate solution (×100) (Sigma-Aldrich), 2 ml/L MS vitamins (-Glycine), 200 mg/L myo-Inositol (Sigma-Aldrich), 40 ml/L 3AA amino acids (see Note 11), 180 g/L (9% final concentration) sucrose (Fisher Scientific UK) (see Note 12). Adjust pH to 5.7 with 5 M NaOH or KOH. Osmolarity should be within the range 800–1,100 mOsM. Filter sterilise (see Note 8) and store at 4oC (see Note 6). 11. MS9%0.5DAg: Mix an equal volume of MSS 3AA/2 9%S (×2) with sterilised, melted agargel (×2). Add 0.5 mg/L 2,4-D (see Note 13) and 10 mg/L AgNO3 and pour into 9-cm diameter Petri-dishes (Bibby Sterilin Ltd., Staffordshire, UK) (~28 ml per dish). Store at 4oC in the dark (see Notes 9, 10, 14).

2.3. Particle Bombardment

1. Gold particles: 0.6 μm (sub-micron) gold particles (BIORAD Laboratories, Hertfordshire, UK) (see Note 15). 2. Macro-carriers, stopping screens, 650 psi rupture discs (all BIO-RAD Laboratories) (see Note 16). 3. 2.5 M Calcium chloride (Fisher Scientific UK): Dissolve 3.67 g CaCl2·2H2O in 10 ml water. Mix well/vortex. Filter sterilise (see Note 8) and store at −20oC in 50 μl aliquots (see Note 6). 4. 0.1 M Spermidine free-base (Sigma-Aldrich): Prepare 1 M stock from powder in sterile water and maintain at –80oC in 20 μl aliquots. Prepare the 0.1 M working solution by making a 1:10 dilution of 1 M stock in sterile water under sterile conditions. Mix well, aliquot in 10 μl volumes and store immediately at –20oC (see Note 17).

Stable Transformation of Plants

115

5. Plasmid DNA: 1 mg/ml in sterile Tris-EDTA (Ethylenediaminetetraacetic acid) (TE) buffer or sterile water, prepared using Qiagen Maxi-prep kit (Qiagen Ltd., West Sussex, UK). Store in 20 μl aliquots at –20oC (see Note 18). 2.4. Regeneration and Selection Media

Solutions 1–4 below are the stock solutions required as additions to the regeneration media (solutions 5 and 6) and selection media (solutions 7 and 8) (see Notes 3, 4). For stock solutions of basal culture media components (see Subheading 2.2.1). 1. Zeatin-mixed isomers (Sigma-Aldrich): 10 mg/ml in HCl/ water (dissolve powder in small volume 1 M HCl and make up to volume with water). Mix well/vortex. Filter sterilise (see Note 8) and store at –20oC in 1 ml aliquots (see Note 6). 2. Copper sulphate (CuSO4) solution (Sigma-Aldrich): 2.5 g CuSO4·5H2O in 100 ml water (0.1 M). Mix well/vortex. Filter sterilise (see Note 8) and store at 4oC in 1 ml aliquots (see Note 6). 3. Glufosinate ammonium (Greyhound Chromatography and Allied Chemicals, Cheshire, UK) (synthetic PPT – see Note 19): 10 mg/ml in water. Mix well/vortex. Filter sterilise (see Note 8) and store at –20oC in 1 ml aliquots (see Note 6). 4. Geneticin disulphate (G418) (Melford Laboratories Ltd., Suffolk, UK) (see Note 20): 50 mg/ml in water. Mix well/ vortex. Filter sterilise (see Note 8) and store at –20oC in 1 ml aliquots (see Note 6).

2.4.1. Regeneration Media

5. R (×2): 200 ml/L L7 macrosalts, 2 ml/L L7 microsalts, 20 ml/L ferrous sulphate chelate solution (×100), 10 ml/L L7 vitamins/Inositol, 60 g/L maltose (Melford Laboratories Ltd.). Adjust pH to 5.7 with 5 M NaOH or KOH. Osmolarity should be within the range 269–298 mOsM. Filter sterilise (see Note 8) and store at 4oC (see Note 6). 6. RZDCu: Mix an equal volume R (×2) with sterilised, melted agargel (×2). Add 5 mg/L zeatin, 0.1 mg/L 2,4-D and 100 μM CuSO4 (see Note 21) and pour into 9 cm Petri dishes (~28 ml/dish). Store at 4oC (see Notes 10, 14).

2.4.2. Selection Media

7. RZPPT4 or RZG50: Mix an equal volume of R (×2) with sterilised, melted agargel (×2) and add 5 mg/L zeatin and 4 mg/L glufosinate ammonium (PPT4) or 50 mg/L G418 (G50) (see Note 22). Pour into 9 cm Petri dishes (~28 ml/ dish). Store at 4oC (see Notes 10, 14). 8. RPPT4 or RG50: Mix an equal volume of R (×2) with sterilised, melted agargel (×2) and add 4 mg/L glufosinate ammonium (PPT4) or 50 mg/L G418 (G50) (see Note 22). Pour into 9 cm Petri dishes (~28 ml/dish) or GA-7 Magenta vessels (SigmaAldrich) (~60 ml/vessel). Store at 4oC (see Notes 10, 14).

116

Jones and Sparks

3. Method Choice of explant is highly species dependent and is also influenced by the DNA-delivery method. Dicotyledonous plants offer a broad range of suitable explants including leaf laminar and petioles, shoot meristems, cotyledonary nodes or immature cotyledons and embryogenic suspension cultures. Brassica and Solanaceae species are often transformed using hypocotyl segments or cotyledonary petioles. The most effective regeneration route for tobacco is via shoot organogenesis from leaf explants. The range of regenerable explants for monocotyledonous plants is more limited. Regeneration protocols for many cereals have been developed using immature embryos but alternatives include the immature inflorescence, leaf bases, shoot meristem cultures, protoplasts or mature seeds (9). The following procedure describes the isolation and pre-culture of immature zygotic embryos from wheat in preparation for transformation. The method has been optimised for transformation of immature scutella of wheat ( T. aestivum L.) (see Notes 23, 24).

3.1. Collection and Sterilisation of Wheat Caryopses

1. Collect spikes from growth room-grown plants at ~10–12 weeks after sowing: embryos at the correct stage are usually found ~12–16 days post-anthesis (see Note 25) 2. Remove the panicles to release the caryopses (see Note 26). 3. Surface sterilise the caryopses by soaking in 70% (v/v) aqueous ethanol for 5 min then 15–20 min in 10% (v/v) Domestos with occasional gentle shaking. 4. Rinse copiously with at least three changes of sterile water. Maintain the sterilised caryopses in moist conditions but do not keep immersed in water.

3.2. Isolation and Pre-Culture of Immature Scutella

1. Isolate the immature embryos microscopically in a sterile environment (see Fig. 1A) and remove the embryo axis to prevent precocious germination. Embryos are generally most responsive when ~0.5–1.5 mm long but there is genotypic variation (see Note 27) 2. Place 25–30 scutella per 9 cm Petri dish containing callus induction medium (MS9%0.5DAg), orientating them with the cut embryo axis in contact with the medium, such that the uncut scutellum side is bombarded (see Fig. 1B). The scutella should be arranged within the central target area of the plate (see Note 28). 3. Seal the plates with Nescofilm® (Fisher Scientific UK) and preculture prepared donor material for 1–2 days in the dark at 22oC (see Note 29).

Stable Transformation of Plants

117

Fig. 1. (A) Caryopsis dissected to reveal immature embryo. (B) Immature scutella isolated and plated for bombardment. (C) Embryogenic callus. (D) Plantlet regenerating on selection medium. (E) Transverse section of wheat seed expressing GUS (bottom), control seed (top). (F) Regenerated transformed plants in GM containment glasshouse. Scale bar = 1 mm approximately.

3.3. Transfer of DNA by Particle Bombardment

Physical or biological methods can be used to deliver DNA into a host cell and the development of these methods has gone hand-inhand with choice and preparation of explants. The latter utilises interactions with Agrobacterium species or other bacterial or viral vectors, the former include procedures such as electroporation, polyethylene glycol (PEG) or calcium treatment, silicon carbide whiskers, microinjection, lasers or particle bombardment. The first transgenic cereals were made using electroporation of protoplasts (10–12) but the difficulty of maintaining embryogenic suspension cultures to produce protoplasts and regeneration of plants from protoplasts lead to the adoption of other direct DNA-delivery methods, such as particle bombardment, that were adapted for intact cells or organised tissues. Particle bombardment was particularly successful for the routine transformation of

118

Jones and Sparks

cereals (13) and along with Agrobacterium tumefaciens for both cereals and dicotyledonous species, now predominates. Below we give a detailed protocol for the preparation of DNA and its delivery into wheat embryos using the BIO-RAD PDS-1,000/ He-particle gun (see Note 30). 3.3.1. Preparation of Gold Particles

1. Weigh 20 mg BIO-RAD sub-micron gold particles (0.6 μm) in a 1.5 ml Eppendorf and add 1 ml 100% ethanol. Sonicate for 2 min, pulse spin for 3 s in a microfuge and remove the supernatant. Repeat this ethanol wash twice more. 2. Add 1 ml sterile water and sonicate for 2 min. Pulse spin for 3 s in a microfuge and remove the supernatant. Repeat this step. 3. Re-suspend fully by vortexing in 1 ml sterile water. Aliquot 50 μl amounts into sterile 1.5 ml Eppendorf tubes, vortexing between taking each aliquot to ensure an equal distribution of particles. Store at −20oC.

3.3.2. Coating of Gold Particles with DNA for Bombardment

The following procedure should be carried out on ice, in a sterile environment. 1. Thaw a 50-μl aliquot of prepared gold at room temperature then sonicate for 1–2 min (see Note 31). To ensure total resuspension, the tubes can be vortexed following sonication, particularly if the aliquots are to be sub-divided for smaller preparations (see Note 32). 2. Add 5 μl DNA (1 mg/ml in TE or water) (see Note 33) or water (see Note 34) and vortex briefly to ensure good contact of DNA with the particles (see Note 35). 3. Mix 50 μl 2.5 M CaCl2 and 20 μl 0.1 M spermidine in the lid of the Eppendorf then briefly vortex into the gold plus DNA solution (see Note 36). 4. Centrifuge 13,000 rpm for 3–5 sec in a microfuge to pellet the DNA-coated particles. Discard the supernatant. 5. Add 150 μl 100% ethanol to wash the particles, re-suspending them as fully as possible (see Notes 37, 38). 6. Centrifuge 13,000 rpm for 3–5 sec in a microfuge to pellet the particles and discard the supernatant. 7. Re-suspend fully in 85 μl 100% ethanol and maintain on ice (see Note 39).

3.3.3. Particle Bombardment Using the PDS-1000/ He-Particle Gun [BIO-RAD]

The delivery system involves the use of high pressure to accelerate particles to high velocity. Appropriate safety precautions should be taken and safety spectacles should be worn when operating the gun. In any bombardment experiment, controls should be included to monitor regeneration and selection efficiencies (see Note 40).

Stable Transformation of Plants

119

1. The PDS-1000/He particle gun [BIO-RAD (see Fig. 2)] is used to deliver DNA-coated gold particles according to the manufacturer’s instructions. The following settings are maintained as standard for this procedure (see Note 41): target distance 5.5 cm (distance between stopping screen and target plate), stopping plate aperture 0.8 cm (distance between macro-carrier and stopping screen), gap 2.5 cm (distance between rupture disc and macro-carrier), vacuum 91.4–94.8 kPa, vacuum flow rate 5.0 and vent flow rate 4.5. 2. Sterilise the gun’s chamber and component parts by spraying with 90% (v/v) ethanol which should be allowed to evaporate completely (~5 min). 3. Sterilise rupture discs, stopping screens, macro-carriers and macro-carrier holders by dipping in 100% ethanol, and allow the alcohol to evaporate completely on a mesh rack in a flow hood (see Note 42). Place the dried macro-carrier holders into sterile 6 cm Petri dishes and mount one macro-carrier into each holder. 4. Briefly vortex the coated gold particles, take a 5 μl sample and drop centrally onto a macro-carrier membrane. Allow to dry naturally, not in the air-flow (see Note 43). 5. Load a rupture disc (see Note 17) into the rupture discretaining cap (see Fig. 2) and screw into place on the gas acceleration tube, tightening firmly using the mini torque wrench (see Note 44). 6. Place a stopping screen into the fixed nest. Invert the macrocarrier holder containing macro-carrier and gold particles/ DNA and place over the stopping screen in the nest and maintain its position using the retaining ring. Mount the fixed nest assembly onto the second shelf from the top to give a gap of 2.5 cm (see Fig. 2). 7. Place a sample on the target stage on a shelf to give the desired distance; fourth shelf from the top gives a target distance of 5.5 cm. 8. Draw a vacuum of 91.4–94.8 kPa and fire the gun (see Note 45). 9. After firing, release the vacuum, remove the sample and disassemble the component parts, discarding the ruptured disc and macro-carrier (see Note 46). 10. Place the macro-carrier holder and stopping screen in 100% ethanol to re-sterilise if they are to be re-used for further shots, otherwise place in 1:10 dilution Savlon (Novartis Consumer Health, West Sussex, UK) to soak. Sonicate for 10 min prior to re-use (see Note 47).

120

Jones and Sparks

Fig. 2. The PDS-1000/He-particle gun [BIO-RAD] (left) and diagram of component parts described in Subheading 3.3.3 (right).

3.4. Callus Induction, Regeneration, and Selection

The recovery of adult, fertile plants via a tissue-culture phase is integral to most transformation procedures. The main way to regenerate plants from transformed somatic tissues involves somatic embryogenesis or organogenesis. Somatic embryogenesis is a non-sexual propagation process in which somatic cells differentiate into embryo-like structures which can be induced to “germinate” into shoots and roots. Organogenesis is the formation of shoot or root meristems on the surface of intact or wounded tissues such as hypocotyls, cotyledons or leaf bases. Both provide routes to regeneration of plants from transformed explant cells and depending on the species and explant, may require a callus phase. Media for inducing callus and regeneration have three main constituents: a salts/vitamins mix, sugars, usually sucrose or maltose and plant growth regulators of the auxin or cytokinin types. These media often also include other additions including nitrate, coconut milk, specific amino acids, sugar alcohols or metal ions. The precise composition of media for callus induction is different from that for organ regeneration and both are highly dependent, and must be optimised, for the precise explant type in question. Transgene delivery and integration are random, inefficient processes and a selection system is usually required to weaken or kill the untransformed plant cells, thus allowing the relatively few transformed cells to preferentially proliferate. Selection systems

Stable Transformation of Plants

121

have two components, a chemical additive to the growth medium and a gene whose product confers the ability to preferentially survive. Three types are in common usage; two are well used and based on an antibiotic or a herbicide, the other is a more recent development and based on a nutrient selection marker. Common selection genes are nptII, hpt and bar which confer resistance to the antibiotics kanamycin, hygromycin and to the herbicide phosphinothricin (PPT) (glufosinate ammonium), respectively. Driven partly by perceived risks of unintentional horizontal or vertical gene transfer of antibiotic and herbicide resistance genes, a range of environmentally benign selection systems have recently been developed. The most well-used of these is the phosphomannose isomerase (PMI) system which utilises the manA gene to convert the otherwise unavailable carbon source mannose-6phosphate to fructose-6-phosphate for respiration (14). Below we describe the specific media, tissue-culture and selection conditions for the regeneration of transgenic wheat plants using herbicide or antibiotic selection. 3.4.1. Callus Induction and Regeneration

1. Following bombardment, divide each replicate between two and three plates of callus induction medium (MS9%0.5DAg) in 9 cm Petri dishes, spreading the scutella evenly across the medium, that is, approximately ten scutella per plate (see Note 48). 2. Seal the plates with Nescofilm and incubate at 22oC in the dark for induction of embryogenic callus (see Notes 49, 50 and Fig. 1c). 3. After 3–5 weeks on callus induction medium, transfer any callus bearing somatic embryos to regeneration medium (RZDCu) in 9 cm Petri dishes. Whole calli should be transferred without division, placing approximately ten calli per plate. Incubate at 22oC in the light for 3–4 weeks (see Note 51).

3.4.1. Callus Induction and Regeneration

1. After 3–4 weeks on regeneration medium (RZDCu), transfer calli to RZ plus selection in 9 cm Petri dishes with high lids (see Notes 52, 53). The transforming plasmid determines the selection medium used: RZPPT4 for bar or RZG50 for nptII (see Notes 19, 20). Seal the plates with Nescofilm and incubate at 22oC in the light (see Note 49). 2. After a further 3–4 weeks, transfer surviving calli to regeneration medium plus selection but without hormones (RPPT4 or RG50) in 9 cm Petri dishes with high lids (see Notes 52, 54 and Fig. 1d). Seal the plates with Nescofilm and incubate at 22oC in the light (see Note 49). 3. Once regenerating shoots are clearly defined and can be separated easily from the callus, transfer these to regeneration medium plus selection but without hormones (RPPT4 or RG50) in GA-7

122

Jones and Sparks

Magenta vessels, placing no more than four to six plantlets per Magenta. Incubate at 22oC in the light (see Note 49). 3.5. Potting Putative Transgenic Plants to Soil

1. Once the leaves reach the top of the Magenta vessel (~10–15 cm) and a reasonable root system has been established, plantlets can be transferred to soil. Typically, this takes at least 3 months from bombardment. Carefully remove plantlets from the agargel-solidified medium (rinsing the roots with water if necessary to remove excess agargel) and pot into soil in 8 cm square plastic pots [Nursery Trades, (Lea Valley) Ltd.]. Place plantlets within a propagator to provide a high humidity for 1–2 weeks to acclimatise them from tissue culture and grow in a GM containment glasshouse (see Notes 55, 56). 2. Once suitably established (three to four leaves) a leaf sample can be taken for extraction of genomic DNA and PCR to establish whether the plant is transformed. Once confirmed PCR positive, plants are re-potted to 13-cm diameter pots [Nursery Trades (Lea Valley) Ltd.] and grown under the same glasshouse conditions (see Note 56). Plants should reach maturity in 3–4 months (see Fig. 1E). 3. Transgenic plants can be analysed in a number of ways: reporter gene expression can be assessed using, for example, the histochemical GUS test (15) for uidA (see Fig. 1F), ultraviolet (UV) visualisation of green fluorescent protein (GFP), herbicide leaf paint assay (16) and/or the ammonium test (17) for bar. Gene integrations can be studied using Southern analysis and fluorescent in situ hybridisation (FISH).

Notes 1. The conditions described are suitable for growth of T. aestivum plants but for T. turgidum ssp. durum, different growing conditions are necessary. 2. Although glasshouse-grown plants can be used, these tend to be more variable due to seasonal variation. 3. Reverse osmosis, polished water with a purity of 18.2 MΩ/cm should be used for all solutions. 4. For alternative varieties or wheat species, modifications to the media detailed here may be required. For example, the choice of basal salts (MS or L7), the concentration of sugars (sucrose or maltose), the level of hormones, etc. need to be empirically determined.

Stable Transformation of Plants

123

5. Before mixing with other components, dissolve CaCl2·2H2O in water. 6. Sterile stock solutions can be stored at 4oC for 1–2 months. Some settling of salts may occur during storage, so the medium should be shaken well prior to use. Stock solutions stored at –20oC should remain effective for at least a year, provided that no freeze/thawing has occurred. 7. MnSO4 is available in various hydrated states, exact mass required will vary. For MnSO4·H2O, add 17.05 g/L; for MnSO4·4H2O, add 23.22 g/L; or for MnSO4·7H2O, add 27.95 g/L. 8. Filter sterilisation is carried out using a filter size of 0.2 μm. For large volumes use MediaKap® (NBS Biologicals Ltd., Cambridgeshire, UK), for smaller volumes use a Nalgene syringe filter (Fisher Scientific UK). 9. AgNO3 is used to promote embryogenesis; silver thiosulphate (a mix of silver nitrate and sodium thiosulphate) at 10 mg/L can be used as an alternative. Both are photosensitive so the stock solutions and any media plates containing them should be kept in the dark. 10. To avoid difficulties when re-melting, the agargel solution should be shaken well both before and after autoclaving to allow uniform solidification. 11. Instead of using the 3AA stock solution, 1.5 g/L L-Glutamine, 0.3 g/L L-Proline and 0.2 g/L L-Asparagine can be added individually. 12. The ability of cells to withstand bombardment may be increased due to partial plasmolysis caused by 9% sucrose in the pre-culture medium. However, this is variety and species dependent and 3% sucrose is often suitable, for example, for T. turgidum ssp. durum scutella. The osmolarity for 3% sucrose medium should be within the range of 355–398 mOsM. 13. Picloram (Sigma-Aldrich) can be used as an alternative auxin at 2–6 mg/L (18,19). 14. Tissue culture media should be prepared as freshly as possible and should not be stored in Petri dishes and Magenta vessels for more than 2–3 weeks. However, they should be prepared a few days in advance of use to allow any contamination to be detected. To minimise condensation in the plates, allow the agargel (×2) to cool once melted, and pour the final medium at ~50oC. 15. Successful transformation has also been achieved using Heraeus gold particles of 0.4–1.2-μm diameter (W. C. Heraeus GmbH and Co., KG, Hanau, Germany); however,

124

Jones and Sparks

the smaller, more uniform size of the submicron BIO-RAD particles gives more consistent results for wheat. The latter particles are preferable for small wheat cells but for other species, larger particles may be suitable. 16. Rupture pressures of 650 psi have been found to be optimal for the wheat varieties reported here; 450, 900 or 1,100 psi pressures will result in successful transformation but with lower efficiency. If attempting transformation of any new variety or species a range should be tested; rupture discs are available as 450, 650, 900, 1,100, 1,350, 1,550, 1,800, 2,000 and 2,200 psi. 17. Spermidine should be maintained below –20oC, preferably at –80oC because it deaminates with time and solutions are hygroscopic and oxidisable. Any unused aliquots once thawed, should be discarded. 18. Plasmids tend to be pUC based and contain one or more gene cassettes. A selectable marker gene must be included to allow selection of transformed tissues; the bar or nptII gene are common examples, usually under the control of a constitutive promoter (e.g. Maize Ubiquitin 1 or CaMV35S) and with a suitable terminator (e.g. nos). The bar gene confers resistance to the herbicides BastaTM (glufosinate ammonium/ PPT) and Bialaphos and the nptII gene confers resistance to the antibiotics geneticin disulphate (G418), kanamycin, neomycin, paromomycin, etc. (see Note 20). In order to monitor both transient and stable transformation, a reporter gene (e.g. uidA, luc or GFP) can be used (Fig. 1F). Such marker genes can be located in the same plasmid or on separate plasmids co-precipitated onto the gold particles. 19. Glufosinate ammonium is synthetically produced PPT bound to ammonium, and is the active component in herbicides such as BastaTM. Bialaphos (phosphinothricylanalylanaline, sodium) (Melford Laboratories Ltd.) is a successful alternative selection agent used at 3–5 mg/L. 20. Kanamycin, paromomycin and neomycin are alternative aminoglycoside antibiotics that can be used for selection with the nptII gene. Although they may be successful for selection of some plant species, they are not recommended for wheat as natural resistance is exhibited by untransformed tissues. 21. Copper sulphate is a stress-inducing agent (similar to silver nitrate) used to promote shooting. 100 μM is the preferred copper sulphate concentration, but if too much shooting occurs, 50 μM can be used. 22. The selection agent should be used at a concentration which is known to fully inhibit the growth of non-transformed

Stable Transformation of Plants

125

explants; however, the concentration should be gauged according to the development of the cultures at each transfer stage. Generally use it within the range of 2–6 mg/L glufosinate ammonium (PPT) and 25–50 mg/L G418. 23. A number of commercial wheat varieties have been transformed using the methods detailed in this chapter but with a range of efficiencies; Cadenza, Canon and Florida have given the highest efficiencies (up to 13%) (20–22). T. turgidum ssp. durum (e.g. cvs. Ofanto and Venusia) can also be transformed by this method (23, 24), but for these and alternative wheat varieties, modifications may be required (18, 25). 24. Immature inflorescences are an alternative explant for transformation as they can have high regeneration potential and for certain varieties these may be more responsive than immature scutella. For references describing modifications necessary when using immature inflorescences see (20, 26) for T. aestivum varieties, (23, 24) for T. turgidum ssp. Durum, (25, 27, 28) for tritiordeum (a fertile cereal amphidiploid obtained from crosses between Hordeum chilense and durum wheat cultivars, and containing the genome HCHHCHAABB) and (18) for barley. 25. In order to determine the size of the embryos, a few caryopses can be opened at the time of collection. Although it is not encouraged, if the caryopses will not be used the same day it is possible to store the spikes intact at 4oC, with stems in water. 26. Due to asynchronous development, avoid using the inner caryopses of the spikelet as these generally contain smaller embryos. 27. 0.5–1.5 mm length is the most responsive size range for the varieties reported here. Smaller and larger embryos may respond but with much lower efficiencies. Size is not quite as important for transient experiments. 28. Typically the gun shot fires most gold particles within a ~2-cm diameter central circular area of a Petri dish. Arranging scutella within this area maximises particle delivery [as shown by transient expression studies (20)]. 29. The pre-culture phase allows the tissues to recover from the isolation procedure before being subjected to bombardment and may also pre-plasmolyse the cells (see Note 14). However, it also allows any contamination to be detected prior to bombardment. Should it be difficult to sterilise donor material resulting in contaminated explants, plant preservative mixture (PPMTM) (Plant Cell Technology, Inc., Washington, DC, USA) can be included in tissue culture media at 1 ml/L. This is a non-toxic broad-spectrum preservative and

126

Jones and Sparks

biocide which does not interfere with callus proliferation or regeneration. 30. A. tumefaciens-mediated transformation of wheat is a viable alternative DNA delivery system. Transformation protocols have been reported elsewhere (16, 29, 30). 31. The sonication has worked effectively if the gold particles have re-suspended in the liquid rather than being present as a pellet in the base of the tube. There is evidence that oversonication can cause aggregation, however, so the particles should not be sonicated longer than 1–2 min. 32. The gold preparation can be sub-divided and volumes scaled down accordingly if fewer shots are required or a variety of DNAs are to be compared. 33. If plasmids are not at a concentration of 1 mg/ml, re-calculate the volume to give 5 μg DNA and add to the gold. However, the addition of large volumes of DNA should be avoided. If the DNA is very dilute, re-precipitate the DNA and re-suspend at a higher concentration. 34. In order to monitor regeneration and selection efficiencies of a bombardment experiment, control plates are required (see Note 40). Some particles should therefore be prepared without DNA, replacing the DNA solution with sterile water. 35. The standard amount of DNA is 5 μg/50 μl gold suspension. If using more than one plasmid for co-bombardment, the amounts of DNA added should be calculated such that equimolar quantities are used, with a total of 5 μg DNA for the two plasmids (greater than 5 μg may cause clumping of particles). Alternatively, different ratios can be used to skew for gene of interest that is 1.5 plasmid of interest: one selectable marker construct; plants surviving selection will then have an increased probability of containing both selectable marker and the gene of interest. 36. CaCl2 and spermidine act to bind, stabilise and precipitate the DNA. Precipitation onto the gold particles is very rapid so the CaCl2 and spermidine are mixed first to ensure that the coating is as even as possible. 37. The particles should be re-suspended as well as possible by scraping the side of the tube with the pipette tip to remove clumps, and drawing up and expelling the solution repeatedly. The gold must be fully re-suspended at this stage as remaining clumps cannot be removed during later re-suspension steps. Vortexing will not aid re-suspension. 38. Ideally the coated particles should be used as soon as possible; however, they can be kept on ice at this point (but for no longer than an hour), completing the rest of the protocol just prior to use.

Stable Transformation of Plants

127

39. Avoid aspirating too much at this stage as the ethanol will evaporate and increase the final concentration of particles. Some natural evaporation means there is generally enough for only 10–12 shots from the 85 μl final volume, even though there should be sufficient for 16–17 shots (5 μl/ shot). In order to reduce further evaporation of the ethanol before the re-suspended particles are required, the Eppendorf lids can be sealed with Nescofilm. However, it is advisable to use coated gold particles as soon as possible. 40. Various control plates should be included within each experiment: unbombarded – to monitor the development/ regeneration of donor tissue; bombarded with gold (no DNA) and unselected – to monitor tissue culture response following bombardment; and bombarded with gold (no DNA) and selected – to monitor the effects of the selection on regeneration. 41. Although these settings were found to be optimal for the wheat varieties routinely used, they may need to be altered for different varieties or species. 42. The rupture discs are composed of laminate layers, therefore, they should not be sterilised for more than 10 min or the layers may become separated. 43. Once the coated particles have been dispensed onto the macro-carriers, the ethanol should be allowed to evaporate slowly. The flow hood may cause vibration which could cause particle agglomeration so in order to create an even spread of dried particles on the macro-carrier, place macro-carriers within their sterile Petri dishes outside of the flow hood on a non-vibrating surface. Macro-carriers should be used when recently dried, so only a few should be loaded with gold at any one time. Macro-carriers can be examined microscopically prior to bombardment to determine the uniformity and spread of particles, discarding any that have agglomerated clumps of gold which will reduce transformation efficiency. 44. The helium pressure on the cylinder should be set to ~200 psi more than the intended rupture pressure. 45. The helium pressure accumulates until the rupture disc breaks, propelling the macro-carrier onto the stopping plate, thus releasing and dispersing the gold particles. The actual pressure at which the rupture disc bursts should be monitored to ensure a successful shot, otherwise transformation efficiencies may be affected. 46. Following a shot, the macro-carrier can be observed microscopically to visualise the mesh pattern left by the stopping screen. This will demonstrate how much gold has been released or retained.

128

Jones and Sparks

47. The macro-carriers and stopping screens are sonicated to destroy any adhering DNA and prevent carry-over to future bombardments. 48. The scutella are spread more evenly in order to reduce the culture density and prevent competition for nutrients. 49. Incubation is carried out in a controlled environment room with a 12 h photo-period provided by cool white fluorescent tubes emitting lighting levels ~250 μmol/m2/s PAR. Trays are covered with foil to create darkness for the callus induction phase. 50. Transient assays, for example, histochemical GUS assay, can be carried out after 1–3 days depending on the strength of the promoter. 51. The induction period for somatic embryogenesis is usually 3–5 weeks; however, the explants should be observed regularly to check for contamination. Judgement and experience is required to monitor development in order to determine the best time for transfer to regeneration medium; transfer is carried out when the embryogenic callus has mature somatic embryos some of which may just be forming small shoots. 52. ‘High lids’ are created by using the upturned base of another Petri dish as the lid. This provides greater height for growth of shoots. 53. Selection is generally applied at the second and subsequent transfers, until all control plantlets have been killed (see Note 40). However, selection can be introduced earlier at callus induction or at the first round of regeneration. This may serve to reduce the numbers of calli and/or plantlets surviving but may also result in loss of transformants if they are not strong enough to survive selection early on. 54. If the regenerating calli to be transferred are large, the number of calli per 9 cm Petri dish should be reduced to prevent overcrowding. The callus can be divided if necessary but each of the callus pieces should be monitored in order to trace plants with possible clonal origin. 55. Tissue-cultured plantlets have little or no waxy cuticle so are particularly prone to desiccation after transfer to soil. 56. Glasshouse conditions are 18–20oC day and 14–16oC night temperatures with a 16 h photo-period provided by natural light supplemented with banks of Son-T 400 W sodium lamps (Osram, Ltd.) giving 400–1,000 μmol/ m2/s PAR.

Stable Transformation of Plants

129

Acknowledgements Rothamsted receives grant-aided support from the Biotechnological and Biological Sciences Research Council, UK. We acknowledge other members of the Rothamsted Cereal Transformation Group, past and present, for their significant contribution to the protocols described here.

References 1. An, G.H., Jeong, D.H., Jung, K.H., and Lee, S. (2005) Reverse genetic approaches for functional genomics of rice. Plant Molecular Biology 59, 111–123. 2. Radhamony, R.N., Prasad, A.M., and Srinivasan, R. (2005) T-DNA insertional mutagenesis in Arabidopsis: a tool for functional genomics. Electronic Journal of Biotechnology 8, 2–106. 3. Bechtold, N., Ellis, J., and Pelletier, G. (1993) In Planta Agrobacterium-mediated gene-transfer by infiltration of adult Arabidopsis-thaliana plants. Comptes Rendus De L Academie Des Sciences Serie Iii-Sciences De La Vie-Life Sciences 316, 1194–1199. 4. Clough, S.J. and Bent, A.F. (1998) Floral dip: a simplified method for Agrobacterium-mediated transformation of Arabidopsis thaliana. Plant Journal 16, 735–743. 5. Trieu, A.T., Burleigh, S.H., Kardailsky, I.V., Maldonado-Mendoza, I.E., Versaw, W.K., Blaylock, L.A., Shin, H.S., Chiou, T.J., Katagi, H., Dewbre, G.R., Weigel, D., and Harrison, M.J. (2000) Transformation of Medicago truncatula via infiltration of seedlings or flowering plants with Agrobacterium. Plant Journal 22, 531–541. 6. Liu, F., Cao, M.Q., Yao, L., Li, Y., Robaglia, C., and Tourneur, C. (1998) In Planta transformation of pakchoi (Brassica campestris L. ssp. chinensis) by infiltration of adult plants with Agrobacterium. Acta Horticulturae 467, 187–192. 7. Supartana, P., Shimizu, T., Shioiri, H., Nogawa, M., Nozue, M., and Kojima, M. (2005) Development of simple and efficient in Planta transformation method for rice (Oryza sativa L.) using Agrobacterium tumefaciens. Journal of Bioscience and Bioengineering 100, 391–397.

8. Supartana, P., Shimizu, T., Nogawa, M., Shioiri, H., Nakajima, T., Haramoto, N., Nozue, M., and Kojima, M. (2006) Development of simple and efficient in Planta transformation method for wheat (Triticum aestivum L.) using Agrobacterium tumefaciens. Journal of Bioscience and Bioengineering 102, 162–170. 9. Jones, H.D. (2005) Wheat transformation: current technology and applications to grain development and composition. Journal of Cereal Science 41, 137–147. 10. Shimamoto, K., Terada, R., Izawa, T., and Fujimoto, H. (1989) Fertile transgenic rice plants regenerated from transformed protoplasts. Nature 338, 274–276. 11. Rhodes, C.A., Pierce, D.A., Mettler, I.J., Mascarenhas, D., and Detmer, J.J. (1988) Genetically transformed maize plants from protoplasts. Science 240, 204–207. 12. Zhang, H.M., Yang, H., Rech, E.L., Golds, T.J., Davis, A.S., Mulligan, B.J., Cocking, E.C., and Davey, M.R. (1988) Transgenic rice plants produced by electroporation-mediated plasmid uptake into protoplasts. Plant Cell Reports 7, 379–384. 13. Christou, P. (1992) Genetic-transformation of crop plants using microprojectile bombardment. Plant Journal 2, 275–281. 14. Joersbo, M. (2001) . Physiologia Plantarum 111, 269–272. 15. Jefferson, R.A., Kavanagh, T.A., and Bevan, M.W. (1987) GUS fusion: β-glucuronidase as a sensitive and versatile gene fusion marker in plants. EMBO Journal 6, 3901–3907. 16. Wu, H., Sparks, C., Amoah, B., and Jones, H.D. (2003) Factors influencing successful Agrobacterium-mediated genetic transformation of wheat. Plant Cell Reports 21, 659–668.

130

Jones and Sparks

17. Rasco-Gaunt, S., Riley, A., Lazzeri, P., and Barcelo, P. (1999) A facile method for screening for phosphinothricin (PPT)-resistant transgenic wheats. Molecular Breeding 5, 255–262. 18. Barro, F., Martin, A., Lazzeri, P.A., and Barcelo, P. (1999) Medium optimization for efficient somatic embryogenesis and plant regeneration from immature inflorescences and immature scutella of elite cultivars of wheat, barley and tritordeum. Euphytica 108, 161–167. 19. Barro, F., Cannell, M.E., Lazzeri, P.A., and Barcelo, P. (1998) The influence of auxins on transformation of wheat and tritordeum and analysis of transgene integration patterns in transformants. Theoretical and Applied Genetics 97, 684–695. 20. Rasco-Gaunt, S., Riley, A., Barcelo, P., and Lazzeri, P.A. (1999) Analysis of particle bombardment parameters to optimise DNA delivery into wheat tissues. Plant Cell Reports 19, 118–127. 21. Pastori, G.M., Wilkinson, M.D., Steele, S.H., Sparks, C.A., Jones, H.D., and Parry, M.A.J. (2001) Age-dependent transformation frequency in elite wheat varieties. Journal of Experimental Botany 52, 857–863. 22. Rasco-Gaunt, S., Riley, A., Cannell, M., Barcelo, P., and Lazzeri, P.A. (2001) Procedures allowing the transformation of a range of European elite wheat (Triticum aestivum L.) varieties via particle bombardment. Journal of Experimental Botany 52, 865–874. 23. He, G.Y. and Lazzeri, P.A. (2001) Improvement of somatic embryogenesis and plant regeneration from durum wheat (Triticum turgidum var. durum Desf.) scutellum and inflorescence cultures. Euphytica 119, 369–376.

24. Lamacchia, C., Shewry, P.R., Di Fonzo, N., Forsyth, J.L., Harris, N., Lazzeri, P.A., Napier, J.A., Halford, N.G., and Barcelo, P. (2001) Endosperm-specific activity of a storage protein gene promoter in transgenic wheat seed. Journal of Experimental Botany 52, 243–250. 25. Barcelo, P. and Lazzeri, P. (1995) Transformation of cereals by microprojectile bombardment of immature inflorescence and scutellum tissues., in Methods in Molecular Biology: Plant Gene Transfer and Expression Protocols (Jones, H., ed.), Humana Press, Totowa, NJ, pp. 113–123. 26. RascoGaunt, S. and Barcelo, P. (1999) Imature inflourescence culture of cereals: a highly responsive system for regeneration and transformation, in Methods in Molecular Biology: Plant cell culture protocols (Hall, R., ed.), Humana Press, Inc., Totowa, NJ, pp. 71–81. 27. Barcelo, P., Hagel, C., Becker, D., Martin, A., and Lorz, H. (1994) Transgenic cereal (Tritordeum) plants obtained at high-efficiency by microprojectile bombardment of inflorescence tissue. Plant Journal 5, 583–592. 28. Barcelo, P., Vazquez, A., and Martin, A. (1989) Somatic embryogenesis and plantregeneration from tritordeum. Plant Breeding 103, 235–240. 29. Jones, H.D., Doherty, A., and Wu, H. (2005) Review of methodologies and a protocol for the Agrobacterium-mediated transformation of wheat. Plant Methods 1, 5. 30. Amoah, B.K., Wu, H., Sparks, C., and Jones, H.D. (2001) Factors influencing Agrobacterium-mediated transient expression of uidA in wheat inflorescence tissue. Journal of Experimental Botany 52, 1135–1142.

Chapter 8 Transient Transformation of Plants Huw D. Jones, Angela Doherty, and Caroline A. Sparks Summary Transient expression in plants is a valuable tool for many aspects of functional genomics and promoter testing. It can be used both to over-express and to silence candidate genes. It is also scaleable and provides a viable alternative to microbial fermentation and animal cell culture for the production of recombinant proteins. It does not depend on chromosomal integration of heterologous DNA so is a relatively facile procedure and can lead to high levels of transgene expression. Recombinant DNA can be introduced into plant cells via physical methods, via Agrobacterium or via viral vectors. Key words: TransformationViral-induced gene silencing, Transgene, Gene delivery.

1. Introduction Transient gene expression provides a rapid and facile alternative to the generation of stably transformed plants. When DNA is delivered into a plant cell, only a tiny proportion (if any) will become integrated into the host chromosomes and, although it is unclear as to the precise long-term fate of the remaining DNA molecules, they can remain transcriptionally competent for several days. This transient expression does not depend on chromosomal integration of the heterologous DNA so analysis of gene expression is not confused by position effects. Expression from extra-chromosomal transgenes can be detected only 3 h after DNA-delivery, reach a maximum after between 18 and 48 h (1) and persist for 10 days (2). Some of the earliest demonstrations of transient heterologous gene expression utilised isolated plant protoplasts as hosts for replicating viruses (3) and Ti plasmids expressing octopines (4). Since then, a wide range of plant cell cultures and Daryl J. Somers et al. (eds.), Methods in Molecular Biology, Plant Genomics, vol. 513 © Humana Press, a part of Springer Science + Business Media, LLC 2009 DOI: 10.1007/978-1-59745-427-8_8

131

132

Jones, Doherty, and Sparks

intact tissues or organs have also been targeted using various vector mechanisms. Transient expression in plants is a valuable tool for aspects of functional genomics and promoter testing. It also provides a viable alternative to microbial fermentation and animal cell culture for the production of recombinant proteins. Plants can readily complete the necessary post-transcriptional modifications, such as glycosylation, and for pharmaceutical proteins are safer, as they are not known to propagate mammalian viruses or pathogens. We have classified the methods of inducing transient expression of recombinant DNA into three categories, defined by the method of DNA-delivery and on whether the DNA replicates within the host plant cell. In two of these categories, which utilise physical (direct) and Agrobacterium DNA-delivery methods respectively, there are no mechanisms for the transferred DNA to replicate within the plant cell. The third category exploits viral vectors to carry and express heterologous genes. Such vectors can replicate and spread systemically within the plant host and can often lead to very high levels of protein accumulation. Below we expand the pros and cons of these different mechanisms for expressing transgenes transiently and outline some of the applications. Although not a method for protein expression, we also devote a section to viral-induced gene silencing (VIGS) which acts through the generation of transient, double-stranded RNA. The protocol section that follows this introduction describes methods for the transient expression of plasmids, delivered via both biolistics and Agrobacterium, into a hitherto recalcitrant tissue, developing wheat endosperm. 1.1. Direct Delivery of Non-Replicating Plasmids

Commonly used transient expression assays utilise direct DNAdelivery methods to introduce recombinant, bacterial plasmids containing reporter genes into plant cell cultures or protoplasts (reviewed by (5)). However, intact tissues and organs can also be targeted. A wide range of direct (physical) methods of delivering double-stranded, naked DNA have been used successfully including particle bombardment (6, 7), electroporation (8–10), polyethylene glycol (PEG) (2, 11) and microinjection (12). Electroporation, microinjection and chemical methods including PEG have proved particularly useful for protoplasts, whereas particle bombardment and Agrobacterium (see below) have been widely utilised for differentiated tissues, whole organs or plants. In addition to the range of bench-top particle bombardment devices available, (in particular the commonly used PDS-1000/He), BIO-RAD (Hercules, CA, USA) also produce a portable, hand-held, HeliosTM gene gun that can be used to bombard DNA into intact living plants in the glasshouse or field. Transient expression of GUS and luciferase was used to optimise the Helios gun for gene delivery to Arabidopsis, tobacco and silver birch (13). A significant advantage of direct delivery methods

Transient Transformation of Plants

133

over Agrobacterium or viral based is that no specialised DNA vectors are required. 1.2. Agrobacterium tumefaciens-Mediated Delivery of T-DNAs

Transient expression of reporter genes carried on the T-DNA of a Ti or binary plasmid has been widely used to demonstrate, measure and optimise DNA transfer into plant cells as a prerequisite to the development of stable, plant transformation procedures. However, the transient expression of T-DNA-encoded genes is also recognised as a useful research tool in its own right with advantages over direct methods. To this end, a range of transient expression vectors have been designed for functional genomics, quantification of promoter activity and RNA silencing in plants (14). Agrobacterium has been used to transform a wide range of cell preparations including protoplasts, cell cultures, callus, organs and whole plants (15). It has a particular advantage over direct methods because it can access hard-to-reach cells. For example, Agrobacterium, combined with vacuum infiltration, has been used to access intercellular spaces of Phaseolus leaves enabling T-DNA transfer to all cell layers (16). Vacuum infiltration during agro-infiltration resulted in significantly more transient GUS production in lettuce leaves compared to agro-infiltration with stirring (17). Another advantage is that multiple gene cassettes can be introduced to an individual plant cell simultaneously in a single T-DNA. Direct DNA transfer methods would normally require multiple plasmids to be co-transferred with the risk that some plant cells would not receive all plasmids. As with plasmids designed for direct DNA transfer, the T-DNA introduced into the plant cell cannot replicate autonomously so expression from extra-chromosomal T-DNAs persists for several days at the most.

1.3. Transfection of Viral Replicons

Viral-based transient expression systems have significant advantages and are emerging as an attractive alternative to non-viral systems, particularly for the production of recombinant proteins in plants. The main advantages are that the DNA sequence inserted into plants as part of a virus vector will be replicated and systemically transported throughout the plant resulting in very high levels of transgene product (reviewed by 18, 19). Viral vectors have been generated from several different viruses but most emphasis is placed on plus-sense RNA, rod-shaped viruses such as tobacco mosaic virus (TMV) or potato virus X (20). To avoid the instability sometimes seen when native viral genes are deleted, foreign genes have often been added to complete viral genomes as additional reading frames. However, this can lead to problems with viral packaging limitations with viral-based systems traditionally restricted to proteins smaller than 60–70 kDa (21). Other undesirable features of the full virus strategy are host-specificity and the presence of functional, infectious virus

134

Jones, Doherty, and Sparks

particles. An emerging, ‘deconstructed virus’ strategy (reviewed by 19, 22) attempts to design expression systems by eliminating the unnecessary viral functions or supplying them in trans by first introducing them into a host plant by genetic engineering. In an alternative approach, the incorporation of silent nucleotide substitutions and multiple introns into a TMV vector combined with delivery via agro-infection, resulted in gene amplification in all leaves simultaneously (23). The authors called this process ‘magnifection’ which can be used to transiently express foreign protein at up to 80% of total soluble protein (24). 1.4. Viral-Induced Gene Silencing

In a variation of the use of viruses for transient over-expression, viral vectors designed to generate short double-stranded RNA molecules are increasingly used to silence plant genes. Viralinduced gene silencing harnesses an innate anti-viral plant defence mechanism to silence targeted endogenous plant RNAs homologous to the sequence engineered into the virus (25, 26). Commonly, fragments of 300–800 nucleotides homologous to targeted plant genes are incorporated in viral vectors (27), but sequences as short as 23–60 nucleotides can also be effective (28, 29). Several viral genomes have been modified to produce VIGS vectors [(reviewed by (30)] with the most widely used ones based on the tobacco rattle virus (TRV), partly because of its ability to infect the meristems of its host (31). VIGS has been used to silence genes in a wide range of plant species including Solanaceae and cereals [reviewed by (32, 33)].

1.5. Applications of Transient Expression

The relative ease of transient expression-reporter gene assays was recognised in the late 1980s as a useful tool to study the regulation of gene expression (34–36) and has since facilitated a wide range of functional genomics and promoter studies. Protoplasts in particular have been used to analyse promoter elements. For example, a deletion series of the figwort mosaic virus promoter was tested in tobacco and maize protoplasts via electroporation (37). Protoplasts from a maize endosperm suspension culture were used to test promoters (and deletions thereof) of several seed storage protein genes and compare their activity to constitutive ones (38). Downstream promoter elements were identified as contributing to activity of the rice tungro bacilliform virus promoter in rice protoplasts (39). Tobacco and maize mesophyll protoplasts were used by Martinez et al. (40, 41) to demonstrate the inducibility of ecdysone receptor chimeras which had potential application as an inducible gene expression system compatible with agricultural use. Using regulatory plus reporter gene cassettes, addition of inducer was found to increase transgene activity up to 420-fold. The inducibility and specificity of particular promoter sequences have also been tested using transient expression in organised tissues. For example, the relevance of sugar-responsive

Transient Transformation of Plants

135

elements of the iso1 promoter and the nuclear localisation of the interacting transcription factor was confirmed in barley endosperm using transient expression of GFP fusions (42). The contribution of a novel cis-acting element in the endosperm specificity of an oat globulin promoter was analysed using transient Green Fluorescent protein (GFP) expression in wheat endosperm and other tissues (43). Transient beta glucuronidase (GUS) and luciferase (LUC) assays in rice demonstrated a 90-fold enhancement in activity of a rice polyubiquitin promoter rubi3 in the presence of a specific 5′ UTR exon and intron (44). In addition to the many gene- or promoter-function studies, there are exciting developments in applying transient expression to the production of recombinant proteins [reviewed by (19, 45)]. Although most of the plant-derived pharmaceutical proteins for the treatment of human diseases that are close to commercialisation were expressed in stably transformed plants, some were the product of transient expression of viral vectors (46). A scaleable transient expression system using Agrobacterium to transform lettuce leaves with a non-viral T-DNA succeeded in producing 20–80 mg of functional recombinant antibody per kilogram fresh weight of leaf tissue in less that 1 week (21). In a different approach, Giritch et al. (47) used synchronous co-infection of two, non-competing viral vectors, each expressing a separate antibody chain. Unlike vectors derived from the same virus, the non-competing vectors co-expressed the light and heavy chains in the same cells throughout the plant resulting in yields of 0.5 g/kg fresh leaf of assembled monoclonal antibodies (47). Viral vectors for protein production are generally designed for maximal amplification and constitutive expression; however, a novel, chemically-inducible viral amplicon, based on the cucumber mosaic virus has recently been described (48). This system was used to demonstrate tightly regulated, high level, transient production of recombinant human blood protein in tobacco leaves and over the next decade there will be significant advances in the commercial production of high-value proteins from plant transient expression.

2. Materials 2.1. Growth of Donor Plants

Donor plants grown for stable transformation experiments can provide a source of material for transient expression. In order to provide healthy plants with consistent quality, plants are grown as follows (see Note 1): 1. Soil: 75% fine-grade peat, 12% screened sterilised loam, 10% 6 mm screened lime-free grit, 3% medium vermiculite, 2 kg osmocote plus/m3 (slow-release fertiliser, 15N/11P/13K

136

Jones, Doherty, and Sparks

plus micronutrients), 0.5 kg PG mix/m3 (14N/16P/18K granular fertiliser plus micronutrients) (Petersfield Products, Leicestershire, UK). 2. Five plants per 21 cm diameter plastic pot [Nursery Trades (Lea Valley) Ltd., Hertfordshire, UK]. Plants are stripped to leave five tillers per plant once plants are 6–8 weeks old. 3. Vernalisation of winter wheat varieties is carried out at 4–5°C for 8 weeks from sowing. 4. Growth room conditions: 18–20°C day and 14–15°C night temperatures under a 16 h photoperiod provided by banks of hydrargyrum quartz iodide (HQI) lamps 400 W (Osram, Ltd., Berkshire, UK) to give an intensity of ~700 μmol/m2/s photosynthetically active radiation (PAR). 5. Watering: Initially all plants are top watered in order to monitor water requirements and thereby provide sufficient water without water-logging. An automated flooding system is used once the root system reaches the base of the pot. 6. Pests and disease: These are kept to a minimum by restricting access to growth rooms and following good housekeeping practices. Any diseased plants are discarded immediately. To avoid mildew, the fungicide Fortress (DOW Agrosciences, Ltd., Hertfordshire, UK) is applied as a preventative. Amblyseius caliginosus [Nursery Trades (Lea Valley) Ltd.] is used as a biological control agent to manage thrips. 7. Sterilising agents: 70% (v/v) aqueous ethanol, 10% (v/v) aqueous Domestos (Lever Fabergé, Ltd., Surrey, UK), sterile water (see Note 2). 2.2. Stock Solutions and Culture Media

Solutions 1–7 below are the recipes for stock solutions of basal culture media components, supplements and agargel/phytagel, from which the final culture media are prepared (see Notes 2, 3).

2.2.1. Stock Solutions of Basal Culture Media Components

1. MS Macrosalts (×10): 16.5 g/L NH4NO3 (Fisher Scientific UK, Leicestershire, UK), 19.0 g/L KNO3 (Fisher Scientific UK), 1.7 g/L KH2PO4 (Fisher Scientific UK), 3.7 g/L MgSO4·7H2O (Fisher Scientific UK), 4.4 g/L CaCl2·2H2O (Fisher Scientific UK) (see Note 4). Autoclave at 121°C for 20 min and store at 4°C (see Note 5). 2. L7 Microsalts (×1,000): 15.0 g/L MnSO4 (Fisher Scientific UK) (see Note 6), 5.0 g/L H3BO3 (Fisher Scientific UK), 7.5 g/L ZnSO4·7H2O (Fisher Scientific UK), 0.75 g/L KI (Fisher Scientific UK), 0.25 g/L Na2MoO4·2H2O (VWR International, Ltd., Leicestershire, UK), 0.025 g/L CuSO4·5H2O (Fisher Scientific UK), 0.025 g/L CoCl2·6H2O (SigmaAldrich). Prepare 100 ml at a time. Filter sterilise (see Note 7) and store at 4°C (see Note 5).

Transient Transformation of Plants

137

3. 3AA Amino acids (×25): 18.75 g/L l-Glutamine (SigmaAldrich), 3.75 g/L L-Proline (Sigma-Aldrich), 2.5 g/L L-Asparagine (Sigma-Aldrich). Store solution at −20°C in 40 ml aliquots (see Note 5). 4. MS Vitamins (-Glycine) (×1,000): 0.1 g/L Thiamine HCl (Sigma-Aldrich), 0.5 g/L Pyridoxine HCl (Sigma-Aldrich), 0.5 g/L Nicotinic acid (Sigma-Aldrich). Prepare 100 ml at a time. Filter sterilise (see Note 7) and store at 4°C (see Note 5). 5. Acetosyringone (3′,5′-dimethoxy-4′-hydroxyacetophenone) (Aldrich D12,440-6: MW-96.20). Dissolve in 70% ethanol to give 10 mg/ml or 50 mM stock solution. Filter sterilise, aliquot and store at −20°C (see Notes 5, 7). 6. Agargel (×2) (Sigma-Aldrich): Prepare in 400 ml volumes at 10 g/L and sterilise by autoclaving at 121°C for 20 min. Store at room temperature and melt in microwave before use (see Note 8). 7. Phytagel (×2) (Sigma-Aldrich): Prepare in 400 ml volumes at 4 g/L and sterilise by autoclaving at 121°C for 20 min. Store at room temperature and melt in microwave before use (see Notes 8, 9). 2.2.2. Culture Media for Biolistics

1. MSS 3AA/2 9%S (×2): 200 ml/L MS macrosalts, 2 ml/L L7 microsalts, 20 ml/L ferrous sulphate chelate solution (×100) (Sigma-Aldrich), 2 ml/L MS vitamins (-Glycine), 200 mg/L myo-Inositol (Sigma-Aldrich), 40 ml/L 3 AA amino acids (see Note 10), 180 g/L (9% final concentration) sucrose (Fisher Scientific UK) (see Note 11). Adjust pH to 5.7 with 5 M NaOH or KOH. Osmolarity should be within the range of 800–1,100 mOsM. Filter sterilise (see Note 7) and store at 4°C (see Note 5). 2. MS9%: Mix an equal volume of MSS 3AA/2 9%S (×2) with sterilised, melted agargel (×2) and pour into 9 cm diameter Petri dishes (Bibby Sterilin, Ltd., Staffordshire, UK) (~28 ml per dish). Store at 4°C (see Notes 11, 12).

2.2.3. Culture Media for Agrobacterium

1. Inoculation/co-cultivation medium (×2): 200 ml/L MS macrosalts, 2 ml/L L7 microsalts, 20 ml/L ferrous sulphate chelate solution (×100) (Sigma-Aldrich), 2 ml/L MS vitamins (-Glycine), 200 mg/L myo-Inositol (Sigma-Aldrich), 1 g/L Glutamine (Sigma-Alrich), 200 mg/L Casein hydrolysate (Sigma-Aldrich), 3.9 g/L 2-(N-Morpholino)ethanesulfonic acid (MES) (Sigma-Aldrich), 20 g/L Glucose (Sigma-Aldrich), 80 g/L maltose (Melford Laboratories, Ltd.). Adjust pH to 5.8 with 5 M NaOH or KOH. Osmolarity should be within the range of 600–700 mOsM. Filter sterilise (see Note 7) and store at 4°C (see Note 5).

138

Jones, Doherty, and Sparks

2. Inoculation/co-cultivation medium: Mix an equal volume of inoculation/co-cultivation medium (×2) with sterilised, melted phytagel (×2). Add Acetosyringone stock to give a final concentration of 400 µM (see Note 13). Pour into 5.5 cm diameter Petri dishes (Fisher Scientific UK) (~13 ml per dish). Store at 4°C (see Note 12). 2.3. Materials for Biolistics

1. Gold particles: 0.6 μm (sub-micron) gold particles (BIORAD Laboratories, Hertfordshire, UK) (see Note 14). (For preparation, see Subheading 3.3.1) 2. Macro-carriers, stopping screens, 650 psi rupture discs (all BIO-RAD Laboratories) (see Note 15). 3. 2.5 M Calcium chloride (Fisher Scientific UK): Dissolve 3.67 g CaCl2·2H2O in 10 ml water. Mix well/vortex. Filter sterilise (see Note 7) and store at −20°C in 50 μl aliquots (see Note 5). 4. 0.1 M Spermidine free-base (Sigma-Aldrich): Prepare 1 M stock from powder in sterile water and maintain at −80°C in 20 μl aliquots. Prepare the 0.1 M working solution by making a 1:10 dilution of 1 M stock in sterile water under sterile conditions. Mix well, aliquot in 10 μl volumes and store immediately at −20°C (see Note 16). 5. Plasmid DNA: 1 mg/ml in sterile Tris-Ethylenediaminetetraacetic acid (EDTA) (TE) buffer or sterile water, prepared using Qiagen Maxi-prep kit (Qiagen, Ltd., West Sussex, UK). Store in 20 μl aliquots at −20°C (see Note 17).

2.4. Materials for Agrobacterium

1. Biotin (Sigma-Aldrich): Dissolve 100 mg of Biotin in a few drops of 1 M NaOH. Once completely dissolved, add 100 ml water, then take 1 ml into 99 ml water to give a final concentration of 1 mg/100 ml. Filter sterilise and store −20°C (see Notes 5, 7). 2. Silwet L-77 (Lehle seeds, USA): Dissolve in water to give 1% v/v. Filter sterilise and store at 4°C in 0.5 ml aliquots (see Notes 5, 7). 3. Carbenicillin (Sigma-Aldrich): Dissolve 500 mg in 5 ml water. Filter sterilise and store at −20°C in 1 ml aliquots (see Notes 5, 7). 4. Kanamycin (Sigma-Aldrich): Dissolve 500 mg in 10 ml water. Filter sterilise and store at −20°C in 1 ml aliquots (see Notes 5, 7). 5. Timentin (Melford Laboratories Ltd.): Dissolve 1.6 g Timentin [Ticarcillin/Clavulanic (15:1)] in 10 ml water. Filter sterilise and store at −20°C in 1 ml aliquots (see Notes 5, 7, 18). 6. MG/L [reference (49)]: 5 g/L Mannitol (Sigma-Aldrich), 1 g/L L-Glutamic acid (Sigma-Aldrich), 250 mg/L KH2PO4

Transient Transformation of Plants

139

(Fisher Scientific UK), 100 mg/L NaCl (Fisher Scientific UK), 100 mg/L MgSO4·7H2O (Fisher Scientific UK) (see Note 19), 5 g/L Tryptone (OXOID), 2.5 g/L Yeast extract (Merck), pH 7.0. Autoclave at 121°C for 20 min then add 1 µg/L Biotin (Sigma-Aldrich), 200 mg/L Carbenicillin and 100 mg/L Kanamycin (see Note 20). 7. LB Medium (Luria-Bertani medium): 10 g/L Tryptone (OXOID), 5 g/L Yeast Extract (Merck), 10 g/L NaCl (Fisher Scientific UK). Adjust to pH 7.0 with 5 M NaOH. Autoclave at 121°C for 20 min. For plates of solidified LB, add 15 g/L bactoTMagar (Difco) prior to autoclaving. Before use, add 200 mg/L Carbenicillin and 100 mg/L Kanamycin (see Note 20). 8. 10 mM Magnesium sulphate (Fisher Scientific UK): Dissolve 246 mg MgSO4·7H2O in 100 ml water. Filter, sterilise and store at −20°C in 1 ml aliquots (see Notes 5, 7). 9. 80% (v/v) Glycerol (Sigma-Aldrich): Add 80 ml glycerol to 20 ml water. Mix thoroughly. Autoclave at 121°C for 20 min. 2.5. Analysis of Transient Expression 2.5.1. X-Gluc Solution

1. Dissolve 25 mg X-Gluc. [5-Bromo, 4-chloro, 3-indolyl β-D glucuronide (Melford Laboratories, Ltd.)] in 0.5 ml methyl cellusolve [ethylene glycol monomethyl ether (Sigma-Aldrich), see Note 21]. 2. Mix with10 ml 0.5 M NaHPO4 buffer (pH 7.0). 3. Add 0.5 ml 50 mM potassium ferrocyanide and 0.5 ml 50 mM potassium ferricyanide (see Note 22). 4. Bring to 50 ml volume with distilled water. 5. Filter, sterilise, aliquot and store at −20°C (see Notes 5, 7, 19).

2.5.2. Triton X-100

1. Prepare 1% Triton X-100 (Sigma-Aldrich) in sterile distilled water. 2. Aliquot and store at 4°C.

2.5.3. X-Gluc PlusTriton

1. Just prior to use, mix X-Gluc solution (see Subheading 2.5.1) with 1% Triton X-100 (see Subheading 2.5.2) at ratio 10:1 (see Note 23).

3. Methods 3.1. Collection and Sterilisation of Wheat Caryopses

1. Collect spikes from growth room-grown plants at ~10–12 weeks after sowing: endosperm at the correct stage is usually found ~7–10 days post-anthesis (see Notes 24, 25)

140

Jones, Doherty, and Sparks

Fig. 1. (A) Immature caryopsis at ~7–10 dpa with correct stage endosperm. (B) Removal of immature endosperm from longitudinal section of caryopsis. (C) Isolated endosperm half. (D) Fifteen immature endosperm halves plated for bombardment. (E) Immature endosperm halves after Agrobacterium inoculation. (F) beta glucuronidase (GUS) expression following bombardment (top) or Agrobacterium co-cultivation (bottom). Scale bar = 1 mm.

2. Remove the panicles to release the caryopses (see Note 26 and Fig. 1A). 3. Surface sterilise the caryopses by soaking in 70% (v/v) aqueous ethanol for 1 min then 10–15 min in 10% (v/v) Domestos with occasional gentle shaking. 4. Rinse copiously with at least three changes of sterile water. Maintain the sterilised caryopses in moist conditions but do not keep immersed in water. 3.2. Isolation of Immature Endosperm and Culturing

1. Working in a sterile environment, cut each caryopsis in half longitudinally along the crease. Release the endosperm half from the seed coat by gently scooping out with forceps, taking care not to damage the endosperm surface (see Note 27 and Fig. 1B and C). 2. Place 15 endosperm halves per Petri dish containing culture medium [MS9% for biolistics or inoculation/co-cultivation medium for Agrobacterium (see Note 3)] orientating them with the cut endosperm surface in contact with the medium, such that, the uncut endosperm side closest to the seed coat is bombarded. The scutella should be arranged within the central target area of the plate for bombardment (see Note 28 and Fig. 1D). 3. For endosperm there is no requirement to pre-culture the tissue and it can be transformed straight away (see Note 29).

3.3. Protocol for DNA Delivery Via Biolistics 3.3.1. Preparation of Gold Particles

1. Weigh 20 mg BIO-RAD sub-micron gold particles (0.6 μm) in a 1.5 ml Eppendorf and add 1 ml 100% ethanol. Sonicate for 2 min, pulse spin for 3 s in a microfuge and remove the supernatant. Repeat this ethanol wash twice more.

Transient Transformation of Plants

141

2. Add 1 ml sterile water and sonicate for 2 min. Pulse spin for 3 s in a microfuge and remove the supernatant. Repeat this step. 3. Resuspend fully by vortexing in 1 ml sterile water. Aliquot 50 μl amounts into sterile 1.5 ml Eppendorf tubes, vortexing between taking each aliquot to ensure an equal distribution of particles. Store at −20°C. 3.3.2. Coating of Gold Particles with DNA for Bombardment

The following procedure should be carried out on ice, in a sterile environment. 1. Thaw a 50 μl aliquot of prepared gold (see Subheading 3.3.1) at room temperature then sonicate for 1–2 min (see Note 30). To ensure total re-suspension, the tubes can be vortexed following sonication, particularly if the aliquots are to be subdivided for smaller preparations (see Note 31). 2. Add 5 μl DNA (1 mg/ml in TE or water see Note 32) or water (see Note 33) and vortex briefly to ensure good contact of DNA with the particles (see Note 34). 3. Mix 50 μl 2.5 M CaCl2 and 20 μl 0.1 M spermidine in the lid of the Eppendorf then briefly vortex into the gold + DNA solution (see Note 35). 4. Centrifuge 13,000 rpm for 3–5 s in a microfuge to pellet the DNA-coated particles. Discard the supernatant. 5. Add 150 μl 100% ethanol to wash the particles, resuspending them as fully as possible (see Notes 36, 37). 6. Centrifuge 13,000 rpm for 3–5 s in a microfuge to pellet the particles and discard the supernatant. 7. Resuspend fully in 85 μl 100% ethanol and maintain on ice (see Note 38, 39).

3.3.3. Particle Bombardment Using the PDS-1000/ He Particle Gun [BIO-RAD]

The delivery system involves the use of high pressure to accelerate particles to high velocity. Appropriate safety precautions should be taken and safety spectacles should be worn when operating the gun. In any bombardment experiment, controls should be included to monitor transformation efficiency (see Note 40). 1. The PDS-1000/He particle gun [BIO-RAD (see Fig. 2)] is used to deliver DNA-coated gold particles (see Subheading 3.3.2) according to the manufacturer’s instructions. The following settings are maintained as standard for this procedure (see Note 41): target distance 5.5 cm (distance between stopping screen and target plate), stopping plate aperture 0.8 cm (distance between macro-carrier and stopping screen), gap 2.5 cm (distance between rupture disc and macro-carrier), vacuum 91.4–94.8 kPa, vacuum flow rate 5.0, vent flow rate 4.5.

142

Jones, Doherty, and Sparks

2. Sterilise the gun’s chamber and component parts by spraying with 90% (v/v) ethanol which should be allowed to evaporate completely (~5 min). 3. Sterilise rupture discs, stopping screens, macro-carriers and macro-carrier holders, by dipping in 100% ethanol and allow the alcohol to evaporate completely on a mesh rack in a flow hood (see Note 42). Place the dried macro-carrier holders into sterile 6 cm Petri dishes and mount one macro-carrier into each holder. 4. Briefly vortex the coated gold particles (see Subheading 3.3.2), take a 5 μl sample and drop centrally onto a macrocarrier membrane. Allow to dry naturally, not in the air-flow (see Note 43). 5. Load a rupture disc (see Note 15) into the rupture disc retaining cap (see Fig. 2) and screw into place on the gas acceleration tube, tightening firmly using the mini torque wrench (see Note 44). 6. Place a stopping screen into the fixed nest. Invert the macrocarrier holder containing macro-carrier plus gold particles/ DNA and place over the stopping screen in the nest and maintain its position using the retaining ring. Mount the fixed nest assembly onto the second shelf from the top to give a gap of 2.5 cm (see Fig. 2).

Fig. 2. The PDS-1000/He particle gun [BIO-RAD] (left) and diagram of component parts described in Subheading 3.3.3 (right).

Transient Transformation of Plants

143

7. Place a sample on the target stage on a shelf to give the desired distance; fourth shelf from the top gives a target distance of 5.5 cm. 8. Draw a vacuum of 91.4–94.8 kPa and fire the gun (see Note 45). 9. After firing, release the vacuum, remove the sample and disassemble the component parts, discarding the ruptured disc and macro-carrier (see Note 46). 10. Place the macro-carrier holder and stopping screen in 100% ethanol to re-sterilise if they are to be re-used for further shots, otherwise place in 1:10 dilution Savlon (Novartis Consumer Health, West Sussex, UK) to soak. Sonicate for 10 min prior to re-use (see Note 47). 11. Following bombardment, seal the plates with Nescofilm® and incubate at 22°C in the dark for 1–3 days (see Note 48) prior to analysis (see Subheading 3.5). 3.4. Protocol for DNA Delivery via Agrobacterium 3.4.1. Preparation of Standard Inoculum and Glycerol Stocks

1. Streak plates of LB + antibiotics from a glycerol stock of AGL1 + vectors (see Note 20). 2. Incubate at 27–29°C for 2–3 days, until single colonies form. 3. Pick a single colony on a sterile cocktail stick and put into 10 ml MG/L medium with appropriate antibiotics (see Notes 20, 49). 4. Incubate at 27–29°C, shaking at 250 rpm for ~40 h until an OD (Abs = 600 nm) of 1 or higher is reached. 5. To prepare a standard inoculum, spin the cultures for 4 min at 4.5 g (5,000 rpm). Remove the supernatant and resuspend the pellet in 1 ml 10 mM magnesium sulphate. Add 3 ml of 80% glycerol and mix thoroughly. It is preferable to have both of these solutions ice cold. 6. Aliquot in 400 µl volumes in sterile cryovials and store at −80°C as a glycerol stock (see Note 50). The standard inoculum can also be used immediately to initiate a full strength culture (see Subheading 3.4.2).

3.4.2. Preparation of Agrobacterium cells for Inoculation

1. Initiate Agrobacterium liquid cultures by adding ~200 µl of a standard glycerol inoculum (see Subheading 3.4.1) to 10 ml MG/L plus antibiotics (see Note 20). Prepare as many 10 ml cultures as plates to be treated. 2. Incubate at 27–29°C, shaking (250 rpm) for 12–24 h [to reach an OD > 1 (Abs = 600 nm)]. 3. Pellet the Agrobacterium culture at 4,500 g for 10 min and resuspend in 6 ml single-strength inoculation medium supplemented with 400 µM acetosyringone (see Subheading 2.2.3). 4. Replace the cultures on the shaker until required.

144

Jones, Doherty, and Sparks

3.4.3. Inoculation of Endosperm with Agrobacterium

1. Take the resuspended Agrobacterium suspension from the shaker (see Subheading 3.4.2), add 1% Silwet to make a final concentration of 0.015% and pour the total 6 ml volume onto a plate containing 15 isolated endosperm halves. Incubate for 1–3 h at room temperature. 2. Remove as much of the Agrobacterium as possible with a pipette, then transfer the endosperm onto fresh inoculation medium in 5.5 cm Petri dishes (see Fig. 1E). Seal the plates with Nescofilm and co-cultivate in the dark at 22–23°C for 2–3 days.

3.4.4. Removal of Agrobacterium

3.5. Analysis of Endosperm for Transient Expression

1. After 2–3 days, place treated endosperm halves in a solution of 160 mg/L Timentin (see Subheading 2.4) and leave overnight. Dab dry on clean filter paper before proceeding to the analysis (see Note 18 and Subheading 3.5). Histochemical GUS staining of explants [based on (50) (see Note 51)] 1. After the appropriate length of time in culture, move endosperm halves into X-Gluc + Triton using ~0.5 ml per well of 24-well plate. 2. Seal the plate with UniSealTM film (Whatman Inc., USA) to prevent evaporation (see Note 52) and incubate at 37°C overnight. 3. Examine tissues for blue foci (see Note 53 and Fig. 1F.).

Notes 1. Glasshouse grown plants or plants in culture could be used as target tissue if appropriate. 2. Reverse osmosis, polished water with a purity of 18.2 MΩ/ cm should be used for all solutions. 3. If explants are to be cultured for transient expression only, a more simple medium to that used for stable transformation can be used, that is, there is no requirement for hormones; commercially available MS medium ± sucrose may be appropriate. Although not ideal, agargel/phytagel alone could be used as a support medium; however, efficiencies may be reduced particularly for Agrobacterium-mediated transformation. 4. Before mixing with other components, dissolve CaCl2·2H2O in water.

Transient Transformation of Plants

145

5. Sterile stock solutions can be stored at 4°C for 1–2 months. Some settling of salts may occur during storage, so the medium should be shaken well prior to use. Stock solutions stored at −20°C should remain effective for at least a year, provided that no freezing/thawing has occurred. 6. MnSO4 is available in various hydrated states so the exact mass required will vary. For MnSO4·H2O, add 17.05 g/L; for MnSO4·4H2O, add 23.22 g/L or for MnSO4·7H2O, add 27.95 g/L. 7. Filter sterilisation is carried out using a filter size of 0.2 μm. For large volumes use MediaKap® (NBS Biologicals, Ltd., Cambridgeshire, UK), for smaller volumes use a Nalgene syringe filter (Fisher Scientific UK). 8. To avoid difficulties when re-melting, the agargel or phytagel solution should be shaken well both before and after autoclaving to allow uniform solidification. When re-melting, be very careful to avoid super-heating when mixing. 9. Although phytagel can be re-melted, it is not as amenable as agargel and sets very quickly; it is preferable, therefore, to use phytagel directly after autoclaving when it has cooled slightly. Phytagel is used routinely in our laboratory for Agrobacterium work to prevent the explants from floating during inoculation as it provides a softer medium than agargel; however, it is probably not essential to use it for endosperm transient experiments. 10. Instead of using the 3AA stock solution, 1.5 g/L L-Glutamine, 0.3 g/L L-Proline, and 0.2 g/L L-Asparagine can be added individually. 11. Partial plasmolysis of cells may increase their ability to withstand bombardment hence 9% sucrose is used in the medium for stable transformation of immature scutella. This may not be as essential for transient expression in which case it may be possible to use MS ± sucrose or just agargel as a simple support medium. 12. Tissue culture media should be prepared as freshly as possible and not be stored in Petri dishes for more than 2–3 weeks. However, they should be prepared a few days in advance of use to allow any contamination to be detected. To minimise condensation in the plates, allow the agargel/phytagel (×2) to cool once melted, and pour the final medium at ~50°C. 13. The presence of 200–400 µM acetosyringone in the Agrobacterium culture or inoculation/co-cultivation medium has been shown to increase T-DNA delivery. 14. Successful transformation has also been achieved using Heraeus gold particles of 0.4–1.2 μm diameter (W. C. Heraeus GmbH

146

Jones, Doherty, and Sparks

and Co., KG, Hanau, Germany); however, the smaller, more uniform size of the sub-micron BIO-RAD particles gives more consistent results for wheat. The latter particles are preferable for small wheat cells but for other species, larger particles may be suitable. 15. Rupture pressures of 650 psi or 450 psi can be used for transformation of young endosperm, with the former giving slightly better results. Lower efficiencies may result from other rupture pressures. Different explants may require alternative rupture pressures depending on tissue type and how fragile the cells are. If attempting transformation of any new explant or species, a range should be tested; rupture discs are available as 450, 650, 900, 1,100, 1,350, 1,550, 1,800, 2,000, and 2,200 psi. 16. Spermidine should be maintained below –20°C, preferably at –80°C because it deaminates with time and solutions are hygroscopic and oxidisable. Any unused aliquots once thawed, should be discarded. 17. Plasmids for biolistics transformation tend to be pUC-based and contain one or more gene cassettes. In order to monitor transient transformation, a reporter gene is necessary in the plasmid, for example, uidA (GUS), luc (luciferase) or GFP (Green Fluorescent Protein) (see Note 51). 18. For stable transformation, explants are transferred to medium containing 160 mg/L Timentin in the medium to control Agrobacterium growth following co-cultivation. Washing the explants in Timentin overnight prior to assay produced clear blue foci. Rinsing the treated endosperm halves with sterile water instead of Timentin treatment resulted in reasonable GUS expression but the foci were rather less discrete. 19. Magnesium sulphate may have various hydrated states which will alter the weight requirement; therefore, calculate the appropriate amount if differing from 7H2O. 20. The antibiotics used depend on the selectable markers in the Agrobacterium strain and the binary vectors used. For the AGL1 strain used in this protocol, carbenicillin (200 mg/L) is used and pAL154/156 combinations are selected with kanamycin (100 mg/L) which is the selectable marker on pAL156. 21. Methyl cellusolve is hazardous; wear gloves and use with caution. 22. Potassium ferrocyanide and potassium ferricyanide are poisonous; wear gloves and use with caution. The potassium ferrocyanide and ferricyanide can be omitted from the X-Gluc. solution if weak GUS expression is anticipated.

Transient Transformation of Plants

147

23. Triton X-100 is commonly used as a surfactant to aid penetration of the X-Gluc substrate, for example, to penetrate the leaf cuticle. For endosperm there is less requirement for this and Triton could be omitted. 24. Immature scutella can be used as alternative explants for transient assays using protocols as for stable transformation but sacrificing scutella for assay 2–3 days after transformation. Other tissues, for example leaf, can also be used but conditions may need to be modified, depending on the fragility of the tissue. 25. In order to determine the state of the endosperm, a few caryopses can be opened at the time of collection. Although it is not encouraged, if the caryopses will not be used the same day it is possible to store the spikes intact at 4°C, with stems in water. 26. Because of asynchronous development, avoid using the inner caryopses of the spikelet as these generally contain younger endosperm which may be too milky. 27. In experiments testing a range of seed ages, younger endosperm was found to be most responsive. Endosperm at the correct stage is from seeds which have a white pericarp which can be indented relatively easily by a fingernail, where the endosperm is opaque but discrete and will slip easily from the seed coat. The embryo will be typically <0.5 mm long. In older seeds the doughy, starchy endosperm is difficult to separate from the seed coat and the endosperm cells are too big and easily damaged and/or not metabolically active enough to respond well. 28. Typically the gun shot fires most gold particles within a ~2 cm diameter central circular area of a Petri dish. Arranging explants within this area maximises particle delivery [as shown by transient expression studies (51)]. 29. The young endosperm can be transformed immediately upon isolation and there is no requirement for pre-culture. If there are concerns about possible contamination, Plant Preservative Mixture (PPMTM) (Plant Cell Technology, Inc., Washington, DC, USA) could be used in the culture medium at 1 ml/L. This is a non-toxic broad-spectrum preservative and biocide which does not interfere with cell metabolism. 30. The sonication has worked effectively if the gold particles have resuspended in the liquid rather than being present as a pellet in the base of the tube. There is evidence that oversonication can cause aggregation, however, so the particles should not be sonicated longer than 1–2 min. 31. The gold preparation can be sub-divided and volumes scaled down accordingly if fewer shots are required or a variety of DNAs are to be compared.

148

Jones, Doherty, and Sparks

32. If plasmids are not at a concentration of 1 mg/ml, re-calculate the volume to give 5 μg DNA and add to the gold. However, the addition of large volumes should be avoided. If the DNA is very dilute, re-precipitate the DNA and resuspend at a higher concentration. 33. In order to monitor DNA delivery in a bombardment experiment, control plates are required. Some particles should, therefore, be prepared without DNA, replacing the DNA solution with sterile water in order to bombard with gold only as a negative control. A plasmid known to work in the explant being tested should also be used to act as a positive control, for example, pAHC25. 34. The standard amount of DNA is 5 μg/50 μl gold suspension. If using more than one plasmid, that is, for co-bombardment, the amounts of DNA added should be calculated such that equimolar quantities are used, with a total of 5 μg DNA for the two plasmids (greater than 5 μg may cause clumping of particles). 35. CaCl2 and spermidine act to bind, stabilise, and precipitate the DNA. Precipitation onto the gold particles is very rapid so the CaCl2 and spermidine are mixed first to ensure that the coating is as even as possible. 36. The particles should be resuspended as well as possible by scraping the side of the tube with the pipette tip to remove clumps, and drawing up and expelling the solution repeatedly. The gold must be fully resuspended at this stage as remaining clumps cannot be removed during later resuspension steps. Vortexing will not aid resuspension. 37. Ideally the coated particles should be used as soon as possible; however, they can be kept on ice at this point (but for no longer than an hour), completing the rest of the protocol just prior to use. 38. Avoid aspirating too much at this stage as the ethanol will evaporate and increase the final concentration of particles. Some natural evaporation means there is generally enough for only 10–12 shots from the 85 μl final volume, even though there should be sufficient for 16–17 shots (5 μl/ shot). In order to reduce further evaporation of the ethanol before the resuspended particles are required, the Eppendorf lids can be sealed with Nescofilm. However, it is advisable to use coated gold particles as soon as possible. 39. For transient transformation where cell damage is not quite so critical, the gold can be used at double strength to increase the efficiency of bombardment, that is, resuspend gold in final volume of 45 μl.

Transient Transformation of Plants

149

40. Various control plates should be included within each experiment: bombarded only with gold (no DNA) as a negative control and bombarded with a construct which can act as a positive control, for example, pAHC25. 41. Although these settings were successful for transient transformation of wheat endosperm, they may need to be altered for different explants or species. 42. The rupture discs are composed of laminate layers; therefore, they should not be sterilised for more than 10 min or the layers may become separated. 43. Once the coated particles have been dispensed onto the macro-carriers, the ethanol should be allowed to evaporate slowly. The flow hood may cause vibration which could cause particle agglomeration so in order to create an even spread of dried particles on the macro-carrier, place macro-carriers within their sterile Petri dishes outside of the flow hood on a non-vibrating surface. Macro-carriers should be used when recently dried, so only a few should be loaded with gold at any one time. Macro-carriers can be examined microscopically prior to bombardment to determine the uniformity and spread of particles, discarding any that have agglomerated clumps of gold which will reduce transformation efficiency. 44. The helium pressure on the cylinder should be set to ~200 psi more than the intended rupture pressure. 45. The helium pressure accumulates until the rupture disc breaks, propelling the macro-carrier onto the stopping plate, thus releasing and dispersing the gold particles. The actual pressure at which the rupture disc bursts should be monitored to ensure a successful shot, otherwise transformation efficiencies may be affected. 46. Following a shot, the macro-carrier can be observed microscopically to visualise the mesh pattern left by the stopping screen. This will demonstrate how much gold has been released/retained. 47. The macro-carriers and stopping screens are sonicated to destroy any adhering DNA and prevent carry-over to future bombardments. 48. Transient assays, for example, histochemical GUS assay, can be carried out after 1–3 days depending on the strength of the promoter/type of explant. GFP expression can be monitored over a longer time period as it does not involve a destructive assay. 49. It is best to use a 50 ml plastic centrifuge tube as these provide good aeration and can be used for the subsequent centrifugation stage.

150

Jones, Doherty, and Sparks

50. It is advisable to place the aliquots directly into liquid nitrogen for immediate freezing prior to being placed at −80°C. A number of standard inoculum cultures can be prepared and stored at −80°C for future use; this avoids the requirement to grow up single colony cultures for every experiment. 51. UidA (GUS) is a useful reporter gene to monitor transient expression in transformed tissues. The disadvantage is that a destructive assay is required to visualise gene expression. Alternative reporter genes include GFP and luciferase which allow longer monitoring of viable tissues in culture. 52. UniSeal is an adhesive, clear polyester seal film which can be used to prevent evaporation of X-Gluc solution. As an alternative the plate can be wrapped in clingfilm. 53. Blue foci should be readily visible on the endosperm surface or scutellum if using immature embryos. If leaves are used as a target explant the chlorophyll may need to be removed in order to visualise the stained cells. This can be done by soaking in successive rounds of 70% ethanol until all colour has been removed.

Acknowledgements Rothamsted receives grant-aided support from the Biotechnological and Biological Sciences Research Council, UK. We acknowledge other members of the Rothamsted Cereal Transformation Group, past and present, for their significant contribution to the protocols described here. References 1. Abel, S. and Theologis, A. (1994) Transient transformation of Arabidopsis leaf protoplasts – a versatile experimental system to study gene-expression. Plant Journal 5, 421–427. 2. Werr, W. and Lorz, H. (1986) Transient geneexpression in a Gramineae cell-line – a rapid procedure for studying plant promoters. Molecular and General Genetics 202, 471–475. 3. Aoki, S. and Takebe, I. (1969) Infection of tobacco mesophyll protoplasts by tobacco mosaic virus ribonucleic acid. Virology 39, 439–448. 4. Davey, M.R., Cocking, E.C., Freeman, J., Pearce, N., and Tudor, I. (1980) Transformation of Petunia protoplasts by isolated Agrobacterium plasmids. Plant Science Letters 18, 307–313.

5. Sheen, J. (2001) Signal transduction in maize and Arabidopsis mesophyll protoplasts. Plant Physiology, 127, 1466–1475. 6. Fromm, M.E., Morrish, F., Armstrong, C., Williams, R., Thomas, J., and Klein, T.M. (1990) Inheritance and expression of chimeric genes in the progeny of transgenic maize plants. Bio-Technology 8, 833–839. 7. Klein, T.M., Fromm, M., Weissinger, A., Tomes, D., Schaaf, S., Sletten, M., and Sanford, J.C. (1988) Transfer of foreign genes into intact maize cells with high-velocity microprojectiles. Proceedings of the National Academy of Sciences of the United States of America 85, 4305–4309. 8. Dekeyser, R.A., Claes, B., Derycke, R.M.U., Habets, M.E., Vanmontagu, M.C., and

Transient Transformation of Plants

9.

10.

11.

12.

13.

14.

15.

16.

17.

18.

19.

Caplan, A.B. (1990) Transient gene-expression in intact and organized rice tissues. Plant Cell 2, 591–602. Lindsey, K. and Jones, M.G.K. (1987) Transient gene-expression in electroporated protoplasts and intact-cells of sugar-beet. Plant Molecular Biology 10, 43–52. Nishiguchi, M., Langridge, W.H.R., Szalay, A.A., and Zaitlin, M. (1986) Electroporation-mediated infection of tobacco leaf protoplasts with tobacco mosaic-virus RNA and cucumber mosaic-virus RNA. Plant Cell Reports 5, 57–60. Krens, F.A., Molendijk, L., Wullems, G.J., and Schilperoort, R.A. (1982) Invitro transformation of plant-protoplasts with Ti-plasmid DNA. Nature 296, 72–74. Hillmer, S., Gilroy, S., and Jones, R.L. (1993) Visualizing enzyme-secretion from individual barley (Hordeum-Vulgare) Aleurone protoplasts. Plant Physiology 102, 279–286. Helenius, E., Boije, M., Niklander-Teeri, V., Palva, E.T., and Teeri, T.H. (2000) Gene delivery into intact plants using the HeliosTM Gene Gun. Plant Molecular Biology Reporter 18, 287–288. Hellens, R., Allan, A., Friel, E., Bolitho, K., Grafton, K., Templeton, M., Karunairetnam, S., Gleave, A., and Laing, W. (2005) Transient expression vectors for functional genomics, quantification of promoter activity and RNA silencing in plants. Plant Methods 1, 446–451. Gheysen, G., Angenon, G., and Van Montagu, M. (1998) Agrobacterium-mediated plant transformation: a scientifically intriguing story with significant applications, in Transgenic Plant Research (Lindsey, K., ed.), Harwood Academic Press, the Netherlands, pp. 1–33. Kapila, J., DeRycke, R., VanMontagu, M., and Angenon, G. (1997) An Agrobacteriummediated transient gene expression system for intact leaves. Plant Science 122, 101–108. Joh, L.D., Wroblewski, T., Ewing, N.N., and VanderGheynst, J.S. (2005) High-level transient expression of recombinant protein in lettuce. Biotechnology and Bioengineering 91, 861–871. Scholthof, H.B., Scholthof, K.B.G., and Jackson, A.O. (1996) Plant virus gene vectors for transient expression of foreign proteins in plants. Annual Review of Phytopathology 34, 299–323. Streatfield, S.J. (2007) Approaches to achieve high-level heterologous protein production in plants. Plant Biotechnology Journal 5, 2–15.

151

20. Gelvin, S.B. (2005) Viral-mediated plant transformation gets a boost. Nature Biotechnology 23, 684–685. 21. Negrouk, V., Eisner, G., Lee, H.I., Han, K.P., Taylor, D., and Wong, H.C. (2005) Highly efficient transient expression of functional recombinant antibodies in lettuce. Plant Science 169, 433–438. 22. Gleba, Y., Marillonnet, S., and Klimyuk, V. (2004) Engineering viral expression vectors for plants: the ‘full virus’ and the ‘deconstructed virus’ strategies. Current Opinion in Plant Biology 7, 182–188. 23. Marillonnet, S., Thoeringer, C., Kandzia, R., Klimyuk, V., and Gleba, Y. (2005) Systemic Agrobacterium tumefaciens-mediated transfection of viral replicons for efficient transient expression in plants. Nature Biotechnology 23, 718–723. 24. Gleba, Y., Klimyuk, V., and Marillonnet, S. (2005) Magnifection-a new platform for expressing recombinant vaccines in plants. Vaccine 23, 2042–2048. 25. Baulcombe, D.C. (1999) Gene silencingRNA makes RNA makes no protein. Current Biology 9, R599–R601. 26. Lindbo, J.A., Silvarosales, L., Proebsting, W.M., and Dougherty, W.G. (1993) Induction of a highly specific antiviral state in transgenic plants – implications for regulation of gene-expression and virus-resistance. Plant Cell 5, 1749–1759. 27. Watson, J.M., Fusaro, A.F., Wang, M.B., and Waterhouse, P.M. (2005) RNA silencing platforms in plants. FEBS Letters 579, 5982– 5987. 28. Jin, H.L., Axtell, M.J., Dahlbeck, D., Ekwenna, O., Zhang, S.Q., Staskawicz, B., and Baker, B. (2002) NPK1, an MEKK1-like mitogen-activated protein kinase kinase kinase, regulates innate immunity and development in plants. Developmental Cell 3, 291–297. 29. Thomas, C.L., Jones, L., Baulcombe, D.C., and Maule, A.J. (2001) Size constraints for targeting post-transcriptional gene silencing and for RNA-directed methylation in Nicotiana benthamiana using a potato virus X vector. Plant Journal 25, 417–425. 30. Burch-Smith, T.M., Anderson, J.C., Martin, G.B., and Dinesh-Kumar, S.P. (2004) Applications and advantages of virus-induced gene silencing for gene function studies in plants. Plant Journal 39, 734–746. 31. Ratcliff, F., Martin-Hernandez, A.M., and Baulcombe, D.C. (2001) Tobacco rattle virus as a vector for analysis of gene function by silencing. Plant Journal 25, 237–245.

152

Jones, Doherty, and Sparks

32. Burch-Smith, T.M., Schiff, M., Liu, Y.L., and Dinesh-Kumar, S.P. (2006) Efficient virusinduced gene silencing in Arabidopsis. Plant Physiology 142, 21–27. 33. Robertson, D. (2004) VIGS vectors for gene silencing: many targets, many tools. Annual Review of Plant Biology 55, 495–519. 34. Fromm, M., Callis, J., Taylor, L.P., and Walbot, V. (1987) Electroporation of DNA and RNA into plant-protoplasts. Methods in Enzymology 153, 351–366. 35. Hauptmann, R.M., Oziasakins, P., Vasil, V., Tabaeizadeh, Z., Rogers, S.G., Horsch, R.B., Vasil, I.K., and Fraley, R.T. (1987) Transient expression of electroporated DNA in monocotyledonous and dicotyledonous species. Plant Cell Reports 6, 265–270. 36. Wang, Y.C., Klein, T.M., Fromm, M., Cao, J., Sanford, J.C., and Wu, R. (1988) Transient expression of foreign genes in Rice, Wheat and Soybean cells following particle bombardment. Plant Molecular Biology 11, 433–439. 37. Bhattacharyya, S., Dey, N., and Maiti, I.B. (2002) Analysis of cis-sequence of subgenomic transcript promoter from the Figwort mosaic virus and comparison of promoter activity with the cauliflower mosaic virus promoters in monocot and dicot cells. Virus Research 90, 47–62. 38. Blechl, A.E., Lorens, G.F., Greene, F.C., Mackey, B.E., and Anderson, O.D., (1994) A transient assay for promoter activity of wheat seed storage protein genes and other genes expressed in developing endosperm. Plant Science 102, 69–80. 39. He, X.Y., Futterer, J., and Hohn, T. (2002) Contribution of downstream promoter elements to transcriptional regulation of the rice tungro bacilliform virus promoter. Nucleic Acids Research 30, 497–506. 40. Martinez, A., Sparks, C., Hart, C.A., Thompson, J., and Jepson, I. (1999) Ecdysone agonist inducible transcription in transgenic tobacco plants. Plant Journal 19, 97–106. 41. Martinez, A., Sparks, C., Drayton, P., Thompson, J., Greenland, A., and Jepson, I. (1999) Creation of ecdysone receptor chimeras in plants for controlled regulation of gene expression. Molecular and General Genetics 261, 546–552. 42. Sun, C.X., Palmqvist, S., Olsson, H., Boren, M., Ahlandsberg, S., and Jansson,

43.

44.

45.

46.

47.

48.

49.

50.

51.

C. (2003) A novel WRKY transcription factor, SUSIBA2, participates in sugar signaling in barley by binding to the sugar-responsive elements of the iso1 promoter. Plant Cell 15, 2076–2092. Vickers, C.E., Xue, G.P., and Gresshoff, P.M. (2006) A novel cis-acting element, ESP, contributes to high-level endosperm-specific expression in an oat globulin promoter. Plant Molecular Biology 62, 195–214. Sivamani, E. and Qu, R. (2006) Expression enhancement of a rice polyubiquitin gene promoter. Plant Molecular Biology 60, 225– 239. Yoshida, K. and Shinmyo, A. (2000) Transgene expression systems in plant, a natural bioreactor. Journal of Bioscience and Bioengineering 90, 353–362. Ma, J.K.C., Drake, P.M.W., Chargelegue, D., Obregon, P., and Prada, A. (2005) Antibody processing and engineering in plants, and new strategies for vaccine production. Vaccine 23, 1814–1818. Giritch, A., Marillonnet, S., Engler, C., van Eldik, G., Botterman, J., Klimyuk, V., and Gleba, Y. (2006) Rapid high-yield expression of full-size IgG antibodies in plants coinfected with noncompeting viral vectors. Proceedings of the National Academy of Sciences of the United States of America 103, 14701–14706. Sudarshana, M.R., Plesha, M.A., Uratsu, S.L., Falk, B.W., Dandekar, A.M., Huang, T.K., and McDonald, K.A. (2006) A chemically inducible cucumber mosaic virus amplicon system for expression of heterologous proteins in plant tissues. Plant Biotechnology Journal 4, 551–559. Garfinkel, D.J. and Nester, E.W. (1980) Agrobacterium-Tumefaciens Mutants Affected in Crown Gall Tumorigenesis and Octopine Catabolism. Journal of Bacteriology, 144, 732–743. Jefferson, R.A., Kavanagh, T.A., and Bevan, M.W. (1987) GUS fusion: β-Glucuronidase as a sensitive and versatile gene fusion marker in plants. EMBO J. 6, 3901–3907. Rasco-Gaunt, S., Riley, A., Barcelo, P., and Lazzeri, P.A. (1999) Analysis of particle bombardment parameters to optimise DNA delivery into wheat tissues. Plant Cell Reports 19, 118–127.

Chapter 9 Bridging the Gene-to-Function Knowledge Gap Through Functional Genomics Stephen J. Robinson and Isobel A. P. Parkin Summary The explosion of genomics data has led to a significant knowledge gap, with thousands of genes identified having no known function. The following chapter describes the available forward and reverse genetics strategies, which can assist researchers in assigning functions to novel genes. Details of the available resources for a number of model and crop species are provided. In addition, protocols are presented for utilising T-DNA tagged populations to identify genes underlying novel phenotypes and to assist with functional characterisation of target genes. Key words: Mutagenised population, Forward genetics, Reverse genetics, Phenotypic screening, Genetic variation.

1. Introduction Annotation of the complete genome sequences of the dicot Arabidopsis and the monocot rice revealed a complement of 30–50,000 genes within the respective genomes (1–3). However, the function of the majority of these genes awaits discovery. The number of sequenced plant genomes is expected to increase shortly with the addition of maize (4), medicago (5) and lotus (6) and consortiums have been established to sequence tomato (7), brassica and soybean (8, 9). The plethora of genomic sequence data and the availability of large sets of expressed sequence tags for a range of species (10, 11) has significantly widened the gene to function knowledge gap, with the identification of large numbers of genes of unknown function. The application of bioinformatics has arguably aided function Daryl J. Somers et al. (eds.), Methods in Molecular Biology, Plant Genomics, vol. 513 © Humana Press, a part of Springer Science + Business Media, LLC 2009 DOI: 10.1007/978-1-59745-427-8_9

153

154

Robinson and Parkin

prediction by identifying commonalties between genes at the sequence level not only within gene families but across significant species boundaries. In addition, the growing wealth of transcriptome and proteome data has allowed researchers an insight into the expression patterns of thousands of unknown genes, which can be informative with regard to function assignment. However, even for plants with simple genomes, such as Arabidopsis, only 10% of the genes have been assigned a function based on their activity in vivo (1). Functional genomics resources, which largely take the form of populations of mutagenised plant lines, have been developed to bridge the widening chasm between the identification and the characterisation of gene function. The isolation and analysis of mutant phenotypes is a powerful tool facilitating the dissection of genetic pathways since the mutants provide a direct link to the biochemical function of a gene in vivo. Traditionally, this practice has involved the identification of curiosities among individuals in natural populations, but the frequency of such individuals rarely saturates even the simplest genetic pathways. To develop populations enriched for genetic lesions different approaches have been used, including chemical mutagenesis (12), insertional mutagenesis by T-DNA (13, 14) or transposon integration (15, 16) and the use of RNA interference (RNAi) (17). The various methods employed to substantially increase the mutation rate have different advantages or drawbacks, which are summarised in Table 1 (18).

Table 1 A comparison of the different features associated with alternate mutagenesis strategies [redrawn from Feng and Mundy (18) ] Induced mutation method Insertions

Characteristic

Point T-DNA Transposon EMS

Deletions RNAi g-ray silencing

DNA tag

Yes

Yes

No

No

No

Reversion

No

Yes

No

No

No

Background mutation

Low

Low

High

High

Low

Penetrance

High

High

High

High

Variable

Requires transformation

Yes

Yes

No

No

Yes

Production time

Slow

Slow

Fast

Fast

Slow

Development cost High

High

Low

Low

High

Bridging the Gene-to-Function Knowledge Gap

1.1. Choice of Mutation Strategy 1.1.1. Insertional Mutagenesis

155

Insertional mutagenesis is based on the inactivation of a gene via insertion of a known DNA fragment (transposons and T-DNA elements). The integration site is thus marked and it is possible to isolate adjacent DNA sequences via this molecular marker. Endogenous transposons in Antirrhinum majus and Zea mays have been used for gene tagging and isolation of genes (19). Wild-type alleles mutated by transposon tagging can be recovered through induced excision, providing additional confirmation of the function of the mutated gene (20). However, imprecise excision events can leave behind ‘footprints’ which limit this advantage and in addition can generate mutations unlinked to the original inserted element. The isolation of genes by tagging with the T-DNA from Agrobacterium tumefaciens has also been successfully applied to clone genes from Arabidopsis (21). More sophisticated strategies have been designed which employ insertion elements engineered to facilitate enhancer, promoter and gene trapping along with the isolation of gain of function mutations through activation tagging (22–24). The development of saturated populations using insertion mutagenesis is dependent upon establishing an efficient transformation system for the species of interest and is realistically limited to species with simple compact genomes.

1.1.2. Chemical and Physical Mutagenesis

Chemical and physical mutagenesis are not limited by the size of the genome or the transformation efficiency of the species, which makes them excellent strategies for the more complex and intractable crop genomes. Physical mutagenesis is induced through exposure to high energy sources (fast neutrons or γ-rays) resulting in significant chromosomal aberrations, including deletions (25). These types of mutations are particularly useful for the analysis of tandemly duplicated genes. The most widely applied chemical mutagen is ethyl methane sulphonate (EMS), which has been used to develop mutagenised populations for a range of species (26–28). EMS is an alkylating agent that has been shown to generate almost exclusively G/C to A/T point mutations randomly distributed across the genome (28). Such mutations may not only result in loss of gene function but can also lead to an allelic series for the gene of interest (29). The efficient saturation of the genome can be achieved with a relatively small chemically mutagenised population although at the cost of multiple mutations per line.

1.2. Forward Genetic Screening Strategies

Forward genetics allows the discovery of the gene responsible for a particular phenotype without a priori knowledge of the gene (Fig. 1). In practice, mutagenised populations are screened to identify novel phenotypes. The identification of such lines requires the high throughput assessment of large populations, necessitating the design of a stringent screen which minimises

156

Robinson and Parkin

the occurrence of false positive results (30). The use of insertion mutagenised populations facilitates the isolation of the underlying gene exploiting the known sequence of the inserted element, whereas the laborious technique of map-based gene cloning is required to isolate the mutated gene from chemical/physical mutagenised populations. There are numerous publicly available resources to allow forward genetic strategies in the model plant Arabidopsis, but virtually none for crop species. Although EMS populations are being developed in many crop species, these are largely being utilised for reverse genetic approaches, due to the difficulties associated with map-based gene cloning in complex genomes. Reverse genetics allows the assignment of function to a selected gene (Fig. 1). This can be achieved by analysing the phenotype of lines carrying defective alleles of the target gene. This strategy has been widely adopted due to the availability of carefully archived populations of insertional mutants. Since the sequence of the inserted element is known, it is possible to rapidly screen large populations using PCR screening of multi-dimensional pools of lines. However more recently, the development of databases containing the sequenced insertions sites for such populations has

1.3. Reverse Genetic Screening Strategies

Forward Genetics Strategies Develop phenotypic screening conditions

Identify novel phenotype

Identify candidate gene through transcription profiling

Identify reliable mutant lines Chemical/ physical

Identify mutant alleles

Reverse Genetics Strategies

Map-based gene cloning

Insertional Insertional loss-of-function gain-of-function Sequence insertion sites

Functionally characterize mutant alleles

Complement mutation with wild type allele

Develop additional gain-of-function alleles

Identify candidate gene

Determine gene redundancy

Member of gene family

Unique gene

Identify mutant alleles & Develop loss-of-function phenotype

Identify mutant alleles

Identify homozygous lines

Develop Find additional Identify loss-of-function over-expressing activated gene alleles alleles

Identify candidate gene through forward genetic screen

Generate RNAi Insertional constructs loss-of-function

Generate transformed plants

Identify insertion alleles in silico

Identify homozygous lines

Identify homozygous lines and assess phenotype

Confirm phenotype

Characterize phenotype

Chemical/physical mutagenesis

Identify TILLING alleles

Pyramid mutant alleles

Fig. 1. Schematic diagram detailing the different steps involved in forward and reverse genetic strategies.

Functionally characterize mutant alleles

Bridging the Gene-to-Function Knowledge Gap

157

revolutionised the process of identifying mutant alleles by allowing researchers to detect and order plant lines carrying a mutation of interest ‘in silico’. These resources are complemented by archived EMS populations that allow the application of TILLING (Targeted Induced Local Lesions in Genomes). TILLING is a powerful high throughput technique that employs a mismatchspecific endonuclease to detect induced DNA polymorphisms (31). As the transgenic approaches are limited to model species, TILLING offers a practicable alternative for reverse genetics in crop genomes. Additionally, the RNAi phenomenon can be exploited to develop silenced alleles to assist in function assignment (32). In Arabidopsis, an endeavour to provide resources to systematically silence ~25,000 genes has been initiated (17). A significant fraction of eukaryotic genes are members of multigene families. This genetic redundancy can mask the effect of single null alleles preventing the manifestation of an observable phenotype. The problem can be circumvented by pyramiding multiple loss-of-function mutations in a single line. Alternatively, silencing of closely related genes can be achieved through RNAi in species amenable to transformation (33). There have been abundant excellent reviews detailing the methods to carry out both insertional and chemical mutagenesis (12, 28, 34, 35). Today, the majority of researchers are unlikely to have the means to develop saturated populations but are in the position to exploit such tools. The following chapter details the available resources for a number of model and crop species and provides protocols for applying these technologies to assist in assigning biological function to selected genes and to identifying the genes underlying novel phenotypic variation. The examples provided are for the Arabidopsis model, which has benefited from the early establishment of expansive genomics and genetics resources. Many of the approaches are directly applicable to rice, a species where extensive resources are being developed.

2. Materials 2.1. Publicly Available Functional Genomics Resources

1. http://www.lotusjaponicus.org/: TILLING in Lotus japonicus. 2. http://orygenesdb.cirad.fr/: OryGenes DB interactive tool for rice functional genomics, includes links to seven tagging databases. 3. http://www.maizegdb.org/rescuemu-phenotype.php: RescueMu maize mutant phenotype database.

158

Robinson and Parkin

4. http://genome.purdue.edu/maizetilling/: Maize TILLING project. 5. http://signal.salk.edu/: Access to insertions lines and cDNA collections for Arabidopsis and rice. 6. http://www.arabidopsis.org/: Comprehensive database providing access to numerous Arabidopsis genetics and genomics resources, including the Arabidopsis biological resource centre (ABRC). 7. http://germinate.scri.sari.ac.uk/barley/mutants/: TILLING resource for barley. 8. http://www.gabi-till.de/: In future will allow TILLING in Arabidopsis, barley and sugarbeet. 9. http://urgv.evry.inra.fr/UTILLdb: In future will be offering TILLING in pea, tomato and rapeseed (oilseed rape). 10. http://www.agrikola.org/index.php?o = /agrikola/main: RNAi in Arabidopsis. 11. http://www.iris.irri.org/: Rice deletion mutants. 2.2. Bioinformatics Resources

1. http://www.arabidopsis.org/: The Arabidopsis Information Resource. 2. http://www.brassica.info/: Brassica/Arabidopsis Genomics Initiative. 3. http://www.ncbi.nlm.nih.gov/: National Center for Biotechnology Information. 4. http://www.maizegdb.org/: Maize Genetics and Genomics Database. 5. http://www.mainlab.clemson.edu/gdr/: Genome database for Rosaceae. 6. http://www.gramene.org/: A resource for comparative grass genomics.

2.3. Reagents

1. 4.4 M ammonium acetate (NH4OAc) (pH 5.2).

2.3.1. Small-Scale Plant Genomic DNA Extraction

2. Urea extraction buffer (UEB): 7 M urea, 300 mM NaCl, 50 mM Tris–HCl (pH 8.0), 20 mM EDTA (pH 8.0), 1% N-lauroyl sarkosine. Do not autoclave urea. 3. Buffer saturated phenol (pH 8)/chloroform/isoamyl alcohol (25:24:1). 4. 40 mg/ml RNAase A. 5. Propan-2-ol. 6. 70% ethanol. 7. TE buffer: 10 mM Tris–HCl (pH 7.6), 1 mM EDTA.

Bridging the Gene-to-Function Knowledge Gap 2.3.2. Large-Scale Plant Genomic DNA Extraction

159

1. Kirby mix: 500 mM 4-aminosalicylic acid, 50 mM Tris–HCl (pH 8), 1% sodium dodecyl sulphate (SDS), 6% buffer saturated phenol (pH 8). Ensure 4-aminosalicylic acid is in solution prior to addition of SDS. Solution should be stored at room temperature in the dark. 2. 3 M sodium acetate (NaOAc) (pH 6.0). 4. Buffer saturated phenol (pH 8)/chloroform/isoamyl alcohol (25:24:1). 5. TE buffer: 10 mM Tris–HCl (pH 7.6), 1 mM EDTA. 6. 40 mg/ml RNAase A. 7. Propan-2-ol. 8. 70% Ethanol.

2.3.3. Southern Blotting

1. 10× TAE: 400 mM Tris–HCl (pH 7.5), 180 mM glacial acetic acid, 1 mM EDTA. 2. 0.8% Agarose gel in 1× TAE. 3. Loading buffer: 75 mM EDTA, 20% Ficol (400), 0.2% bromophenol blue. 4. 250 mM hydrochloric acid (HCl). 5. Transfer solution: 400 mM sodium hydroxide (NaOH). 6. Hybond N + (GE Healthcare). 7. 3MM Whatman paper. 8. 2× SSC: 0.3 M NaCl, 30 mM sodium citrate.

2.3.4. Southern Hybridisation

1. 10 mg/ml Herring Testes DNA (Sigma D-6898). 2.

32

P dCTP oligo-labelling buffer: 250 mM Tris–HCl (pH 6.9), 100 mM MgSO4, 100 mM dithiothreitol, 100 mM dATP, 100 mMdGTP, 100 mM dTTP.

3. 50× Denharts: 1% Ficol (400), 1% polyvinylpyrrolidone (360), 1% BSA (Fraction V). 4. 20× SETS: 3 M NaCl, 20 mM EDTA (pH 8), 600 mM Tris– HCl (pH 8) and 10 mM tetra-sodium pyrophosphate. 5. 20× SSC: 3 M NaCl, 0.3 M sodium citrate. 6. Pre-hybridisation solution: 0.01 g/ml dextran sulphate (Sigma D-6001), 4× SETS, 10× Denharts, 0.1% SDS. 7. Hybridisation solution: 0.1 g/ml dextran sulphate (Sigma D-6001), 4× SETS, 10× Denharts, 0.1% SDS. 8. Sepharose CL-6B Spin column: Plug the pierced hole in the bottom of a 0.5-ml Eppendorf tube with 0.4 mm glass beads. Add 0.5 ml of Sepharose CL-6B (GE Healthcare) in 1× TE. Centrifuge at 2,000 g to pack column.

160

Robinson and Parkin

9. Wash 1: 0.2× SSC, 0.1% SDS. 10. Wash 2: 2× SSC, 0.1% SDS. 11. Primer d(N)6 (50 μg/ml). 12. 10% SDS. 13. 80 mM EDTA. 14. DNA polymerase I Klenow fragment (2 U/µl, Roche) 15. 2.3.5. Ligation-Mediated Genome Walking

32

P dCTP (3,000 Ci/mmol).

1. AP1 – GTAATACGACTCACTATAGGGC 2. AP2 – ACTATAGGGCACGCGTGGT 3. GW adaptor sequences: 5′GTAATACGACTCACTATAGGGCACGCGTGGTCGACGGCCCGGGCTGGT-3′ 5′ACCAGCCC-NH2–3′ Prepare the adaptors by placing equimolar concentrations together in an Eppendorf tube, incubate sequentially at 95°C for 10 min, 65°C for 10 min, 37°C for 10 min followed by room temperature for 20 min. 4. Glycogen (10 μg/μl). 5. Blunt end cutting enzymes (e.g. DraI, EcoRV, PvuII or StuI) and appropriate 10× restriction enzyme buffers. 6. T4 DNA ligase and 5× T4 DNA ligase buffer (Invitrogen). 7. 7.5 M ammonium acetate (NH4OAc). 8. 95% ethanol and 70% ethanol. 9. Amplitaq Gold polymerase 5 U/μl and 10× PCR buffer (Applied Biosystems). 10. 25 mM MgCl2. 11. 10 mM dNTPs.

2.3.6. RNA Extraction

1. Baked pestle and mortar (see Note 1). 2. TLES buffer: 100 mM Tris–HCl pH 8, 100 mM LiCl, 10 mM EDTA, 1% SDS. 3. Buffer saturated phenol (pH 4.5). 4. Chloroform/isoamyl alcohol (24:1). 5. 4 M lithium chloride (LiCl). 6. 0.1% diethylpyrocarbonate (DEPC) treated ddH2O (autoclave). 7. 70% ethanol. 8. Liquid nitrogen.

2.3.7. Rt-pcr

1. Gene specific primers (GSPs) or oligo(dT) primer (10 µM). 2. 0.1% DEPC treated ddH2O (autoclave).

Bridging the Gene-to-Function Knowledge Gap

161

3. 5× first strand cDNA synthesis buffer (Invitrogen): 250 mM Tris–HCl (pH 8.3), 375 KCl, 15 mM MgCl2. 4. 10× PCR buffer (Applied Biosystems). 5. 25 MgCl2. 6. 10 mM dNTPs. 7. 100 mM dithiothreitol (DTT). 8. Superscript II reverse transcriptase (200 U/μl) (Invitrogen). 9. DNase free RNase (1 U/μl). 10. Amplitaq Gold DNA polymerase (5 Uμl) (Applied Biosystems). 2.3.8. PCR Identification of Homozygous Mutant Alleles

1. Reagents 9–11 as in Subheading 2.3.5.

3. Methods 3.1. Forward Genetic Screens

Presently, there are limited resources to allow forward genetic screens in crop species restricting such analyses to the model Arabidopsis for which there are publicly available chemical, physical and insertion mutagenised populations (see Subheading 2.1). Identification of a chemically/physically induced mutation requires map-based gene cloning to identify the underlying allele, which is beyond the scope of the present chapter. We will limit the following description to the identification of a T-DNA insertional mutant. Forward genetics starts with the identification of a novel variant for a phenotype of interest (e.g. Fig. 2, see Note 2). The segregation ratio of self progeny from the mutant should be assayed to establish the dominance effects of the phenotype and ensure stable inheritance of the phenotype through meiosis. Tissue is sampled from individual self progeny, which can then be used to genotype the lines, demonstrate co-segregation of the phenotype with the insertion element, identify genomic position of insertion element and determine copy number of inserted element(s) (see Note 3). The number of insertion events in a mutant line is assayed using Southern hybridisation (see Note 4). Numerous methods have been adopted to facilitate the identification of DNA flanking the T-DNA insertion, with varying success, for example, thermal asymmetric interlaced PCR (36), ligationmediated genome walking (see below) and plasmid rescue where applicable (24). (see Note 5). The sequence obtained through these methods is compared to available genome sequence using sequence alignment tools (e.g. BLASTN) to determine the location of the insert. For the majority of T-DNA insertion lines, the

162

Robinson and Parkin

phenotype results from loss of function, where the element has integrated into a specific gene. However, in populations where the T-DNA carries enhancer elements integration into intergenic sequence can lead to activation of adjacent genes (Fig. 2). The activation/disruption of a target gene is confirmed by assaying expression in the mutant relative to wild type using RT-PCR (see Note 6 and 7) (Fig. 2). The DNA sequence flanking the insertion site can also be used to design primers to discriminate the genotype of the progeny as described below (see Subheading 3.2.1). Theoretically, the phenotype may not be the result of mutating the identified gene since alternative DNA lesions linked to the inserted elements may be responsible. For activation mutants, confirmation of gene function can be achieved by reproducing the phenotype using an over-expression construct (see Note 8). In the case of loss-of-function alleles, gene function can be confirmed through complementation of the mutant or the identification of independent alleles, which can be achieved using reverse genetic strategies (see below). Two methods are provided for DNA extraction, the first extracts small amounts of DNA suitable for PCR analysis and the second is optimised for large-scale DNA extractions and is suitable for species with high levels of insoluble carbohydrates and phenolics. 3.1.1. Small-Scale Plant Genomic DNA Extraction [Modified from (37)]

1. Harvest approximately three inflorescence buds into an Eppendorf tube and freeze in liquid nitrogen. Grind tissue into a fine powder. 2. Add 500 μl of UEB and continue grinding tissue. 3. Once all samples are suspended in UEB, add 400 μl of phenol/chloroform/isoamyl alcohol, vortex and centrifuge at 10,000 g in a microfuge for 10 min. 4. Aspirate the aqueous layer and place into a fresh tube (~450 μl). 5. Precipitate DNA with 450 μl of propan-2-ol, centrifuge for 3 min at 10,000 g. 6. Remove propan-2-ol and suspend the pellet in 500 μl TE and 40 μg/ml RNase A. 7. Add 100 μl 4.4 M NH4OAc (pH 5.2) followed by 700 μl of propan-2-ol. 8. Vortex and then centrifuge for 3 min at 10,000 g to pellet the DNA. 9. Remove supernatant and wash the DNA in 70% ethanol. 10. Allow the pellet to air dry and resuspend the pellet in 50–100 μl of 0.1× TE.

Bridging the Gene-to-Function Knowledge Gap

163

Fig. 2(A). Schematic diagram showing the generation of both loss-of-function and gain-of-function alleles through the use of activation tagged T-DNA mutagenesis. (B) Example of activation tagged Arabidopsis line identified through a simple morphological screen. The wild-type and mutant plants were grown for 2 weeks at 20°C and 150 μmol/m2 s1, the diminutive mutant exhibits a clear stunted phenotype. (C) The identified T-DNA integration site on Arabidopsis chromosome 1, for the mutant line. The location of the T-DNA was intergenic between the genes At1g50940 and At1g50950 suggesting the line possessed an activation mutation. (D) RT-PCR analysis confirmed the activation of At1g50960. Note the activated gene was not directly adjacent to the insertion.

164

Robinson and Parkin

3.1.2. Large-Scale Plant Genomic DNA Extraction [Modified from (38)]

Leaf tissue is excised from the plant and immediately frozen in liquid nitrogen. The tissue is then freeze dried for a minimum of 24 h after which it can be stored with silica gel at –20°C. 1. Grind 500 mg of freeze dried tissue to a fine powder using a milling machine (Spex CertiPrep mixer/mill 8000). 2. Transfer the powder to a 50 ml poly-propylene centrifuge tube. Add 15 ml of Kirby mix and vortex for 5–10 s to mix powder. Gently mix on a shaking platform for 15 min to disperse clumps. 3. Add 10 ml of phenol/chloroform/isoamyl alcohol and vortex for 5 s. Gently mix on a platform for 10 min. Centrifuge at 2,000 g for 10 min. 4. Transfer the upper phase (15 ml) to a fresh 50 ml tube using a wide-bore disposable plastic pipette. 5. Add 0.1 vol. of 3 M NaOAc (pH 6.0) and mix gently, add 0.6 vol. of propan-2-ol. Mix gently and precipitate DNA in darkness for at least 1 h. 6. Centrifuge at 2,000 g for 10 min. Decant supernatant and allow the pellet to air dry. Resuspend the pellet in 2 ml TE and 40 μg/ml1 RNase A. Resuspend pellet by placing in shaking incubator at 37°C. 7. Transfer the resuspended pellet to a 15 ml poly-propylene tube and add 2 ml phenol/chloroform/isoamyl alcohol, vortex and centrifuge as in step 3. 8. Transfer the aqueous phase to a fresh tube and precipitate the DNA as in step 5. 9. Centrifuge at 2,000 g for 10 min, decant supernatant and leave DNA to air dry. 10. Resuspend the DNA pellet in 250 μl of TE. This method should yield between 0.5 and 1 mg of DNA per 0.5 g of freeze dried tissue.

3.1.3. Southern Blotting

1. Digest genomic DNA with appropriate restriction enzyme (see Note 4), add loading buffer and resolve restriction fragments through gel electrophoresis: 1× TAE, 0.8% agarose gel at 1.5 V/cm for ~20 h, or until bromophenol blue dye front has migrated 12 cm. Cut gel to desired size for capillary blotting. 2. Depurinate the DNA by immersing the gel in 250 mM HCl for 10 min. 3. Replace the HCl with transfer solution and incubate for further 10 min. 4. Build blotting apparatus by placing a glass plate on top of a plastic tray containing transfer solution. Saturate two sheets

Bridging the Gene-to-Function Knowledge Gap

165

of Whatman 3MM paper with transfer solution and place onto the glass plate to create a wick. Remove any bubbles from the wick. 5. Place the gel onto the wick, positioning it centrally and taking care to remove air bubbles beneath it. To limit indirect wicking and evaporation of the transfer solution surround the gel with cling-film. 6. Saturate hybond N+ nylon membrane with transfer solution and place on the gel. Place three pieces of saturated 3MM paper on top of the membrane, ensuring no air bubbles are present. Apply a stack of dry paper towels on top of the 3MM paper, place a weight on top and leave to blot overnight. 7. Membranes can be stored in 2× SSC at room temperature prior to use. 3.1.4. Southern Hybridisation

1. Add 50 µg/ml freshly boiled 10 mg/ml Herring Testes DNA to pre-hybridisation solution and mix well, keep at 65°C. 2. Place membranes into hybridisation tube, add 30 ml of prehybridisation solution and place at 65°C for ~4 h. 3. In a 1.5 ml Eppendorf tube (with pierced lid) add 50–100 ng of template DNA and 2 µl primer d(N)6. Denature in a boiling water bath for 5 min. Immediately cool on ice, and centrifuge briefly to bring down condensation. Add ddH2O to bring the volume to 14.5 μl. 4. Add 2 µl 32P dCTP Oligo-labelling buffer, 0.5 µl DNA polymerase I Klenow fragment and 3 µl 32P dCTP to the DNA solution. Incubate at room temperature for ~2 h then stop the reaction by adding 30 μl of 80 mM EDTA. 5. Remove unincorporated nucleotides and primers by chromatography using Sepharose CL-6B in 1× TE. 6. Denature the labelled DNA by placing in a boiling water bath for 2 min and immediately place on ice. 7. Replace the pre-hybridisation solution with 5 ml of hybridisation solution and add the labelled DNA. Incubate overnight at 65°C. 8. Wash the filters three times with wash one followed by two repetitions with wash two for 15 min each at 65°C. Assay hybridisation by autoradiography.

3.1.5. Identification of T-DNA Insertion Sites by Ligation-Mediated Genome Walking

1. Digest 1 μg of genomic DNA: 10 μl genomic DNA (0.1 μg/μl) with 2 μl restriction enzyme (10 U/μl), 5 μl 10× restriction enzyme buffer and 33 μl ddH2O for 16 h at 37°C (see Note 9). 2. Bring volume to 200 μl using ddH2O, add an equal volume of phenol/chloroform/isoamyl alcohol, vortex and centrifuge at 10,000 g for 5 min.

166

Robinson and Parkin

3. Transfer the aqueous phase to a fresh tube and add 65 μl of 7.5 M NH4OAc, 2 μl of glycogen and 1 ml of 95% ethanol. Mix well and centrifuge at 10,000 g for 5 min. 4. Wash pellet in 70% ethanol and resuspend in 14 μl of ddH2O. 5. Add to the 12 μl digest, 5 μl 5× DNA ligase buffer, 2 μl of adaptor molecule (500 mM) and 1 μl T4 DNA ligase (1 U/μl). Incubate at 16°C for 16 h. 6. Repeat steps 2 through 4, but resuspend in 20 μl of ddH2O. 7. For each digest combine the following reagents in a 0.2 ml PCR tube: 5 μl 10× PCR buffer, 1 μl dNTP (10 mM each), 1 μl GSP1, 1 μl AP1 (10 μM), 40 μl ddH2O and 1 μl AmpliTaq Gold Polymerase. 8. Add 1 μl of each DNA digest to the PCR reaction. 9. Perform DNA amplification using a Thermal Cycler using a two-step cycle: denature at 94°C for 12 min, followed by 7 cycles of 94°C for 30 s, 58°C for 30 s, 72°C for 3 min, then 32 cycles of 94°C for 30 s, 55°C for 30 s, 72°C for 3 min, then 72°C for 10 min and hold at 4°C. 10. Dilute the primary reaction by 1/200 and add 1 μl of the diluted products to a 0.2 ml PCR tube containing 5 μl 10× PCR buffer, 1 μl dNTP, 1 μl GSP2, 1 μl AP2 (10 μM), 40 μl ddH2O and 1 μl AmpliTaq Gold Polymerase. 11. Perform DNA amplification using the same programme as in step 9, except reduce the number of cycles to 5 for the first step and to 20 for the second step. 12. Separate the amplification products using agarose gel electrophoresis to resolve the anticipated differences in size. 13. The DNA fragments can then be cloned and sequenced using standard techniques (39). 3.1.6. RNA Extraction [Modified from (40)]

1. Harvest 500 mg of material, freeze in liquid nitrogen and grind the material to a fine powered using a pestle and mortar chilled in liquid nitrogen. Transfer the powder into a chilled 50 ml tube. 2. Add 5 ml of TLES buffer followed by 5 ml phenol and vortex for 15 s. 3. Add 5 ml of chloroform/isoamyl alcohol and vortex for 15 s. 4. Centrifuge at 2,000 g for 5 min, remove the aqueous layer and distribute into 1.5 ml Eppendorf tubes. 5. Add an equal volume of 4 M LiCl, mix well and place at 4°C for at least 3 h.

Bridging the Gene-to-Function Knowledge Gap

167

6. Centrifuge at 10,000 g at 4°C for 20 min, remove supernatant and resuspend pellet in 0.5 ml DEPC-treated water. 7. Add 0.5 ml of 4 M LiCl mix and place at 4°C overnight. 8. Centrifuge at 10,000 g at 4°C for 20 min, remove supernatant and wash pellet with 70% ethanol. Allow the pellet to air dry and resuspend in 200 μl of DEPC-treated ddH2O. 3.1.7. RT-PCR

1. Place 1–3 μg of RNA into a tube and add 2 μl 3¢ GSP (10 μM) or oligo(dT) and adjust the volume to 15.5 μl using DEPC-treated ddH2O. 2. Incubate the mixture at 70°C for 10 min and then place on ice for 1 min. 3. Add 2.5 μl 5× first strand cDNA synthesis buffer, 2.5 μl 25 mM MgCl2, 1 μl 10 mM dNTPs and 2.5 μl 100 mM DTT. 4. Mix the solution gently and incubate at 42°C for 1 min. 5. Add 1 μl Superscript II reverse transcriptase (5 U/μl), mix, then, incubate for 50 min at 42°C. 6. Optional step: Add 1 μl DNase free RNase and incubate at 37°C for 30 min. 7. Dilute 10 μl of the single stranded cDNA with 20 μl ddH2O, use 1.5 μl as template, for PCR. Add to the template, 2.5 μl 10× PCR buffer, 1.5 μl 25 mM MgCl2, 0.5 μl dNTPs, 1 μl each of the GSPs (10 μM), 16.75 μl of ddH2O and 0.25 μl Taq polymerase. 8. Perform DNA amplification using a standard PCR cycling programme with temperatures optimised for the specific GSPs. 9. Resolve the amplification products using agarose gel electrophoresis.

3.2. Reverse Genetic Screening

The following protocol describes the confirmation of a loss-offunction allele obtained from one of the available sequence tagged T-DNA site databases. In general, TILLING alleles are obtained through a dedicated service provider, information describing the analysis of such mutant lines will be provided as notes. Reverse genetics begins with the identification of a target gene. Initially, to determine the utility of this approach, the level of duplication and the genome organisation of putative homologues of the target gene should be assessed. This can be achieved through sequence alignment, such basic bioinformatics analyses can be performed using software provided at a number of websites as listed in Subheadings 2.1. and 2.2. These sites can also be used to collect all relevant information and resources associated with the target gene and close homologues (Fig. 3). Lines containing loss-of-function alleles for the relevant genes

Fig. 3. Numerous websites are available that allow ‘in silico’ screening for lines with mutations in your gene of interest (see Subheading 2.1). (A) Output from SIGnAL T-DNA express identifying Arabidopsis mutagenised lines and cDNA clones associated with a target gene, At1g14920 (http://signal.salk.edu/). Searches can be made based on gene identifier, gene function or line identifier. A representation of the pseudo-chromosome with the location of annotated genes is presented and lines possessing mutant alleles in the target gene are highlighted. (B) The equivalent web portal for the model monocot, rice. The rice gene OsO3g49990 was used to complete the search, which possesses significant homology to At1g14920.

Bridging the Gene-to-Function Knowledge Gap

169

can be purchased from the appropriate stock centre (see Notes 10–12). The seed obtained will ordinarily be from a segregating population but individuals possessing homozygous mutant alleles can be identified using PCR. Protocols described in Subheading 3.1. can be used to confirm the loss of function mutant phenotype through RT-PCR and to assess the number of integrated elements by Southern blotting and hybridisation. Where lines with multiple insertions are found it maybe necessary to generate a segregating population and recover from the progeny an individual possessing only the desired mutation. The identified genetically uniform line can be assessed for the expression of a differential phenotype relative to wild type (see Note 13). The complimentary action of close homologues may result in no observable phenotype for individual mutant lines. In such cases a loss-of-function phenotype may be generated by combining mutations in all of the homologues using a pyramiding strategy. 3.2.1. PCR Identification of Homozygous Mutant Alleles

1. Extract DNA from individual lines using the protocols described in Subheadings 3.1.1 and 3.1.2. 2. Design GSPs for target gene. For Arabidopsis, the SIGnAL website (see Subheading 2.1.5) provides a tool (ISect) for designing primers which flank insertion sites of sequence tagged lines. Briefly, GSPs flanking the inserted element are designed to amplify a product ~900–1000 bp. These primers are compatible with those annealing to the ends of the inserted elements and will amplify a set of diagnostic DNA products. 3. Diagnostic products are amplified by PCR in two independent reactions. In the first, only the two GSPs are included. In the second, a total of three primers are included, one insert specific (e.g. LBb1 specific to the left border of the Arabidopsis SALK T-DNA insertion lines) and two GSPs. Place 1 μl genomic DNA as template (~25 ng) into a 0.2 ml PCR tube. 4. Reaction 1: Add 5 μl 10× PCR buffer, 1 μl dNTP, 1 μl GSP1 (10 μM), 1 μl GSP2 (10 μM), 37 μl ddH2O, 3 µl 25 mM MgCl2 and 1 μl AmpliTaq Gold Polymerase. 5. Reaction 2: Add 5 μl 10× PCR buffer, 1 μl dNTP, 1 μl GSP1 (10 μM), 1 μl GSP2 (10 μM), 1 μl LBb1 (10 μM), 36 μl ddH2O, 3 µl 25 mM MgCl2 and 1 μl AmpliTaq Gold Polymerase. 6. Perform DNA amplification using the following conditions in a PCR thermocycler: 94°C for 12 min, 32 cycles of 94°C for 30 s, 50°C for 30 s, 72°C for 1 min, followed by 72°C for 10 min and 4°C hold. 7. Resolve fragments on a 1% agarose gel, stain with ethidium bromide (0.5 ug/ml in the gel) and visualise under ultraviolet light.

170

Robinson and Parkin

8. The combination of amplification products observed in the reactions indicates the genotype for the individual investigated. The first set of reactions will only amplify wild-type alleles whereas the second set of reactions will amplify from the wild-type and mutant allele. Therefore, no wild-type product from reaction 1 and a single product in reaction 2 would indicate a line homozygous for the loss-of-function alleles.

Notes 1. It is essential to avoid contamination with RNase. The use of RNase free solutions and disposable plastic ware is recommended. Do not handle reagents and tubes with un-gloved hands. Solutions can be treated with 0.1% DEPC and glassware can be baked at 180°C for 8 h to inhibit RNase activity. 2. It is important to establish reproducible conditions for plant growth under control and experimental environments, prior to initiating screen. 3. A Mendelian segregation ratio of 1:2:1 is expected in the self progeny, where the mutant phenotype is recessive a 3:1 (wild type: mutant) ratio is expected and conversely where the mutant allele is dominant a 1:3 (wild type: mutant) ratio is expected. Segregation ratios should be based on the observed mutant phenotype as opposed to either the presence or the expression of the selectable marker. Since if there are multiple inserts in the mutant line, the additional insertion will compromise the detection of co-segregation and can lead to the silencing of the selectable marker. 4. The choice of enzyme is dependant upon the vector used and to simplify interpretation should not cut within the region of the T-DNA which is homologous to your probe. An enzyme that does not cut within the T-DNA will result in a single hybridisation signal per insertion. Extra information may be obtained using an enzyme which cuts once in the T-DNA, since this will allow simple tandem integration events to be resolved. For multiple inserts, the analysis of the self progeny should discriminate between linked and un-linked elements. Where the elements are linked, the self progeny should present an identical Southern hybridisation pattern. However, segregation should be observed for unlinked elements. 5. Although the methods developed for identifying the DNA flanking insertion sites are sometimes successful in isolating

Bridging the Gene-to-Function Knowledge Gap

171

multiple or complex insertion sites, best results are obtained using lines with single insertions. 6. The temporal and developmental expression pattern of the wild-type gene should be determined through qPCR or Northern blot to identify the optimal tissue/stage for assaying the effect of the mutant allele. 7. In activation tagged lines, the identification of the gene with enhanced expression can be complicated by the fact that the integration of the T-DNA may activate a gene that can be up to 8 Kb away from the insertion site (24). 8. To confirm the function of a gene identified by an activation tagged T-DNA allele, it maybe possible to reproduce the phenotype by over-expressing the gene in a wild-type background. However, due to promoter strength and ectopic expression, the use of a constitutive promoter can result in an unpredictable phenotype. This problem maybe alleviated by developing a second activation allele (24). Alternatively, a line containing a null allele for the target gene that exhibits an opposing phenotype would provide corroboratory evidence of function. 9. The restriction enzymes used must yield a blunt end in this protocol. Amplification of a product is possible if there is a restriction site within ~3 Kb of the inserted DNA fragment. The enzymes EcoRV, DraIII, NaeI, PvuII and StuI can be used to identify a suitable site. 10. When selecting loss-of-function alleles preference should be given to those where the flanking DNA aligns to the genomic sequence with the highest level of significance (determined by E-value). It is preferred that the insertion site should be located in an annotated exon and ideally at the 5′ end of the gene. 11. To identify TILLING alleles, it is necessary to supply the service provider with the sequence of the gene of interest, indicating the intron/exon structure to allow the design of appropriate primers which maximise the likelihood of identifying deleterious point mutations. 12. The allelic sequence and genotype of available TILLING mutants will be provided. Each line should be backcrossed to wild type for three generations to remove background mutations, to confirm the association of any observed phenotype with the target gene. 13. In Arabidopsis and other plants a wide range of different ecotypes (cultivars) have been utilised to generate the available genomics resources. For physiological assays it is imperative that the correct wild-type ecotype is compared to the identified mutants.

172

Robinson and Parkin

References 1. Arabidopsis Genome Initiative. (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815. 2. Yu, J., Hu, S., Wang, J., Wong, G.K., Li, S., Liu, B., Deng, Y., Dai, L., Zhou, Y., Zhang, X., et al. (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296, 79–92. 3. Goff, S.A., Ricke, D., Lan, T.H., Presting, G., Wang, R., Dunn, M., Glazebrook, J., Sessions, A., Oeller, P., Varma, H., et al. (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296, 92–100. 4. http://www.maizegenome.org/index.html 5. http://www.genome.ou.edu/medicago.html 6. http://www.kazusa.or.jp/lotus/ 7. http://www.sgn.cornell.edu/help/about/ tomato_sequencing.pl 8. http://www.brassica.info/b_rapa_sequencing_project/bac_sequencing.htm 9. Jackson S.A., Rokhsar, D., Stacey, G., Shoemaker, R.C., Schmutz, J., and Grimwood, J. (2006) Toward a reference sequence of the soybean genome: a multiagency effort. Crop Sci. 46, 55–61. 10. http://www.ncbi.nlm.nih.gov/dbEST/ 11. Paterson, A.H. (2006) Leafing through the genomes of our major crop plants: strategies for capturing unique information. Nat. Rev. Genet. 7, 174–184. 12. Rédei, G.P. and Koncz, C. (1992) Classical Mutagenesis, in Methods in Arabidopsis Research (Koncz, C., Chua, N.-H., and Schell, J. eds.), World Scientific Publishing Company, Singapore, pp. 16–82. 13. Alonso, J.M., Stepanova, A.N., Leisse, T.J., Kim, C.J., Chen, H., Shinn, P., Stevenson, D.K., Zimmerman, J., Barajas, P., Cheuk, R., et al. (2003) Genome-wide insertional mutagenesis of Arabidopsis thaliana. Science 301, 653–657. 14. Sallaud, C., Gay, C., Larmande, P., Bes, M., Piffanelli, P., Piegu, B., Droc, G., Regad, F., Bourgeois, E., Meynard, D., et al. (2004) High throughput T-DNA insertion mutagenesis in rice: a first step towards in silico reverse genetics. Plant J. 39, 450–464. 15. Tissier, A.F., Marillonnet, S., Klimyuk, V., Patel, K., Torres, M.A., Murphy, G., and Jones, J.D. (1999) Multiple independent defective suppressor-mutator transposon insertions in Arabidopsis: a tool for functional genomics. Plant Cell 11, 1841–1852.

16. Zhao, T., Palotta, M., Langridge, P., Prasad, M., Graner, A., Schulze-Lefert, P., and Koprek, T. (2006) Mapped Ds/T-DNA launch pads for functional genomics in barley. Plant J. 47, 811–826. 17. Hilson, P., Allemeersch, J., Altmann, T., Aubourg, S., Avon, A., Beynon, J., Bhalerao, R.P., Bitton, F., Caboche, M., Cannoot, B., et al. (2004) Versatile gene-specific sequence tags for Arabidopsis functional genomics: transcript profiling and reverse genetics applications. Genome Res. 14, 2176–2189. 18. Feng, C.-P. and Mundy, J. (2006) Gene discovery and functional analyses in the model plant Arabidopsis. J. Integr. Plant Biol. 48, 5–14. 19. Martienssen, R.A. (1998) Functional genomics: probing plant gene function and expression with transposons. Proc. Natl. Acad. Sci. USA 95, 2021–2026. 20. Petersen, M., Brodersen, P., Naested, H., Andreasson, E., Lindhart, U., Johansen, B., Nielsen, H.B., Lacy, M., Austin, M.J., Parker, J.E., et al. (2000) Arabidopsis map kinase 4 negatively regulates systemic acquired resistance. Cell 103, 1111–1120. 21. Meinke, D.W., Meinke, L.K., Showalter, T.C., Schissel, A.M., Mueller, L.A., and Tzafrir, I. (2003) A sequence-based map of Arabidopsis genes with mutant phenotypes. Plant Physiol. 131, 409–418. 22. Springer, P.S. (2000) Gene traps: tools for plant development and genomics. Plant Cell 12, 1007–1020. 23. Rojas-Pierce, M. and Springer, P.S. (2003) Gene and enhancer traps for gene discovery. Methods Mol. Biol. 236, 221–240. 24. Weigel, D., Ahn, J.H., Blazquez, M.A., Borevitz, J.O., Christensen, S.K., Fankhauser, C., Ferrandiz, C., Kardailsky, I., Malancharuvil, E.J., Neff, M.M., et al. (2000) Activation tagging in Arabidopsis. Plant Physiol. 122, 1003–1013. 25. Li, X. and Zhang, Y. (2002) Reverse genetics by fast neutron mutagenesis in higher plants. Funct. Integr. Genomics 2, 254–258. 26. Perry, J.A., Wang, T.L., Welham, T.J., Gardner, S., Pike, J.M., Yoshida, S., and Parniske, M. (2003) A TILLING reverse genetics tool and a web-accessible collection of mutants of the legume Lotus japonicus. Plant Physiol. 131, 866–871. 27. Caldwell, D.G., McCallum, N., Shaw, P., Muehlbauer, G.J., Marshall, D.F., and Waugh, R. (2004) A structured mutant population for

Bridging the Gene-to-Function Knowledge Gap

28.

29.

30.

31.

32.

33.

forward and reverse genetics in Barley (Hordeum vulgare L.). Plant J. 40, 143–150. Kim, Y., Schumaker, K.S., and Zhu, J.K. (2006) EMS mutagenesis of Arabidopsis. Methods Mol. Biol. 323, 101–103. Greene, E.A., Codomo, C.A., Taylor, N.E., Henikoff, J.G., Till, B.J., Reynolds, S.H., Enns, L.C., Burtner, C., Johnson, J.E., Odden, A.R., Comai, L., and Henikoff, S. (2003) Spectrum of chemically induced mutations from a large-scale reverse-genetic screen in Arabidopsis. Genetics 164, 731–740. Page, D.R. and Grossniklaus, U. (2002) The art and design of genetic screens: Arabidopsis thaliana. Nat. Rev. Genet. 3, 124–136. McCallum, C.M, Comai, L., Greene, E.A., and Henikoff, S. (2000) Targeting induced local lesions IN genomes (TILLING) for plant functional genomics. Plant Physiol. 123, 439–442. Watson, J.M., Fusaro, A.F., Wang, M., and Waterhouse, P.M. (2005) RNA silencing platforms in plants. FEBS Lett. 579, 5982–5987. Mansoor, S., Amin, I., Hussain, M., Zafar, Y., and Briddon, R.W. (2006) Engineering novel traits in plants through RNA interference. Trends Plant Sci. 11, 559–565.

173

34. Krysan, P.J., Young, J.C., and Sussman, M.R. (1999) T-DNA as an insertional mutagen in Arabidopsis. Plant Cell 11, 2283–2290. 35. Galbiati, M., Moreno, M.A., Nadzan, G., Zourelidou, M., and Dellaporta, S.L. (2000) Large-scale T-DNA mutagenesis in Arabidopsis for functional genomic analysis. Funct. Integr. Genomics 1, 25–34. 36. Sessions, A., Burke, E., Presting, G., Aux, G., McElver, J., Patton, D., Dietrich, B., Ho, P., Bacwaden, J., Ko, C., et al. (2002) A highthroughput Arabidopsis reverse genetics system. Plant Cell 14, 2985–2994. 37. Shure, M., Wessler, S., and Fedoroff, N. (1983) Molecular identification and isolation of the waxy locus in maize. Cell 35, 225–233. 38. Sharpe, A.G., Parkin, I.A.P., Keith, D.J., and Lydiate, D.J. (1995) Frequent nonreciprocal translocations in the amphidiploid genome of oilseed rape (Brassica napus). Genome 38, 1112–1121. 40. Verwoerd, T.C., Dekker, B.M., and Hoekema, A. (1989) A small-scale procedure for the rapid isolation of plant RNAs. Nucleic Acids Res. 17, 2362.

Chapter 10 Heterologous and Cell-Free Protein Expression Systems Naser Farrokhi, Maria Hrmova, Rachel A. Burton, and Geoffrey B. Fincher Summary In recognition of the fact that a relatively small percentage of ‘named’ genes in databases have any experimental proof for their annotation, attention is shifting towards the more accurate assignment of functions to individual genes in a genome. The central objective will be to reduce our reliance on nucleotide or amino acid sequence similarities as a means to define the functions of genes and to annotate genome sequences. There are many unsolved technical difficulties associated with the purification of specific proteins from extracts of biological material, especially where the protein is present in low abundance, has multiple isoforms or is found in multiple post-translationally modified forms. The relative ease with which cDNAs can be cloned has led to the development of methods through which cDNAs from essentially any source can be expressed in a limited range of suitable host organisms, so that sufficient levels of the encoded proteins can be generated for functional analysis. Recently, these heterologous expression systems have been supplemented by more robust prokaryotic and eukaryotic cell-free protein synthesis systems. In this chapter, common host systems for heterologous expression are reviewed and the current status of cell-free expression systems will be presented. New approaches to overcoming the special problems encountered during the expression of membrane-associated proteins will also be addressed. Methodological considerations, including the characteristics of codon usage in the expressed DNA, peptide tags that facilitate subsequent purification of the expressed proteins and the role of post-translational modifications, are examined. Key words: Annotation, Functional analysis, Host systems, In vitro expression, Membrane proteins.

1. Introduction In the last decade, high-throughput sequencing of several plant genomes has allowed the identification of many genes, but in most cases the assignment of a gene’s function has not been Daryl J. Somers et al. (eds.), Methods in Molecular Biology, Plant Genomics, vol. 513 © Humana Press, a part of Springer Science + Business Media, LLC 2009 DOI: 10.1007/978-1-59745-427-8_10

175

176

Farrokhi et al.

supported by the rigorous functional analysis of the encoded protein. Instead, identifications have been largely based on similarities between the nucleotide sequences of the genes of interest and related gene sequences that have been deposited in various electronic databases. Thus, a very small percentage of ‘named’ genes in databases have any experimental proof for their annotation. The homology modelling programs used in their functional annotation are not always robust, and the use of short overlapping sequences means that many identities are simply wrong (1). For example, many enzymes annotated as β-D-glucosidases in the large family GH1 group of glycoside hydrolases show a marked preference for β-D-mannosides and should therefore be listed as β-d-mannosidases (2). As our attention shifts from gene sequencing towards the accurate assignment of functions to all individual genes in a genome, we will increasingly reduce our reliance on sequence similarities and seek more reliable methods for the analysis of protein function. Advanced bioinformatics approaches, which include three-dimensional (3D) structural alignments to identify distant relationships occurring during evolution, domain analyses, hidden Markov modelling to improve alignments, and automated predictions of catalytic and binding sites and regions responsible for changes in specificity and thus function, are currently under development. These ‘structural phylogenomics’ techniques will clearly lead to significant improvements in gene identification (1) but will be greatly assisted by the direct analysis of protein function. Given the long-standing technical difficulties associated with the purification of specific proteins from extracts of biological material, especially where the protein is present in low abundance, has multiple isoforms or post-translationally modified species, and the relative ease with which cDNAs can be cloned, it is hardly surprising that many methods have been devised to express cDNAs from essentially any source in a limited range of suitable host organisms so that sufficient levels of the encoded proteins can be generated for functional analysis. More recently, these so-called heterologous expression systems have been supplemented by relatively efficient prokaryotic and eukaryotic cell-free protein synthesis systems, again to provide sufficient protein to directly measure protein activity and function (3, 4). Both heterologous expression and cell-free biosynthesis of proteins encoded by specific fragments of DNA will be considered in this chapter. The heterologous expression of recombinant DNA enables rapid production of selected proteins and thus plays an important role in modern functional genomics (5, 6). However, the technique is not always straightforward and difficulties cannot necessarily be predicted. The codon usage of the DNA to be expressed must be compatible with the availability of the corresponding

Heterologous and Cell-Free Protein Expression Systems

177

tRNA population in the host cell. Furthermore, the particular system can be successful only if the expressed protein does not have seriously detrimental effects on the physiology of the host (7), and if the host system is able to provide the conditions required for the production of a correctly folded and functionally active protein. Perhaps, the key requirements are to find a system and to develop conditions that allow correct folding of the expressed proteins, which in some cases may demand the formation of disulphide bonds or specific post-translational modifications (8). It should be noted that a high level of transcription of the DNA fragment in the heterologous host cell does not always lead to the production of high-quality, full-length active protein (9). Several biological systems have been developed as hosts for heterologous expression, including bacterial, yeast, fungal, viral, plant, mammalian and insect cell systems, alongside the use of transgenic plants and animals (5) and cell-free protein synthesis systems (3, 4). Each of these systems has advantages and pitfalls that will affect protein yield, proper folding of the expressed protein, correct post-translational modification, cost, speed and ease of use (10). In general, biological and biochemical properties of the protein of interest will dictate the type of expression system that can be used successfully (11), although at this stage no clear ‘rules’ or principles have emerged that enable us to predict with any confidence whether a particular combination of DNA and expression system will lead to the production of active protein. In some cases, different isoforms or orthologues of a single enzyme will be preferentially expressed in different expression systems (12). Thus, the successful expression of a DNA fragment requires a degree of trial and error with respect to the selection and construction of expression vectors, the selection of the expression host or cell-free system to be used and the optimisation of experimental conditions that generate maximum yields of the expressed protein. Most successes to date have been obtained with soluble proteins where correct folding can be readily assessed; for example, through the measurement of enzymic activity. Soluble enzymes are likely to remain the most amenable proteins for heterologous expression in the immediate future. If the recombinant enzyme is properly folded, has undergone the appropriate post-translational modifications and is active, it can be analysed for kinetic and other properties. If the expression system is efficient or can be readily scaled up, it might be possible to synthesise sufficient pure protein for the generation of 3D crystals for structural analysis by X-ray protein crystallography. In the future, defining the functions of individual components of protein complexes and the nature of protein–protein interactions in the cell will attract increasing attention. Similarly, membrane-bound proteins, which

178

Farrokhi et al.

represent ~30% of all proteins in a typical cell, are likely to be particularly important as we dissect signal transduction, molecular transport and other complex biological processes. In the case of membrane-bound enzymes, considerable evidence exists to suggest that these proteins more precisely exist as protein–lipid complexes, and thus are dependent on associated lipids for correct folding, structural integrity and function (13). Thus, their functional analysis through heterologous expression will critically depend on the ability to produce correctly folded proteins in lipid phases or liposomes (13), such as ergosterols, phosphatidyl choline and phosphatidyl ethanolamine (14, 15). Finally, it is worth noting that heterologous expression of functional proteins from DNA fragments has had and will continue to have important biotechnological applications in agriculture and medicine. The generation of many valuable pharmacological compounds, including antigens, antibodies, anti-allergens and vaccines will rely on the availability of robust heterologous expression systems. In the following subheadings, the most common host systems for heterologous expression are reviewed. In addition, the current status of cell-free expression systems will be presented, together with recent developments in the expression of membraneassociated proteins. Methodological considerations, including the characteristics of codon usage in the expressed DNA, peptide tags that facilitate subsequent purification of the expressed proteins and the role of post-translational modifications, will also be examined.

2. Materials and Methods 2.1. General Methodological Considerations 2.1.1. Codon Usage

The compatibility of codon usage between the expressed DNA and the host’s protein synthesising system is an important factor that can have profound impacts on expression levels of eukaryotic proteins in prokaryotic systems (8). Every host species has a specific subset of tRNA molecules, which are essential carriers of amino acyl residues, and for which the anti-codon sequence must match the codon being translated (8, 16). The members of each tRNA subset can vary between organisms. Thus, Escherichia coli may lack tRNAs that recognise the codons AUA (isoleucine) and AGA or AGG (arginine). If these codons are present in a heterologous cDNA sequence, the protein machinery will cease operating during the translation of the corresponding recombinant protein. Such a pause in translation may lead to replacement of the amino acid with an incorrect amino acid (16). One possible solution to the problem of codon usage compatibility in bacterial systems is

Heterologous and Cell-Free Protein Expression Systems

179

Fig. 1. The RIG plasmid map with unique restriction sites. RIG is derived from pACYC184 and carries argU, ileX and glyT genes, which express tRNAs specific for AGG/AGA (arginine), ATA (isoleucine) and GGA (glycine) codons, respectively (17). Chlor chloramphenicol resistance.

to expand the tRNA pool of the host via over-expression of genes encoding the rare tRNAs from an introduced (8). For example, the RIG plasmid carries genes that encode tRNAs for Arg, Ile, Gly, which are deficient in E. coli (17) (see Fig. 1). 2.1.2. Purification Tags

For the functional or structural analysis of the expressed protein, it will normally be desirable to produce the protein in purified form, or at least in partially purified form. Thus, it is usually necessary to purify the protein of interest from a pool of other proteins, which are expressed naturally in the host system. Simple, singlestep purification protocols are preferred and various procedures based on affinity purification have been developed. The section of the expressed protein that enables the affinity purification is referred to as an affinity tag and can be predetermined through the insertion of an appropriate coding sequence in the DNA to be expressed. This results in the addition of a short peptide fragment to the expressed protein, usually at its NH2- or COOH-terminus. In Table 1, several commonly used affinity tags are listed, together with the advantages and disadvantages of each. Although many different proteins, domains or peptides are used as fusion

180

Farrokhi et al.

Table 1 Purification affinity tags for recombinant proteins Tag

Size Purification (kDa) matrix

His6

1

IMAC

GB1

6

IMAC followed Small size by IgG affinity Stable fold chromatography

Advantage

Disadvantage

Poor performance Small size for eukaryotic and Detectable via immuHMW proteins noassay Tight binding capability Functional under denaturing conditions Uncharged at physiological pH

Reference (18)

(19)

Osmotic shock or Increase the solubility freeze/thaw of eukaryotic treatment proteins

GST

26

Immobilised glutathione

High level of expression Single step purification Increase in solubility Detectable via either an enzyme assay or immunoassay

(21)

MBP

42

Cross-linked amylose, followed by IEC

Detectable via immunoassay Increase in solubility High level of expression

(22)

NusA

54

IMAC followed by IEC

Increase in solubility

CBD

Self-cleavable tag

CBD-intein 55

Weak performance in proteins with HMW It is not an affinity tag and requires a second tag for purification

(20)

TrxA, not an 11 affinity tag!

It is not an affinity tag (23) and requires a second tag for purification (24)

The table provides information regarding the purification tags, which are often used to purify recombinant proteins and sometimes increase the solubility of the heterologously expressed proteins (10, 25). The molecular weight of each tag, its advantages and disadvantages are described in the tableHis6 poly histidine tag, GB1 B1 immunoglobulin-binding domain of streptococcal protein G, trxA thioredoxin, GST glutathione-S-transferase, MBP maltose-binding protein, CBD-intein chitin-binding domain associated with self-splicing inteins, IMAC immobilised metal affinity chromatography, HMW high molecular weight, IEC ion exchange chromatography

Heterologous and Cell-Free Protein Expression Systems

181

tags, it is difficult to choose the best purification system for an individual protein and the optimal choice is almost always based on empirical observations. A suitable affinity tag should allow a rapid one-step affinity purification, it must exert minimal effects on protein folding, tertiary structure and biological activity, it must not be shielded by neighbouring domains and it must allow for its straightforward and specific removal after purification of the expressed protein (25, 26). It must also be noted that the solubility of the expressed protein can be influenced by the location and nature of its purification tag, and this in turn will depend on the 3D structure of the protein. Thus, the choice of an affinity tag and its position at the NH2- or COOH-terminus (or both) of the protein could be important for the subsequent folding and activity of the protein. Following expression, the protein can be purified from the large number of endogenous proteins that are associated with the host cells, usually via the affinity of the tag towards immobilised compounds such as metal resins. Furthermore, some tags such as maltose-binding protein (MBP) and glutathione-S-transferase (GST) are highly soluble, which may promote and enhance protein folding of the target protein and increase the solubility of the expressed protein (27, 28). Although affinity tags have facilitated the purification of some proteins from heterologous expression systems, they sometimes interfere with folding, function and crystallisation of the target protein (10, 26). This might make it necessary to remove the tag following affinity purification. Again the DNA to be expressed can be engineered to include sequences that encode peptide cleavage sites such as those for thrombin, enterokinase, factor Xa protease, TEV (tobacco etch virus) protease, caspase and human rhinovirus 3C protein, which can be inserted between the affinity tag and the target protein as a linker peptide, to allow removal of the tag following expression. Another approach for removal of the tag is to use self-cleavable tags such as intein, which is more specific and does not require additional chromatographic steps (24). Although often successful, the presence of the affinity tag does not always guarantee purification of the protein of interest. For instance, if a His6 tag is buried within or shielded by the expressed protein, subsequent protein purification via nickel-nitrilotriacetic acid (Ni-NTA) affinity chromatography can be compromised (25). Whether to position the purification tag at the COOH- or the NH2-terminus is another important consideration for the expression of soluble proteins in active forms. Molecular modelling of the target protein can be a very useful tool for prediction of a protein fold and can guide the choice for the best positioning of purification tags. Pagny et al. (29) demonstrated that in a (1→2)-β-xylosyltransferase from Arabidopsis, anything that interferes with the COOH-terminus of the enzyme, including

182

Farrokhi et al.

the His6-tag, prevented the expression of active enzyme. Similar results were obtained by Braun and LaBaer (10). 2.2. Host Systems for Heterologous Expression 2.2.1. Escherichia coli

2.2.2. Yeast

The expression of heterologous genes in bacterial host systems, in particular in E. coli, is the quickest, simplest and cheapest available technique for synthesising proteins. Heterologous cDNAs are cloned into low copy number plasmids, from which the synthesis of a protein chain can occur approximately once every 35 s, due to the virtually simultaneous transcription and translation of genes in E. coli (30). The correct folding and post-translational modifications of proteins remain great challenges for the synthesis of active proteins in this host system (31). Despite the ease of use and limited requirements of this system, the production of eukaryotic functional proteins in E. coli has often been difficult, since many are either misfolded, unfolded or partially folded (32). Generally, proteins containing less than 100 amino acid residues more readily fold into their native conformation, as a result of their fast folding kinetics. In contrast, large protein molecules often aggregate to produce inactive protein masses, known as inclusion bodies, or are targeted for degradation (31). A number of techniques for manipulating the cellular folding apparatus have been developed to avoid the formation of inclusion bodies (Table 2). In the first set of methodologies, amino acid substitutions and the use of accessory proteins such as chaperones may increase the chance of correct folding. In some cases, alterations in growth or expression conditions, such as lowering the incubation temperature or changing the viscosity of the media, might lead to slower folding kinetics and provide enough time for each amino acid to find its correct configuration within the polypeptide. This may result in the formation of properly folded proteins. However, the systematic manipulation of experimental conditions can be an extremely time-consuming exercise and it may be quicker in the end to try another expression system, should the E. coli system be unsuccessful after a few attempts. Moreover, while the techniques described above can be used to enhance folding during heterologous expression in bacteria, they do not necessarily prevent misfolding or partial folding that are caused by inappropriate or insufficient post-translational modification. For example, intra- or intermolecular disulfide bond formation, glycosylation or other modifications might be essential for protein activity in plants or animals (38). If this is the case, it might never be possible to express an active protein in bacterial systems, because bacteria have a limited capacity for the post-translational modification of proteins. It is claimed that yeast expression systems can be easy to use, although protein expression might be slower than in E. coli (39). A key attraction for using these systems is that yeasts, as eukaryotes,

Heterologous and Cell-Free Protein Expression Systems

183

Table 2 Common techniques that can be used in heterologous expression systems to obtain properly folded proteins Techniques

Methodology

Examples

Problems

Which amino acid to Manipulating the Amino acid substitution Enhancing electrostatic substitute? cellular folding to suppress interactions, hydroapparatus aggregation phobic interactions and chain configuration What is the effect of mutation on the function and stability of the protein? Co-overexpression of chaperones

Manipulating the expression conditions

Heat shock protein (Hsp70) family dnaK in bacteria

Finding the correct substrate chaperone is a major task

Chaperonins GroEL in bacteria

Sometimes has detrimental effect through cell septation and filamentation depending on the level of expression of the chaperones (33)

Co-expression of foldases

PDI PPIase

Requires optimisation

Choice of expression vector

Strong/weak and inducible/constitutive promoter

Reduction in protein synthesis rate

Lower induction temperature Lower concentration of induce Additives

Tagging the protein with other highly soluble proteins

See Table 1

These techniques can be divided into two sub-categories (10, 32, 34, 35). Chaperones and foldases are the two classes of accessory proteins that can prevent misfolding of expressed recombinant proteins by assisting the folding and maturation of proteins. Chaperones improve the tertiary structure of proteins through coordinated binding and releasing of partially folded polypeptides, but they do not increase the rate of protein folding (36, 37) Protein disulfide-isomerase (PDI), an ER-localised enzyme, restores the proper conformation of the expressed proteins via establishing disulfide bonds and also by accelerating protein folding in cysteine-rich polypeptides by repeatedly breaking and forming new disulfide bonds (36, 37). Peptidyl prolyl isomerases (PPIase) are other class of enzymes, which speed up the final folded conformation of proteins via the cis–trans isomerisation of X-proline bonds (37). A minority of peptide bonds containing a proline residue acquire a cis-conformation. Thus, PPIases facilitate the formation of the final configuration (37)

184

Farrokhi et al.

have the basic cellular machinery required for many of the posttranslational modifications that are needed for the expression of active proteins of eukaryotic origin. Two species of yeast, namely Saccharomyces cerevisiae and Pichia pastoris, have been extensively used for heterologous expression of prokaryotic and eukaryotic genes. Again it is not yet easy to identify ‘rules’ in relation to the species or conditions required for the successful expression of active proteins in these hosts, although expression of enzymes such as glycosyl transferases has been shown to be more successful in P. pastoris (40). The S. cerevisiae system often results in expressed proteins that are hyperglycosylated with mannosyl residues (50–100) and this is likely to impair the folding and/or biological activity of the recombinant protein (40). In contrast, P. pastoris incorporates fewer mannosyl residues (8–14) on potential N-glycosylation sites of the expressed protein (40, 41). Besides glycosylation, yeasts perform other eukaryotic posttranslational modifications, such as the removal of NH2-terminal methionine residues, the addition of acetyl groups to the NH2-terminus, the formation of COOH-terminal methylation complexes, myristoylation and farnesylation (41). These modifications, especially the last three, are important in intracellular membrane targeting of expressed proteins, which suggests that yeast would be a relatively good system for the expression of membrane proteins (39, 41). However, yeast does not always share the same protein trafficking and post-translational modification steps as higher eukaryotes, which may limit its usefulness as a heterologous expression system for some plant and mammalian proteins (42). Gustafsson et al. (8) showed that codon usage in yeast is close to that in plants and this increases the attraction of this system for the expression of plant proteins that are not overly sensitive to hyperglycosylation and which are not membrane associated. The other limiting factor for the expression of membrane-associated proteins in yeast is that the yeast lacks plant-specific sterols (sito-, stigma- and campesterols) required for correct folding and activity of the translated protein. Although yeast contains ergosterol, it often does not produce fully functional membranebound proteins (13, 43). Apart from the intracellular steps involved in expressing and processing proteins, it has been shown that expression in yeast requires optimisation of a number of external factors. The position of the affinity tag, temperature, cell density, medium formulation, variation between clones, the length of the expression period and the type of promoter (constitutive/inducible) need to be tested (38). In P. pastoris, a comparison of eight expressed glycosyl transferases derived from different organisms, all of which mediated N- and O-linked oligosaccharide biosynthesis, showed that reduction in the expression temperature from 30°C to 16°C

Heterologous and Cell-Free Protein Expression Systems

185

decreased yeast growth but increased the overall yield of active protein (38). Before the expression of these enzymes in P. pastoris, Bencurova et al. (38) constructed a number of plasmid vectors and tested the efficiency of each one. The enzyme is normally membrane-associated through a single transmembrane helix (TMH). The full-length cDNA containing the TMH encoding region and a partial-length cDNA lacking the TMH region, under the control of either a constitutive or inducible promoter, were evaluated. They also studied the effects of affinity tag location in the expectation that the best position may depend on the location of the catalytic domain in relation to the tag (38). Overall, the use of P. pastoris as a heterologous expression system has met with mixed success. In many cases, it can be demonstrated that the transgene is present in the cells, but expression of an active protein does not occur. Recently, it has been suggested that the chances of success selecting high protein expressors can be greatly increased through screening 20–100 transgenic P. pastoris colonies for the presence of the protein of interest (T. Teeri, personal communication). Thus, only a small proportion of transformed lines might actually produce protein from the transgene, for reasons that are not yet clear. If only 5–10% of transgenic lines produce the protein, these can be easily missed if it is assumed that all lines carrying the transgene will express active protein. Furthermore, once a productive line has been identified, it can be used indefinitely for expression of the protein of interest. 2.2.3. Insect Cells

Insect cells represent another host system for recombinant protein production. They are capable of performing the co- and post-translational protein processing seen in other eukaryotes, which may be important in producing properly folded proteins (11, 44). To introduce the recombinant plasmid containing the gene of interest into the insect cells, two methods of transfection have been developed. One is plasmid uptake via standard methods such as calcium phosphate co-precipitation. The second method is the introduction of the plasmid into an insect virus so that the recombinant plasmid is subsequently transferred via infection of insect cells; this method involves more experimental steps (44). The insect cells that are commonly used for foreign gene expression are derived from Spodoptera frugipedra (Sf9), usually used with a baculovirus vector, or Drosophila Schneider cells (S2). The insect cells may be transiently transformed in preliminary experiments and, if expression of active protein is successful, stable transformation of the insect cells can subsequently be performed. Stably transformed cells can result in the production of milligram quantities of recombinant protein (44). The difference between the two transformation techniques is that for stable transformation

186

Farrokhi et al.

another plasmid carrying a selectable marker is used to transfect the insect cells, in addition to the plasmid bearing the gene of interest. Pagny et al. (29) used a baculovirus vector system to express Arabidopsis (1→2)-β-xylosyltransferase in Sf9 cells, not only to demonstrate its glycosylation function but also to identify important amino acid residues or domains for GT activity. In another study, Liepman et al. (45) used Drosophila S2 cells to define the enzymic functions of Csl genes from Arabidopsis and rice. The results showed that members of the CslA gene family encode β-d-mannan synthases; the expressed enzyme showed biochemical activity in the presence of GDPmannose. 2.2.4. Mammalian Cells

In comparison with other expression systems, working with mammalian cells appears to require sophisticated facilities and expensive reagents (10). Many factors, at both transcription and translation levels, control the efficient heterologous expression of DNAs in mammalian cells, including RNA processing, gene copy number, mRNA stability, position effects at the site of chromosomal integration and genetic properties of the cell type (46). Mammalian cells are able to perform correct protein folding, together with complex and authentic post-translation modifications (47). Mammalian cells, as with yeast expression systems, can hyperglycosylate the expressed protein (48) and application of tunicamycin, a sugar analogue of UDP-GlcNAc, can be used to reduce the extent of hyperglycosylation (49). Two common strategies, using either viral- or plasmid-based vectors, are generally used to introduce foreign DNA into mammalian cells (46, 50). Viral expression vectors can transfer the gene of interest to the mammalian cells through infecting the cells, while the other types of vectors can be delivered to the cells via both chemical and physical methods, as reviewed by Colosimo et al. (50). Moreover, the vectors can be used to carry and express the gene of interest either transiently or stably. Transient gene expression facilitates the rapid production, within 2–3 days, of small quantities of proteins for evaluating the system. Baby hamster kidney (BHK), African green monkey kidney CV1 also known as COS cells, human embryonic kidney (HEK)-293 and baby hamster kidney (HEK)-293 cells are amongst the cell lines that are used for transient gene expression (46). In contrast, stable gene expression is used for expression of genes of interest in Chinese hamster ovary (CHO), Madin-Darby canine kidney (MDCK) or myeloma cells. In addition to the viral vectors, which integrate foreign DNA into the genome, episomal elements have also been used for gene transfer (46, 50). The episomal vectors, such as the Epstein-Barr virus (EBV), do not integrate DNA into the genome, so expression of the gene of interest is not subject to positional effects (50).

Heterologous and Cell-Free Protein Expression Systems

187

Mammalian cells have been used for the expression of a number of plant proteins. One of the earliest examples was the identification of cDNAs encoding plasma membrane water channels through the expression of an Arabidopsis root cDNA library in COS cells (51), while Perrin et al. (52) successfully expressed a full-length Arabidopsis fucosyltransferase (AtFT1) gene in COS cell lines. 2.2.5. Expression in Higher Plants

Plants represent a simple, versatile and efficient alternative for the production of recombinant proteins. Furthermore, plant cells are more likely to provide all the conditions required for a plant protein to fold into its active form and to undergo appropriate post-translational modifications. In planta expression of foreign proteins is not only useful for plant proteins but can also be used for other eukaryotic and prokaryotic proteins (53–57). For example, recombinant human glycoproteins produced in plants are more similar to their corresponding native protein when N-glycosylation is considered, as compared with the same proteins expressed in systems such as yeast, bacteria and filamentous fungi (58). Thus, plants can be used as efficient heterologous expression system hosts, wherein expressed proteins are likely to be correctly modified, for example, in terms of glycosylation, and can potentially be targeted to a number of different cellular compartments when appropriate vectors are used. Historically, a number of methods have been tried for protein production in plants, including stable transformation, through incorporation of DNA into nuclear or plastid genomes. In addition, transient expression can be achieved, either through Agrobacterium infiltration of plant tissue or through infection with plant viral vectors. The route via stable nuclear transformation can have a long lead-up time before the production of the first batch of target protein, but expression can be maintained indefinitely once the transgenic lines have been selected. Strong promoters, such as 35S CaMV, can direct expression in the leaf and other tissues. The selection of seedspecific promoters can lead to accumulation of protein in the seed or grain. However, yields are usually below 100 μg protein per gram of fresh tissue. Stable transformation of plastids is also possible and the expression of the protein can be limited to this organelle where high levels of soluble protein expression may be achieved. However, protein deposition into inclusion bodies can also occur here and the protein may not be properly processed and thus folded. Despite these drawbacks a few protein products have been successfully produced in planta (59, 60). Transient expression systems in plants rely on the delivery of the DNA through a bacterium, typically Agrobacterium, a virus or by particle bombardment. These methods are much quicker than stable transformation procedures, but downstream protein

188

Farrokhi et al.

yields are not always high. Of the three methods that based on viral delivery appears to have the most potential for producing high yields of protein (61). A major limitation of this method is the length of the gene that may be inserted before the genomic size limit of the virus is exceeded. This has recently been overcome by the adoption of a ‘deconstructed’ strategy (62), whereby viral vectors have been engineered to remove genes required for certain wild-type processes. Processes that have been disabled can be supplied either by the host plant or by incorporating non-viral steps in the procedure. For example, infection of the host plant is traditionally carried out by generating viral RNA in vitro, and transferring the RNA mechanically into the host tissue. These steps can be bypassed by combining the ability of Agrobacterium to infect host cells and to transfer a virally based vector into the host plant. The need for efficient cell-to-cell spread can be eliminated in this way because amplicons are delivered to many cells simultaneously by the Agrobacterium infiltration procedure. Vectors such as those developed by the Large Scale Biology Corporation (USA), based on TMV, have enabled this technology to be used on an industrial scale. Further elegant modifications have been made, including ‘magnifection’ (63, 64), which is a combinatorial approach through which pro-vectors based on TMV have been engineered and optimised to allow post-delivery recombination inside the plant cell (55). Such vectors can be used in various combinations in planta (usually Nicotiana benthamiana) to optimise protein expression without the need to construct each variant separately in the laboratory. Different combinations of pro-vectors can be used to target the protein to various sub-cellular compartments, where the desired posttranslational modifications occur, and to allow fusion to a range of affinity tags to be tested. Combined with the versatility of such a system is the significant increase in protein yield that makes the system even more attractive for industrial scale-up. Marillonnet et al. (64) reported a yield of up to 5 g of expressed protein per kilogram fresh weight biomass; this represented up to 80% of the total protein usually found in uninfected leaf tissue (8–10 g/kg). Gils et al. (65) reported the production of active human growth hormone in the apoplast of N. benthamiana leaves at a level of 1 mg per gram fresh weight biomass (up to 10% of total soluble protein), while recombinant Yersinia pestis antigens have also been produced at similar levels (66). The potential of magnifection to produce biologically active and useful heterologous proteins, such as those used as vaccines, appears to be extensive. Until recently, heterooligomeric proteins, for example, IgG antibodies, which require both light- and heavy-chain molecules to combine, had not been expressed in plants. The expression of correctly folded and functional singlechain IgA and IgG antibodies has been reported (67), but the

Heterologous and Cell-Free Protein Expression Systems

189

yields were very low at 1–40 μg per gram fresh biomass. However, attempts to express two polypeptides simultaneously from two different non-competing viral vectors, based on tobacco mosaic virus and potato virus X, in the same cell, have been successful (68). The yield of functionally active antibodies was higher than previously seen with single-chain molecules, and once such systems are optimised, they offer an opportunity to express protein combinations or complexes. 2.2.6. Moss ( Physcomitrella patens): A Lower Plant Expression System

Mosses are small plants belonging to the Bryophyta and are fast growing in popularity as a model functional genomics system because of a number of features that make them easy to handle, grow and transform (69). The mosses lack flowers, true vascular tissues and seeds, and remain in a predominantly haploid state (Fig. 2) during most of their life cycle (71). The moss Physcomitrella patens is the only known plant that has evolved a highly efficient homologous recombination capability (72, 73), and can be readily transformed. These characteristics facilitate the integration of foreign DNA into its genome through a single step, and allow the rapid and efficient functional analysis of genes and their products (71, 74). Mosses lack some of the disadvantages found when higher plants are used as heterologous expression systems, such as genetic instability and the production of recombinant proteins in low yield (74). Mosses also catalyse post-translational modifications that are similar to those in higher plants, including disulfide bond formation and glycosylation (75).

Fig. 2. Simplified life cycle of moss (Physcomitrella patens) [modified after (70)]. Haploid gametophyte and diploid sporophyte are the two stages of the moss life cycle. Following mitosis gametes are produced from the gametophyte, and fuse to produce zygotes that develop into the sporophyte. The sporophyte undergoes a meiosis step to produce spores that germinate to produce a long-lived filamentous structure referred as ‘protonema’. Within protonemal stage of the gametophyte, filaments produce branches and extend their length.

190

Farrokhi et al.

2.3 .Cell-Free Expression Systems 2.3.1. Overview

2.3.2 .Three Main Platforms for Cell-Free Protein Production

Cell-free protein synthesis systems that involve the transcription and translation of DNA fragments in vitro represent alternatives to in vivo protein expression in whole-cell hosts. Cell-free systems have been developed in the past as key experimental systems to unravel molecular mechanisms of protein biosynthesis (76, 77). In this technique, ribosomes, translation factors and post-translational components are isolated from whole cells and used for in vitro synthesis of polypetides, using DNA (translation systems) or RNA templates (coupled transcription/translation systems) (78–80). In recent years, the simplicity, cost-efficiency and substantial improvements in cell-free systems have led to their widespread use for the large-scale production of proteins from heterologous DNA (81, 82). A most important advantage of cell-free translational systems over whole-cell host systems is their capacity to obviate the formation of protein misfolds, which are a heterogenous mixture of misfolded, incompletely folded or aberrantly folded proteins that associate via intramolecular and intermolecular interactions. Furthermore, the synthesis of proteins that undergo intracellular proteolysis is reduced and the cell-free systems often produce correctly folded proteins. The techniques are also valuable for the expression of proteins that might be toxic to living cells in heterologous hosts such as E. coli, and which frequently form non-toxic insoluble inclusion bodies in these hosts. The key breakthroughs in the development of efficient cellfree systems have been those that ensure continuous feeding of amino acids and nucleotide triphosphates into the reaction mixtures, and the use of efficient energy regeneration systems (4, 80, 83). While solubility and functionality are the key factors during screening of recombinant clones for expression of soluble proteins, for expression of membrane proteins solubility in the presence of selected lipids, detergents and liposomes, and functionality of a fully membrane-inserted protein, are the most important requirements (80). Currently, there are three well-defined platforms that are available for preparative in vitro protein production. These include translation systems based on extracts from E. coli (84), wheat germ (85) and mammalian cell-derived systems, such as rabbit reticulocytes (80) and tumor HeLa cell extracts (86). The E. coli and wheat germ systems are the best characterised prokaryotic and eukaryotic systems. Very recently cell-free systems from insect cells (87), a hybridoma-based (88) translation system and yeast cells have been developed, although proteins tend to be produced in relatively low yields. The hybridoma-based translation system is also capable of fully N-glycosylating proteins and, in combination with a human stress-induced protein, which recruits a phosphatase, leads to a decrease in protein over-phosphorylation and a net increase in protein synthesis efficiency (86).

Heterologous and Cell-Free Protein Expression Systems

191

The systems for preparative in vitro cell-free protein production usually consist of four main steps: (1) DNA template preparation for in vitro transcription; (2) small-scale screening for protein translation efficiency; (3) scale-up of protein production and (4) purification of the expressed target protein in sufficient amounts for functional studies, or for crystallography and NMR data collection (82). The first step in the cell-free protein pipeline is DNA template preparation. Here, PCR-generated linear DNA templates based on the structure of pEU plasmids have been successfully used in a wheat germ embryo system (7). The linear DNA templates also serve as excellent templates for rapid batch-mode screening of overall protein solubility, domain boundaries of membrane proteins, and for creating mutant, fragmented and chimeric proteins that facilitate protein folding (79, 89). Similarly, kits for the generation of DNA templates by PCR can be used and are available from several commercial suppliers, although they often produce rather low amounts of synthesised proteins. The second step involves the small-scale screening of protein translation and is usually performed in high-throughput multi-well formats (7, 78). Successful candidate clones and conditions are separated from non-expressors and selected for largescale protein production (third step), which involves the use of selenomethionine (90) or single or double 13C- and 15N-labeling (78, 82, 91), for phasing of crystal structures or high-resolution NMR spectroscopy, respectively, followed by protein purification. In addition, crude translation reaction mixtures can be used immediately for testing of biological functions of synthesised polypetides, which speeds up analysis of screening trials. Finally, cell-free systems allow one to engineer and guide conditions for polypeptide synthesis in the broadest scope, because cell-free systems represent ‘open systems’ in the sense that synthesis of proteins can be manipulated, stabilised or affected in any possible way. For example, cell-free systems support protein folding by co-expression with chaperones (92), allow disulphide bond formation (93) or affect protein glycosylation, phosphorylation and other post-translational modifications (88, 94). For example, when human erythropoietin was expressed using a S30 cell extract from E. coli Origami B(DE3) strain and supplemented with chaperone GroE and disulfide isomerase, the biological activity of synthesised erythropoietin was increased more than sevenfold compared to the control reaction (92). This implies that post-translational folding, disulphide bond re-shuffling and co-translational folding are key factors for acquiring a functional protein. The successful implementation of cell-free translation systems obviates problems with cell expression optimisation, cell harvesting, lysis and protein isolation. Two serious disadvantages of cell-free translation systems are the high cost and restricted

192

Farrokhi et al.

availability of active wheat germ and E. coli extracts, and it is generally impracticable to prepare these highly complex cell-free extracts in the laboratory. Finally, cell-free expression systems are amenable to robotics and automation through high-throughput screening of around 1,000 small-scale screening reactions (25–50 μl) per week (82). 2.3.3. Cell-Free Expression of Membrane Proteins

As mentioned above, the solubility of membrane proteins in lipoid environments and functionality of fully membrane-inserted proteins are the key requirements for expressing membrane proteins in cell-free systems (80, 95). This means that a variety of preformed micelles containing detergents and lipids or (proteo) liposomes (that provide membrane curvature to support protein insertion), or mixtures thereof, can be added to the reaction mixture so that the membrane proteins can be synthesised directly into a defined hydrophobic environment. This allows their proper interaction with lipoids and detergents, and promotes functional folding during or shortly after translation (95, 96). Further, it is important that sufficient micelles are provided for optimal production of membrane proteins, so that the micelles are present at molar concentrations approximately equal to the molar concentrations of the membrane proteins to be synthesised, to prevent aggregate formation and non-homogenous (proteo)micelle formation (96, 97). In addition to a tailored protein/lipoid component ratio, it is critical to subject membrane protein screening trials to the same optimisation strategies that apply to soluble proteins, as specified above. It is well known that detergents interact with membrane proteins non-specifically and therefore a large variety of detergents is usually tested to select conditions that are beneficial to membrane protein folding but are chemically neutral in reaction mixtures. Alternatively and/or concurrently, protein precipitates can be formed during cell-free membrane protein production. These misfolded membrane protein species can be solubilised and refolded with mild, usually non-ionic or amphipathic detergents and polymers (96, 98). The re-folding relies on the fast transfer of proteins from a denaturant solution to lipoid environments. Some membrane proteins are sensitive to in vitro transfer from one lipoid environment to another, so if it were possible to induce insertion, correct folding and biological function of the emerging, translated protein directly into a liposome, this would remove the biggest hurdle in obtaining active enzymes in an artificial membrane. This could be achieved in reticulocyte lysate systems by fusing liposomes with the reticulocytes. In addition, E. coli lysate kits can be used to express proteins in close proximity to a liposome membrane, into which the membrane proteins can be incorporated (99).

Heterologous and Cell-Free Protein Expression Systems

3. Concluding Remarks and Future Directions

193

In summary, the aim of heterologous and cell-free expression is to produce biochemically active proteins with technological ease and at low cost. Unfortunately, individual proteins require specific conditions for correct folding and, in most cases the only way to achieve this is by trial and error, using different host systems. Further, a key drawback of heterologous systems is that successfully expressed proteins may display no activity. In vitro assays of undefined enzymes require the inclusion of a range of substrate molecules, some of which are not readily available (100), and in cases where they are available the number of possible combinations makes the examination of substrate specificity a large and expensive task. Although some proteins have been shown to be autonomously functional, biochemical and transcript data have indicated that others require partners for efficient function. It might be possible to overcome this requirement for a complex of enzymes by using in planta expression systems (55, 62, 101, 102) and assaying either the crude extract or an enriched protein isolated through the use of an affinity tag. Appropriate design of gene constructs may also be possible where proteins that have previously been shown to require partners for efficient function are co-expressed at similar levels from a poly-cistronic mRNA driven by a single strong promoter. The interspersion of Internal Ribosome Entry Site (IRES) (103) sequences between each gene is used to mediate such expression and there is a growing number of these that may be used in vector design (http://ifr31w3. toulouse.inserm.fr/IRESdatabase). Another more general problem associated with some proteins is that prediction of function based on sequence relatedness with previously characterised proteins is often not possible.

Acknowledgements The financial support of the Australian Research Council, the Grains Research and Development Corporation and the South Australian State Government is gratefully acknowledged. We thank Bianca Kuchel for her invaluable technical assistance with aspects of the work described here.

194

Farrokhi et al.

References 1. Brown, D. and Sjölander, K. (2006) Functional classification using phylogenomic inference. PLoS Comput. Biol. 2, 479–483. 2. Hrmova, M., Burton, R.A., Biely, P., Lahnstein, J., and Fincher, G.B. (2006) Hydrolysis of (1,4)-β-D-mannans in barley (Hordeum vulgare L.) is mediated by the concerted action of (1,4)-β-D-mannan endohydrolase and β-Dmannosidase. Biochem. J. 399, 77–90. 3. Sawasaki, T., Ogasawara, T., Morishita, R., and Endo, Y. (2002) A cell-free protein synthesis system for high-throughput proteomics. Proc. Natl. Acad. Sci. USA 99, 14652–14657. 4. Endo, Y., Otsuzuki, S., Ito, K., and Miura, K. (1992) Production of an enzymatic active protein using a continuous flow cell-free translation system. J. Biotechnol. 25, 221–230. 5. Kost, T.A. (1997) Expression systems: gene expression systems in the genomics era. Curr. Opin. Biotechnol. 8, 539–541. 6. Dubessay, P., Pages, M., Delbac, F., Bastien, P., Vivares, C., and Blaineau, C. (2004) Can heterologous gene expression shed (a torch) light on protein function? Trends Biotechnol. 22, 557–559. 7. Endo, Y. and Sawasaki, T. (2004) Highthroughput, genome-scale protein production method based on the wheat germ cell-free expression system. J. Struct. Funct. Genomics 5, 45–57. 8. Gustafsson, C., Govindarajan, S., and Minshull, J. (2004) Codon bias and heterologous protein expression. Trends Biotechnol. 22, 346–353. 9. Olins, P.O. (1996) Quantity versus authenticity of heterologously produced proteins: an inevitable compromise? Curr. Opin. Biotechnol. 7, 487–488. 10. Braun, P. and LaBaer, J. (2003) High throughput protein production for functional proteomics. Trends Biotechnol. 21, 383–388. 11. Geisse, S., Gram, H., Kleuser, B., and Kocher, H.P. (1996) Eukaryotic expression systems: a comparison. Protein Expr. Purif. 8, 271–282. 12. Persans, M.W., Nieman, K., and Salt, D.E. (2001) Functional activity and role of cationefflux family members in Ni hyperaccumulation in Thlaspi goesingense. Proc. Natl. Acad. Sci. USA 98, 9995–10000. 13. Opekarova, M. and Tanner, W. (2003) Specific lipid requirements of membrane proteins – a putative bottleneck in heterologous expression. Biochim. Biophys. Acta. 1610, 11–22. 14. Opekarova, M., Robl, I., Grassl, R., and Tanner, W. (1999) Expression of eukaryotic plasma membrane transporter HUP1 from

15.

16.

17.

18.

19.

20.

21.

22.

23.

24.

Chlorella kessleri in Escherichia coli. FEMS Microbiol. Lett. 174, 65–72. Robl, I., Grassl, R., Tanner, W., and Opekarova, M. (2000) Properties of a reconstituted eukaryotic hexose/proton symporter solubilized by structurally related non-ionic detergents: specific requirement of phosphatidylcholine for permease stability. Biochim. Biophys. Acta 1463, 407–418. Kurland, C. and Gallant, J. (1996) Errors of heterologous protein expression. Curr. Opin. Biotechnol. 7, 489–493. Baca, A.M. and Hol, W.G. (2000) Overcoming codon bias: a method for high-level overexpression of Plasmodium and other AT-rich parasite genes in Escherichia coli. Int. J. Parasitol. 30, 113–118. Hochuli, E., Dobeli, H., and Schacher, A. (1987) New metal chelate adsorbent selective for proteins and peptides containing neighbouring histidine residues. J. Chromatogr. 411, 177–184. Huth, J.R., Bewley, C.A., Jackson, B.M., Hinnebusch, A.G., Clore, G.M., and Gronenborn, A.M. (1997) Design of an expression system for detecting folded protein domains and mapping macromolecular interactions by NMR. Protein Sci. 6, 2359–2364. LaVallie, E.R., DiBlasio, E.A., Kovacic, S., Grant, K.L., Schendel, P.F., and McCoy, J.M. (1993) A thioredoxin gene fusion expression system that circumvents inclusion body formation in the E. coli cytoplasm. Biotechnology 11, 187–193. Smith, D.B. and Johnson, K.S. (1988) Singlestep purification of polypeptides expressed in Escherichia coli as fusions with glutathione S-transferase. Gene 67, 31–40. di Guan, C., Li, P., Riggs, P.D., and Inouye, H. (1988) Vectors that facilitate the expression and purification of foreign peptides in Escherichia coli by fusion to maltose-binding protein. Gene 67, 21–30. Davis, G.D., Elisee, C., Newham, D.M., and Harrison, R.G. (1999) New fusion protein systems designed to give soluble expression in Escherichia coli. Biotechnol. Bioeng. 65, 382– 388. Chong, S., Mersha, F.B., Comb, D.G., Scott, M.E., Landry, D., Vence, L.M., Perler, F.B., Benner, J., Kucera, R.B., Hirvonen, C.A., Pelletier, J.J., Paulus, H., and Xu, M.Q. (1997) Single-column purification of free recombinant proteins using a self-cleavable affinity tag derived from a protein splicing element. Gene 192, 271–281.

Heterologous and Cell-Free Protein Expression Systems 25. Terpe, K. (2003) Overview of tag protein fusions: from molecular and biochemical fundamentals to commercial systems. Appl. Microbiol. Biotechnol. 60, 523–533. 26. Waugh, D.S. (2005) Making the most of affinity tags. Trends Biotechnol. 23, 316–320. 27. Kapust, R.B. and Waugh, D.S. (1999) Escherichia coli maltose-binding protein is uncommonly effective at promoting the solubility of polypeptides to which it is fused. Protein Sci. 8, 1668–1674. 28. Smith, D.B. (2000) Generating fusions to glutathione S-transferase for protein studies. Methods Enzymol. 326, 254–270. 29. Pagny, S., Bouissonnie, F., Sarkar, M., Follet-Gueye, M., Driouich, A., Schachter, H., Faye, L., and Gomord, V. (2003) Structural requirements for Arabidopsis beta1,2-xylosyltransferase activity and targeting to the Golgi. Plant J. 33, 189–203. 30. Lorimer, G.H. (1996) A quantitative assessment of the role of the chaperonin proteins in protein folding in vivo. Faseb J. 10, 5–9. 31. Baneyx, F. and Mujacic, M. (2004) Recombinant protein folding and misfolding in Escherichia coli. Nat. Biotechnol. 22, 1399–1408. 32. Weickert, M.J., Doherty, D.H., Best, E.A., and Olins, P.O. (1996) Optimization of heterologous protein production in Escherichia coli. Curr. Opin. Biotechnol. 7, 494–499. 33. Blum, P., Ory, J., Bauernfeind, J., and Krska, J. (1992) Physiological consequences of DnaK and DnaJ overproduction in Escherichia coli. J. Bacteriol. 174, 7436–7444. 34. Wetzel, R. (1994) Mutations and off-pathway aggregation of proteins. Trends Biotechnol. 12, 193–198. 35. Georgiou, G. and Valax, P. (1996) Expression of correctly folded proteins in Escherichia coli. Curr. Opin. Biotechnol. 7, 190–197. 36. Raikhel, N. and Chrispeels, M. (2000) Protein sorting and vesicle traffic, in Biochemistry ‘ Molecular Biology of Plants (Buchanan, B.B., Gruissem, W., and Jones, R., eds.), John Wiley ‘ Sons, New York, pp. 160–201. 37. Spremulli, L. (2000) Protein synthesis, assembly, and degradation, in Biochemistry ‘ Molecular Biology of Plants (Buchanan, B.B., Gruissem, W., and Jones, R., eds.), John Wiley ‘ Sons, New York, pp. 412–454. 38. Bencurova, M., Rendic, D., Fabini, G., Kopecky, E., Altmann, F., and Wilson, I. (2003) Expression of eukaryotic glycosyltransferases in the yeast Pichia pastoris. Biochimie 85, 413–422. 39. Grisshammer, R. and Tate, C.G. (1995) Overexpression of integral membrane proteins for

40.

41.

42.

43.

44.

45.

46.

47.

48.

49.

50.

51.

52.

195

structural studies. Q. Rev. Biophys. 28, 315– 422. Malissard, M., Zeng, S., and Berger, E.G. (1999) The yeast expression system for recombinant glycosyltransferases. Glycoconj. J. 16, 125–139. Eckart, M.R. and Bussineau, C.M. (1996) Quality and authenticity of heterologous proteins synthesized in yeast. Curr. Opin. Biotechnol. 7, 525–530. Sadhukhan, R. and Sen, I. (1996) Different glycosylation requirements for the synthesis of enzymatically active angiotensin-converting enzyme in mammalian cells and yeast. J. Biol. Chem. 271, 6429–6434. Wagner, S., Bader M.L., Drew, D., and de Gier, J.W. (2006) Rationalizing membrane protein overexpression. Trends Biotechnol. 24, 364–371. McCarroll, L. and King, L.A. (1997) Stable insect cell cultures for recombinant protein production. Curr. Opin. Biotechnol. 8, 590–594. Liepman, A.H., Wilkerson, C.G., and Keegstra, K. (2005) Expression of cellulose synthase-like (Csl) genes in insect cells reveals that CslA family members encode mannan synthases. Proc. Natl. Acad. Sci. USA 102, 2221–2226. Makrides, S.C. (1999) Components of vectors for gene transfer and expression in mammalian cells. Protein Expr. Purif. 17, 183–202. Marino, M. (1989) Expression systems for heterologous protein production. BioPharm. 2, 18–33. Daly, R. and Hearn, M.T. (2005) Expression of heterologous proteins in Pichia pastoris: a useful experimental tool in protein engineering and production. J. Mol. Recognit. 18, 119–138. Elbein, A.D. (1984) Inhibitors of the biosynthesis and processing of N-linked oligosaccharides. CRC Crit. Rev. Biochem. 16, 21–49. Colosimo, A., Goncz, K.K., Holmes, A.R., Kunzelmann, K., Novelli, G., Malone, R.W., Bennett, M.J., and Gruenert, D.C. (2000) Transfer and expression of foreign genes in mammalian cells. Biotechniques 29, 314–318. Kammerloher, W., Fischer, U., Piechottka, G.P., and Schaffner, A.R. (1994) Water channels in the plant plasma membrane cloned by immunoselection from a mammalian expression system. Plant J. 6, 187–199. Perrin, R., DeRocher, A., Bar-Peled, M., Zeng, W., Norambuena, L., Orellana, A., Raikhel, N., and Keegstra, K. (1999) Xyloglucan fucosyltransferase, an enzyme involved in plant cell wall biosynthesis. Science 284, 1976–1979.

196

Farrokhi et al.

53. Hellwig, S., Drossard, J., Twyman, R., and Fischer, R. (2004) Plant cell cultures for the production of recombinant proteins. Nat. Biotechnol. 22, 1415–1422. 54. Li, Y., Geng, Y., Song, H., Zheng, G., Huan, L., and Qiu, B. (2004) Expression of a human lactoferrin N-lobe in Nicotiana benthmiana with potato virus X-based agroinfection. Biotechnol. Lett. 26, 953–957. 55. Marillonnet, S., Giritch, A., Gils, M., Kandzia, R., Klimyuk, V., and Gleba, Y. (2004) In planta engineering of viral RNA replicons: efficient assembly by recombination of DNA modules delivered by Agrobacterium. Proc. Natl. Acad. Sci. USA 101, 6852–6857. 56. Selth, L.A., Randles, J.W., and Rezaian, M.A. (2004) Host responses to transient expression of individual genes encoded by tomato leaf curl virus. Mol. Plant Microbe Interact. 17, 27–33. 57. Wagner, B., Hufnagl, K., Radauer, C., Wagner, S., Baier, K., Scheiner, O., Wiedermann, U., and Breiteneder, H. (2004) Expression of the B subunit of the heat-labile enterotoxin of Escherichia coli in tobacco mosaic virus-infected Nicotiana benthamiana plants and its characterization as mucosal immunogen and adjuvant. J. Immunol. Methods 287, 203–215. 58. Gomord, V. and Faye, L. (2004) Posttranslational modification of therapeutic proteins in plants. Curr. Opin. Plant Biol. 7, 171–181. 59. Fischer, R., Stoger, E., Scillberg, S., Christou, P., and Twyman, R. M. (2004) Plant based production of biopharmaceuticals. Curr. Opin. Plant Biol. 7, 152–158. 60. Maliga, P. and Graham, I. (2004) Molecular farming and metabolic engineering promise a new generation of high-tech crops. Curr. Opin. Plant Biol. 7, 149–151. 61. Porta, C. and Lomonossoff, G.P. (2002) Viruses as vectors for the expression of foreign sequences in plants. Biotech. Genet. Eng. Rev. 19, 245–291. 62. Gleba, Y., Marillonnet, S., and Klimyuk, V., (2004) Engineering viral expression vectors for plants: the “full virus” and the “deconstructed virus” strategies. Curr. Opin. Plant Biol. 7, 182–188. 63. Gleba, Y., Klimyuk, V., and Marillonnet, S. (2005) Magnifection – a new platform for expressing recombinant vaccines in plants. Vaccine 23, 2042–2048. 64. Marillonnet, S., Thoeringer, C., Kandzia, R., Klimyuk, V., and Gleba, Y. (2005) Systemic Agrobacterium tumefaciens-mediated transfection of viral replicons for efficient transient

65.

66.

67.

68.

69.

70.

71. 72.

73.

74. 75.

76.

77.

expression in plants. Nature Biotech. 23, 718– 723. Gils, M.,Kandzia, R., Marillonnet, S., Klimyuk, V., and Gleba, Y. (2005) High-yield production of authentic human growth hormone using a plant virus-based expression system. Plant Biotech. J. 3, 613–620. Santi, L., Giritch, A., Roy, C., Marillonnet, S., Klimyuk, V., Gleba, Y., Webb, R., Arntzen, C.J., and Mason, H.S. (2006) Protection conferred by recombinant Yersinia pestis antigens produced by a rapid and highly scalable plant expression system. Proc. Nat. Aca. Sci. USA 103, 861–866. Awram, P., Gardner, R.C., Forster, R.L., and Bellamy A.R. (2002) The potential of plant viral vectors and transgenic plants for subunit vaccine production. Adv. Virus Res. 58, 81–124. Giritch, A., Marillonnet, S., Engler, C., van Eldik, G., Botterman, J., Klimyuk, V., and Gleba, Y. (2006) Rapid high-yield expression of full-size IgG antibodies in plants coinfected with noncompeting viral vectors. Proc. Natl. Acad. Sci. USA 103, 14701–14706. Frank, W., Ratnadewi, D., and Reski, R. (2005) Physcomitrella patens is highly tolerant against drought, salt and osmotic stress. Planta 220, 384–394. Cove, D., Knight, C., and Lamparter, T. (1997) Mosses as model systems. Trends Plant Sci. 2, 99–105. Reski, R. and Cove, D.J. (2004) Physcomitrella patens. Curr. Biol. 14, R261–R262. Landy, A. (1989) Dynamic, structural, and regulatory aspects of lambda site-specific recombination. Annu. Rev. Biochem. 58, 913–949. Schaefer, D.G. and Zryd, J.P. (1997) Efficient gene targeting in the moss Physcomitrella patens. Plant J. 11, 1195–1206. Decker, E.L. and Reski, R. (2004) The moss bioreactor. Curr. Opin. Plant Biol. 7(2), 166–70. Koprivova, A., Altmann, F., Gorr, G., Kopriva, S., Reski, R., and Decker, E. (2003) N-glycosylation in the moss Physcomitrella patens is organised similarly to that in higher plants. Plant Biol. 5, 582–591. Zubay, G. (1973) In vitro synthesis of proteins in microbial systems. Annu. Rev. Genet. 7, 267–287. Nirenberg, M.W. and Matthaei, J.H. (1981) The dependence of cell-free protein synthesis in E. coli upon naturally occurring or synthetic polynucleotides. Proc. Natl. Acad. Sci. USA 47, 1588–1602.

Heterologous and Cell-Free Protein Expression Systems 78. Spirin, A.S. (2004) High-throughput cellfree systems for synthesis of functionally active proteins. Trends Biotech. 22, 538–545. 79. Miles, L.A. (2005) Robust and cost effective cell-free expression of biopharmaceuticals: Escherichia coli and wheat embryo, in Modern Biopharmaceuticals, Design, Development and Optimization, Volume 2 (Knäblein, J., ed.), Wiley-VCH Verlag GmbH ‘ Co. KGaA, Weinheim, pp. 1063–1081. 80. Dixon, N.E. (2006) Cell-free protein synthesis. FEBS J. 273, 4131–4132. 81. Yokoyama, S. (2003) Protein expression systems for structural genomics and proteomics. Curr. Opin. Chem. Biol. 7, 39–43. 82. Vinarov, D.A., Lytle, B.L., Peterson, F.C., Tyler, E.M., Volkman, B.F., and Markley, J.L. (2004) Cell-free protein production and labeling protocol for NMR-based structural proteomics. Nat. Methods 1, 149–53. 83. Spirin, A.S., Baranov, V.I., Ryabova, L.A., Ovodov S.Y., and Alakhov, Y.B. (1988) A continuous cell-free translation system capable of producing polypeptides in high yield. Science 242, 1162–1164. 84. Kigawa T., Yabuki T., Yoshida Y., Tsutsui M., Ito Y., Shibata, T., and Yokoyama, S. (1999) Cell-free production and stable-isotope labeling of milligram quantities of proteins. FEBS Lett. 442, 15–19. 85. Madin, K., T Sawasaki, T., Ogasawara, T., and Y Endo, Y. (2000) A highly efficient and robust cell-free protein synthesis system prepared from wheat embryos: plants apparently contain a suicide system directed at ribosomes. Proc. Natl. Acad. Sci. USA 97, 559–564. 86. Mikami, S., Masutani, M.N., Yokoyama, S., and Imataka, H. (2006a) An efficient mammalian cell-free translation system supplemented with translation factors. Protein Expr. Purif. 46, 348–357. 87. Ezure, T., Suzuki, T., Higashide, S., Shintani, E., Endo, K., Kobayashi, S., Shikata, M., Ito, M., Tanimizu, K., and Nishimura, O. (2006) Cell-free protein synthesis system prepared from insect cells by freeze-thawing. Biotechnol. Prog. 22, 1570–1577. 88. Mikami, S., Kobayashi, T., Yokoyama, S., and Imataka, H. (2006b) A hybridoma-based in vitro translation system that efficiently synthesizes glycoproteins. J. Biotechnol. 127, 65–78. 89. Palmer, E., Liu, H., Khan, F., Taussig, M.J., and He, M. (2006) Enhanced cell-free protein expression by fusion with immunoglobulin Cκ domain. Protein Sci. 15, 2842–2846. 90. Arai, R., Kukimoto-Niino, M., Uda-Tochio, H. Morita, S., Uchikubo-Kamo, T., Akasaka,

197

R., Etou, Y., Hayashizaki, Y., Kigawa, T., Terada, T., Shirouzu, M., and Yokoyama, S. (2005) Crystal structure of an enhancer of rudimentary homolog (ERH) at 2.1 Å resolution. Protein Sci. 14, 1888–1893. 91. Staunton, D., Schlinkert, R., Zanetti, G., Colebrook, S.A., and Campbell, I.D. (2006) Cell-free expression and selective isotope labelling in protein NMR. Magn. Reson. Chem. 44, S2–S9. 92. Kang, S.H., Kim, D.M., Kim, H.J., Jun, S.Y., Lee, K.Y., and Kim, H.J. (2005) Cell-free production of aggregation-prone proteins in soluble and active forms. Biotechnol. Prog. 21, 1412–1419. 93. Kawasaki, T., Gouda, M.D., Sawasaki, T., Takai, K., and Endo, Y. (2003) Efficient synthesis of a disulfide-containing protein through a batch cell-free system from wheat germ. Eur. J. Biochem. 270, 4680–4786. 94. Jiang, X., Ookubo, Y., Fujii, I., Nakano, H., and Yamane, T. (2002) Expression of Fab fragment of catalytic antibody 6D9 in an Escherichia coli in vitro coupled transcription/translation system. FEBS Lett. 514, 290–294. 95. Klammt, C., Schwarz, D., Löhr, F., Schneider, B., Dötsch, V., and Bernhard, F. (2006) Cellfree expression as an emerging technique for the large scale production of integral membrane protein. FEBS J. 273, 4141–4153. 96. Klammt, C., Schwarz, D., Eifler, N., Engel, A., Piehler, J., Haase, W., Hahn, S., Dötsch,V., and Bernhard, F. (2007) Cell-free production of G protein-coupled receptors for functional and structural studies. J. Struct. Biol. (in press); online: doi:10.1016/j.jsb.2007.01.006 97. Klammt, C., Schwarz, D., Fendler, K., Haase, W., Dotsch, V., and Bernhard, F. (2005) Evaluation of detergents for the soluble expression of alpha-helical and beta-barreltype integral membrane proteins by a preparative scale individual cell-free expression system. FEBS J. 272, 6024–6038. 98. Pocanschi, C.L., Dahmane, T., Gohon, Y., Apell, H.-J., Kleinschmidt, J.H., and Popot, J.-L. (2006) Amphipathic polymers: tools to fold integral membrane proteins to their active form. Biochemistry 45, 13954–13961. 99. Lyford, L.K. and Rosenberg, R.L. (1999) Cell-free expression and functional reconstitution of homo-oligomeric alpha7 nicotinic acetylcholine receptors into planar lipid bilayers. J. Biol. Chem. 274, 25675–25681. 100. Perrin, R. (2001) Cellulose: how many cellulose synthases to make a plant? Curr. Biol. 11, R213–R216.

198

Farrokhi et al.

101. Voinnet, O., Rivas, S., Mestre, P., and Baulcombe, D. (2003) An enhanced transient expression system in plants based on suppression of gene silencing by the p19 protein of tomato bushy stunt virus. Plant J. 33, 949–956. 102. Komarnytsky, S., Gaume, A., Garvey, A., Borisjuk, N., and Raskin, I. (2004) A quick

and efficient system for antibiotic-free expression of heterologous genes in tobacco roots. Plant Cell Rep. 22, 765–773. 103. Bonnal, S., Boutonnet, C., Prado-Lourenco, L., and Vagner, S. (2003) IRESdb: the Internal Ribosome Entry Site database. Nucleic Acids Res. 31, 427–428.

Chapter 11 Functional Genomics and Structural Biology in the Definition of Gene Function Maria Hrmova and Geoffrey B. Fincher Summary By mid-2007, the three-dimensional (3D) structures of some 45,000 proteins have been solved, over a period where the linear structures of millions of genes have been defined. Technical challenges associated with X-ray crystallography are being overcome and high-throughput methods both for crystallization of proteins and for solving their 3D structures are under development. The question arises as to how structural biology can be integrated with and adds value to functional genomics programs. Structural biology will assist in the definition of gene function through the identification of the likely function of the protein products of genes. The 3D information allows protein sequences predicted from DNA sequences to be classified into broad groups, according to the overall ‘fold’, or 3D shape, of the protein. Structural information can be used to predict the preferred substrate of a protein, and thereby greatly enhance the accurate annotation of the corresponding gene. Furthermore, it will enable the effects of amino acid substitutions in enzymes to be better understood with respect to enzyme function and could thereby provide insights into natural variation in genes. If the molecular basis of transcription factor– DNA interactions were defined through precise 3D knowledge of the protein–DNA binding site, it would be possible to predict the effects of base substitutions within the motif on the specificity and/or kinetics of binding. In this chapter, we present specific examples of how structural biology can provide valuable information for functional genomics programs. Key words: Crystallography, Database annotation, Gene function, Protein structure, Protein– protein interactions.

1. Introduction Structural biology in the broadest sense refers to the definition of molecular structures at the three-dimensional (3D) level and to the application of that 3D structural information to better Daryl J. Somers et al. (eds.), Methods in Molecular Biology, Plant Genomics, vol. 513 © Humana Press, a part of Springer Science + Business Media, LLC 2009 DOI: 10.1007/978-1-59745-427-8_11

199

200

Hrmova and Fincher

understand a biological process or system. The structures of all biological molecules are included under this umbrella definition, but in practice structural biology is focused primarily on defining the structures of proteins and in most instances the 3D structures are solved through X-ray crystallography. The very short wavelengths of the X-ray beam enable the structures of proteins to be defined at a resolution of 2 Å and better, which is sufficient to distinguish individual atoms in the proteins and to determine the 3D spatial dispositions of individual atoms in relation to all the other atoms in the protein. Only at this resolution is it possible to fully understand how proteins actually function, at a molecular level, in their biological environment. The definition of 3D structures of proteins has been and remains technically challenging, as outlined below. The technical difficulties are reflected in the relatively small number of protein structures that have been defined by X-ray crystallography, in comparison, for example, with the structures of genes. In Fig. 1, it can be seen that by the middle of 2007 structures of some 45,000 proteins have been solved, over a period where the linear structures of millions of genes have been defined. It is probably true to say that the technical problems associated with X-ray crystallography have also limited the application of the technique to relatively simple biological systems, such as defining enzyme– substrate or protein–ligand interactions. Nevertheless, there are increasing numbers of cases where the technique has been successfully applied to much more complex biological processes and this trend can be expected to continue as the technical difficulties are progressively overcome. The question now arises as to how structural biology can be applied to functional genomics methods and programs, and what value can be added to these programs through the inclusion of 3D structural data on the protein products of genes. Firstly, and in the immediate future, structural biology can be expected to assist in the definition of gene function through the identification of the likely function of gene products, and hence to add value to the annotation of gene databases and genome sequences. One of the central objectives of the Arabidopsis 2010 project was to define the functions of all the genes in the Arabidopsis genome by 2010. This has proved to be an ambitious target, but the addition of structural data should certainly accelerate gene identification in the immediate future. The 3D information will be useful at two levels. Firstly, it allows protein sequences predicted from DNA sequences to be classified into broad groups, according to the overall ‘fold’, or 3D shape, of the protein. It has been estimated that proteins fold into ~1,000 different base conformations, or folds (1), and currently there are ~860–1,050 unique folds reported by Structural Classification of Proteins (SCOP) and Class, Architecture, Topology and Homologous Superfamily

Functional Genomics and Structural Biology

201 2,000

50,000

Structures

40,000

CATH classification

1,600

30,000

1,200

20,000

800

10,000

400

0

Folds (SCOP or CATH)

SCOP classification

0 19771979 1981 1983 1985 1987 1989 1991 1993 1995 19971999 2001 2003 2005 2007

Year Fig. 1. Growth of released three-dimensional structures and protein folds that are classified in the Protein Data Bank (http://www.rcsb.org).

(CATH) classifications in the protein Data Bank (http://www. rcsb.org). In the CAZY databases, the proteins that constitute the many carbohydrate-modifying enzymes have been classified into families, each of which is confidently predicted to contain proteins with a particular fold (2, http://www.cazy.org/). At the second level, detailed 3D structural information can be used to predict and define the precise function of a protein. For example, a large family of enzymes might include members with related yet distinct substrate specificities. Structural information can be used to predict the preferred substrate of a protein with a defined amino acid sequence, and thereby greatly enhance the accurate annotation of the corresponding gene. It is worth noting at this stage that as more 3D structures become available, they provide learning tools and information for the improvement in structure prediction programs. This in the future is also going to benefit the accuracy of gene annotations. Precise 3D structural information will allow the definition of the function of gene products at the molecular level. For example, detailed mechanisms of action of enzymes can be defined, including the precise interactions that occur between substrates and enzymes during substrate binding and the precise sequence of events that occurs during catalysis. This information can be used to define or predict the exact basis for substrate specificity and for catalytic efficiency. Further, it could enable a better understanding of the effects of amino acid substitutions in enzymes on enzyme function and might also provide insights on the effects of

202

Hrmova and Fincher

natural variation in genes. Similarly, the molecular basis of transcription factor–DNA interactions will be defined through precise 3D knowledge of the DNA-binding site of the transcription factor itself and the chemically and three-dimensionally complementary structure of the DNA motif that is bound, and it will be possible to predict the effects of base substitutions within in the motif on the specificity and/or kinetics of binding. At a slightly higher level, 3D structural information will enable the molecular details of protein–protein interactions to be defined. Important cellular processes such as signal transduction, the assembly of multi-protein transcription complexes or the activation of membrane-bound transporters might all involve interactions between two or more proteins. The 3D structure of the auxin receptor in plants has been solved and involves interactions between at least two proteins (3). A thorough understanding of these types of protein–protein interactions at the molecular level will be an essential step towards the broader definition of cellular function and towards potential manipulations of the cellular processes. Although the rate of generation of protein structural information has fallen behind the rate of gene discovery (Fig. 1), high-throughput functional genomics programs are providing an incentive for structural biologists to develop their own highthroughput techniques. Accordingly, small scale, high-throughput heterologous expression systems are under development for the rapid generation of purified proteins for crystallization. Small scale, high-throughput crystallization matrices are now available and high-energy beamlines, particularly through the increasing numbers of synchrotrons that are coming on stream around the world, will enable the rapid collection of diffraction patterns from small and micro-crystals. Finally, powerful computing programs and computing hardware are available for the rapid solution of 3D structures from the diffraction patterns so generated. Examples of these advances and their applications to functional genomics programs are given in the subheadings below.

2. Methods for the Definition of Protein Structure

Two key procedures have been used to define the 3D structures of proteins to the atomic level that is required to understand function, namely, nuclear magnetic resonance (NMR) spectroscopy and X-ray crystallography. Other techniques such as two-dimensional (2D) and single-particle electron cryomicroscopy, electron tomography (4), site-directed spin labelling (5) and ab initio molecular modelling start are emerging as useful additional methods.

Functional Genomics and Structural Biology

203

The NMR-based procedures have been particularly successful in the definition of small protein and peptide structures, and provide a significant advantage over X-ray crystallography insofar as they allow the determination of peptide structures in aqueous solutions and show the thermal mobility, or molecular vibration, of peptides as it occurs in a ‘native’ state in the aqueous solution. The limitations of NMR-based procedures are that they require large amounts of protein and can generally only be used to solve structures of proteins with 150 or fewer amino acid residues. More powerful NMR machines will undoubtedly address some of these shortcomings in the future. About 85% of macromolecular structures of proteins, viruses and nucleic acids and their complexes have been solved by X-ray crystallography (6), and it is likely that in the future this technique will remain the method of choice as the most robust experimental approach for determination of 3D structures of biological macromolecules. We will therefore focus on that method here. Obtaining satisfactory results from X-ray crystallography depends very heavily on the availability of scrupulously pure protein, whether that is purified from biological tissue or generated through heterologous expression systems. In the past, up to low milligram quantities of purified proteins were usually required and this was certainly a limiting factor for the structural determination of lowabundance proteins, such as transcription factors. However, the development of cloning technologies and more robust heterologous and cell-free expression systems (7) means that enough protein can usually be obtained, although even expressed proteins might require additional purification. Furthermore, crystallographers are developing more sensitive methods for gathering diffraction data and require increasingly smaller amounts of purified protein. Once sufficient protein of interest has been generated, at acceptably high purity, the real task of generating good quality crystals begins. Crystallization of proteins is arguably the most difficult and unpredictable step in the procedure and has traditionally represented a major hurdle for the successful adoption of structural biology. In principle, the protein is dissolved at high concentration, at a level just below its precipitation or isoelectric point, in a solution that is designed to maintain 3D conformation over extended periods. Crystallization is allowed to proceed in a number of ways, of which hanging and sitting drop methods have been popular in the past. A small 2–10 μl drop of the protein solution is placed on a glass slide or in a microwell, and suspended over a well containing a concentrated aqueous solution of a precipitant, say, ammonium sulfate, and allowed to equilibrate over a period of days to months. During that time, water vapour diffuses from the protein hanging drop to the precipitating solution, which causes the concentration of the protein to gradually

204

Hrmova and Fincher

increase, until it exceeds the solution limit. The protein will slowly come out of solution and, with luck, will form an ordered crystal lattice as it does so. Crystals of different sizes will obviously form in this process, but a regular crystal larger than about 0.2 mm has usually been large enough in the past (Fig. 2A), and new methods are being developed to solve the structures of proteins in crystals as small as 10 μm. Commercially, kits that provide an arrayed matrix of solutions of varying composition and pH are now available and greatly facilitate the crystallization process. Crystals of large soluble proteins can be obtained through these techniques, although membrane proteins still present major challenges, as detailed later in this chapter. The protein crystal is subsequently placed in an X-ray beam and the diffraction pattern of the beam that is subsequently collected (Fig. 2B) provides the information on the relative dispositions in 3D space that allows the structure of the protein to be solved. Heavy metal or other compounds that react with specific amino acid residues may be diffused into the crystal to provide a ‘reference point’ within the protein, and the 3D structure is expressed as a series of coordinates, bond distances and angles between individual atoms in the molecule. The structure so generated from a protein crystal is generally considered to be a static structure, although some indication of protein movement can be obtained through thermal B factor values for various regions of the protein chain.

Fig. 2. Single crystal of a barley β-D-glucan glucohydrolase enzyme in the hanging drop (A) and (B) diffraction image of the β-D-glucan glucohydrolase crystal. The inset in the bottom-left corner shows diffraction intensities to 2.20 Å (8). Reproduced with permission of the International Union of Crystallography.

Functional Genomics and Structural Biology

205

As mentioned above, molecular modelling programs are becoming more precise in their prediction of 3D structures from amino acid sequences of proteins, and are likely to continue to improve as they ‘learn’ from new crystal structures that are deposited in the databases. Programs such as Modeller implement comparative modelling by satisfaction of empirical spatial restraints and statistical analysis of the relationships between pairs of homologous structures and produce a protein fold (9, 10). The stereochemical quality of the predicted structure can be calculated with programs such as PROCHECK (11) and Prosa 2003 (12), and programs such as ‘O’ (13) and various publicly available servers can be used to determine a root-mean-square deviation values in the Cα backbone positions between the modelled structure and the template coordinates that were used for the modelling process.

3. Genome Projects and the Annotation of Protein Function

Genome sequencing programs are providing extensive lists of structural genes from the selected organisms. For example, the Arabidopsis, rice and poplar genome sequences enable protein primary sequences to be deduced for all genes but, in the immediate future, attention is likely to shift from genome sequencing, which is becoming increasingly cheaper and faster, towards the accurate assignment of functions to all individual genes in a genome (14). The Arabidopsis 2010 project had this as a central objective and, as mentioned above, the availability of protein structural data can greatly facilitate the accurate prediction and annotation of protein function. Currently, the functions of about 50% of genes on these genomes can be predicted from sequence similarities with proteins from other organisms. However, the evidence for function is often circumstantial or inferred from a third source and in most cases direct biochemical evidence for probable function is not available. The number of genes for which there is direct experimental evidence for function is therefore likely to be considerably less than 5% of all genes in these genome sequences. The homology modelling programs used in the annotation of protein function are not always robust, and if short overlapping sequences are used the identities might be completely incorrect (15). Within this predictive framework, a gene or protein sequence is generally aligned against other genes and proteins in the databases. When a homologous sequence that is detected has a statistically significant similarity to the queried sequence, then the function of the unknown gene is inferred, based on the ‘known’ function of the homologous sequence. These predictions are generated

206

Hrmova and Fincher

to gain a first-order approximation of the molecular function of the investigated protein (15, 16). However, standard methods of protein function predictions produce systematic errors in these comparisons. Gene duplication is perhaps the single greatest contributor to errors in function predictions by homology, although domain shuffling that precludes a global sequence alignment and speciation of the function of orthologous gene products are also potential sources of error. In the process of functional annotation of genes, it will be necessary to reduce our dependence on sequence similarities and seek more reliable methods for the analysis of protein function. Sophisticated bioinformatics approaches, which include 3D structural alignments to identify distant relationships between proteins, analyses of protein domains, hidden Markov model (17) to improve sequence alignments, automated predictions of catalytic and binding sites and analyses of regions responsible for changes in specificity and thus function, are currently under development. These methods are defined as ‘structural phylogenomics’, and are likely to result in significant improvements in gene identification (15), but must again be considered as a tool to assist in the assignment of protein function that cannot replace the direct analysis of protein function. As indicated in the subheading above, structural approaches to defining protein functions are in general time-consuming and technically difficult and the structures of very few plant proteins are available (Fig. 1). However, the incorporation of structural information into protein function prediction can greatly enhance the accuracy of genome annotations and in some cases could reveal possible functions that are not revealed by amino acid sequence alone (15, 16). Phylogenetic or phylogenomic analyses that lead to inference of protein molecular function consist of a series of subtasks, such as identification of homologous sequence clusters, multiple sequence alignments, phylogenetic tree constructions, overlaying of annotations on the tree topology, discriminating between orthologous and paralogous sequences and finally inferring the function of a protein based on the orthologues identified by this process and the annotation data retrieved. To this end, it might be assumed that orthologues have a greater functional similarity than paralogues, although the accuracy of functional inferences also depends on evolutionary distance and functional attributes of the analyzed sequences (15). This means that some protein attributes, such as 3D folds, could persist across very large evolutionary distances, while other properties such as substrate specificity can easily be modified by introducing a small cluster of amino acid substitutions in strategic positions on the protein fold. After the phylogenomic analyses are completed, phylogenetic tree topologies allow the analysis of branch points, which are indicative of speciation or duplication. The trees can be overlaid with biochemical

Functional Genomics and Structural Biology

207

and structural data and merged with the information contained in protein categorizing systems such as SCOP (18) and CATH (19), and with Web-based meta-servers such as biennial critical assessment of protein structure prediction CASP (20) or LiveBench (21). One of the most important parameters during phylogenetic trees evaluation and the identification of sub-trees has always been the assignment of confidence levels at different nodes of the phylogenetic tree; this task can be done through bootstrap analysis or computing of p values (22). Looking to the future of phylogenomic analysis, the greatest improvement in this field will undoubtedly take place when we gain access to validated biological data, thus gaining accuracy in phylogenomic inference (15); the latter allows in silico prediction of protein functions with high accuracy. In this context, the protein structure prediction field is one of the most advanced fields in computational biology, and benchmarks such as for example Structure Prediction Meta Server (23) and MetaPP server (24), and SeqAlert (Bioinformatics and Biological Computing, Weizmann Institute of Science, Israel) represent a challenging set of computational tools. The Structure-Function Linkage Database (25), which links 3D protein structures with detailed information on chemical reactions, is another important asset. In recognition of the importance of 3D structures in defining protein function and the mechanisms of action of gene products, one of the objectives of the National Institutes of Health Protein Structure Initiative in the USA is to define 3D structures of all the proteins in human or other genomes (26). Perhaps the next challenge on the journey of protein structure–function assignments would be to describe how individual proteins of genomes interact with each other and with other biological molecules, at the molecular level, in space and in real time. These descriptions can be performed computationally using predictive approaches, and could not only expand and complete our understanding of biology from atoms to cells, but they will also bridge the gap between genome sequencing, functional genomics, proteomics, structural genomics and systems biology (14, 27).

4. Detailed Knowledge of Protein Function and Mechanisms of Action

Structural biology has the potential not only to provide a detailed understanding of the specificities and mechanisms of protein function, but it will also provide broader insights into cellular function. Here, we describe how 3D structural information provided a sound molecular explanation for the broad specificity of a barley

208

Hrmova and Fincher

β-D-glucan glucohydrolase and how it was used to define in detail the molecular mechanisms for substrate binding and catalysis. The barley β-D-glucan glucohydrolase is a member of the very large family GH3 group of enzymes, in which there are more than 1,400 mostly enzymes that are largely annotated as β-D-glucosidases, β-D-xylosidases and α-L-arabinofuranosidases (28). The barley β-D-glucan glucohydrolases is related to enzymes annotated as β-D-glucosidases but is more correctly called a nonspecific as β-D-glucan exohydrolase (29). The barley β-D-glucan glucohydrolase, isoenzyme ExoI, exhibits broad substrate specificity, in that it can hydrolyze a broad range of substrates with (1,2)-, (1,3)-, (1,4)- and (1,6)-β-D-glucosidic linkages. The enzyme was purified from extracts of young barley seedlings (29) and crystallized (8), and the 3D structure was determined (30, 31). It adopts a globular, two-domain modular structure (30, 31), as shown in Fig. 3. The first 357 amino acid residues represent the first domain and fold into a (α/β)8 TIM-barrel conformation. This NH2-terminal domain is joined by a 16 amino acid helix-like linker to the second domain, which consists of residues 374–559 and which forms a six-stranded β-sheet flanked on either side by three α-helices; this second domain therefore constitutes a (α/β)6 sandwich. An antiparallel loop of 42 amino acid residues is located at the COOH-terminus of the enzyme. A pocket about 13 Å in depth at the interface of the two domains has been identified as the active site of the enzyme (Fig. 3). The dimensions of the pocket indicate that it could accommodate two glucosyl residues (30, 31), and this is compatible with subsite mapping data (32). The enzyme binds the various substrates at their non-reducing ends (32), and the restricted depth of the dead-end pocket, coupled with the relative disposition of the catalytic amino acid residues, ensures that only the last glycosidic linkage at the nonreducing end of the substrate can be hydrolyzed. Thus, the exoaction pattern of the enzyme can be explained in structural terms. The shape of the action site in the barley β-D-glucan glucohydrolase can be contrasted with those of endohydrolases, which usually have open cleft- or tunnel-like topologies that allow the enzyme to bind at multiple positions across the polysaccharide substrate and to hydrolyze internal glycosidic linkages (33). The structural analysis of the crystallized barley β-D-glucan glucohydrolase also revealed that a glucose molecule is bound in the active site pocket of the enzyme. The glucose is presumed to be the product of the enzyme-catalyzed reaction that has not been released after hydrolysis is completed (30, 31). It is remarkable that this glucose molecule remains bound to the –1 subsite of the active site pocket, even after extended protein purification procedures that included ion-exchange, hydrophobic and size-exclusion chromatography and chromatofocussing and after

Functional Genomics and Structural Biology

209

Fig. 3. A Stereo representation of native β-D-glucan glucohydrolase with bound glucose (in cpk colours), where domain 1, linker, domain 2 and the COOH-terminal antiparallel loop are shown in yellow, green, blue and magenta, respectively. B Molecular surface drawing of native β-D-glucan glucohydrolase (colours as specified in panel A) with two occupied Asn221 and Asn498-linked glycosylation sites (cpk colours) (30).

crystallization for several weeks. The molar occupancy of glucose in the enzyme was always approximately one. The crystallized enzyme with the bound glucose product represents a rare opportunity to study the structural events that occur when the bound product is eventually displaced as the incoming substrate molecule approaches the active site (30, 31). This possibility is explored further below. Another unexpected opportunity afforded by the binding of the glucose product to the barley β-D-glucan glucohydrolase was that details of amino acid residue interactions with

210

Hrmova and Fincher

the glucose bound at subsite –1 could be defined in exact atomic terms (30, 31). When the crystals were subsequently soaked in an array of specifically designed low-molecular mass substrate analogues and inhibitors, new crystal structures of the barley β-D-glucan glucohydrolases were solved, from both native enzyme and from the enzyme in complex with inhibitors and substrate analogues, and provided data to explain substrate specificity, the mechanism of catalysis and the role of domain movements during the catalytic cycle, at the atomic level. The propensity of the enzyme to hydrolyze a broad range of substrates with (1,2)-, (1,3)-, (1,4)- and (1,6)-β-D-glucosidic linkages could be rationalized from crystal structures of the enzyme in complex with non-hydrolysable S-glycoside substrate analogues, and from molecular modelling. Thus, the two non-hydrolysable S-glycoside substrate analogues 4I, 4III, 4V-S-trithiocellohexaose (31) (PDB accession code 1IEX) and 4`-nitrophenyl S-(β-D-glucopyranosyl)-(1,3)-(3-thio-β-D-glucopyranosyl)- (1,3)-β-D-glucopyranoside (1J8V) (32) were synthesized and soaked into enzyme crystals. The 3D structures of the thio-analogue-enzyme complexes demonstrated that both ligands displaced the glucose molecule that is normally bound in the active site pocket of the enzyme, and that the two non-reducing end residues of the inhibitors are positioned in the pocket at the –1 and +1 subsites. While the glucosyl residue at subsite –1 is tightly constrained through extensive hydrogen bonding with multiple amino acid residues, the glucosyl residue at subsite +1 is sandwiched between two large tryptophan residues (Fig. 4). The hydrophobic interactions between the glucosyl residue at subsite +1 and the tryptophan residues are not as accurate as the multiple hydrogen bonds at subsite –1, and hence the glucosyl residue at subsite +1 is not so stringently constrained. As a result, differences in the spatial positions of non-reducing and penultimate glucosyl residues in say (1,3)- and (1,4)-β-D-linked glucoside substrates can be accommodated through the flexibility of binding at the +1 subsite. Moreover, the active site pocket is only deep enough to bind two glucosyl residues, so that the remainder of polymeric substrates must project away from the active site pocket, without making contact with the enzyme surface (32). One would anticipate that the overall conformation of β-D-polyglucosyl substrates, which would ultimately be defined by linkage types between the β-D-glucosyl residues, would not greatly affect the ability of the enzyme to bind the two non-reducing residues. This would explain the broad specificity of these enzymes with respect to their ability to hydrolyze substrates with (1,2)-, (1,3)-, (1,4)- and (1,6)-β-D-glucosidic linkages (32). The possible binding conformations of the two remaining positional isomers of β-D-diglucosides, namely, those of the

Functional Genomics and Structural Biology

211

Fig. 4. Stereo representation of the active site of barley β-D-glucan glucohydrolase with bound S-cellobioside and S-laminaribioside moieties. The sugar moieties are presented as sticks and atoms are coloured grey (carbons; cyan for S-laminaribioside), orange (sulfur) and red (oxygens). Transparent yellow and green colours represent the molecular surfaces of domains 1 and 2, respectively. The structures were superposed over the Cα atoms of 14 active site amino acid residues, with the root-mean-square deviation value of 0.158 Å in the Cα positions. The entrance to the active site is located towards the lower right-hand corner (32).

(1,2)-linked disaccharide sophorose and the (1,6)-linked disaccharide gentiobiose, were investigated by molecular modelling, based on the crystal structures of sophorose (34) and gentiobiose (35) and the 3D structure of the barley β-D-glucan glucohydrolase (32). It was envisaged that these oligosaccharides are held in place by almost exactly the same amino acid residues as the S-laminaribioside- and S-cellobioside-enzyme complexes. From the superpositions of the S-laminaribioside- and the S-cellobioside-enzyme complexes and from the sophorose- and gentiobiose-enzyme models, it was possible to come up with a structural rationale for a broad substrate specificity of the enzyme. For all ligand-enzyme structures, the glucopyranosyl residues at –1 subsites are bound in almost identical positions. In contrast, the glucopyranosyl residues of the S-laminaribioside and gentiobiose and S-cellobioside and sophorose moieties occupy subsite +1 that is located between the large hydrophobic Trp286 and Trp434 residues. In the case of the S-laminaribioside and gentiobiose moieties, the apolar face of the glucopyranosyl residue at subsite +1 is geometrically complementary with the pyrrole ring of Trp286, while the polar face of the glucopyranosyl residue positions itself over the phenyl ring of Trp434. In the S-cellobioside and sophorose moieties, the polar and the apolar faces of glucopyranosyl residue at subsite +1 are in contact with the pyrrole ring of Trp286 and the phenyl ring of Trp434, respectively. That is, the four positional sugar isomers can adopt two different orientations with respect to the phenyl/pyrrole rings of Trp286 or Trp434. Thus, it follows that if a substrate binds to the broad specificity β-D-glucan

212

Hrmova and Fincher

glucohydrolase, substrate binding will be largely independent of polysaccharide conformation and the glycosidic linkage positions between adjacent non-reducing-end β-D-glucosyl residues, hence explaining why the β-D-glucan glucohydrolases have broad substrate specificities (Fig. 4). In summary, the crystallographic investigations of the enzyme crystals of the barley β-D-glucan glucohydrolase in complex with inhibitors and substrate analogues allowed us to explain structure–function relationships and the basis of the enzyme’s broad substrate specificity. In addition, these analyses of ligand–enzyme interactions involved in binding of inhibitors and substrate analogues could suggest potentially novel organo-synthetic avenues for future improvements in inhibitor design. These considerations could be important, for example, for the design of highly efficient tailor-made inhibitors of the plant β-retaining glycoside hydrolases, which could control vital biological events in life cycles of economically important plants or in certain situations be used as herbicides (36, 37).

5. Defining the 3D Structures of Membrane Proteins Remains Difficult

As cell biologists attempt to explain complex processes in precise molecular and atomic terms, 3D structural information will become increasingly important. However, many of the key processes in plant cells occur in membranes, including the reception and transduction of hormone and other signals, ion transport and the biosynthesis of cell wall polysaccharides. Structural biology technologies for membrane proteins are problematical but not insurmountable. Thus, it will be necessary to develop robust new methods or modify existing methods for protein expression and for the crystallization of membrane proteins. The prerequisite for 3D structure determination is the ability to produce pure, properly folded and active proteins in milligram quantities for crystallization. Both in vivo and in vitro methodologies are available for membrane protein production (7). In vitro methods that only contain purified cell lysate replication machinery have the advantage of producing very pure protein very rapidly. Cryo-electron microscopy (CEM) molecular imaging and 2D crystallization represent the only viable approach to structure determination of membrane proteins that are refractory to X-ray diffraction techniques. Recent successes in obtaining 3D structures from crystals of polytopic proteins from the inner membranes of bacteria and mitochondria are encouraging, as are advances in the technologies for obtaining crystals of membrane proteins (38, 39).

Functional Genomics and Structural Biology

213

For the production of 2D crystals for CEM studies, it is necessary to transfer the expressed membrane protein from the cell membrane where it is associated with endogenous membrane proteins and other contaminants, to an artificial membrane where the environment is more amenable to crystallization. From extensive experience in the purification of viral membrane proteins from influenza, parainfluenza and measles for X-ray diffraction and CEM structural studies, robust techniques for transferring proteins expressed in the membranes of a range of different cell types into artificial liposomes have been developed (40, 41). Proteins embedded in these liposomes are amenable to 2D structural analysis. Recombinant cells expressing the membrane protein are grown in culture, harvested by centrifugation, lysed and membranes pelleted. Membrane pellets are diluted in buffer with fresh protease inhibitors and detergents such as sodium cholate, n-octyl β-D-glucopyranoside or Triton X-100 are added to solubilize the membrane-bound proteins. If the expressed protein (or an associated protein) contains a poly-His tag, it can be easily purified on nickel-nitrilotriacetic columns. Further purification can be achieved using ion-exchange chromatography or other procedures. Subsequently, mixed detergent–lipid micelles can be prepared by the addition of lipids of known composition, and the detergent can be removed to below its critical micellar concentration by dialysis (42). This essentially transfers the expressed protein into liposomes. Alternatively, the film hydration method can be used (43), where lipids of carefully defined composition are dried from a chloroform solution and buffer containing the protein is added. The suspension is subjected to numerous freeze– thaw cycles to generate the liposomes. The 2D crystallization of full-length membrane-bound proteins can be achieved either on a lipid layer at an air–water interface (44) or through lipidic bilayer cubic phase (in cubo) technology (45, 46). The former has been used for a number of proteins and is based on the specific interaction between the protein and lipid ligand inserted in a planar lipid film at an air–water interface (47). Significantly, the technique has now been applied successfully to purified, tagged, bacterial polytopic membrane proteins, including the two Escherichia coli inner membrane proteins melibiose permease (48) and lactose permease (49), each of which has 12 transmembrane helices. The latter, in cubo method of Landau and Rosenbusch (50), has been successful for membrane proteins such as halorhodopsin (51), bacteriorhodopsin (52) and a cyanobacterial photosystem II protein (53). The method involves dispersing the detergent-solubilized protein in lipid, typically a monoacylglycerol. In this process, the cubic phase self-assembles. A precipitant is added to trigger crystal nucleation and growth. A commercial screen solution series is used for the crystallization trials (45).

214

Hrmova and Fincher

Fig. 5. Integral membrane aquaporin protein in open (magenta) and closed (cyan) conformations. The position of a flexible hydrophobic loop that occludes the water pore from accessing the cytosol is visible in the upper part of the figure. Water molecules crossing the pore are shown in dark blue. The structures were superposed over the Cα atoms of 220 amino acid residues, with the root-mean-square deviation value of 0.45 Å in the Cα positions (54).

The accumulation of membrane protein in artificial membranes promotes the assembly of membrane proteins into a 2D lattice, without contamination by other membrane-associated proteins. This is necessary to obtain the necessary crystallization conditions for 2D and 3D crystallization studies for CEM and X-ray diffraction structural studies, respectively. The expression of a membrane protein into an artificial membrane enables the assembly of 2D rafts of molecules suitable for electron diffraction analysis. The structural determinations can be carried out using CEM for single-molecule protein imaging (resolution 10–7 Å) and electron diffraction for 2D crystal structures (resolution 5–4 Å), while X-ray diffraction can be used for 3D crystals (resolution 4–2 Å). The structure of an integral membrane aquaporin from spinach with several water molecules crossing the pore is shown in open and closed conformations (54) in Fig. 5. The most striking difference between the two structures is the position of a flexible hydrophobic loop that occludes the water pore from accessing the cytosol.

6. Definition of Protein–Protein Interactions

Structural biology is expected to have an increasingly important role in defining protein–protein interactions of the type that are involved in signal transduction, transport processes and gene activation. Protein–protein interactions in enzymes with multiple

Functional Genomics and Structural Biology

215

subunits, such as ATPases, in light harvesting protein complexes and in proteins that modulate diverse nuclear functions, have already been defined at the 3D level, and have provided important insights into the mechanisms of action of the protein complexes (55–57). In a recent landmark paper, the 3D structure of the first plant hormone receptor has been defined (3). Although functional genomics and other technologies had identified the protein participants in the process, the structural work revealed an unexpected mechanism of auxin perception. Auxins are a relatively heterogeneous group of phytohormones that regulate diverse aspects of plant growth and development by promoting the degradation of transcriptional repressors known as Aux/IAA proteins, through the action of an ubiquitin protein ligase. The auxin receptor was shown to be an F-box subunit, designated TIR1, of the ubiquitin protein ligase. Auxins bind directly to the TIR1 protein and this promotes binding of TIR1 to the Aux/ IAA transcriptional repressor protein. The auxins appear to act as a molecular ‘glue’ between hydrophobic regions of the two proteins and results in targeted ubiquitination and proteolytic degradation of the transcriptional repressor, followed by the expression of de-repressed genes (3). The crystal structures of the protein complexes also accounted for the variety of auxin-like molecules that can cause gene activation in this way (3). It is likely that the definition of other protein–protein interactions in the future through structural biology will reveal the mechanisms of many important cellular processes.

7. Rational Redesign of Protein Properties or Function

The availability of 3D structural information presents opportunities for the rational redesign of proteins for enhanced performance in a particular application. In the example given below, the heat stability of a barley enzyme was increased in attempts to prolong its activity at temperatures used in the malting and brewing processes (58). The 3D structures of barley (1,3;1,4)-β-glucanase isoenzyme EII (EC 3.2.1.73) and (1,3)-β-glucanase isoenzyme GII (EC 3.2.1.39) had been solved to high resolution by X-ray crystallography (59) and showed that the Cα chains of the two enzymes were superimposable with a root-mean-square deviation value of 0.65 Å over 278 amino acid residues. This was taken as evidence to suggest that the enzymes arose through divergent evolution of a common ancestral enzyme (59, 60), despite the fact that the (1,3)-β-glucanases hydrolyze polysaccharides found in fungal cell walls, while the (1,3;1,4)-β-glucanases function in plant cell

216

Hrmova and Fincher

wall metabolism. The (1,3)-β-glucanases have evolved to be significantly more stable than the (1,3;1,4)-β-glucanases, probably as a consequence of the hostile environments imposed on the plant by invading microorganisms (58). Redesigning the barley (1,3;1,4)-β-glucanase to increase its heat stability was based on the potential of a thermostable form to retain its activity during kilning and malt extraction procedures. If barley expressing a thermostable (1,3;1,4)-β-glucanase could be engineered, many of the filtration problems and other difficulties associated with incomplete hydrolysis of the cell wall polysaccharide (1,3;1,4)-β-glucan in the malting and brewing processes (61) might be overcome. To increase the thermostability of barley (1,3;1,4)-β-glucanase isoenzyme EII, amino acid substitutions were introduced into the wild-type form of the enzyme by site-directed mutagenesis of the corresponding cDNA, and the relative thermostabilities of the resulting mutant enzymes were measured. The amino acid substitutions chosen were based on structural comparisons with the more stable barley (1,3;1,4)-β-glucanases, and of other higher plant (1,3;1,4)-β-glucanases. Three of the resulting mutant enzymes showed increased thermostability compared with the wild-type (1,3;1,4)-β-glucanase. The largest increase in stability was observed when the histidine at position 300 was changed to a proline (mutant His300Pro), a mutation that was likely to decrease the entropy of the unfolded state of the enzyme. Similar approaches, based on 3D structural information, have been taken to increase the catalytic efficiency of a barley α-amylase through site-directed manipulation of amino acid residues at the enzyme’s active site (62), and these types of directed changes to any protein can be contemplated if its 3D structure is known and the molecular basis of its substrate specificity and catalytic action have been defined. Site-directed mutagenesis of targeted amino acid residues can also be used to change or broaden the substrate specificity, or increase the efficiency of enzymes or proteins such as ion transporters.

8. High-Throughput Techniques to Link Structural Biology to Functional Genomics

The high-throughput discovery of genes in plant functional genomics programs around the world have placed pressure on protein chemists and crystallographers to come up with matching high-throughput procedures for the expression and purification of proteins in sufficient quantities for crystallization. Furthermore, the development of rapid small-scale and appropriate crystallization conditions will be essential, as will a capability for the collection of good quality diffraction data from very small crystals. The

Functional Genomics and Structural Biology

217

successful development of these high-throughput technologies would result in a rapid increase in the number of 3D structures in the databases. Access to these would, in turn, allow biologists who would not identify themselves necessarily as structural biologists or even protein chemists, to enter the world of threedimensionality of biological molecules, and to assist them in linking spatial atomic distributions of macromolecules to a biological function of the gene product. Substantial developments have had to occur not only in the field of 3D structure determination but also in the closely related fields of protein purification, crystallization and molecular modeling, all of which are necessary for the structural determination of biological macromolecules. Key technical capabilities that have enabled unprecedented progress in protein crystallography in the past few years include the automation of crystallization and diffraction generation, robotics and high-throughput powerful data processing systems (63). The emerging automated technologies for structure determination and structural genomics will now permit larger and more complex biological systems to be characterized at the 3D structural level. It will further allow the realtime dynamics of biological interactions in complex systems to be examined in four dimensions. In this subheading, we will summarize recent progress in high-throughput protein production, macromolecular crystallography and molecular modelling. Despite these comments, the high-throughput production of proteins for 3D structure determination remains a major bottleneck in modern structural biology. Unfortunately, individual proteins usually require specific conditions for correct folding and, in most cases the only way to achieve this is by trial and error, using different host systems (7). Nevertheless, high-throughput techniques are under development (64), using cell-free and other selected heterologous expression systems including yeast (65), insect cells and E.coli (66). High-throughput approaches to operations such as crystallization, diffraction or data collection, phasing, model building, refinement and coordinate data deposition are also under development. Crystallization has traditionally been the most laborious step in protein crystallography, because it has relied on trial-and-error experimentation. However, recent advances in dispensing robotics technology have resulted in the design and marketing of automated crystallization platforms (67). These technologies have been widely used in high-throughput crystallization trials (67, 68), and interactive online crystallization databanks for screening strategies also provide a valuable resource for the technology (69). Examples of automated crystallization and observation systems include Cartesian Technologies Honeybee (PixSys4200) crystallization robotics and QIGEN BioRobot Rapid Plate dispensing robots. However, a fundamental requirement

218

Hrmova and Fincher

for future generation robotic systems is that they must become fully automated, and some of these fully automated devices have already started emerging, such as the molecular crystallization/ observation robotic system HTS-80 (67). The next step in structure determination is the collection of diffraction data. We are now witnessing the implementation of automated data collection systems on several third-generation synchrotron sources. For example, these automated systems have been installed on seven insertion device beamlines at the European Synchrotron Radiation Facility (ESRF), as a part of the SPINE (Structural Proteomics In Europe) development of an automated structure-determination pipeline (70), and they are also installed on the Australian Synchrotron that is under construction in Melbourne. These systems allow remote interactions with beamline-controlled systems and automatic sample mounting, alignment, data collection and data processing (70). The central problem in macromolecule structure determination by X-ray crystallography is the solution of the phase problem. When the phase problem has been solved, the remaining steps of model building, refinement and coordinate data deposition can usually be accomplished in a reasonable time frame. Although the data collected from the single crystal consists of structure factor amplitudes, a critical component, namely, the phase associated with each of the amplitudes, is lost and can not be recorded directly. There are two principal experimental methods for determining the phase and these include single or multiple isomorphous replacements and single or multiple anomalous diffractions. During the past 5 years, substantial steps have been taken to optimize and automate macromolecular structure determination, including phase information calculations. To name a few, software packages like SOLVE/RESOLVE (71), PHENIX (72) and ARP/wARP (73) are complete, efficient and expeditious. These programs have simplified the task of finding of heavy atom sites, calculating accurate phases, carrying out identification of non-crystallographic symmetry, implementing density modification and automating of the model building process. Most of these programs are integrated within the CCP4 graphical user interface (74). Lastly, the Web servers and interfaces are also available for some of these tasks. For example, the interfaces TEXTAL (75) or FINDMOL (76) allow the interpretation of electron density maps to be automated. Finally, high-throughput approaches are under development for molecular modelling. Structural information is increasingly used to discriminate between proteins or to classify them into related groups. For example, SCOP (18) and CATH (19) protein categorizing systems represent two programs for the structural classification of proteins. Genomics programs in organisms as diverse as bacteria, yeast, plants and humans are generating nucle-

Functional Genomics and Structural Biology

219

otide sequence data at an unprecedented rate, and genes encoding previously unknown proteins will certainly be discovered in these genome sequencing programs. As part of the process of assigning a potential function to a particular gene product, alignments of primary sequences of unknown enzymes with DNA and protein databases allow the identification of many genes, and the use of 2D hydrophobic cluster analysis procedures will enhance this process, especially where sequence similarities are low (77). Furthermore, proteins can now be identified rapidly in highthroughput investigations of genomes by automated, comparative 3D structural modelling programs (10, 78–80). Typically, the modelling databases MODBASE (79) and FAMSBASE (80) contain annotated comparative or homology protein structural models for all available protein sequences that can be matched to at least one known protein structure. The MODBASE database (79) contains more than 3 million models for domains of more than 1 million unique protein sequences listed in the UniProt database, and the FAMSBASE (80) database currently contains ~370,000 models derived from 276 species and 111 phage genomes. These databases are regularly updated and provide a wealth of structural information for biologists. The models are typically constructed through comparative modelling Web servers, where completely automated software pipelines use many different template structures and sequence–structure alignments. MODBASE database further predicts ligand-binding pockets and protein–protein interaction sites (79). The availability of structural information for a particular class of proteins can be seen as a currency, with which the success of structural genomics can be measured (78). For example, the availability of the crystal structure of the barley β-D-glucan glucohydrolase enzyme in the GH3 family of glycoside hydrolases (30) proved useful for comparative protein modelling, through which reliable models of other members of the GH3 family (currently containing ~1,600 members) could be constructed and potential differences in substrate specificity could be identified (28). However, comparative modelling should be applied and interpreted with caution when sequence identities between templates and targets fall within the ‘twilight zones’of around 20% or less. For example, the structural data of the barley β-D-glucan glucohydrolase is not likely to be useful for rationalizing differences in substrate specificity between the family GH3 β-D-glucan glucohydrolases (32) and β-D-xylosidases (81). Members of the plant β-D-xylosidase-like group of enzymes in family GH3 also have α-L-arabinofuranosidase activity (81) and this could further complicate substrate specificity assignments. Hence, the structural information for barley β-D-glucan glucohydrolase might be useful for closely related enzymes but cannot be satisfactorily applied to more distantly related enzymes, such as the plant β-D-

220

Hrmova and Fincher

xylosidases. This limitation clearly applies more broadly to hydrolytic and other enzymes that are grouped into the same family, and an important requirement for glycosyl hydrolase classification in general is to isolate and characterize the substrate preferences of representative enzymes from individual subfamilies. Similarly, new crystal structures of plant and other enzymes within subfamilies of the GH3 group will greatly assist in the assignment of functions more broadly across the family. Nevertheless, careful analysis of sequences encoding substrate-binding regions might allow the discrimination of family GH3 members with β-D-glucosidase (1,3;1,4)- and (1,3)-β-D-glucan exohydrolase, and β-D-xylosidase and α-L-arabinofuranosidase activities. These considerations are of particular relevance for the annotation of cereal and other plant genomes, where structural comparisons might ultimately assist in defining functions of unknown genes and/or in defining the effects of nucleotide polymorphisms and allelic variation.

9. Emerging Methods in Structural Biology and Structural Genomics

In the next two decades, we are likely to witness a fusion of structural biology with functional genomics and cellular biology, as we work towards the definition of complex biological processes and the operation of cells as a whole. It can be confidently predicted that these developments will bring with them a need to use structural biology, biochemistry and computational modelling to define biological processes at an atomic level and on a time coordinate (82–84). It is also likely that to put living systems within their natural set of physicochemical and environmental conditions, one will be compelled to take into account their so-called ‘dissipative’ or non-equilibrium states (84). In order to capture these nonequilibrium states, future generations of structural biologists will have to explore developments of so-called ‘investigative’, groundbreaking, structural approaches (85), which include a range of biophysical structural techniques that have already started emerging. We will briefly review some of the recent developments in these biophysical techniques for structural and functional analyses of proteins. Four dimensional (4D) or Laue crystallography is an emerging technology that is being applied to the dissociation of the reaction product from the barley β-D-glucan glucohydrolases. Historically, the first structural approach to monitoring changes in protein structures, in real time, during a biological process has been Laue time-resolved crystallography or 4D crystallography (86–88). The technique has mainly been used on proteins that contain natural chromophores, such as haemoglobin (89),

Functional Genomics and Structural Biology

221

myoglobin (90, 91), photoactive yellow protein (92, 93) or photosynthetic reaction centres of bacteria (94). In some instances, photolytic structural evolution has been observed over time intervals of around 100 ps, and it has been shown that picosecond dynamics are relevant to protein function (91). Let us now illustrate by example how time-resolved crystallography could be useful for the description of substrate-product trafficking in the barley β-D-glucan glucohydrolase. In this instance and contrary to previous examples where a chromophore is naturally contained in a protein molecule, the chromophore (or ‘photo-caged’ substrate analogue that must be monitored during the substrate-product trafficking) has to be tailor-synthesized by organo-synthetic chemistry. This particular enzyme was found to be suitable for Laue crystallography because previous structural studies showed that the barley β-D-glucan glucohydrolase retains the glucose molecule that is released from the non-reducing end of the substrate in its active site, until the next, incoming substrate molecule approaches (30, 31). The glucose product of the reaction remains in the –1 substrate-binding subsite of the enzyme through purification and crystallization, for periods of several months. Thus, if the dissociation of the glucose from the enzyme’s active site could be synchronized throughout the crystal, time-resolved Laue crystallography could be used to monitor, in real time, the product diffusing away from the substrate-binding site as the competing substrate analogue approaches the active site and the define chemical and conformational changes that occur as this happens (86, 95). To successfully apply timeresolved Laue crystallography, one needs first to determine catalytic first-order rate constants of inactivation for inhibitors that are designed to displace the glucose from the active site. If dissociation rates could be slowed, either through design of the nonoptimal substrate analogues or through lowering the reaction temperature to 5–10oC, then the time-resolved crystallography could be conducted in a time frame of say 10 ns, which would allow the collection of multiple diffraction data sets. Even so, the release of the glucose molecule has to be coordinated in all the enzyme molecules in the crystal, so that sufficiently sharp diffraction patterns are generated. To achieve this synchronous release of the glucose, photo-caged non-hydrolysable substrate analogues, such as the disaccharide 1-O-methyl-thio-gentiobioside, would be soaked into the crystals and flash photolysis (96, 97) and would be used to split the ester, or other, linkage that interconnects the caged group to the inhibitor. This ‘de-caging reaction’ results in the instantaneous and simultaneous release of free inhibitor throughout the crystal (96, 97). After the inhibitor is released synchronously in all unit cells of the crystal, Laue polychromatic diffraction data sets can be collected, using a synchrotron beam, under experimental conditions and at suitable

222

Hrmova and Fincher

time intervals, depending on the rate of the reaction. However, read-out time of the detectors, the flux of the synchrotron source and the stability of the crystals will potentially limit the ability to collect data in convenient time intervals. For example, typical β-D-glucan glucohydrolase crystals diffract to 2.20–2.60 Å in 20 s data collection time intervals at the synchrotron source, and the crystals have been shown to be stable under X-rays at room temperature (30). In the final step, the diffraction data are collected, integrated (98, 99) and the 3D structures are refined at every time interval (30–32, 36, 100). Other new technologies for structural biology include single-molecule diffraction imaging and microcrystalline powder diffraction, which depend on third- or fourth-generation synchrotron sources. The techniques rely on capturing images of samples in femtosecond time regimes, following photo-absorption of electrons that are ejected from samples by the probes used for imaging. In these instances, only one or a few differently oriented molecules are required to capture the structures of macromolecules or various intermediary states of macromolecules. It is remarkable that the reconstructed images of the samples, obtained directly from the coherent patterns by phase retrieval through oversampling, show no measurable damage, and are reconstructed at the diffraction-limited resolution (101). Microcrystalline powder diffraction techniques (102) can be used to investigate partially ordered biological materials, such as actinmyosin, where synchrotron small-angle X-ray diffraction has been used to capture instantaneous structures of the contractile apparatus of living Drosophila (103).

10. Concluding Remarks We predict that structural biology will assume an increasingly important place in the analysis of the products of genes discovered in functional genomics programs and, in particular, in the description of the molecular basis for specificity and mechanisms of action of these gene products. Not only will structural biology be applied to the identification of gene function in functional genomics programs, but the technology is also likely to provide new information on important protein–protein interactions in cells, including those involved in transcriptional activation of genes, membrane transport, phytohormone perception and signal transduction. Important progress has already been made in a number of the latter processes, and the precise definition of the roles of protein–protein interactions in cellular function will continue to be a research priority in the immediate future. The

Functional Genomics and Structural Biology

223

technical difficulties associated with X-ray crystallography will also remain a target of research attention. Robotic and small-scale handling of protein samples during the development of optimal crystallization conditions will greatly enhance our chances of obtaining crystals that diffract well, and high-energy micro-beam lines will enable the structural analysis of proteins that form small crystals. Associated efforts by protein chemists and functional genomics groups to develop robust and more rapid heterologous expression systems will be an important parallel activity for structural biologists. Exciting 4D techniques will undoubtedly be refined to allow enzyme action, protein–protein interactions and other molecular processes to be viewed in real time.

Acknowledgements This work has been supported by grants from the Australian Research Council, the Grains Research and Development Corporation and the South Australian state government. References 1. Chothia, C. (1992) One thousand families for the molecular biologist. Nature 357, 543– 544. 2. Coutinho, P.M. and Henrissat, B. (1999) Carbohydrate-active enzymes: an integrated database approach, in Recent Advances in Carbohydrate Bioengineering (Gilbert, H.Y., Davies, G., Henrissat, B., and Svensson, B., eds.), The Royal Society of Chemistry, Cambridge, pp. 3–12. 3. Tan, X., Calderon-Villalobos, L.I.A., Michael, S., Zheng, C., Robinson, C.V., Estelle, M., and Zheng, N. (2007) Mechanism of auxin perception by the TIR1 ubiquitin ligase. Nature 446, 640–645. 4. Sali, A., Glaeser, R., Earnest, T., and Baumeister, W. (2003) From words to literature in structural proteomics. Nature 422, 216– 225. 5. Hubbell, W.L., Gross, A., Langen, R., and Lietzow, M.A. (1998) Recent advances in site-directed spin labelling of proteins. Curr. Opin. Struct. Biol. 8, 649–656. 6. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E. (2000) The protein data bank. Nucleic Acids Res. 28, 235–242. 7. Farrokhi, N., Hrmova, M., Burton, R.A., and Fincher, G.B. (2007) Heterologous and cell

8.

9.

10.

11.

12.

13.

free expression systems, in Methods in Molecular Biology: Plant Genomics (Gustafson P., ed.), (in press). Hrmova, M., Varghese , J.N., Høj, P.B., and Fincher, G.B. (1998) Crystallization and preliminary X-ray analysis of β-glucan exohydrolase isoenzyme ExoI from barley (Hordeum vulgare). Acta Cryst. D54, 687–689 . Sali, A. and Blundell, T.L. (1993) Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 234, 779–815. Sanchez, R. and Sali, A. (1998) Large-scale protein structure modeling of the Saccharomyces cerevisiae genome. Proc. Natl. Acad. Sci. USA 95, 13597–13602. Laskowski, R.A., MacArthur, M.W., Moss, D.S., and Thornton, J.M. (1993) PROCHECK: a program to check the stereochemical quality of protein structures. J. Appl. Cryst. 26, 283–291. Sippl, M.J. (1993) Recognition of errors in three-dimensional structures of proteins. Proteins 17, 355–362. Jones, T.A., Zou, J.Y., Cowan, S.W., and Kjeldgaard, M. (1991) Improved methods for building protein models in electron density maps and the location of errors in these models. Acta Cryst. A47, 110–119.

224

Hrmova and Fincher

14. Sali, A. and Chiu, W. (2005) Macromolecular assemblies highlighted. Struct. Fold. Des. 13, 339–341. 15. Brown, D. and Sjölander, K. (2006) Functional classification using phylogenomic inference. PLoS Comput. Biol. 2, 479–483. 16. Sjölander, K. (2004) Phylogenomic inference of protein molecular function: advances and challenges. Bioinformatics 20, 170–179. 17. Eddy, S.R. (2004) What is a hidden markov model? Nat. Biotechnol. 22, 1315–11316. 18. Murzin A.G., Brenner S.E., Hubbard T., and Chothia C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540. 19. Pearl, F.M., Bennett, C.F., Bray, J.E., Harrison, A.P., Martin, N., Shepherd, A., Sillitoe, I., Thornton, J., and Orengo, C.A. (2003) The CATH database: an extended protein family resource for structural and functional genomics. Nucleic Acids Res. 31, 452–455. 20. Marti-Renom, M.A., Madhusudhan, M.S., Fisher, A., Rost, B., and Sali, A. (2002) Reliability of assessment of protein structure prediction methods. Struct. Fold. Des. 10, 430–435. 21. Rychlewski, L., Fischer, D., and Elofsson, A. (2003) LiveBench-6: large-scale automated evaluation of protein structure prediction servers. Proteins 53 (Suppl 6), 542–547. 22. Shimodaira, H. and Hasegawa, M. (2001) CONSEL: for assessing the confidence of phylogenetic tree selection. Bioinformatics 17, 1246–1247. 23. Ginalski, K., Elofsson, A., Fischer, D., and Rychlewski, L. (2003) 3D-Jury: a simple approach to improve protein structure predictions. Bioinformatics 19, 1015–1018. 24. Rost, B., Yachdav, G., and Liu, J. (2004) The predictProtein server. Nucleic Acids Res. 32, W321–W326. 25. Brown, S.D., Gerlt, J.A., Seffernick, J.L., and Babbitt, P.C. (2006) A gold standard set of mechanistically diverse enzyme superfamilies. Genome Biol. 7, R8. 26. Levitt, M. (2007) Growth of novel protein structural data. Proc. Natl. Acad. Sci. USA 104, 3183–3188. 27. Armstrong, J.D., Pocklington, A.J., Cumiskey, M.A., and Grant, S.G.N. (2006) Reconstructing protein complexes: from proteomics to system biology. Proteomics 6, 4724–4731. 28. Harvey, A.J., Hrmova, M., DeGori, R., Varghese, J.N., and Fincher, G.B. (2000) Comparative modeling of the three-dimensional structures of family 3 glycoside hydrolases. Proteins Struct. Funct. Genet. 41, 257–269.

29. Hrmova, M., Harvey, A.J., Wang, J., Shirley, N.J., Jones, G.P., Høj, P.B., and Fincher, G.B. (1996) Barley β-D-glucan exohydrolases with β-D-glucosidase activity. Purification and determination of primary structure from a cDNA clone. J. Biol. Chem. 271, 5277–5286. 30. Varghese, J.N., Hrmova, M., and Fincher, G.B. (1999) Three-dimensional structure of a barley β-D-glucan exohydrolase, a family 3 glycosyl hydrolase. Struct. Fold Des. 7, 179– 190. 31. Hrmova, M., Varghese, J.N., De Gori, R., Smith, B.J., Driguez, H., and Fincher, G.B. (2001) Catalytic mechanisms and reaction intermediates along the hydrolytic pathway of plant β-D-glucan glucohydrolase. Struct. Fold Des. 9, 1015–1016. 32. Hrmova, M., De Gori, R., Smith, B.J., Fairweather, J.K., Driguez, H., Varghese, J.N., and Fincher, G.B. (2002) Structural basis for broad substrate specificity in higher plant β-Dglucan glucohydrolases. Plant Cell 14, 1033– 1052. 33. Davies, G. and Henrissat, B. (1995) Structures and mechanisms of glycosyl hydrolases. Struct. Fold Des. 7, 853–859. 34. Ikegami, M., Sato, T., Suzuki, K., Noguchi, K., Okuyama, K., Kitamura, S., Takeo, K., and Ohno, S. (1995) Molecular and crystal structures of 2,3,4,6,1¢,3¢,4¢,6¢-octa-O-acetylβ-sophorose, methyl 2,3,4,6,3¢,4¢,6¢-hepta-Oacetyl-β-sophoroside, and methyl 2,3,4,6,3¢,4¢hexa-O-acetyl-6¢-deoxy-β-sophoroside. Carbohydr. Res. 271, 137–150. 35. Rohrer, D.C., Sarko, A., Bluhm, T.L., and Lee, Y.N. (1980) The structure of gentiobiose. Acta Crystallogr. B36, 650–654. 36. Hrmova, M., Streltsov, V.A., Smith, B.J., Vasella, A., Varghese, J.N., and Fincher, G.B. (2005) Structural rationale for low nanomolar binding of transition state mimics to a family GH3 β-D-glucan glucohydrolase from barley. Biochemistry (USA) 44, 16529–16539. 37. Hrmova, M. and Fincher, G.B. (2007) Dissecting the catalytic mechanism of a plant. -D-glucan glucohydrolase through structural biology using inhibitors and substrate analogues. Carbohydr. Res. 342, 1613–1623. 38. Grisshammer, R. (2006) Understanding recombinant expression of membrane proteins. Curr. Opin. Biotechnol. 17, 337–340. 39. Link, A.J. and Georgiou, G. (2007) Advances and challenges in membrane protein expression. AIChE J. 53, 752–756. 40. Sizer, P.J., Miller, A., and Watts, A. (1987) Functional reconstitution of the integral membrane proteins of influenza virus into phospholipid liposomes. Biochemistry 26, 5106–5113.

Functional Genomics and Structural Biology 41. Käsermann, F. and Kempf, C. (2006) Virus membrane proteins and proteinaceous pores. Future Virol. 1, 823–831. 42. Lichtenberg, D. and Barenholtz, Y. (1988) Liposomes: preparation, characterization and preservation. Methods Biochem. Anal. 33, 337–462. 43. Colletier, J.P., Chaize, B., Winterhalter, M., and Fournier, D. (2002) Protein encapsulation in liposomes: efficiency depends on interactions between protein and phospholipid bilayer. BMC Biotechnol. 2, 9–16. 44. Levy, D., Chami, M., and Rigaud J.L. (2001) Two-dimensional crystallization of membrane proteins: the lipid layer strategy. FEBS Lett. 504, 187–193. 45. Cherezov, V., Fersi, H., and Caffrey, M. (2001) Membrane protein crystallization in meso: lipid type-tailoring of the cubic phase. Biophys. J. 81, 225–242. 46. Byrne, B. and Iwata, S. (2002) Membrane protein complexes. Curr. Opin. Struct. Biol. 2, 239–243. 47. Kornberg, R.D. and Darst, S.A. (1991) Two dimensional crystals of proteins on lipid layers. Curr. Opin. Struct. Biol. 1, 632–646. 48. Hacksell, I., Rigaud, J.-L., Purhonen, P., Pourcher, T., Hebert, H., and Leblanc, G. (2002) Projection structure at 8 Å resolution of the melibiose permease, a Na-cotransporter from. E coli. EMBO J. 21, 3569–3574. 49. Zhuang, J.P., Prive, G.G., Werner, G.E., Ringler, P., Kaback, H.R., and Engel, A. (1999). Two-dimensional crystallization of Escherichia coli lactose permease. J. Struct. Biol. 125, 63–75. 50. Landau, E.M. and Rosenbusch, J.P. (1996) Lipidic cubic phases: a novel concept for the crystallization of membrane proteins. Proc. Natl. Acad. Sci. USA 93, 14532–14535. 51. Kolbe, M., Besir, H., Essen, L.O., and Oesterhelt, D. (2000) Structure of the light-driven chloride pump halorhodopsin at 1.8 A resolution. Science 288, 1390–1396. 52. Leucke, H., Schobert, B., Richter, H., Cartailler, J., and Lanyi, J. (1999) Structure of bacteriorhodopsin at 1.55 Å resolution. J. Mol. Biol. 291, 899–911. 53. da Fonseca, P., Morris, E.P., Hankamer, B., and Barber, J. (2002) Electron crystallographic study of photosystem II of the cyanobacterium Synechococcus elongates. Biochemistry 41, 5163–5167. 54. Tornroth-Horsefield, S., Wang, Y., Hedfalk, K., Johanson, U., Karlsson, M., Tajkhorshid, E., Neutze, R., and Kjellbom, P. (2006) Structural mechanism of plant aquaporin gating. Nature 439, 688–694.

225

55. Kühlbrandt, W. (1984) Three-dimensional structure of the light-harvesting chlorophyll a/b-protein complex. Nature 307, 478–480. 56. Walden, H., Podgoski, M.S., and Schulman, B.A. (2003) Insights into the ubiquitin transfer cascade from the structure of the activating enzyme for NEDD8. Nature 422, 330–334. 57. Kol, S., Turrell, B.R., de Keyzer, J., van der Laan, M., Nouwen, N., and Driessen, A.J. (2006) YidC-mediated membrane insertion of assembly mutants of subunit c of the F1F0 ATPase. J. Biol. Chem. 281, 29762–29768. 58. Stewart, R.J., Varghese, J.N., Garrett, T.P.J., Høj, P.B., and Fincher, G.B. (2001) Mutant barley (1,3;1,4)-β-glucan endohydrolases with enhanced thermostability. Protein Eng. 14, 245–253. 59. Varghese, J.N., Garrett, T.P.J., Colman, P.M., Chen, L., Hoj, P.B., and Fincher, G.B. (1994) Three-dimensional structures of 2 plant betaglucan endohydrolases with distinct substrate specificities. Proc. Natl. Acad. Sci. USA 91, 2785–2789. 60. Høj, P.B. and Fincher, G.B. (1995) Molecular evolution of plant β-glucan endohydrolases. Plant J. 7, 367–379. 61. Bamforth, C.W. (1994) β-Glucan and β-glucanases in malting and brewing: practical aspects. Brew. Dig. 69, 12–16. 62. Mori, H., Bak-Jensen, K.S., and Svensson, B. (2002) Barley alpha-amylase Met53 situated at the high-affinity subsite –2 belongs to a substrate binding motif in the beta— > alpha loop 2 of the catalytic (beta/alpha)8-barrel and is critical for activity and substrate specificity. Eur. J. Biochem. 269, 5377–5390. 63. Topaloglou, T. (2006) Informatics solutions for high-throughput proteomics. Drug Discov. Today 11, 509–516. 64. Forstner, M., Leder, L., and Mayr, L.M. (2007) Optimization of protein expression systems for modern drug discovery. Expert Rev. Proteomics 4, 67–78. 65. Wentz A.E. and Shusta E.V. (2007) A novel high throughput screen reveals yeast genes that increase heterologous protein secretion. Appl. Environ. Microbiol. 73, 1189–1198. 66. Peti, W. and Page, R. (2007) Strategies to maximize heterologous protein expression in Escherichia coli with minimal cost. Protein Expr. Purif. 51, 1–10 67. Miyatake, H., Kim, S.-H., Motegi, I., Matsuzaki, H., Kitahara, H., Higuchi, A., and Miki, K. (2005) Development of a fully automated molecular crystallization/observation robotic system, HTS-80. Acta Crystallogr. D61, 658– 663.

226

Hrmova and Fincher

68. D’Arcy, A., Villard, F., and Marsh, M. (2007) An automated microseed matrix-screening method for protein crystallization. Acta Crystallogr. D63, 550–554. 69. Charles, M., Veesler, S., and Bonneté, F. (2006) MPCD: a new interactive on-line crystallization data bank for screening strategies. Acta Crystallogr. D62, 1311–1318. 70. Beteva, A., Cipriani, F., Cusack, S., Delageniere, S., Gabadinho, J., Gordon, E.J., Guijarro, M., Hall, D.R., Larsen, S., Launer, L., Lavault, C.B., Leonard, G.A., Mairs, T., McCarthy, A., McCarthy, J., Meyer, J., Mitchell, E., Monaco, S., Nurizzo, D., Pernot, P., Pieritz, R., Ravelli, R.G., Rey, V., Shepard, W., Spruce, D., Stuart, D.I., Svensson, O., Theveneau, P., Thibault, X., Turkenburg, J., Walsh, M., and McSweeney, S.M. (2006) Highthroughput sample handling and data collection at synchrotrons: embedding the ESRF into the high-throughput gene-to-structure pipeline. Acta Crystallogr. D62, 1162–1169. 71. Terwilliger, T. (2004) SOLVE and RESOLVE: automated structure solution, density modification, and model building. J. Synchrotron Radiat. 11, 49–52. 72. Adams, P.D., Gopal, K., Grosse-Kunstleve, R.W., Hung, L.-W., Ioerger, T.R., McCoy, A.J., Moriarty, N.W., Pai, R.K., Read, R.J., Romo, T.D., Sacchettini, J.C., Sauter, N.K., Storoni, L.C., and Terwilliger, T. (2006) Recent developments in the PHENIX software for automated crystallographic structure determination. J. Synchrotron Radiat. 11, 53–55. 73. Lamzin, V.S. and Perrakis, A. (2002) Current state of automated crystallographic data analysis. Nat. Struct. Biol. 7, 978–981. 74. Collaborative Computational Project Number 4 (1994) The CCP4 suite: programs for protein crystallography. Acta Crystallogr. D50, 760–763. 75. Gopal, K., McKee, E.W., Romo, T., Pai, R., Smith, J., Sacchettini, J.C., and Ioerger, T.R. (2006) Crystallographic model-building on the web. Bioinformatics 23, 375–377. 76. McKee, E.W., Kanbi, L.D., Childs, K.L., Grosse-Kunstleve, R.W., Adams, P.D., Sacchettini, J.C., and Ioerger, T.R. (2005) FINDMOL: automated identification of macromolecules in electron-density maps. Acta Crystallogr. D61, 1514–1520. 77. Callebaut, I., Labesse, G., Durand, P., Poupon, A., Canard, L., Chomilier, J., Henrissat, B., and Morno, J.P. (1997) Deciphering protein sequence information through hydrophobic cluster analysis (hca) – current status and perspectives [Review]. Cell Mol. Life Sci. 53, 621–645.

78. Mirkovic, N., Li, Z., Parnassa, A., and Murray, D. (2006) Strategies for high-throughput comparative modelling: applications to leverage analysis in structural genomics and protein family organization. Proteins 66, 766–777. 79. Pieper, U., Eswar, N., Braberg, H., Madhusudhan, M.S., Davis, F.P., Stuart, A.C., Mirkovic, N., Rossi, A., Marti-Renom, M.A., Fiser, A., Webb, B., Greenblatt, D., Huang, C.C., Ferrin, T.E., and Sali, A. (2006) MODBASE: a database of annotated comparative protein structure models and associated resources. Nucleic Acids Res. 34, D291–D295. 80. Yura, K., Yamaguchi, A., and Go, M. (2006) Coverage of whole proteome by structural genomics observed through protein homology modeling database. J. Struct. Func. Genomics 7, 65–76. 81. Lee, R.C., Hrmova, M., Burton, R.A., Lahnstein, J., and Fincher, G.B. (2003) An α-larabinofuranosidase and a β-D-xylosidase from barley: purification, characterization and primary structures J. Biol. Chem. 278, 5377– 5387. 82. Harrison, S.S. (2004) Whither structural biology? Nat. Struct. Mol. Biol. 11, 12–15. 83. Kornberg, A. (2004) Biochemistry matters. Nat. Struct. Mol. Biol. 11, 493. 84. Abad-Zapatero, C. (2007) Notes on protein crystallography: quo vadis structural biology? Acta Crystallogr. D54, 687–689. 85. Dauter, Z. (2006) Current state and prospects of macromolecular crystallography. Acta Crystallogr. D62, 1–11. 86. Moffat, K. (1997) Laue diffraction. Methods Enzymol. 277, 433–447. 87. Hajdu, J., Neutze, R., Sjögren, T., Edman, K., Szöke, A., Wilmouth, R.C., and Wilmot, C.M. (2000) Analyzing protein function on four dimensions. Nat. Struct. Biol. 7, 1006–1012. 88. Schlichting, I. and Chu, K. (2000) Trapping intermediates in the crystal: ligand binding to myoglobin. Curr. Opin. Struct. Biol. 10, 744–752. 89. Srajer, V., Teng, T.Y., Ursby, T., Pradervand, C., Ren, Z., Adachi, S., Schildkamp, W., Bourgeois, D., Wulff, M., and Moffat, K. (1996) Photolysis of the carbon-monoxide complex of myoglobin-nanosecond time-resolved crystallography. Science 274, 1726–1729. 90. Bourgeois, D., Vallone, B., Schotte, F., Arcovito, A., Miele, A.E., Csiara, G., Wulf, M., Anfinrud, P., and Brunori, M. (2003) Complex landscape of protein structural dynamics unveiled by nanosecond Laue crystallography. Proc. Natl. Acad. Sci. USA 100, 8704–8709.

Functional Genomics and Structural Biology 91. Schotte, F., Soman, J., Olson, J.S., Wulff, M., and Anfinrud, O.A. (2004) Picosecond time-resolved crystallography: probing protein function in real time. J. Struct. Biol. 147, 235–246. 92. Schmidt, M., Pahl, R., Srajer, V., Anderson, S., Ren, Z., Ihee, H., Rajagopal, S., and Moffat, K. (2004) Protein kinetics: structures of intermediates and reaction mechanism from time-resolved x-ray data. Proc. Natl. Acad. Sci. USA 101, 4799–4804. 93. Ihee, H., Rajagopal, S., Srajer, V., Pahl, R., Anderson, S., Schmidt, M., Schotte, F., Anfinrud, P.A., Wulff, M., and Moffat, K. (2005) Visualizing reaction pathways in photoactive yellow protein from nanoseconds to seconds. Proc. Natl. Acad. Sci. USA 102, 7145–7150. 94. Baxter, R.H.G., Ponomarenko, N., Pahl, R., Moffat, K., and Norris, J.R. (2004) Time-resolved crystallographic studies of light-induced structural changes in the photosynthetic reaction centre. Proc. Natl. Acad. Sci. USA 101, 5982–5987. 95. Stoddard, B.L. (2001) Accumulation and trapping of catalytic intermediates for crystallographic structure determination. Methods 24, 126–138. 96. Schlichting, I. and Goody, R.S. (1997) Triggering methods in crystallographic enzyme kinetics. Methods Enzymol. 277, 467–490. 97. Scheidig, A.J., Burmester, C., and Goody, R.S. (1998) Use of caged nucleotides to characterize unstable intercmediates by X-ray crystallography. Methods Enzymol. 291, 251– 264.

227

98. Ren, Z. and Moffat, K. (1995) Quantitative analysis of synchrotron Laue diffraction patterns in macromolecular crystallography. J. Appl. Crtystallogr. 28, 461–481. 99. Yan, X., Ren, Z., and Moffat, K. (1998) Structure refinement against synchrotron Laue data: strategies for data collection and reduction. Acta. Crystallogr. D54, 367–377. 100. Hrmova, M., De Gori, R., Smith, B J., Vasella, A., Varghese, J.N., and Fincher, G.B. (2004) Threedimensional structure of the barley β-d-glucan glucohydrolase in complex with a transitionstate mimic. J. Biol. Chem. 279, 4970–4980. 101. Chapman, H.N., Barty, A., Bogan, M.J., Boutet, S., Frank, M., Hau-Riege, S.P., Marchesini, S., Woods, B.W., Bajt, S., Benner, W.H., London, R.A., Plonjes, E., Kuhlmann, M., Treusch, R., Dusterer, S., Tschentscher, T., Schneider, J.R., Spiller, E., Moller, T., Bostedt, C., Hoener, M., Shapiro, D.A., Hodgson, K.O., van der Spoel, D., Burmeister, F., Bergh, M., Caleman, C., Huldt, G., Seibert, M.M., Maia, F.R.N.C., Lee, R.W., Szoke, A., Timneanu, N., and Hajdu, J. (2006) Femtosecond diffractive imaging with a soft-X-ray free-electron laser. Nat. Phys. 2, 839–843. 102. Von Dreele, R.B. (2005) Binding of N-acetylglucosamine oligosaccharides to hen egg-white lysozyme: a powder diffraction study. Acta Crystallogr. D61, 22–32. 103. Dickinson, M., Farman, G., Frye, M., Bekyarova, T., Gore, D., Maughan, D., and Irving, T. (2005) Molecular dynamics of cyclically contracting insect flight muscle in vivo. Nature 433, 330–333.

Chapter 12 In situ Analysis of Gene Expression in Plants Sinéad Drea, Paul Derbyshire, Rachil Koumproglou, Liam Dolan, John H. Doonan, and Peter Shaw Summary In the post-genomic era, it is necessary to adapt methods for gene expression and functional analyses to more high-throughput levels of processing. mRNA in situ hybridization (ISH) remains a powerful tool for obtaining information regarding a gene’s temporal and spatial expression pattern and can therefore be used as a starting point to define the function of a gene or a whole set of genes. We have deconstructed ‘traditional’ ISH techniques described for a range of organisms and developed protocols for ISH that adapt and integrate a degree of automation to standardized and shortened protocols. We have adapted this technique as a high-throughput means of gene expression analysis on wax-embedded plant tissues and also on whole-mount tissues. We have used wax-embedded wheat grains and Arabidopsis floral meristems and whole-mount Arabidopsis roots as test systems and show that it is capable of highly parallel processing. Key words: High-throughput, Spatial patterns of gene expression, In situ hybridization.

1. Introduction In situ hybridization (ISH) is one of the methods of choice for determining the spatial expression pattern of a given gene. High resolution protocols provide cellular and even subcellular resolution. One of the most significant advantages inherent in the technique is that it is applicable to any species whether or not these species are amenable to other methods of functional analyses such as stable transformation. For this reason it has proved to be invaluable in the evo-devo (evolution of development) field for instance where, in the absence of direct functional data in diverse species, it can provide detailed gene expression patterns across Daryl J. Somers et al. (eds.), Methods in Molecular Biology, Plant Genomics, vol. 513 © Humana Press, a part of Springer Science + Business Media, LLC 2009 DOI: 10.1007/978-1-59745-427-8_12

229

230

Drea et al.

the evolutionary spectrum that can be informative for studies of comparative development (1). ISH effectively complements Northern blotting, RT-PCR (reverse transcriptase-polymerase chain reaction) and microarrays where the extraction of the RNA invariably results in the loss of spatial information. Microarrays allow many genes to be studied in parallel and are currently the most powerful tool to study gene expression. However, the microarray outputs need to be verified by independent methods, such as ISH (2, 3). To match the level of output, ISH must be made more efficient and less time-consuming. A number of variations on the traditional in situ protocols have been reported, including whole-mount ISH (4), in situ PCR (5, 6) and the use of vibratome sectioned tissues (7), but the main shortcoming of ISH is undoubtedly the low-throughput nature of the technique. Efforts to make the ISH technique into a highly parallel, systematic process have been successful in flies and primitive chordates (8–10). Attempts have been made to address this issue in plants using the whole-mount ISH (WISH) and in situ PCR techniques (11, 12). However, though the potential is noted, the actual throughput is undetermined. The high-throughput protocols used in animal embryos involve whole-mount methods that are more feasible for these systems (8–10). The challenge in plants is the sheer size of the tissues required for analysis and this not only compromises the penetration of probe and hybridization but makes microscopic examination more difficult and therefore more time-consuming, though we have effectively used it as a means of gene expression in the small and more easily penetrable Arabidopsis root. The other option for cellular localization of transcripts is promoter fusions to reporter genes and subsequent transformation. This approach has recognized shortcomings as elements controlling a genes expression are know to be located not only in the traditional promoter region upstream of the coding region, but intergenically and at unconventional distances from the gene (13). The resources required for mass transformation and the fact that not all plant species are amenable limits the application of this approach to well-studied model species. Two of the most significant developments in tissue and cell type-specific gene expression involve fluorescence-activated cell-sorting (FACS) and laser capture micro-dissection (LCM). These techniques overcome the limitations of non-specific manual tissue manipulation for RNA extraction and bring together the ability to isolate cell-specific material for use in genome-wide transcriptional profiling. The former method has been applied very elegantly to obtain what is a useful reference for gene expression patterns in Arabidopsis root cell types (14). However, the approach is dependent on the availability of transformable lines with cell specific GFP expression, on the protoplasting of plant

In Situ Analysis of Gene Expression in Plants

231

material and on the existence of microarray facilities for the species being analyzed. LCM allows the isolation of RNA samples from individual cell types (15) but requires specialized microscopy facilities and the RNA isolated needs to be amplified prior to its application on microarrays. Sequence and statistics based methods such as SAGE (serial analysis of gene expression) and MPSS (massively parallel signature sequencing) and direct statistical profiling of EST (expressed sequence tags) are certainly very high-throughput in terms of scale and constitute useful reference databases; and MPSS has proven a particularly useful tool for the analysis of small RNAs (16–18). SAGE and MPSS rely on the matching of a short sequence to cognate genes in order to be identified and are therefore most useful for species with well-characterized genomes. In this description of ISH on plant material, we will draw on three ISH projects conducted and/or underway in our group and which involve(d): gene expression analysis on early wheat grain development (19), gene expression patterns in Arabidopsis flower meristems (20, R. Koumproglou, unpublished) and finally in Arabidopsis roots. These projects involved the optimization of the protocol for different tissue types; different sources of probe templates; using both wax-sectioned (flower meristems and wheat grains) and whole-mount (roots) approaches; and, for the wax-sectioned material, using different automated slide processors. We begin each section with probe-making as this part of the protocol was virtually identical in each project.

2. Materials 2.1. Probe Making

1. A 10× stock of NTP-mix for in vitro transcription consisted of 1 μl each of ATP, CTP, GTP, 0.65 µl UTP (100 mM stocks from Roche), 3.5 μl of Dig-UTP (10 mM stock from Roche) and 2.85 μl sterile water for a 10 μl stock. 2. RNA polymerases (Roche or Promega) and used as recommended. 3. RNase inhibitor available from Roche or Promega and used as recommended. 4. QIAquick PCR purification kit (QIAGEN) used according to the manufacturer’s instructions. 5. Montage Clean-up Kit (Millipore) was used for PCR purification in 96-well format. 6. 200 mM Carbonate buffer, pH 10.2 (80 mM NaHCO3, 120 mM Na2CO3) for hydrolysis of probes. 7. Nitrocellulose (Amersham).

232

Drea et al.

2.2. Plant Tissue Preparation 2.2.1. Whole-Mount Seedling Tissue

1. Vortex bleach (Procter & Gamble Ltd) 2. Parafilm® laboratory film (Pechiney Plastic Packaging, Menasha, USA) 3. Standard growth medium contained 1× Murashige and Skoog (MS) basal salts (micro and macro elements) (Duchefa), 1% (w/v) sucrose and 0.5% (w/v) Phytagel™ dissolved in deionized water with pH adjusted to 5.7 with KOH, followed by autoclaving for 20 min. Medium was cooled to 50–60°C and ~20 ml poured 9 cm petri plates (Bibby Sterilin Ltd) and allowed to solidify. 4. Paraformaldehyde (Sigma) 4% (w/v) solution in PBS (1.3 M NaCl, 70 mM Na2HPO4, 30 mM NaH2PO4, pH 7 – made up as 10× stock and diluted in sterile water before use) and prepared fresh for each use. 5. Alternative fixative, FAA: 3.7% formaldehyde (a 37% stock solution is available from Sigma), 50% ethanol and 5% acetic acid.

2.3. Wax-Embedded Tissue

6. Tissue-Tek® Vacuum Infiltration Processor (Sakura; distributed by Bayer, UK) for processing material before embedding. 7. Sectioning of wax-embedded material was done using a Leica Microtome (RM2125RT). Silicone isolators used for precise positioning of sections on slides were obtained from Grace Biolabs. For post-situ embedding and sectioning of root material we used Technovit 7100® (Kulzer GmBH, Germany) resin and an Ultracut-E microtome (Reichert-Jung, Austria) with a glass knife.

2.4. Pre-treatment, Hybridization, Washing and Staining of Slides/ Tissues

1. Automated ISH on wax-sectioned tissue was performed using the VP2000 (Vysis) and the InSituPro (Intavis) slide processors. 2. Buffers used in pre-treatment of slides prior to hybridization (see Note 1): PBS (diluted from a 10× stock solution containing 1.3 M NaCl, 70 mM Na2HPO4, 30 mM NaH2PO4, pH 7). 3. Proteinase K (Roche) was made up as a 25 mg/ml stock in sterile water and used at 2–3 or 10 µg/ml in Tris buffer (100 mM Tris–HCl, 50 mM EDTA, pH 7.5). 4. Acetic anhydride (Sigma) was used at 0.5% in 0.1 M triethanolamine (Sigma). 5. Glycine was used at 0.2% in PBS. 6. Hybridization solution (HS) (Salts [300 mM NaCl, 10 mM Tris–HCl pH 6.8, 10 mM NaPO4, 5 mM EDTA] 50% deionized formamide, 5% dextran sulphate, 0.5 mg/ml tRNA, 1× Denhardts, 0.1 mg/ml Salmon testis DNA) and maintained stably at −20°C until hybridization.

In Situ Analysis of Gene Expression in Plants

233

7. Hybridization chambers were obtained from Grace Biolabs. 8. Solutions used for washing prior to staining: 2× SSC and 1× SSC (20× SSC stock: 3 M NaCl, 0.3 M NaCitrate) made up in 50% formamide. 9. TBS (10 mM Tris–HCl, 250 mM NaCl, pH 7 – made up as 10× stock and diluted in water before use). 10. AP-buffer (100 mM Tris–HCl, 100 mM NaCl pH 9.5; 50 mM MgCl2). 11. NBT (0.1 mg/ml) and BCIP (0.075 mg/ml) from Promega. 12. Anti-digoxigenin-alkaline phosphatase (anti-Dig-AP) antibody and blocking reagent (Roche). 13. Ethanol (diluted in water if required). 14. Triton or Tween surfactants (Sigma). 15. Calcofluor (Fluorescent Brightener 28 from Sigma) used at 0.1% in water. 16. Entellen (Merck). 2.5. Microscopy

1. A Nikon E800 microscope using a digital camera under brightfield conditions for wheat sections and with UV filter for the calcofluor-counterstained Arabidopsis sections. 2. A Nikon Coolpix 950 digital camera attached to a Leica WILD M10 binocular microscope was used to capture low magnification images of roots after ISH. White light from above and white paper underneath the plates improved the signal contrast.

3. Methods 3.1. Probe Making

1. Primers are designed in order to append a T7 RNAP site to the 3′ end of the gene sequence to be labelled (see Note 2). We use a standard PCR cycle, for example, 94°C 3 min, then 30 cycles of 94°C 45 s, 63°C 45s and 72°C 1.5 min, final extension of 72°C for 6 min. For 96-well plates PCR-product purification was done using the Montage Clean-up Kit (Millipore). Individual PCR templates can be cleaned using the available commercial kits, for example, from Qiagen. 2. In vitro transcription was performed with ~500 ng of PCR template in 10 μl reactions for 2 h at 37°C in the presence of Dig-UTP nucleotides (see Note 3). 3. Hydrolysis was carried out immediately in 100 mM carbonate buffer pH 10.2 at 60°C for a standard 30 min (see Note 4),

234

Drea et al.

and products precipitated in 2.5 M ammonium acetate and 3 volumes absolute ethanol for 1 h at 4°C. 4. Plates were centrifuged at ~2,300× g for 30 min (or tubes for 10 min at ~7,400× g in a microfuge at 4°C) and pellets resuspended in 30 μl TE (10 mM Tris–HCl, 1 mM EDTA) buffer. 5. Dilutions (100 times) were made in water and 1 μl of each spotted on nitrocellulose for dot-blot: 30 min in blocking solution (Sigma), 30 min in anti-Dig-AP, 5 min wash in TBS, 5 min in AP-buffer and developed as described above until signal was sufficient (see Note 5). All probes were then diluted 100 times in HS and maintained stably at −20°C until hybridization. Probes diluted in HS were denatured for 2 min at 85°C before application to slides or seedlings (see Note 6). 3.2. Plant Tissue Preparation for In situ Hybridization

1. Seeds of Arabidopsis thaliana L. Heynh, ecotype Columbia-0 (Col-0) were sterilized in 5% (v/v) bleach for 5 min, and washed ×3 in sterile distilled water (sdH2O).

3.2.1. Whole-Mount Tissues

2. Seeds were dropped individually onto the surface of the growing medium in horizontal lines at a density of 5–10 seeds per centimetre. 3. Plates were then sealed with Parafilm were placed in darkness at 4°C for 48 h to stimulate and synchronize germination. 4. Following cold treatment, plates were transferred to a growth room maintained at 25°C and incubated in a near vertical position, under fluorescent lamps emitting ~70 μmol/m2/s in a continuous white light regime.

3.2.2. Tissues for WaxSectioning

1. For Arabidopsis flower meristems we used ecotype Columbia grown under long day conditions in the greenhouse because this produced larger meristems (and therefore more sections containing the central meristematic zone) than other conditions that were tested. Wheat plants (variety Savannah) were grown under controlled environment conditions (16°C, 16 h light) and ears tagged daily at anthesis. 2. Wheat grains harvested at 3, 6 and 9 days after anthesis (DAA) were trimmed and Arabidopsis floral meristems were removed just after bolting. All tissues were fixed in paraformaldehyde or FAA (6 h 35°C in the Tissue-Tek Vacuum Infiltration Processor – VIP). 3. Tissue-Tek VIP cycle further included the following steps: 70% ethanol 1 h 35°C, 80% ethanol 1.5 h 35°C, 90% ethanol 2 h 35°C, 100% ethanol 1 h 35°C, 100% ethanol 1.5 h 35°C (repeat 2 h), xylene 0.5 h 35°C (repeat 1 h and again 1.5 h), wax 1 h 60°C (repeat same then for 2 h twice). All steps are

In Situ Analysis of Gene Expression in Plants

235

performed under vacuum and the plant tissue is contained in plastic cassettes (also Tissue Tek). 4. Cages containing the samples were then transferred to the Tissue-Tek Embedding Console and embedded in the desired orientation – we used longitudinal sections for flower meristems and transverse sections for wheat grains. 5. 14 µm sections were found to be most suitable for wheat grains but a standard 8 µm was used for Arabidopsis tissues. Sections were allowed to dry onto slides overnight at 42°C (see Note 7). For wheat grains and for use with the VP2000 slide processor, the arrangement and number of tissue sections on the slide was made uniform using adherent, but removable, silicone isolators. This allowed the parallel screening of multiple probes on the same slide containing up to eight sections, each section in an isolated well. For using the Intavis Processor Arabidopsis meristems were sectioned right through and all sections from one meristem positioned on one slide (~30 sections). 3.3. Pre-treatment, Hybridization, Washing and Staining of Slides/Tissues 3.3.1. For Whole-Mount Tissues

The description of this method and an example of results obtained is summarized by the schematic in Fig. 1. 1. 4-day-old seedlings were fixed in 4% paraformaldehyde while still on MS/agar plates by applying a weak vacuum to ensure penetration of the fixative. 2. The seedlings were then transferred with tweezers in clusters of 40–50 into Tissue-Tek mess biopsy cassettes (Sakura). A brief vacuum infiltration was applied with each change of the following solutions, to submerge the cassettes: 3. Dehydration for 1 h each in 30, 65, 100, 65 and 30% (v/v) ethanol; PBS 30 min, acetic anhydride/TEA 30 min; PBS ×2 15 min each. 4. Seedlings (10–12) were tweezer-transferred from cassettes into 1.5 ml microfuge tubes containing 100 µl probe-HS and incubated at 50°C 16 h. 5. Following the hybridization reaction, seedlings were tweezertransferred into a 48 well mesh-bottom plate (1 reaction per well), covered with a lid and the plate placed in a plastic box [100 mm (w) × 200 mm (l) × 50 mm (h)] containing 100 ml of appropriate washing solution. Material was subjected to three washes in 2× SSC/50% (v/v) formamide and one wash in 1× SSC/50% (v/v) formamide 30 min each at 50°C; 1× SSC 5 min and PBS 10 min at room temperature (r.t.). 6. Material was prepared for antibody labelling by washing in TBS for 10 min; TBS + 0.5% (w/v) blocking reagent 1 h, and TBS/1% (w/v) BSA/0.3% (v/v) Triton X-100 1 h.

236

Drea et al.

Fig. 1. Flow diagram describing whole-mount in situ hybridization on Arabidopsis roots (Subheading 3.3.1). Seedlings are fixed in plates and transferred into mesh biopsy cassettes followed by pre-treatment washes in a beaker. Groups of seedlings are then transferred into microfuge tubes and incubated in probe/hybridization solution overnight. Groups of seedlings are placed into individual wells of a mesh bottom 48-well plate and subjected to post-hybridization washes, then collected into separate wells of a 6 well plate and stained. Low magnification images are collected to show general spatial expression patterns, and selected roots embedded in resin and sectioned in the zones of expression, giving cell-specific resolution. All images are then collected into a database. Images of results using a probe for Histone4 are shown. Scale bars; root whole-mount = 300 µm, section = 25 µm.

7. Anti-Dig-AP was diluted (1/3000) in the TBS/BSA/Triton buffer and used for seedling incubation at r.t. (1 h) then 4°C 16 h. 8. Seedlings were given three 20 min washes at r.t. in the same buffer (without antibody), followed by one wash in TBS for 20 min, and one wash in AP-Buffer 10 min. 9. Seedlings were tweezer-transferred into six well plates and colour detection with NBT/BCIP was carried out in complete darkness for 2–4 h and then stopped in water. Expression profiles were broadly separated into four categories; absent (−), weak (+), moderate (++) and strong (+++). 3.3.2. Using the VP2000 and Multiple Probes per Slide

The use of corresponding isolators and chambers for section organization and hybridization is shown in Fig. 2. This arrangement was used to maximize efficiency and economy when working with probes in 96-well format and has been described (20). 1. Silicone isolators were removed from the slides when dry and the slides loaded in the slide rack for the VP2000 processor. The rack-capacity is 50 slides. 2. The slides are put through the following program: xylene (see Note 8) 20 min × 2, (with agitation for final minute of

In Situ Analysis of Gene Expression in Plants

237

Fig. 2. Silicone isolators and hybridization chambers as used to arrange and hybridize various probes to wax-sectioned wheat grains when used in conjunction with a 96-well probe preparation format, as described in Subheading 3.3.2. and in Drea et al. (20) (A). Hybridization chambers applied to slides and probes added from 96-well plate (B). Alternative format using larger hybridization chambers.

second treatment); 100% ethanol 10 min (with agitation for final minute), then through a 95%, 85%, 50%, 30% ethanol series for 2 min each (see Note 9); PBS 3–4 min ×2; Proteinase K 30 min at 37°C; glycine 2 min; PBS 3–4 min; acetic anhydride 10 min (with agitation); PBS 3–4 min, then back through the ethanol series. Slides were completely dry at this stage and ready for hybridization. 3. Hybridization chambers were applied securely to the slides (after pre-treatment) and probes (diluted in HS) were applied

238

Drea et al.

to one well (two sections) for the three stages individually. Coverslips were placed on the chambers to prevent evaporation and hybridization was performed overnight in a 50°C incubator. 4. Chambers are removed and slides arranged in the VP2000 for washing program: 15 min in 2× SSC/50% formamide (see Note 10) at 40°C, 40 min in same at 50°C, 20 min in 1× SSC/50% formamide at 50°C (all steps with constant agitation), 5 min in 1× SSC at room temperature, 5 min in TBS at room temp (see Note 11). 5. Then slides are transferred into trays/boxes [eight slides fit in a box 100 mm (w) × 200 mm (l) × 50 mm (h)] for staining: 1% blocking solution in TBS 1 h, TBS containing 1/3,000 dilution of anti-Dig-AP and 0.05% Tween-20 for 1 h, 4× 10 min washes in TBS, 5 min in AP-buffer. 6. Develop in AP-buffer containing NBT and BCIP (see Note 12). Slides were then washed several times in water to stop the reaction followed by sequential washes in 70% and 100% ethanol to remove excess stain (the duration of the ethanol washes depends on the level of colour development and should be monitored by eye). Slide and then allowed to dry and permanently mounted in Entellan. 3.3.3. Using the Intavis InSituPro

Using the Intavis Processor allows automation of the protocol from hydration after de-waxing to signal-detection stage: 1. Slides were de-waxed in xylene manually before loading in the processor (the capacity is 60 slides) for the following program: 5 min in 100% ethanol ×4; 2 min each ×2 in 95%, 85%, 50%, 30% ethanol; 5 min ×2 in PBS; 15 min in Proteinase K (10 µg/ml); 5 min ×2 in glycine; 5 min ×2 in PBS; 20 min in 4% paraformaldehyde (see Note 13); 5 min ×2 PBS; 14 h hybridization at 50oC; 10 min ×10 in 2× SSC/50% formamide at 50oC; 5 min in 2× SSC/50% formamide at 37oC; 10 min ×2 in 1× SSC; 5 min in PBS; 5 min ×2 in TBS; 30 min ×2 in blocking buffer; 1 h in anti-Dig-AP antibody (1/3,000 in 1× TBS, 1% BSA, 0.3% Triton); 10 min ×10 in TBS. 2. Slides were transferred to boxes and processed as in Subheading 3.3.2. (step 5 and 6).

3.4. Microscopy 3.4.1. For Whole-Mount ISH

Low magnification images of root in situs were captured with a dissecting microscope and attached digital camera. To get detailed image data in cross section, samples were embedded in plastic resin and sectioned with an ultramicrotome (see Note 14). Images of these sections were captured with a digital camera attached to a light microscope using DIC optics.

In Situ Analysis of Gene Expression in Plants

239

Fig. 3. Results of ISH on wax-sectioned Arabidopsis flowers using the Intavis Pro (Subheading 3.4.2.) (A). HistoneH4 expression in floral meristem (B). STM expression in floral meristem. 3.4.2. For Wax-Sectioned ISH

Images of the sections were captured on a digital camera attached to a light microscope with fluorescence for slides counterstained with calcofluor. An example of the results obtained is shown in Fig. 3.

3.5. Data Processing

In our experience, the rate limiting steps in conducting in situs on a high-throughput scale are the image capture and dataprocessing stages. Arranging sections in a reproducible order on the slides makes manual image capture more routine as the same conditions and settings can be used for all samples. In other fields, there are more advanced attempts to automate and computerize the image capture and interpretation of expression patterns (21–23). For wheat grain work we recorded all details of probes (sequences, slide positions, folders where corresponding images were stored, etc.) in excel spreadsheets and used these to construct a database of the results (19). Gene expression studies in other systems have also produced web-accessible and searchable databases (8–10).

Notes 1. We did not find it necessary to use DEPC (diethylpyrocarbonate; Sigma)-treated water for pre-treatment of slides or probe making. We used fresh sterile water for dilutions and

240

Drea et al.

autoclaved all buffers and solutions (for procedures prior to, and during, hybridization when RNase-precautions are most important). The buffers and solutions were set aside exclusively for in situ work so as to minimize chance of contamination. Likewise, we do not pretreat all the boxes/dishes by baking for each experiment but set aside a set of apparatus for RNA work exclusively. 2. T3 RNAP can also be used for transcription by appending the appropriate recognition site, but we do not recommend SP6 RNAP, especially for large numbers of probes, as we have found it to be less efficient and less reliable. In many protocols, sense probes (transcribed from the 5¢ end) are used as negative controls but when conducting many in situs simultaneously we usually include just one or few sense probes as negative controls. Gene sequences may be amplified from genomic (exon regions) or cDNA directly but each gene will require individually-designed primers. We usually begin with many genes inserted in a common vector, for example, cDNA library (19) or the SALK cDNA collection for Arabidopsis etc. This allows one to design a set of common primers targeted to the surrounding vector sequence and amplify templates for all genes simultaneously. With regard to probe specificity: UTR sequences are often used as templates so as to minimize the chance of cross-hybridization between similar genes. If using cDNA library clones as templates, since these are often made using polyadenylated tails as anchors, they will already contain 3¢UTRs and should therefore be specific for the gene in question (19). 3. We usually check 0.5 µl of the transcription reaction on a 1% agarose gel (run at 50–70 V) to determine if transcription proceeded efficiently – the single-strand RNA should run as a smaller band below the double-stranded DNA template. 4. There is a formula to determine the amount of time required to hydrolyze the RNA to the desired size: t = (Li − Lf)/K × Li× Lf, where t = time in minutes, K = rate constant (=0.11 kb/min), Li = initial length (kb) and Lf = final length. When making probes in 96-well format we used a standard 30 min for all probes. Some labs have found that it is not essential to reduce the size of the probe by hydrolysis and have obtained stronger signals with longer probes. 5. We use the dot-blot as a means of qualitatively determining the success of Dig-labelling rather than as a means of quantification. 6. Most protocols require separate denaturation of the probe in 50% deionized formamide prior to dilution in HS but we have found it unnecessary.

In Situ Analysis of Gene Expression in Plants

241

7. Many labs recommend processing wax sections immediately after they are adhered to the slides. When dried onto slides, we have routinely stored the slides covered in a box at r.t. or at 4°C for days or weeks prior to dewaxing, pre-treatment and hybridization. 8. There are alternatives to xylene (often less toxic and more easily to dispose of safely) for dewaxing slides such as Histoclear (Raymond Lamb Inc., UK) and CitriSolv (Fisher Scientific). 9. It is not always necessary to use a very elaborate ethanol dehydration sequence and a broader series can be designed as long as it does not affect the quality of the tissue. 10. It is possible to use a simple 0.2× SSC solution for washing slides also and the use of formamide can be avoided if desired. 11. We found that RNase treatment during washing steps did not significantly affect the results of the in situs. 12. We recommend monitoring the development of slides (when using colour-based detection systems) by eye or using a dissecting microscope and always using a well-characterized positive control (like HistoneH4) as an indication of how efficiently the experiment proceeded. As in all high-throughput techniques, there may be false negatives under standardized conditions. For instance, when working on individual or small numbers of probes we have found that using a larger template (>500 bp) may produce a stronger signal (see Notes 2 and 4) 13. Proteinase K treatment can sometimes result in some tissue damage and sometimes it is necessary to include a re-fixation step in the pre-treatment protocol. 14. Because of the delicate nature of the Arabidopsis root, we have found that embedding in a plastic resin preserves the histology of the layers optimally.

References 1. Kramer, E.M. and Irish, V.F. (1999) Evolution of genetic mechanisms controlling petal development. Nature 399, 144–148. 2. Chuaqui, R.F., Bonner, R.F., Best , C.J. , Gillespie, J.W. , Flaig, M.J. , Hewitt , S.M., Phillips, J.L. , Krizman , D.B., Tangrea , M.A. , Ahram , M. , Linehan , W.M. , Knezevic, V. , and Emmert-Buck, M.R. (2004) Post-analysis follow-up and validation of microarray experiments. Nat. Genetics 32 , 509 –514.

3. Wellmer, F., Riechmann, J.L., Alves-Ferreira, M., and Meyerowitz, E.M. (2004) Genome-wide analysis of spatial gene expression in Arabidopsis flowers. Plant Cell 16, 1314–1326. 4. de Almeida Engler, J., Van Montagu, M., and Engler, G. (1998) Whole-mount in situ hybridization in plants. Methods Mol. Biol. 82, 373–384. 5. Johansen, B. (1997) In Situ PCR on Plant Material with Sub-cellular Resolution. Ann. Bot. 80, 697–700.

242

Drea et al.

6. Pesquet, E., Barbier, O., Ranocha, P., Jauneau, A., and Goffner, D. (2004) Multiple gene detection by in situ RT-PCR in isolated plant cells and tissues. Plant J. 39, 947–959. 7. Borlido, J., Pereira, S., Ferreira, R., Coelho, N., Duarte, P., and Pissarra, J. (2002) Simple and Fast In Situ Hybridization. Plant Mol. Biol. Rep. 20, 219–229. 8. Tomancak, P., Beaton, A., Weiszmann, R., Kwan, E., Shu, S., Lewis, S.E., Richards, S., Ashburner, M., Hartenstein, V., Celniker, S.E., and Rubin, G.M. (2002) Systematic determination of patterns of gene expression during Drosophila embryogenesis. Genome Biol. 3, RESEARCH0088-8. 9. Satou, Y., Takatori, N., Fujiwara, S., Nishikata, T., Saiga, H., Kusakabe, T., Shin-i, T., Kohara, Y., and Satoh, N. (2002) Ciona intestinalis cDNA projects: expressed sequence tag analyses and gene expression profiles during embryogenesis. Gene 287, 83–96. 10. Quiring, R., Wittbrodt, B., Henrich, T., Ramialison, M., Burgtorf, C., Lehrach, H., and Wittbrodt, J. (2004) Large-scale expression screening by automated whole-mount in situ hybridization. Mech. Dev. 121, 971–976. 11. Koltai, H. and McKenzie Bird, D. (2000) High throughput cellular localization of specific plant mRNAs by liquid-phase in situ reverse transcription-polymerase chain reaction of tissue sections. Plant Physiol. 123, 1203–1212. 12. Friml, J., Benkova, E., Mayer, U., Palme, K., and Muster, G. (2003) Automated whole mount localisation techniques for plant seedlings. Plant J. 34, 115–124. 13. Taylor, C. (1997) Promoter fusion analysis: an insufficient measure of gene expression. Plant Cell 9, 273–275. 14. Birnbaum, K., Shasha, D.E., Wang, J.Y., Jung, J.W., Lambert, G.M., Galbraith, D.W., and Benfey, P.N. (2003) A gene expression map of the Arabidopsis root. Science 302, 1956–1960. 15. Kerk, N.M., Ceserani, T., Tausta, S.L., Sussex, I.M., and Nelson, T.M. (2003) Laser capture

16.

17.

18.

19.

20.

21.

22.

23.

microdissection of cells from plant tissues. Plant Physiol. 132, 27–35. Ogihara, Y., Mochida, K., Nemoto, Y., Murai, K., Yamazaki, Y., Shin, I.T., and Kohara, Y. (2003) Correlated clustering and virtual display of gene expression patterns in the wheat life cycle by large-scale statistical analyses of expressed sequence tags. Plant J. 33, 1001–1011. Gowda, M., Jantasuriyarat, C., Dean, R.A., and Wang, G.L. (2004) Robust-LongSAGE (RLSAGE): a substantially improved LongSAGE method for gene discovery and transcriptome analysis. Plant Physiol. 134, 890–897. Nakano, M., Nobuta, K., Vemaraju, K., Tej, S.S., Skogen, J.W., and Meyers, B.C. (2006) Plant MPSS databases: signature-based transcriptional resources for analyses of mRNA and small RNA. Nucleic Acids Res. 34, D731–D735. Drea, S., Leader, D.J., Arnold, B.C., Shaw, P., Dolan, L., and Doonan, J.H. (2005) Systematic spatial analysis of gene expression during wheat caryopsis development. Plant Cell 17, 2172–2185. Drea, S., Corsar, J., Crawford, B., Shaw, P., Dolan, L., and Doonan, J.H. (2005) A streamlined method for systematic, high resolution in situ analysis of mRNA distribution in plants. Plant Methods 1, 8. Camp, R.L., Chung, G.G., and Rimm, D.L. (2002) Automated subcellular localization and quantification of protein expression in tissue microarrays. Nat. Med. 8, 1323–1327. Brey, E.M., Lalani, Z., Johnston, C., Wong, M., McIntire, L.V., Duke, P.J., and Patrick, C.W., Jr. (2003) Automated selection of DAB-labeled tissue for immunohistochemical quantification. J. Histochem. Cytochem. 51, 575–584. Fernandez, D.C., Bhargava, R., Hewitt, S.M., and Levin, I.W. (2005) Infrared spectroscopic imaging for histopathologic recognition. Nat. Biotechnol. 23, 469–474.

Chapter 13 Plant and Crop Databases David E. Matthews, Gerard R. Lazo, and Olin D. Anderson Summary Databases have become an integral part of all aspects of biological research, including basic and applied plant biology. The importance of databases continues to increase as the volume of data from direct and indirect genomics approaches expands. What is not always obvious to users of databases is the range of available database resources, their access points, or some basic elements of database querying. This chapter briefly summarizes the history of data access via the Internet and reviews some basic terms and considerations in dealing with plant and crop databases. The reader is directed to some of the major publicly available Internet-accessible relevant databases with summaries of the major focuses of those databases, and several examples are given to illustrate how to access plant genomics data. Finally, an outline is given of some of the issues facing the future of plant and crop databases. Key words: Databases, Genomics, Bioinformatics, Plant, Crop, Internet.

1. Introduction When we refer to plant and crop databases, we mean those databases that are generally available to any users over the Internet. In today’s research environment, ready accessibility is a sine qua non to be considered relevant to the general thrust of plant research. Local or restricted access databases play little, if any, role in the broad advancement of plant sciences. The central role of the Internet in plant sciences traces back to the foundation of the Internet and its initial purpose of transferring large data sets between specific laboratories and agencies for targeted projects supported by the US Department of Defense through links that became Advanced Research Projects Agency Network (ARPANET), the technical

Daryl J. Somers et al. (eds.), Methods in Molecular Biology, Plant Genomics, vol. 513 © Humana Press, a part of Springer Science + Business Media, LLC 2009 DOI: 10.1007/978-1-59745-427-8_13

243

244

Matthews, Lazo, and Anderson

core of the future Internet. Soon, the concept was adopted by National Science Foundation (NSF) to connect supercomputing facilities through the NSFnet systems. The merger of these and other nets and the internationalization of the connections evolved into the Internet proper. Thus, the expansion and the realization that the use went beyond the initial purposes was recognized and led to the open access vision of the Internet as we know it today and the networking infrastructure and telecommunications now available. In the early days of general Internet access, the options were limited for data sharing via text and file transfer options using such services as gopher, Wide Area Information Servers (WAIS), and File Transfer Protocol (FTP). The development of hypertext interfaces and graphical browsers heralded the World Wide Web (WWW) and an explosion of both users and applications such as the introduction of specialty online databases and other information resources. In a relative instance of time, 10–15 years, Internet-accessible information has become an integral part of the scientific enterprise, such that, for many fields including the plant sciences, it now seems impossible to conceive of future significant progress being made without the Internet and the databases and other resources the Internet makes available. This is particularly true as the information flow from genomics and other high-throughput technologies accelerate their impacts on all aspects of plant sciences. Although the Internet is one of the foundations for modern science, the actual pillars that directly support modern biological science are three interrelated resources/tools; that is, databases, bioinformatics, and computational biology. These three overlap, and the formal definitions vary with each proponent. For an extended listing of various definitions, we refer the reader to http://www.geocities.com/bioinformaticsweb/definition.html. For present purposes, general descriptions of these three areas will suffice. Databases are facilities and tools that allow researchers to utilize computers to handle large and/or complex data sets, search such data sets, analyze those sets, and assist in reaching conclusion and hypotheses. Bioinformatics is research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral, or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. Computational biology is the development and application of data-analytical and theoretical methods, mathematical modeling, and computational simulation techniques to the study of biological, behavioral, and social systems. Together these three deal with the storage of data, access and analysis of data, and developing theoretical frameworks for new algorithms. The focus of the present chapter is plant databases that specialize in data related to plant genetics, molecular biology, genomics, and data relevant to disciplines utilizing these tools – what

Plant and Crop Databases

245

resources are available, what are some general common principles in querying databases, some examples of typical queries, and brief speculations on where plant databases are moving.

2. Plant Databases Why need databases at all? Some simple examples can quickly dispel such questions. The current total number of major expressed sequence tag (EST) collections for plants at GenBank is shown in Fig. 1A. Four plants, Arabidopsis, rice, maize, and wheat all have over a million ESTs, with 16 other plants having 100,000 to almost 600,000. It is not only the current volume that is significant but also the rate of change. The growth of wheat ESTs is shown in Fig. 1B. For wheat, there were only nine ESTs in 1999, but wheat ESTs have now risen to over 1 million as of mid-2007 illustrating the rapid growth of such data. A similar trend has been experienced with other plant species. Computational capability, databases, sophisticated search capabilities, and connections to other data types are mandatory to utilize such resources. To further enrich, and complicate, the potentials are the whole plant genome sequences whose number will only accelerate over time. Currently, only the genomes of Arabidopsis, rice, and poplar have been fully sequenced, but other genomes are in the process of, or in consideration for, being sequenced – which will only accelerate with technological improvements in sequencing methodology. 2.1. A Survey of Databases and Resources

We will discuss the features of databases, the most common methods of querying, examples of the types of queries possible from databases, and some comments on database use. Following that will be a survey of the major currently publicly available plant databases resources, including a summary of the individual focus and mission, the organisms covered, the main classes of data, and some pointers on using these sources.

2.2. Features of Databases

By their nature, databases must provide something that users want – otherwise they would have no utility and be either unused or supplanted by more useful resources. How to deliver what users want is typically an interplay between database staff and users. Whether by design or by serendipity, database resources evolve to meet users’ needs or the database disappears. Thus, the users have a central role in database design and should be recognized for their contribution. The concept that a database system can be designed and evolve based solely on visions of originators and/or maintainers is quixotic – who can possibly know all possible needs and foresee all new directions? The degree of this interaction will

246

Matthews, Lazo, and Anderson

Fig. 1. Plant expressed sequence tags (ESTs). (A) Number of ESTs for the top 20 plant species in dbEST at National Centre for Biotechnology Information (NCBI) on July 20, 2007. The species in order (highest to lowest) are Arabidopis thaliana (A), Oryza sativa (O, rice), Zea mays (Z, maize), Triticum aestivum (T, wheat), Brassica napus (B, oilseed rape), Hordeum vulgare (H, barley), Glycine max (G, soybean), Pinus taeda (P, loblolly pine), Vitis vinifera (V, grape), Solanum lycopersicum (S, tomato), Malus X domestica (M, apple tree), Saccharum officinarum (S, sugarcane), Medicago truncatula (M, barrel medic), Solanum tuberosum (S, potato), Sorghum bicolor (S, sorghum), Gossypium hirsutum (G, cotton), Physcomitrella patens (P, moss), Lotus japonicus (L, trefoil), Picea sitchensis (P, Sitka spruce), and Picea glauca (P, white spruce). (B) Increase in wheat ESTs.

vary depending on the nature of the underlying data types and the specific users. A major resource such as GenBank specializes in sequences and associated information, and as such may have relatively little intense interaction with users. On the contrary, there are more specialized resources, such as GrainGenes that

Plant and Crop Databases

247

focuses on small grains crops and their improvement, where the gathering and curation of data is a daily interactive flow between the database staff and data users and generators. Individual databases also vary in the spectrum of data and services they deliver. Many databases sites actually provide varying suites of services and specialized portals accessing a central core of data. Increasingly, databases linked among themselves and/or share data to both avoid redundancy and to enrich the data they present; for example, GenBank is the central DNA sequence repository and links that sequence data to related data and presentations such as some of the main genetic and physical maps found in GrainGenes. As a complement, GrainGenes may contain the complete set of small grains genetic maps linked to curate information on markers and germplasm populations, with the most relevant DNA sequence information pulled from GenBank and links and references to all other DNA sequences back to GenBank. Added to this synergism are projects such as Gramene where comparative plant genomics is emphasized and which both links to and incorporates data from GenBank and GrainGenes.

3. Current Databases Let us briefly overview the most relevant Internet-accessible current database assets available to plant researchers – what each covers, how to access them, and how they interrelate. Readers are encouraged to become familiar with all of those listed since their focuses and ranges of data are often complementary. Additional formal databases sites may also be useful, in addition to many specific project Web pages. The latter tend to be ephemeral, but the reader should bookmark those active sites that may contain specific useful data. Most databases are not built just for queries, but usually have tools that extend the utility of the information that can be gleaned from the site. Even if a specific database may not seem directly relevant to a specific researcher and their interests, users are recommended to at least browse the other sites for the depth of data and potential for comparative interpretations. Not listed are numerous smaller and/or more specialized databases – many established for specific projects. Many of those sites not listed below tend to be temporary or not regularly maintained, but readers are encouraged to be alert for such resources that may be of interest to their own research. 3.1. National Center for Biotechnology Information

National Centre for Biotechnology Information (NCBI) (http:// www.ncbi.nlm.nih.gov) is a primary source for DNA and protein sequences of all organisms. Sequence similarity searches, such as

248

Matthews, Lazo, and Anderson

with Basic Local Alignment Search Tool (BLAST) versions, can be done using a known sequence to find the closest matches to all other organisms. Unigenes suggest how known sequences cluster and selected available maps connect directly to markers with known sequences. A useful BLAST feature at NCBI is the ability to BLAST to species; that is, a query can be directed only to specific taxonomic groups. 3.2. European Molecular Biology Laboratory: European Bioinformatics Institute

European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI) (http://www.ebi.ac.uk) is Europe’s primary collection molecular biology/genomics data and also contains a copy of the known DNA and protein sequences. This site contains links to an impressive array of tools and data collections, while focused on non-plants, are well worth becoming familiar with, that is, the Ensembl project for automating annotation of large genomes.

3.3. DNA Databank of Japan

DNA Databank of Japan (DDBJ) (http://www.ddbj.nig.ac.jp) is the official DNA data bank in Japan and not only collects DNA sequences mainly from Japanese researchers but also accepts data and issues the accession number to researchers in any other country. DDBJ, NCBI, and EMBL exchange the collected data on a daily basis so that the three data banks share virtually the same data at any given time. In addition, DDBJ provides many tools for data retrieval and analysis developed by at DDBJ and others.

3.4. The Institute for Genome Research

The Institute for Genome Research (TIGR) (http://www.tigr. org) is recently merged into the J. Craig Venter Institute, an organization focused on aspects of genomic research throughout all classes of living organism. Efforts in plants include developing bioinformatics resources for annotating the plant genomes. A centralized Website contains all available sequence data for specific well-studied plant species and provides access tools for analyzing those sequences. For example, the rice genome is used to link to syntenic genetic markers of other plants. Other resources include searchable and downloadable assemblies of plant ESTs, linkage of Affymetrix DNA array to mapped plant sequences, and numerous statistics on available plant DNA sequences.

3.5. The Arabidopsis Information Resource

The Arabidopsis Information Resource (TAIR) (http://www. arabidopsis.org) maintains a database of genetic and molecular biology data for the model higher plant Arabidopsis thaliana. As such, TAIR provides data including the complete genome sequence along with gene structure, gene product information, metabolism, gene expression, DNA and seed stocks, genome maps, genetic and physical markers, publications, and information about the Arabidopsis-research community.

Plant and Crop Databases

249

3.6. Gramene

Gramene (http://www.gramene.org) focuses on comparative genome analysis in the grasses, with emphasis on application of models, such as rice and Brachypodium (future-planned addition) to the grasses. Gramene provides cross-species homology relationships of genomic and EST sequences, protein structure and function analysis, genetic and physical mapping, interpretation of biochemical pathways, gene and quantitative trait loci (QTL) localization and descriptions of phenotypic characters and mutations. Gramene also provides viewing of selected non-rice maps, mappings of sequences to the rice genome, and comparative links.

3.7. PlantGDB

PlantGDB (http://www.plantgdb.org) has as its main objectives to develop plant species-specific EST and genome survey sequencing (GSS) databases, provide Web-accessible tools and inter-species query capabilities, and provide genome browsing and annotation capabilities. PlantGDB attempts to aid in the organization and interpretation of genomic sequence data through the development and implementation of integrated databases and analytical tools, including the estimation and characterization of the plant gene space, the extent and conservation of alternative splicing in plants, and the development of algorithms and statistical methods for splice site recognition and gene structure prediction.

3.8. GrainGenes

GrainGenes (http://wheat.pw.usda.gov) focuses on what are termed the “small grains” crops (wheat, barley, rye, triticale, and oats) and extends to members of their two grass tribes. The inclusion of data on non-crop Triticeae and Aveneae is because of the value of the germplasm within these tribes and since so much of the breeding effort for the small grains includes tapping into the gene pools of wild relatives. GrainGenes is the source of the broadest range of data on these grasses, and includes the most comprehensive source of maps (linkage and physical) and associated molecular markers, plus DNA sequences and links to related data such as germplasm and traits. Associated databases implemented at the request of the community include TREP (Triticeae Repeat Database; http://wheat.pw.usda.gov/TREP) and wEST (Wheat Expressed Sequence Tag database for detailed information on bin-mapped EST in the wheat genome; http://wheat. pw.usda.gov/wEST). In addition, GrainGenes provides small grains community services such as posting of meetings, position openings, and support for web sites and databases for specific small grains projects.

3.9. SHared Information of GENetic Resources

SHared Information of GENetic Resources (SHIGEN) (http:// www.shigen.nig.ac.jp) provides access mainly to data generated from genomics projects in Japan. Some specialized tools/data include BLAST to specific ESTs generated from the research

250

Matthews, Lazo, and Anderson

model organisms, software tools for handling BLAST output files, facility to request germplasm and ESTs held at sites within Japan, and extractable DNA sequences from specific cultivars. This site houses databases and information for a range of organisms, with significant resources on wheat, rice, barley, and legumes. For example, the Komugi database (http:// www.shigen.nig.ac.jp/wheat/komugi) that focuses on wheat genomics and the Wheat Information Service (WIS; http:// www.shigen.nig.ac.jp/ewis) is a non-peer-reviewed Internet journal for rapid dissemination of wheat community news, technical tips, protocols, mutant and germplasm collection descriptions, and other topics of potential interest to the wheat research community. 3.10. International Crop Information System

Developed by the Consultative Group on International Agricultural Research (CGIAR), ICIS (http://www.icis.cgiar.org:8080) is a database system for the management of global information on genetic resources and crop improvement for any specific crop, with information on individual germplasm the centerpiece of the databases. For example, GWIS is the wheat implementation of ICIS and includes germplasm pedigrees, field evaluations, structural and functional genomic data (including links to external plant databases) and environmental (geographic information system, GIS) data. Implementation can be downloaded from ICIS/ CGIAR Websites.

3.11. Germplasm Resources Information Network

Germplasm Resources Information Network (GRIN) (http:// www.ars-grin.gov) is the US Department of Agriculture−Agricultural Research Service (USDA-ARS) program that provides germplasm information about plants, animals, microbes, and invertebrates important for US food and agricultural production. Searches can be carried out for specific germplasm accessions (http://www.ars-grin.gov/npgs/acc/acc_queries.html), and GRIN provides access to ordering seeds from the ARS germplasm collections such as the Small Grains Germplasm Repository in Aberdeen, Idaho (http://www.ars-grin.gov/npgs/order.html).

3.12. Plant Expression Database

Plant Expression Database (PLEXdb) (http://www.plexdb.org) is a resource for gene expression for plants and plant pathogens, and while focusing on gene expression data also attempts to integrate new and rapidly expanding gene expression profile data sets with traditional structural genomics and phenotypic data. The tools at PLEXdb allow investigators to use common data features across plants for a comparative approach to functional genomics through use of large-scale expression profiling data sets. In addition, wheat- and barley-specific Affymetrix DNA microarray data sets can be queried.

Plant and Crop Databases

251

3.13. Maize Genome Database

Maize Genome Database (MaizeGDB) (http://www.maizegdb. org) is the USDA-ARS database devoted to maize and its immediate relatives, and is the most comprehensive data source for maize. It serves as the maize community database for genetic, genomic, sequence, gene product, functional characterization, literature reference, and person/organization contact information.

3.14. Soybase

Soybase (http://soybase.ncgr.org) is the USDA-ARS database for genetic, phenotypic, and other information about soybean. The Soybase home page (http://soybase.agron.iastate.edu) provides an entry point to the Soybase database, which is hosted at the National Center for Genome Resources (NCGR; http:// www.ncgr.org), plus other soybean information and community links. The data within the database is accessible via a Class Browser, Text Search, and an Ace Query. The SoyBase Home Page provides the latest news about the database, links to other soybean and legume sites, unpublished data that have not yet been incorporated into SoyBase and many other items of interest to soybean researchers.

3.15. National Center for Genome Resources

National Centre for Genome Resources (NCGR) (http:// www.ncgr.org) is a non-profit research institution dedicated to the interactions of bioscience, computing, and mathematics to general issues in the biological sciences. In addition, NCGR carries out research and development of software and computation tools to improve treatment of diseases and nutrition, hosts the soybean database, hosts the Legume Information Service (LIS; http://www.ncgr.org/ourwork/#lis) which integrates genetic and molecular data from multiple legume species and enables genomic, transcript and map cross-species comparisons, and carries out a number of research projects on specific topics and/or species.

3.16. SOL Genomics Network

SOL Genomics Network (SGN) (http://www.sgn.cornell.edu) is oriented to genomic, genetic, and taxonomic information for species in the Euasterid clade, particularly Solanaceae (e.g. tomato, potato, eggplant, pepper, and petunia) and Rubiaceae (coffee) families. Genomic information is presented in a comparative format and tied to the fully-sequenced Arabidopsis genome. SGN is a part of the International Solanaceae Initiative (SOL), which has the long-term goal of creating a network of resources and information to address key questions in plant adaptation and diversification. In addition to a wide range of genomics data, SGN makes available bioinformatics tools for general use, including BLAST searches, the SolCyc biochemical pathways database, a CAPS (Cleaved Amplified Polymorphic Sequences) experiment designer, an intron detection tool, an advanced Alignment Analyzer, and a browser for phylogenetic trees.

252

Matthews, Lazo, and Anderson

4. Searches and Queries Using a database depends on what type of information you want to get from it; some information may be more straightforward and to the point, whereas others may be more complex in their nature. Also the format in which the data is provided can vary from database to database; some focus more on making it easier to use, whereas others try to provide more information rich files, which a power user may sometimes prefer for doing their own analyses. The breath of the data provided and the way it is provided is mainly a decision for developers of the host site. The varied forms of data provided might range from a simple onepage visualization to a downloadable file that can be used for other purposes. In many cases, the host site tries to enrich the data and provide it in a user-friendly fashion. However, because of the volume of data provided nowadays, mainly an outcome of highthroughput technologies, the ability to curate the data becomes more difficult. One thing to look for in a database Website is a link for contact information to the host personnel so that you might add suggestions in better ways to accommodate the types of uses which can be most valuable for public use. A popular suggestion is for databases to communicate better with other databases – this has added to standardization of the way information is provided through databases, and has strengthened the way information is sought when using the Web. This allows many databases to capitalize on their expertise strengths and refer to other resources that may specialize in other areas of data information. 4.1. Types

The basic function of the database is to house information, and there are different ways of extracting the information and applying it for its intended use. Databases aim to provide a range of search modes, from very simple to very powerful but not so simple. The following are different ways in which information can retrieved from a database.

4.1.1. Full Text Search

One of the simplest queries is the “full text search”; this is similar to the familiar Google application and is completely straightforward – simply enter in your keywords and any indexed hits will be retrieved. When the keyword is specific, full text searching is the quickest way to go. One drawback is if your search term is not specific enough you will get a lot of unrelated material.

4.1.2. Class-Qualified Search

If the keyword is less specific, the query needs to focus on an appropriate database for the term, and it should narrow in on the relevant data category, requiring a “class-qualified search.” Most

Plant and Crop Databases

253

databases house their data based on data classes. As an example, the GrainGenes database has about 30 data classes. If the search term were “Sr24” the most obvious search would be in the class “Gene.” Sr24 is a term used for a plant-resistance gene to a specific pathogen variant causing the disease stem rust. Other possibilities to explore would be “Locus” and “Allele.” (In GrainGenes the default class-qualified search is “All”, i.e., all classes, which is sometimes convenient but not for the most degenerate cases like the gene name.). These types of searches require some familiarization with the available data classes. In most cases, users can look for links to background information on the database structure – sometimes this is intuitive, but perhaps in most cases not. Class-qualified searches retrieve only records where the name is included in the search word, not those that contain the word “Sr24” anywhere in the text of the record. This improves the selectivity of the search. The trade-off is that “Sr24” does not directly return all information, as some maps contain the Sr24 locus, but are linked by another name, PSR1203, and can be linked to probes and markers that are located near this gene. For example, the record for Probe PSR1203 links to locus records Xpsr1203, Xpsr1203-3A, …, including Sr24 which have links to corresponding maps. 4.1.3. Field-Qualified (“Boolean”) Searches

The next level of specificity and power is for cases where there is not a keyword at all, but another kind of criterion for which a list of all appropriate hits is desired. Following are three samples of a desired query lists. In general terms (1) find microsatellite markers on the short arm of wheat chromosome 2B, (2) find ESTs from drought-stressed leaf tissue, and (3) find genes for resistance to stem rust, leaf rust, or stripe rust. Queries like the above are the reason structured databases exist. Unfortunately, this is also where their interfaces become inconvenient. A list of 30 data classes is manageable for a quick search, but the list of 30 or so fields in each class requires effort and time, especially since the field names and their range of values are rarely self-explanatory and never used consistently across all databases. Examples of “fields” in the above queries would include “probe,” “sequence,” and “gene.” Some documentation of the field names is usually available but in practice the quickest approach is to examine a few records and infer the usage. The two approaches toward addressing field-qualified searches are with the use of query builders and query languages.

4.1.4. Query Builders

This is usually a form-based format where a pre-configured query simply needs the input of selected field names and values and to combine them in a Boolean fashion and produce a result. Query builders are friendlier for most users. A good example is BioMart (http://www.biomart.org), a generic relational database interface

254

Matthews, Lazo, and Anderson

with flexibility to manage a rich set of possible field structures. See GrameneMart for another example (http://www.gramene. org/Multi/martview). Some query builders, including this one, provide the ability to specify not only which fields should be searched but also which fields should be returned to the user, in a table format – downloadable of course. This feature is much more useful than just a list of the names of the records found. 4.1.5. Query Languages

This is the raw query form under which a database operates. Query languages are less commonly found in plant databases – presumably because they are useful only to “power users” who spend enough time with a particular database to learns its language. However, for such users a sufficiently simple query language, for example, NCBI’s Entrez, is always faster and more convenient. In addition, they are frequently more powerful because they can access everything in the database instead of only what has been predefined by the curators for inclusion in the query-builder interface. The native query language for most databases is SQL (Structured Query Language). Query languages provided to users are usually non-native for two good reasons – that they can be made more intuitive syntactically, and more importantly that SQL requires detailed knowledge of the underlying structure of tables and relationships in the particular database. Nonetheless SQL provides access to absolutely everything in a relational database whereas even the best emulated query languages can become equally unfriendly/complex in trying to grant total data access. The GrainGenes plant database provides a direct WWW SQL interface (http://wheat.pw.usda.gov/cgi-bin/graingenes/ sql.cgi) that has been quite successful in allowing exchange of data between curators of other databases – this is one of few databases that allows the direct access over the Internet. Developing such an interface for public use requires close scrutiny of the class tables and their interrelationships, and assuring that system security can be maintained. Essentially, all your system settings can be determined through this interface, and it is probably the security risk why more databases do not allow this access. SQL is definitely in the power-users category, primarily for professional data curators and database programmers; that is, curators from Gramene and NCBI use the GrainGenes SQL interface to extract data for their own databases. However, sophisticated users can design as complex a query as the data, the database structure, and the user’s imagination can support. Similarly, database staff can use SQL to provide answers to users’ requests; rather than receiving a static table of output, the user can get the SQL query itself to run via the Web interface, can re-run at any later time to get the current data, and can modify the query if desired. For example, the GrainGenes query shown

Plant and Crop Databases

255

below retrieves all maps containing loci identified by the probe BCD372 and also retrieves the related map positions. select probe.name, locus.name, map.name from probe\ join locusprobe on probe.id = locusprobe.probeid\ join locus on locusprobe.locusid = locus.id\ join maplocus on locus.id = maplocus.locusid\ join map on maplocus.mapid = map.id\ where probe.name = “BCD372” The query above can be easily modified to search for probe CD064 instead of BCD372 by simply replacing the probe name. A painless way to deliver some of the power of SQL is for the database curator to write the query, with parameters that can be user supplied from a friendly Web form. Such queries can be relatively simple and general, but they can also be arbitrarily complex to address specific questions that are frequently asked. Users often have common queries they will make repeatedly or queries common to many users. In such cases it can be efficient for database staff to pre-design queries once the user community has identified likely query candidates. Such quick queries can both ease users’ utilization of the database and serve to introduce them to querying where it might otherwise seem onerous to learn querying languages or protocols. An example is in Fig. 2 which shows a section of the GrainGenes Quick Query Web page. Three simple

Fig. 2. Quick Queries. At the top web page for the GrainGenes database (http://wheat. pw.usda.gov), click on “Quick Queries” to view a list of pre-written Structured Query Language (SQL) queries. Shown is a section with three queries related to maps and loci. Users can fill in the boxes to specify search restrictions; that is, probe names and map distances.

256

Matthews, Lazo, and Anderson

quick queries are shown; that is, finding all loci within a specified centiMorgans (cM) distance of a reference loci (cdo431 in this example) on any map, finding known genes within a specified distance to a reference locus (cdo64), or a listing of all loci between two reference loci (cdo64 and adh). 4.1.6. Batch Queries

An additional important feature, rarely found in either query builders or query languages, is the ability to submit a long list of query terms all at once instead of one at a time – for example, see http:// wheat.pw.usda.gov/cgi-bin/graingenes/batchsql.cgi. A similar feature is found at NCBI with batch Entrez where specific sets of sequences can be retrieved all in one session, and direct query results can be used to fill a batch request (http://www.ncbi.nlm.nih.gov/ entrez/batchentrez.cgi). The power of this interface is that once the list is obtained and processed, the data can be downloaded to the users desktop in a wide variety of formats; for example, FASTA, GenBank, and GI Accessions. This functionality does not require any particular user expertise, and many users have asked for it. Its usefulness speaks for itself. The lack of it in so many existing, otherwise powerful, databases is hard to puzzling.

4.2. Comparison of Queries

On the basis of the different types of queries, and the different types of information available from databases, we hope to illustrate how queries can result in different forms of information. Each of the available public databases is different in appearance, organization, and modes of data access. However, most operate on underlying relational databases systems with common features. In a previous example, a search was used for the wheat stem rust-associated gene Sr24. The same search when compared against various databases results in completely different results that do tie information together for the user. In the GrainGenes database, a database where information is about a pathogen on one of its species, several information points were available about the disease, gene, probes, locus, and maps. In comparing the same information in the Gramene database, the information was more restricted to map location, and in some cases a probe linked to the gene name was used instead, but a link between the probe and gene were not evident. A survey for the term Sr24 when searched at NCBI resulted in a listing of several nucleotide sequences and literature links which had the key word, some were related, and some were not; a single-related nucleotide record for the PSR1304 term was obtained. A survey of the TIGR research site demonstrated several interesting files and records of wheat genes and probe maps along with other information, but both terms for the gene and the probe were not found. Conversely, when other information in the Gramene, TIGR, and NCBI research sites were analyzed, important information relating back to the wheat plant was also uncovered demonstrating power

Plant and Crop Databases

257

in comparative mapping and uncovering some relationships to how closely related similar species might be. In all, this may be important in setting a foundation for research in new areas where information from one database may help resolving information in another database. But it is important for the user to understand the centric nature of the database they are using to benefit the most from the data that will be provided. 4.3. Examples

All of the listed databases are rich resources for plant genomic data, and it is beyond the present format to list all that is available. As only a taste of the breadth of data and systems available, a few varied examples are given.

4.3.1. Grass Comparative Analysis with Rice Model

A snapshot of the rice genome is displayed at the Gramene database as shown in Fig. 3 for a portion of rice chromosome 1. Only

Fig. 3. Comparative analysis at Gramene of multiple grass expressed sequence tags (ESTs) and markers to the annotated rice genome.

258

Matthews, Lazo, and Anderson

a fraction of the total display is given in Fig. 3, but includes results from gene modeling of this section of the rice genome using results from a TIGR analysis (a good example of the cross-linking/interactions of databases), and indications where ESTs and markers from other plants match best to the rice genome. Each of the EST and marker rows is collapsed and can be expanded by clicking to reveal the complete set of matches. 4.3.2. Finding Markers Associated with Traits

Starting from a locus and going to relevant probes and other information is shown from GrainGenes in Fig. 4. By searching with the locus Sr24 (a stem rust-resistance gene in wheat), the user finds information (Fig. 4a) such as the wheat map containing

Fig. 4. Using GrainGenes to find map positions and linked markers to a wheat stem rustresistance gene (Sr24). (A) The search result for the Sr24 locus (partial view). (B) Clicking on “Show Nearby Loci” in “A” gives a table of adjacent markers and loci (partial list ).

Plant and Crop Databases

259

Sr24, the primary probe, and other information (not all shown). Linking through “Show Nearby Loci” gives a table of nearby loci on that map (Fig. 4b; not all shown). 4.3.3. Unigene Clusters Mapped onto the Poplar Genome

NCBI displays maps of numerous genomes along with links to genome sequence data as it becomes available. Shown in Fig. 5 is a graphical view of the assembled poplar genome sequence linked to map positions, NCBI contigs, predicted gene models, and alignments of Populus transcripts and NCBI Unigene clusters to the poplar genome.

4.3.4. Downloadable EST Assemblies

At TIGR, ESTs from 254 plants are assembled independently for each plant with at least 1,000 ESTs. A portion of the total list is shown (Fig. 6) for those plants with the most ESTs, and includes information on when the assembly was done (note the

Fig. 5. Linking map data and sequence annotation to the assembled poplar (Populus trichocarpa) genome sequence at National Centre for Biotechnology Information (NCBI).

260

Matthews, Lazo, and Anderson

Fig. 6. Information and downloadable expressed sequence tag (EST) assemblies for all plants with at least 1,000 ESTs from The Institute for Genomic Research (TIGR).

latest assembly for each plant may not contain all currently available ESTs), assembly characteristics, and total ESTs. Each assembly can be downloaded for further analysis.

5. The Future of Plant and Crop Databases

Integration of data types and sources will continue to be a struggle in the future. In addition to the technical problems with integration, there is need for vision at all community levels as to the role of databases in the plant sciences. Quality of data and the role of curation are interrelated. Much of what is published contains various levels of errors. What is the significance of different error types and what can be done to address the problem? Curation of data is a possible mitigating factor, but curation is resource intensive and not particularly valued by funding agencies. In some cases volumes of data can perform de facto cross-checking roles by the sheer amount of data and flagging errors through cross-links. In many cases errors can propagate through data inter-connections and be difficult to root out. Database users should be aware that while the vast majority of data is an accurate representation of data originators, no database entries are infallible; that is, use the databases, but be conscious of apparent discrepancies and check with data curators or originators as necessary.

Plant and Crop Databases

261

Making the databases and related bioinformatics tools easily accessible is a continual problem. Reality is that many potential users will not use available resources for a number of reasons including resources too difficult to learn and extract data, lack of basic training in the use of bioinformatics, and simple inertia at learning new tools. Appreciation of the role of databases and bioinformatics is often lacking or colored by competition for attention. Training of scientists for the current and future bioinformatics landscape is essentially an ad hoc exercise except for a few forward-thinking programs. Part of the solution is time since younger researchers are more attuned to the importance of bioinformatics than many established researchers. But more formal training in all aspects of bioinformatics, including database essentials and use, will be a strong addition to the training of any future biological scientist. An inherent problem of databases is to be flexible enough to allow both predicted and unexpected queries. By its very nature, science is exploratory, and the exact queries needed can never be totally predicted. The best that can be strived for is continual improvements in interfaces and querying options. It is of little use to have mountains of data but not be accessible. It is also important to have an element of browsability to the data. While specific queries may be the most visible uses of databases, there should be the possibility of higher-order perusal of data to allow the potential for recognizing unexpected associations. In looking to the future, it can be instructive to look briefly backward. Twenty-five years ago computers were limited to major facilities such as university computer centers. There was no Internet, and analysis of even a single gene sequence was commonly carried out by either primitive software, if available, or programs written by the researcher for whatever early computer was available, and there were no relevant databases. Today, a desktop computer typically has 1 GB or more memory, 100 GB disk storage, and speeds at 2 GHz and faster, with other computer resources such as parallel arrays becoming more commonplace. Databases accessible through the Internet place such resources as all public DNA sequences at the researcher’s fingertips. As both the volume of data power of computers increases, what is not keeping pace is the software to fully utilize the potentials and the expertise of users in accessing those potentials. The amount of sequence data has dramatically increased over the last few years and will only accelerate. As new sequencing technologies come online and the costs continue their downward trend, there will always be “more” worthy sequencing projects. Already we see multiple sequencing from the same genera with both the Oryza japonica and Oryza indica genomes sequenced and additional Arabidopsis genome projects following that of A. thaliana. If it were feasible today, researchers would want

262

Matthews, Lazo, and Anderson

the complete genome sequence of every line of every organism under study – thus an effectively unlimited thirst for sequence information. Before the day arrives, that makes that dream a reality, there will be whole genomes of additional plants, the already mentioned sequence of additional versions of plant genomes, and intense re-sequencing of specific regions over tens, hundreds, and thousands of genomes. Already custom microarrays can be made to re-sequence hundreds of thousands of contiguous or dispersed DNA sequences. An additional major coming DNA sequence source will be associated with high-throughput genotyping. Re-sequencing to discover single nucleotide polymorphisms (SNPs) allows rapid genotyping through various array technologies. How many SNPs are necessary for breeding programs is still to be determined. Currently, the planning is based more toward a minimal number necessary for a given program, but as costs decline and higher resolutions are within range of breeding programs, the density of desired SNPs may approach the entire genome level. There will also be more integration of data as knowledge, database, and analysis tools interlink. Functional genomics data on mRNA transcription and expression will tie to proteomic analyses and metabolomics of entire plants. The complexity of possible higher orders of interactions can only be speculated, but the reasonable assumption is it will dwarf our current limited views. A consequence of more complex and voluminous data is the need for better visualizations. At some point, the human eye and brain cannot assimilate everything needed. Two likely developments will be better graphic tools to consolidate and summarize, and integration of data in a flexible enough manner to customize for each researcher. There will be the adoption of more simultaneous data presentations. This can already be done by using larger computer monitors or multiple monitors to have ready views of multiple programs simultaneously. This can be expanded into visions of whole-wall monitors and immersion in three-dimensional formats (if you saw the movie “Minority Report”). What will the future bring? It will surely involve ever more powerful computers, more computational capability, more sophisticated displays and tools, and greater expertise in the capabilities and exploitation of databases.

Chapter 14 Plant Genome Annotation Methods Shu Ouyang, Françoise Thibaud-Nissen, Kevin L. Childs, Wei Zhu, and C. Robin Buell Summary Annotation of plant genomic sequences can be separated into structural and functional annotation. Structural annotation is the foundation of all genomics as without accurate gene models understanding gene function or evolution of genes across taxa can be impeded. Structural annotation is dependent on sensitive, specific computational programs and deep experimental evidence to identify gene features within genomic DNA. Functional annotation is highly dependent on sequence similarity to other known genes or proteins as the majority of initial “first-pass” functional annotation on a genomic scale is transitive. Coupling structural and functional annotation across genomes in a comparative manner promotes more accurate annotation as well as an understanding of gene and genome evolution. With the increasing availability of plant genome sequence data, the value of comparative annotation will increase. As with any new field, methodologies are evolving for genome annotation and will improve in the future. Key words: Gene prediction, Genome sequence, Gene structure, Gene function.

1. Introduction With the advent of rapid and inexpensive sequence technologies, it is now possible to work on a wide range of plant species and to have access to large sequence data sets. While the bulk of all sequence data is still in the form of transcripts derived from expressed sequence tags (ESTs), which are single-pass sequences from cDNA clones, genome sequences for a number of species have been generated or are in progress. Thus, a wide range of clades within the Plant Kingdom have entered the “genomics era” of biological research. Coupled with access to genomic or transcriptomic data sets is the ability to interpret the sequence Daryl J. Somers et al. (eds.), Methods in Molecular Biology, Plant Genomics, vol. 513 © Humana Press, a part of Springer Science + Business Media, LLC 2009 DOI: 10.1007/978-1-59745-427-8_14

263

264

Ouyang et al.

in a biological context. The process of annotation or “adding notes” to sequence is a combination of computational and biological analyses in which the final output is of use to a biologist. Genome annotation can be roughly divided into structural and functional annotation. Structural annotation refers to finding gene structures within the underlying DNA sequence. This typically includes identification of the genes, their exons, introns, transcriptional start and stop sites and their translational start and stop sites. Structural annotation also involves identifying alternative gene models encoded within a gene, also known as a locus or transcriptional unit (TU). Promoters and other regulatory motifs can also be annotated. Functional annotation is defining the function of a sequence. The primary functional annotation for a gene is what it does in a cell, that is, what is the function of the predicted protein or nucleic acid. However, there are a number of other functional annotations that can be made, all of which assist in understanding the function of genes within an organism. As with sequencing technologies, annotation methods have developed in the last decade from rather primitive gene finding attempts to more sophisticated and richer analyses that utilize well-trained algorithms, curated data sets, robust ontologies and well-designed integrative methods to define a gene and its function. Annotation methods can be performed at large bioinformatic centers using pipelines designed for processing whole genomes and in which manual curation steps are invoked at discrete steps as warranted by biological interest and fiscal resources. Annotation can also be performed at a single gene or multiple gene level in which processes are computed through Web interfaces in which the biologist curates each stage of the process. Both automated and manual curations serve essential functions in genome annotation and are highly complementary. While large-scale sequencing technologies are understood at a basic level by most biologists, there are nuances about the quality of sequence that is generated in genomic projects that are either not well understood or even acknowledged by most of the users of these data sets. Understanding the quality of the underlying sequence is essential to avoid misinterpretations of the resulting annotation. Clearly, although large-scale genome sequencing projects have generated a wealth of resources for the greater biological community, it is “caveat emptor” and the user should be informed about the sequence and its quality before commencing to annotate the sequence. In this chapter, we have described basic sources of genomic sequence along with methods and resources of use to general plant biologists in the structural and functional annotation of small gene or genome sequences.

Plant Genome Annotation Methods

265

2. Materials For genome annotation, there are two required materials: a genome sequence and a computer. For demonstration purposes, we have selected to annotate a rice (Oryza sativa) genomic sequence that can be found on the finished bacterial artificial chromosome (BAC) clone OSJNBa0094F01. The sequence of this BAC can be obtained from the Plant Division of GenBank (1) using the Entrez nucleotide retrieval system (http://www.ncbi.nlm.nih. gov/entrez/query.fcgi) and querying using its accession number (AC093713). We will work primarily with three loci (genes) on this BAC [LOC_Os03g58260 (Indole-3-glycerol phosphate lyase, chloroplast precursor, putative), LOC_Os03g58270 (retrotransposon protein, putative, unclassified), LOC_Os03g58280 (hypothetical protein)] which can be obtained from the MSU Osa1 Rice Genome Annotation Resource at http://rice.plantbiology.msu. edu/LocusNameSearch.shtml. Computational programs needed throughout this chapter are provided primarily in the form of publicly accessible Web interfaces and are noted in the respective subheadings. They are also summarized in Table 1. Alternatively, these programs can be downloaded onto a local machine with the requisite system and specifications.

3. Methods The methods have been divided into Sequence Quality, Structural Annotation, Functional Annotation and Visualization Tools. These four subheadings provide an overview on how to assess sequence quality, how to structurally annotate a genomic sequence, how to assign a function to a gene model and how to visualize your annotation in a graphical display. The reader is referred to Fig. 1 in each of the subheadings below for graphical representation of the respective data. 3.1. Sequence Quality

Before starting with any annotation efforts, it is essential to understand the quality of the underlying sequence as the quality of the sequence, whether it is low sequence accuracy or mis-assembly of the actual sequence, will affect the quality of the resulting annotation. If you have performed the sequencing yourself, then you will have a reasonable idea of the sequence quality as you would know the sequence coverage (number of times each base has a supporting read) and the quality score for each base. A minimum level of coverage for a sequence is 2X sequence coverage in which each base has at least two independent sequence reads.

266

Ouyang et al.

Table 1 Web-based resources for genome annotation Resource

URL

Sequence databases Entrez Retrieval System

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi

Division of GenBank

http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord. html#GenBankDivisionB

Pfam Database

http://www.sanger.ac.uk/Software/Pfam/search.shtml

InterProScan Database

http://www.ebi.ac.uk/InterProScan/

Ab initio gene finders FGENESH program

http://sun1.softberry.com/berry.phtml?topic=fgenesh&group=progra ms&subgroup=gfind

Genemark Hmm program

http://exon.gatech.edu/GeneMark/eukhmm.cgi

Summary of ab initio gene finders

http://www.nslij-genetics.org/gene/

Repetitive sequence tools RepeatMasker

http://www.repeatmasker.org

RepeatMasker Web server http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker Plant transcript databases TIGR Plant Transcript Assemblies

http://plantta.tigr.org

PlantGDB Unique Transcripts

http://www.plantgdb.org/prj/ESTCluster/index.php

HarvEST

http://harvest.ucr.edu/

Sputnik

http://sputnik.btk.fi/ests

NCBI Unigene

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigene

Gene Ontology Gene Ontology Project

http://www.geneontology.org

Goanna

http://agbase.msstate.edu/GOAnna.html

GoFigure

http://udgenome.ags.udel.edu/gofigure

Map2slim

http://www.godatabase.org/dev/pod/scripts/map2slim.html

MSU Osa1 Rice Resources MSU Osa1 Rice Genome Locus Search

http://rice.plantbiology.msu.edu/LocusNameSearch.shtml

Plant Genome Annotation Methods

267

Table 1 (continued) Resource

URL

MSU Osa1 Rice Expression Search Page

http://rice.plantbiology.msu.edu/locus_expression_evidence.shtml

MSU Osa1 Rice FST mapping page

http://rice.plantbiology.msu.edu/BACmapping/FST_map.shtml

MSU Osa1 Rice Genome Browser

http://rice.plantbiology.msu.edu/cgi-bin/gbrowse/rice/

Genome Visualization Tools Artemis Viewer

http://www.sanger.ac.uk/Software/Artemis/v8/

Appolo Editor

http://www.gmod.org/?q=node/4

Generic Genome Browser http://www.gmod.org/?q=node/71 GFF3 Format

http://www.sequenceontology.org/gff3.shtml

In each read, a minimum quality score for each base should be set, typically set at phred greater than or equal to 25 or 30. Any base in which the phred base calls are conflicting should be examined based on the electropherograms and discrepancies should be resolved through re-sequencing. As a wealth of sequence data is available in public databases, it is most likely that you will obtain your genomic sequence from a database. Thus, it is essential that you understand the “inherent” quality (or lack of quality) of sequence in public repository. The main source for these sequences is GenBank which has several divisions for sequence based on the taxonomic origin of the sequence and/or the quality or type of sequence [http://www.ncbi.nlm. nih.gov/Sitemap/samplerecord.html#GenBankDivisionB; (1)]. Finished, high-quality plant genome sequence can be found in the Plant (PLN) division; this would include individual sequences generated by researchers and a subset of sequences generated in large-scale genome sequencing projects. It is assumed in this division that the sequences have been properly reviewed for quality and any portion of the sequence that fails to meet the basic quality levels has been noted within the accession record. Other large sequence data sets include single-pass sequences such as ESTs, BAC

268

Ouyang et al.

Fig. 1. Graphical representation of annotation for the three test loci described in this chapter. The loci, the tracks, the methods utilized to generate these tracks are described throughout the chapter. The figure can be regenerated on the MSU Osa1 Rice Genome Browser (http://rice.plantbiology.msu.edu/cgi-bin/gbrowse/rice/) by pasting in the Landmark or Region box Chr3:3313235833146997 and selecting the tracks shown in this figure.

end sequences, whole genome shotgun sequences, gene enrichment sequences, and unfinished, draft BAC sequences. These can be found in the high-throughput genomic sequence (HTG), genome survey sequences (GSS), whole genome sequences (WGS), EST and Trace Archives of GenBank. As these sequences represent either single-pass sequences or unfinished draft sequences,

Plant Genome Annotation Methods

269

their quality, both at the sequence and the assembly level, should be interrogated. In the GenBank’s RefSeq collection, representative sequences such as pseudomolecules of whole genomes are available. These represent a unified set of sequences for an organism, typically derived from large-scale sequencing projects. With respect to impact on annotation, low-quality sequence such as ESTs which are single-pass sequences or BAC end sequences should be interpreted with caution as what may appear as a frame shift may in reality be a sequencing error. Thus, any critical sequence should be verified through re-sequencing. Sequences in the dbHTG are primarily unfinished and not only can contain sequencing errors but also contain mis-assemblies. Thus, any annotation of your gene of interest should be confirmed by manual inspection and/or experimental work. For sequences within the PLN division, large-scale genome sequences undergo a quality control process before submission and thus, except where noted, should be of high quality. However, if there are issues such as defining an open reading frames or potential chimeric genes, one should examine the assembly quality of the BAC. 3.2. Structural Annotation 3.2.1. Ab Initio Gene Finding

Gene prediction can be generally divided into two major groups: ab initio gene prediction (template method) and similarity-based gene prediction (lookup method or pattern recognition method; see below) (2). Ab initio gene prediction uses statistical and computational methods to build signals and content sensors to identify functional elements relevant to gene structures such as core promoters (e.g., TATA-box), splice sites, exons, introns, and translation initiation and termination sites. A majority of ab initio gene finders are composed of several different specific sensors that are integrated together by either dynamic programming or Hidden Markov Models (HMM). All ab initio gene finders have limitations. First, even though specificity and sensitivity of some gene finders can be greater than 90% at the exon level (3), this extrapolates into less than 60% of genes with five exons that will be completely accurate at the gene level. Second, most gene finders cannot handle complicated gene structures and non-conventional biological signals such as (1) alternative splicing, (2) nested genes, (3) overlapping genes, (4) long introns, (5) non-canonical introns, (6) frameshift errors, (7) merged start codons (i.e. an authentic start codon which is split by an intron in the genomic sequence), and (8) introns in untranslated regions. The advantages of ab initio gene prediction programs are that they are very fast and require little computational effort and therefore are widely used in automated genome annotation. Clearly, ab initio gene prediction plays an important role in identifying gene location and protein coding potential within a genome, thereby providing a rapid, preliminary analysis of the genome annotation.

270

Ouyang et al.

FGENESH (4) and GeneMark.hmm (5) are two ab initio gene prediction programs that have strong performance in plants (3). These two programs are Web accessible at http://sun1. softberry.com/berry.phtml?topic=fgenesh&group=programs &subgroup=gfind and http://exon.gatech.edu/GeneMark/ eukhmm.cgi. In our case study, rice BAC sequence (AC093713), the FGENESH and GeneMark.hmm Web tools predicted 20 genes and 30 genes, respectively, in the BAC. These gene prediction results can be further imported into visualization tools to compare with other analysis results (see subheading 4). Gene prediction-related publications have been reviewed by Mathe et al. (6) and can be seen in a well-organized manner by Wentian Li (http://www.nslij-genetics.org/gene/). 3.2.2. Repeat Masking

In general, plant genomes are highly repetitive. Repetitive and lowcomplexity sequences are troublesome in the sequence assembly process. Sometimes, highly repetitive sequences are masked during the assembly of large genomic sequences and processed separately. When working on genomes with high repeat content, you can run ab initio gene prediction programs on repeat-masked genomic sequences to avoid interference by repeats. Alternatively, you can run the gene prediction programs using the unmasked sequences, and then compare the output with the prediction from the repeat-masked sequences. In this way, repeat-overlapping genes can be readily identified without compromising the accuracy of the prediction by using the masked sequences. Low complexity and sometimes repetitive sequences should also be masked before a sequence similarity search to eliminate statistically significant but biologically uninteresting matches. RepeatMasker (http://www.repeatmasker.org) is very efficient in masking both low complexity and interspersed repeats using species-specific repeat libraries. RepeatMasker comes with several eukaryotic repeat databases from Repbase (7), although custom libraries are allowed. A plant repeat database is available at MSU (8). In addition to masking repeats, RepeatMaker can also be used to identify known repeats in genomic sequences generating a tabulated output file. However, RepeatMasker’s search program, CrossMatch, is computationally time intensive. In MaskerAid (9), CrossMatch is replaced with WU-BLAST and works comparably to RepeatMasker yet is ~30-fold faster. Both CrossMatch and WU-BLAST are available at the RepeatMasker Web server (http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker).

3.2.3. Gene Model Support

The largest and perhaps most important source of evidence for structural annotation of gene models are experimentally derived transcripts. These are primarily in the form of ESTs and full-length cDNAs (FLcDNAs). The numbers of ESTs and FLcDNAs vary

EST/Full-Length cDNA

Plant Genome Annotation Methods

271

significantly for each species. For rice and maize, there are over 1 million ESTs and FLcDNAs, while for a species such as cassava (Manihot esculenta), there are less than 20,000. One issue with having these large collections of ESTs and FLcDNAs are that they are highly redundant and as they are single-pass sequences, their accuracy is low. This can be resolved through reduction in these sequence sets into a set of assemblies that represent all of the transcripts and in which sequencing errors are minimized by generation of consensus sequences. There are several groups that actively generate assemblies of ESTs and FLcDNAs: The TIGR Plant Transcript Assemblies Project [http://plantta. tigr.org; (10)], PlantGDB-assembled Unique Transcripts (http://www.plantgdb.org/prj/ESTCluster/index.php), HarvEST (http://harvest.ucr.edu/), openSputnik EST project (http://sputnik.btk.fi/ests) and the National Center for Biotechnology Information (NCBI) Unigene project (http://www. ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigene). While each of these projects provides a similar set of assemblies, they do differ in the stringencies in which transcripts are co-assembled, the frequency in which new builds are made available, inclusion of “virtual” transcripts (i.e. gene predictions from genome sequencing/annotation projects), and whether transcripts are assembled at the genus, species or sub-species level. Each of these versions of transcript assemblies (TAs) can be advantageous depending on your goals, for the purposes of structural gene annotation, only TAs of bona fide transcripts and transcripts derived from the species or lower taxonomic level should be used. For the purposes of this chapter, we will use the Rice Transcript Assembly from the TIGR Plant Transcript Assemblies Project in which 1,205,038 O. sativa ESTs, FLcDNAs, and mRNAs have been assembled into 49,870 transcript assemblies (TAs, contigs) and 197,646 singleton ESTs, representing 247,516 unique sequences. To identify experimental support for the genes on our test BAC, the entire sequence was searched against the rice TAs using the BLAST search program on the TIGR Plant TA Web site (http://tigrblast.tigr.org/euk-blast/plantta_blast.cgi). This is not the optimal search or alignment tool but as the rice TA set is so large and our query sequence is so large, it provides an initial method to identify all potential cognate transcripts. Other gapped alignment programs can be used; however, BLAST was selected based on its speed and accuracy. On our test BAC, the BLAST results clearly show that there are a number of expressed genes on the BAC, including two of our test loci, LOC_Os03g58260 and LOC_Os03g58270. However, LOC_Os03g58280 lacks EST and FLcDNA support, and this is consistent with its annotation as a hypothetical protein.

272

Ouyang et al.

Protein

Alignment of known proteins to the sequence of interest may help to delineate protein coding regions. For genes in which the number of available ESTs or transcripts is small, it is particularly important to leverage information from protein alignments. As sequence conservation is higher at the protein than at the nucleotide level, protein searches can be performed against diverged species. However, this method may provide an idea of where a gene is located, but it is unlikely to help in the resolution of the internal gene structure as intron–exon boundaries may vary across species. With our test BAC example, the entire sequence was searched against the predicted proteomes of rice, maize and Arabidopsis with the NCBI BLAST alignment tool with the BLASTX option (11), which translates the query sequence in all six possible frames and searches the protein database. In the case of large sequences such as BACs, it is recommended to perform the alignment with different species successively. Each alignment should be considered based on its quality (E-value; similarity, coverage) and the divergence between the query and the database species. Furthermore, stacked alignments from diverging species should carry more weight than single alignment as they indicate gene conservation likely to extend to the queried species. A BLAST search of our sample BAC returned multiple hits from Arabidopsis for over 15 locations on the BAC. Apart from repetitive elements aligning to two regions located at 90 and 140 kb, only one region (around 15 kb) has protein alignment from all three species, all annotated as cytosine-5 DNA methyltransferase. Among our three test loci, only the retrotransposon LOC_Os03g58270 has matches with the rice, Arabidopsis, or maize protein databases, and LOC_Os03g58260 matches the Arabidopsis locus At4g02610. Note that prior masking of the query sequence facilitates the visualization of the results by eliminating the numerous hits to retrotransposable elements. Additional protein databases to search include comprehensive protein databases such as UniProt or the non-redundant amino acid database at NCBI which are described in detail below. However, it can be more informative to search targeted protein databases (as we have done in this case example) before or in addition to searching comprehensive protein databases.

3.3. Functional Annotation

Once the structure of the gene has been established and its protein sequence deduced, a putative function may be assigned to the protein. Protein alignments and searches for conserved domains are two common ways to attribute a name to a gene. Protein alignments against a protein database are performed with BLASTP. E-values, coverage and identity cut-offs are largely dependent on methodical testing, personal experience and the quality and representation of related sequences in the database. For the annotation of our test BAC, we have used in combination: an

3.3.1. Gene function CDNA

Plant Genome Annotation Methods

273

E-value cut-off of e–10, an identity threshold of 30% and a minimum coverage of 50% of the length of the query. The number of protein hits and the quality of the annotation of the hit depend mostly on the database. For example, UniProtKB/Swiss-Prot (http://au.expasy.org/sprot/) (12) is a database of manually examined records, many of which are linked to publications, while UniProtKB/TrEMBL is a larger database containing all of the protein sequences translated from EMBL/GenBank/DDBJ nucleotide sequence databases in addition to protein sequences in UniProtKB/Swiss-Prot. Therefore, while UniProtKB/SwissProt provides high-quality hits, the UniProtKB/TrEMBL database provides higher likelihood of finding a similar protein. In practice, there are currently several large databases combining non-redundant sets of sequences of different origins. The NCBI nr (non-redundant) database is a non-redundant set of over 4 million sequences including GenBank coding sequence translations, UniProtKB/Swiss-Prot, PIR, PDB and PRF sequences. The Unit-Prot consortium of UniProtKB/Swiss-Prot, Tr-EMBL and PIR (http://www.pir.uniprot.org/) have built several nonredundant databases, including UniRef90 and UniRef100 which contain sequences that are less than 90% or 100% identical to other sequences in the database (13). Protein sequences of the three exemplar genes on our test BAC were subjected to BLASTP searches against the NCBI nr database. Results for the first locus, LOC_Os03g58260, illustrate a commonly encountered situation, where several hits with very significant E-values are found and require evaluation. Examination of the record of the first hit, NP_001051559 which corresponds to rice LOC_Os03g0797000, indicates that NP_001051559 was annotated in the context of a large genome annotation project and was likely named based on sequence similarity to another protein. By contrast, the second hit, AAG42689 is supported by experimental data (see publication title). In this case, the annotator should give preference to the second hit and avoid transitive annotation. LOC_Os03g58270 can be annotated based on the large number of hits annotated as retrotranposon-related proteins. However, it is recommended to annotate transposable element-related genes by searching all the gene models against a repeat database, if available. The only significant hits to LOC_ Os03g58270 are to itself, so the name “hypothetical protein” is assigned to this gene. In the case where no hit above a given threshold or no wellcharacterized hit is identified in the database, you can search for conserved domains lying in the gene models. The Pfam collection (14) is searchable at http://www.sanger.ac.uk/Software/ Pfam/search.shtml and returns HMM-predicted domains, with an E-value and a score. Alternatively, you can query the InterPro database of protein families, domains and functional sites,

274

Ouyang et al.

including Pfam, Prosite and ProDom, with InterproScan (15, 16) (http://www.ebi.ac.uk/InterProScan/). In the case of LOC_ Os03g58270, no high confidence domain is found by either the InterProScan or Pfam search. If a query protein is identical to a previously characterized protein, the original protein name is preserved. Proteins with similarity to known database matches are named after the database entries as “XXX, putative”. Proteins with matches to hypothetical proteins are called “conserved hypothetical proteins”. Proteins with no matches are called “hypothetical protein”. To encapsulate the higher level of confidence in genes with evidence of expression, “expressed” can be appended to the names of gene models with cognate ESTs, FLcDNA or protein support. Following these guidelines, the three example loci are annotated as “tryptophan synthase alpha, putative, expressed”, “retrotransposon protein, putative” and “hypothetical protein”. 3.3.2. Gene Ontologies

Gene Ontology (GO, http://www.geneontology.org) is a dynamic database of controlled vocabularies describing three features of gene products: biological process, cellular component (location) and molecular function. The GO project was designed to provide uniform cross-species queries for the biological information distributed in databases that represent diverse taxa. GOs provide consistent annotation in a computer readable and usable form, thereby making them amenable to high-throughput data analyses including interpretation of “omics” data and validation of automated annotation tools (17). The ontologies are organized in a directed acyclic graph, in which, child ontologies can have more than one parent ontology and in which a child ontology must be true to every parent’s attributes. Ideally, GO terms should be assigned manually on the basis of experimental evidence. However, GO associations can also be assigned using sequence and structural similarity, phylogeny and paralogous family information in the event no experimental data is available (18). An evidence code must be recorded to summarize how a GO assignment is made which is indicative of the reliability of the GO assignment. A gene product can have multiple GO terms as it could possess multiple molecular functions, be involved in more than one biological process, and located in multiple cellular locations. The manual GO assignment process is time-consuming and consequently different methods to computationally assign GO terms on a large scale have been developed. These methods are similar in principle as they map the gene products to proteins with existing GO terms, and the GO annotation is transferred to the query protein. GO terms can be transitively annotated from SwissProt entries (spkw2go), Enzyme Commission numbers (ec2go) or InterPro domain matches (interpro2go) (19, 20). One can also compile a set of proteins whose ontologies have been relatively

Plant Genome Annotation Methods

275

reliably assigned in other databases, map the gene products to the proteins in the compiled data set by a sequence similarity search, such as BLAST, and eventually transfer the GO terms to the genes. Online GO annotation tools, such as GOanna (http://agbase. msstate.edu/GOAnna.html) and GoFigure (http://udgenome. ags.udel.edu/gofigure), are available for occasional large-scale GO annotation; with our three test loci, these Web sites assigned GO terms to genes using sequence similarity and performed reasonably well. Assignment of GO Slim terms, instead of GO terms, to the gene products is another option. GO Slims are selected subsets of ontologies which are at higher nodes of the GO “tree” and are more generalized. There are several pre-made GO Slim sets, including a generic GO Slim and Plant GO Slim set, available at the GO consortium (http://www.geneontology.org). An advantage in using GO Slim terms rather than GO terms is that the association can be more accurate than assigning the granular GO terms. This is particularly true when electronically assigning GO terms in a large-scale manner in which manual review of the evidence is not feasible. To assign GO Slim terms, granular GO terms can be assigned first, and then the associations can be converted to GO Slim terms using tools such as map2slim ( http://www.godatabase.org/dev/pod/scripts/map2slim. html). Alternatively, query gene products can be mapped to gene products with GO Slim terms and GO Slim terms transferred subsequently. 3.3.3. Comparative Alignments

All of the above annotations (structural and functional) assume that complete data sets are available and robust for your species. However, this is rarely the case and it can be highly informative to examine sequence and function of homologues of your gene of interest as structure and function of orthologous genes will be conserved throughout the evolutionary process. The bulk of plant genome sequence data is in the form of ESTs, although large-scale genomic sequence data sets are available for a growing number of plant species. For the three test loci, pre-computed alignments with a number of plant sequence data sets (TIGR Plant Transcript Assemblies, predicted Arabidopsis proteome, geneenriched maize and sorghum genome assemblies) are available on the MSU Osa1 Rice Genome Browser (http://rice.plantbiology. msu.edu/cgi-bin/gbrowse/rice/). For LOC_03g58260, support for the gene model is readily apparent in other Poaceae TAs while the other two loci lack homology with a Poaceae TA similar to the lack of rice EST and FLcDNA support. Potential homologues are present for LOC_03g58260 and LOC_03g58270 in Arabidopsis, sorghum and maize but not LOC_03g58280, consistent with its annotation as a hypothetical protein.

276

Ouyang et al.

3.3.4. Other Functional Annotation

There are many other types of functional annotation which may not be highly specific but provide additional information to the biologist. These other functional annotation data types can be derived from a number of different data sources, which when seen in their totality can be informative as to gene function.

Expression Data

The temporal and spatial expression patterns of a gene can be highly informative as to function. For example, if a gene is upregulated in roots following salt stress, it may be inferred that the gene has a function in salt tolerance providing a testable hypothesis through knockout or knockdown assays. There are multiple expression data types. Simple tests of expression in a temporal or spatial manner can be obtained using real-time polymerase chain reaction (RT-PCR) or quantitative RT-PCR. For a few genes (<100), this is perhaps the most direct and definitive method to determine expression patterns. Microarrays provide a platform to obtain expression data for large sets of genes and currently, a number of expression platforms are available for model and crop species (21). Analysis tools are readily available for both the twocolour and single-channel arrays and enable identification of genes that are differentially expressed as well as genes that are co-regulated, thereby allowing for identification of networks of genes. New sequencing technologies and reduced sequencing costs have enabled the generation of large sequence-tagged sets of transcripts and permitted “electronic northerns” to assess expression patterns. These approaches include large EST sets from different cDNA libraries (22, 23), Massively Parallel Signature Sequencing [MPSS,(24,25)] or 454 sequencing (26). Regardless of the origin, expression data provide evidence that a gene is expressed which adds value to the functional annotation and can provide data on a gene’s potential function based on correlative evidence. For our three test genes, we can see in the MSU Osa1 Expression Evidence Search Page (http://rice.plantbiology.msu.edu/ locus_expression_evidence.shtml) that there are MPSS and Serial Analysis of Gene Expression data for LOC_Os03g58260 but not for LOC_Os03g58270 or LOC_Os03g58280. However, probes on a number of different array platforms have been designed for all three loci.

Flanking Sequence Tags

While similarity to known proteins and expression patterns are insightful, these are based on either inference from sequence similarity or correlative data and are not definitive experimental evidence of gene function. Two approaches can be used to empirically determine gene function: gene knockout/knockdown or over-expression. Gene knockout or knockdown can be obtained through silencing or targeted disruption of the gene using insertion or mutation methods. For a number of plant species such as rice, Arabidopsis and maize, there are large-scale

Plant Genome Annotation Methods

277

insertion tagging projects in progress (27–31). These projects disrupt genes through random insertion of a mobile genetic element and then sequence the insertion site resulting in a “flanking sequence tag” (FST) for the insertion site. By searching a database of FSTs with the query genomic sequence, one can determine whether a mutant is already available for a gene of interest. For our test loci, a search of available FSTs at the MSU Osa1 Rice Genome Annotation Resource (http://rice.plantbiology.msu. edu/BACmapping/FST_map.shtml) revealed a putative FST for LOC_Os03g58260 and LOC_Os03g58270. 3.4. Visualization Tools

There are several tools available that allow a biologist to visualize alignment and gene prediction evidence that is used for annotation. Apollo is a Java-based genome editor that is part of the Generic Model Organism Database (GMOD) open source software project (http://www.gmod.org/?q=node/4). Apollo is a very powerful tool that is used by several large genome sequencing projects. Unfortunately, Apollo works best when genome and evidence data are maintained in a database or formatted in special Extensible Markup Language (XML) files. This makes Apollo inaccessible to a casual user. A popular and rather easy to use tool for viewing genome feature evidence and annotation is the Generic Genome Browser [(32); http://www.gmod. org/?q=node/71], another product of the GMOD project. The Generic Genome Browser can access genome data from a database (as used in the MSU Osa1 Rice Genome Project and shown in Fig. 1), but it can also work from flat files. This makes it reasonable for viewing small genomes or genome segments without the hassle of maintaining a database. Unfortunately, a Web server is required, and not all biologists have access to such a resource. Artemis is a Java-based genome viewer and editor that is available for download (33). This program can run on any computer that has Java installed. It reads genome and evidence data from a variety of flat-file formats including European Molecular Biology Laboratory, GenBank, General Feature Format (GFF), FASTA and tab-formatted BLAST result files. The use of Artemis for visualizing alignment data with respect to the sequence of AC093713 is described below. 1. A repeat-masked version of AC093713 should be prepared using RepeatMasker as described in subheading 3.2.2. If the repeats in AC093713 are not masked, alignments to this sequence will contain many uninformative matches to repeats which will make it difficult to identify matches to interesting genes. 2. The repeat-masked sequence of AC093713 should be aligned to the O. sativa protein sequences in the NCBI nr database using BLASTX. In a Web browser, go to http://

278

Ouyang et al.

www.ncbi.nlm.nih.gov/blast/index.shtml. Select the link to the BLASTX page. 3. Paste the repeat-masked sequence of AC093713 into the “Search” box. 4. In the Options section, limit the search by choosing O. sativa from the pull-down menu. 5. In the Format section, change the number of alignments to 250. Also, change the “Alignment view” to “Hit Table”. 6. Click the BLAST button, and a new page will appear. After a few moments, click the Format button. 7. When the results page appears, it will have one HSP listed per row. This page should be saved as plain text file. 8. Open the results file in a spreadsheet program. Each row of this file contains tab-delimited results, and each element of the results should appear in separate cells of the spreadsheet. Delete the preliminary lines in the file up to the first line of results. These comment lines will not allow Artemis to properly read the file. There are also two columns in this BLASTX results file that are not recognized by Artemis. Columns 4 and 5 must be deleted. These columns are not included in tab-formated BLASTN results. Save the file in text-only format. 9. Repeat steps 2–8, but align the repeat-masked sequence of AC093713 against the peptide sequences of Arabidopsis thaliana. 10. Use the unmasked sequence of AC093713 as input into FGENESH. In a Web browser, go to http://linux1.softberry.com/berry.phtml?topic=fgenesh&group=programs&s ubgroup=gfind. Paste the sequence of AC093713 into the box. Choose “Monocot plants (Corn, Rice, Wheat, Barley)” as the Organism, and click the Search button. 11. The results from FGENESH are not directly usable by Artemis, but the results can be converted into GFF3 format which can be recognized by Artemis. It is straightforward to write a Perl script to do this conversion, but creating a GFF3 file by hand for a small number of predicted gene models is simple. A description of the GFF3 format can be found at http://www. sequenceontology.org/gff3.shtml. A GFF3 file with the results from the FGENESH analysis of AC093713 has been prepared and can be downloaded from ftp://ftp.plantbiology.msu.edu/ pub/data/plant_genome_annotation_methods/AC093713. fgenesh.gff. Use a Web browser to download this file. 12. Artemis should be obtained from the Sanger Institute Web site: http://www.sanger.ac.uk/Software/Artemis/v8/. Follow the installation instructions on the Web site.

Plant Genome Annotation Methods

279

13. Open Artemis on your computer. Load the FASTA version of the sequence for AC093713 (File → Open). 14. Alignment and evidence data are referred to as “Entries” in Artemis. Load the FGENESH data. Choose “File → Read An Entry…” from the menu bar. Select the FGENESH GFF3 file. Load the O. sativa and Arabidopsis thaliana BLASTX results in the same way. 15. In the menu bar, go to the “Display” option. Make sure that each of the display options is checked. The view in the Artemis window should be similar to Fig. 2.

Fig. 2. View of annotation in the Artemis viewer. The view in Artemis of the region of AC093713 that contains LOC Os03g58260. BLASTX alignment results with Oryza sativa and Arabidopsis thaliana protein sequences and FGENESH predictions are shown. Note that the sequence of AC093713 is reversed relative to its orientation in the assembled pseudomolecule 3 in the MSU Osa1 Rice Genome Browser (Fig. 1).

280

Ouyang et al.

16. It is now possible to create a new entry track that will contain gene annotations based on the best interpretation of the evidence from gene, protein, repeat sequence and gene prediction alignments. Note that the alignments to the rice proteins are most likely self-hits or hits to annotated rice paralogues and should be interpreted with caution. The custom annotation can then be saved in GenBank format that can be used for submissions to GenBank or for later viewing in Artemis.

Notes 1. While large-scale EST and FLcDNA data sets facilitate genome annotation, obtaining the cognate transcript sequence for a given gene is perhaps the most critical step in accurate genome annotation and thus, if rapid amplification of cDNA ends can be performed for your gene of interest, it should be done. 2. As compilation of all sequences within the Plant Kingdom is difficult due to the continual release of new data, it is wise to search a number of repositories for possible homologues and then refine the search later using targeted alignment programs. Databases to search would be the NCBI dbEST, dbGSS, dbHTG and PLN divisions. 3. The results of BLASTP searches performed for functional assignment should be evaluated carefully, as many records in large protein databases are mere translation of predicted open reading frame, sometimes referred to as “Conceptual translation” in the comment field of a GenBank record. Annotators should keep in mind that little or no manual curation is performed in large genome annotation projects.

Acknowledgements We acknowledge the efforts of the TIGR Bioinformatics department that has made a number of robust annotation tools readily available to our group. Genome annotation is funded by grants to C.R.B from the National Science Foundation (DBI-0218166 and DBI-0321538).Note added in proof. Subsequent to the submission of this chapter, rice genome project at TIGR moved to Michigan State University (new URL http://rice.plantbiology. msu.edu/)

Plant Genome Annotation Methods

281

References 1. Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., and Wheeler, D.L. (2005) GenBank. Nucleic Acids Res. 33, D34–D38. 2. Fickett, J.W. (1996) The gene identification problem: an overview for developers. Comput. Chem. 20, 103–118. 3. Yao, H., Guo, L., Fu, Y., Borsuk, L.A., Wen, T.J., Skibbe, D.S., Cui, X., Scheffler, B.E., Cao, J., Emrich, S.J., et al. (2005) Evaluation of five ab initio gene prediction programs for the discovery of maize genes. Plant Mol. Biol. 57, 445–460. 4. Salamov, A.A. and Solovyev, V.V. (2000) Ab initio gene finding in Drosophila genomic DNA. Genome Res. 10, 516–522. 5. Lomsadze, A., Ter-Hovhannisyan, V., Chernoff, Y.O., and Borodovsky, M. (2005) Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res. 33, 6494–6506. 6. Mathe, C., Sagot, M.F., Schiex, T., and Rouze, P. (2002) Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res. 30, 4103–4117. 7. Jurka, J., Kapitonov, V.V., Pavlicek, A., Klonowski, P., Kohany, O., and Walichiewicz, J. (2005) Repbase update, a database of eukaryotic repetitive elements. Cytogenet. Genome Res. 110, 462–467. 8. Ouyang, S. and Buell, C.R. (2004) The TIGR plant repeat databases: a collective resource for the identification of repetitive sequences in plants. Nucleic Acids Res. 32, D360–D363. 9. Bedell, J.A., Korf, I., and Gish, W. (2000) MaskerAid: a performance enhancement to RepeatMasker. Bioinformatics 16, 1040–1041. 10. Childs, K., Hamilton, J., Zhu, W., Ly, E., Cheung, F., Wu, H., Rabinowicz, P.D., Town, C.D., Buell, C.R., and Chan, A.P. (2007) The TIGR plant transcript assemblies database. Nucleic Acids Res. 35(Database issue), D846–D851. 11. Gish, W. and States, D.J. (1993) Identification of protein coding regions by database similarity search. Nat. Genet. 3, 266–272. 12. Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O’Donovan, C., Phan, I., et al. (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31, 365–370. 13. Bairoch, A., Apweiler, R., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger,

14.

15.

16.

17.

18.

19.

20.

21.

22.

23.

E., Huang, H., Lopez, R., Magrane, M., et al. (2005) The universal protein resource (UniProt). Nucleic Acids Res. 33, D154–D159. Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E.L., et al. (2004) The Pfam protein families database. Nucleic Acids Res. 32, D138–D141. Mulder, N.J., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bradley, P., Bork, P., Bucher, P., Cerutti, L., et al. (2005) InterPro, progress and status in 2005. Nucleic Acids Res. 33, D201–D205. Quevillon,E.,Silventoinen,V.,Pillai,S.,Harte,N., Mulder, N., Apweiler, R., and Lopez, R. (2005) InterProScan: protein domains identifier. Nucleic Acids Res. 33, W116–W120. Lee, V., Camon, E., Dimmer, E., Barrell, D., and Apweiler, R. (2005) Who tangos with GOA?-use of Gene Ontology Annotation (GOA) for biological interpretation of ‘-omics’ data and for validation of automatic annotation tools. In Silico Biol. 5, 5–8. Haas, B.J., Wortman, J.R., Ronning, C.M., Hannick, L.I., Smith, R.K., Jr. Maiti, R., Chan, A.P., Yu, C., Farzad, M., Wu, D., et al. (2005)Complete reannotation of the Arabidopsis genome: methods, tools, protocols and the final release.BMC Biol. 3, 7. Berardini, T.Z., Mundodi, S., Reiser, L., Huala, E., Garcia-Hernandez, M., Zhang, P., Mueller, L.A., Yoon, J., Doyle, A., Lander, G., et al. (2004)Functional annotation of the Arabidopsis genome using controlled vocabularies.Plant Physiol. 135, 745–755. Camon, E., Magrane, M., Barrell, D., Lee, V., Dimmer, E., Maslen, J., Binns, D., Harte, N., Lopez, R., and Apweiler, R. (2004) The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res. 32, D262–D266. Rensink, W.A. and Buell, C.R. (2005) Microarray expression profiling resources for plant genomics. Trends Plant Sci. 10, 603–609. Ronning, C.M., Stegalkina, S.S., Ascenzi, R.A., Bougri, O., Hart, A.L., Utterbach, T.R., Vanaken, S.E., Riedmuller, S.B., White, J.A., Cho, J., et al. (2003)Comparative analyses of potato expressed sequence tag libraries.Plant Physiol. 131, 419–429. Journet, E.P., van Tuinen, D., Gouzy, J., Crespeau, H., Carreau, V., Farmer, M.J., Niebel, A., Schiex, T., Jaillon, O., Chatagnier, O., et al. (2002)Exploring root symbiotic programs

282

24.

25.

26.

27.

28.

Ouyang et al. in the model legume Medicago truncatula using EST analysis.Nucleic Acids Res. 30, 5579–5592. Nakano, M., Nobuta, K., Vemaraju, K., Tej, S.S., Skogen, J.W., and Meyers, B.C. (2006) Plant MPSS databases: signature-based transcriptional resources for analyses of mRNA and small RNA. Nucleic Acids Res. 34, D731–D735. Meyers, B.C., Vu, T.H., Tej, S.S., Ghazal, H., Matvienko, M., Agrawal, V., Ning, J., and Haudenschild, C.D. (2004) Analysis of the transcriptional complexity of Arabidopsis thaliana by massively parallel signature sequencing. Nat. Biotechnol. 22, 1006–1011. Cheung, F., Haas, B.J., Goldberg, S.M., May, G.D., Xiao, Y., and Town, C.D. (2006) Sequencing Medicago truncatula expressed sequenced tags using 454 Life Sciences technology. BMC Genomics 7, 272. Alonso, J.M., Stepanova, A.N., Leisse, T.J., Kim, C.J., Chen, H., Shinn, P., Stevenson, D.K., Zimmerman, J., Barajas, P., Cheuk, R., et al. (2003) Genome-wide insertional mutagenesis of Arabidopsis thaliana. Science 301, 653–657. Jeong, D.H., An, S., Park, S., Kang, H.G., Park, G.G., Kim, S.R., Sim, J., Kim, Y.O., Kim, M.K., Kim, S.R., et al. (2006) Generation of a flanking sequence-tag database for

29.

30.

31.

32.

33.

activation-tagging lines in japonica rice. Plant J. 45, 123–132. Greco, R., Ouwerkerk, P.B., Taal, A.J., Favalli, C., Beguiristain, T., Puigdomenech, P., Colombo, L., Hoge, J.H., and Pereira, A. (2001) Early and multiple Ac transpositions in rice suitable for efficient insertional mutagenesis. Plant Mol. Biol. 46, 215–227. Kumar, C.S., Wing, R.A., and Sundaresan, V. (2005) Efficient insertional mutagenesis in rice using the maize En/Spm elements. Plant J. 44, 879–892. Kim, C.M., Piao, H.L., Park, S.J., Chon, N.S., Je, B.I., Sun, B., Park, S.H., Park, J.Y., Lee, E.J., Kim, M.J., et al. (2004) Rapid, largescale generation of Ds transposant lines and analysis of the Ds insertion sites in rice. Plant J. 39, 252–263. Stein, L.D., Mungall, C., Shu, S., Caudy, M., Mangone, M., Day, A., Nickerson, E., Stajich, J.E., Harris, T.W., Arva, A., et al. (2002) The generic genome browser: a building block for a model organism system database. Genome Res. 12, 1599–1610. Rutherford, K., Parkhill, J., Crook, J., Horsnell, T., Rice, P., Rajandream, M.A., and Barrell, B. (2000) Artemis: sequence visualization and annotation. Bioinformatics 16, 944–945.

Chapter 15 Molecular Plant Breeding: Methodology and Achievements Rajeev K. Varshney, Dave A. Hoisington, Spurthi N. Nayak, and Andreas Graner Summary The progress made in DNA marker technology has been remarkable and exciting in recent years. DNA markers have proved valuable tools in various analyses in plant breeding, for example, early generation selection, enrichment of complex F1s, choice of donor parent in backcrossing, recovery of recurrent parent genotype in backcrossing, linkage block analysis and selection. Other main areas of applications of molecular markers in plant breeding include germplasm characterization/fingerprinting, determining seed purity, systematic sampling of germplasm, and phylogenetic analysis. Molecular markers, thus, have proved powerful tools in replacing the bioassays and there are now many examples available to show the efficacy of such markers. We have illustrated some basic concepts and methodology of applying molecular markers for enhancing the selection efficiency in plant breeding. Some successful examples of product developments of molecular breeding have also been presented. Key words: Molecular markers, Marker-assisted selection, Molecular breeding, Polymorphism, Linkage mapping, Association mapping, Marker–trait association.

1. Introduction The identification of variation and its effective incorporation into germplasm are important components of any crop improvement programme. Such variation can be obtained from either crossing two different parental genotypes or selecting existing variation from the enormously available germplasm in the plant kingdom. Ancient farmers were the first ‘plant breeders’ by selecting the best plants for their needs. Archaeological evidence indicates that farmers employed selection pressure to meet their demands

Daryl J. Somers et al. (eds.), Methods in Molecular Biology, Plant Genomics, vol. 513 © Humana Press, a part of Springer Science + Business Media, LLC 2009 DOI: 10.1007/978-1-59745-427-8_15

283

284

Varshney et al.

as early as 12,000 years ago. As knowledge continues to grow, plant breeding has evolved as a major discipline in plant biology. There were many landmarks in plant breeding after the rediscovery of Mendelian genetics. Crossing two morphologically different parental genotypes allowed plant breeders to study the recombination and crossing-over events. Morphological markers played a major role in following genetics of the traits, for example flower colour, shape of the flower, seed size, seed colour, plant height, etc. Morphological markers are not always simple Mendelian-inherited genes, which has reduced their usefulness in plant breeding programmes. There is enormous diversity (polymorphism) at the DNA level in higher plants, such that no two organisms are likely to be identical in their DNA sequence, including among natural populations of plants (1). Molecular techniques have provided strategies to develop marker systems that detect such DNA variation, which can be used to assist traditional plant breeding (2, 3). Once linkage between a marker locus and the gene for an agronomic trait of interest has been established, DNA-based tests can be used to enable more precise selection in plant breeding (4, 5). This powerful revolution has already demonstrated its impacts in the understanding of, and ability to manipulate, oligogenic and quantitative traits. The development and availability of abundant, naturally occurring, molecular genetic markers during last two decades has generated renewed interest in locating and measuring the effects of genes (polygenes or QTLs – quantitative trait loci) controlling quantitative traits (6). Molecular markers are now well established as powerful tools in plant breeding and genetics for indirect selection of difficult traits at the seedling stage during plant breeding, thus speeding up the process of conventional plant breeding and facilitating the improvement of difficult traits that can not be improved easily by the conventional methods of plant breeding. In this direction, a large number of genes and QTLs controlling agronomic traits and conferring tolerance to both abiotic and biotic stresses have been identified and tagged using molecular markers in several crop species especially cereals (7–9). In fact, the products of MAS have already been released as varieties in case of some cereal species (10). Some notable examples of the successful deployment of MAS in some species have been listed in Table 1.In addition, several programmes and initiatives like molecular breeding programmes in wheat and barley in Australia (12) and ‘MASWheat’ (http://maswheat.ucdavis.edu/index.htm) are under way to conduct MAS in breeding. Molecular breeding strategies are also in use in several crops by several private companies, for example, Monsanto, Pioneer-HiBred, and Syngenta. In the present genomics or post-genomic era when the sequence data have already become available through genome

Molecular Plant Breeding

285

Table 1 Some successful examples of molecular breeding in cereals Achievement

Details

References

Acceleration in varietal development

1. Release of US barley variety Tango that contains two QTL for adult resistance to stripe rust

(11)

2. Advancement of a ‘Sloop type’ variety with cereal cyst nematode (CCN) resistance for commercial release

Comparative Research Centre for molecular plant breeding (CRCMPB) (12)

3. Release of ‘Flagship’ variety in Australia in 2004 after following whole genome breeding approach

(12)

4. Release of two Indonesian rice cultivars ‘Angke’ and ‘Conde’, in which markerassisted selection (MAS) was used to introduce xa5 into a background containing xa4

(13)

5. Development of quality protein maize (QPM) through marker-aided transfer of opaque2 gene in backcross programmes

(14)

6. Release of an Indian pearl millet hybrid cultivar ‘HHB 67-improved’ in 2005, which has resistance to downy mildew

C.T. Hash, ICRISAT (personal communication)

7. Development of an improved version of Pusa T. Mohapatra, NRCPB, IARI, India (personal Basmati 1 (PB1) variety of rice after introgresscommunication) ing the genomic segments, harbouring the bacterial blight resistance namely xa13 and Xa21 have been transferred to PB1 from a non-Basmati donor through MAS Introgression of trait (gene pyramiding)

1. Introgression of Yd2 gene conferring resistance to barley yellow dwarf virus (BYDV) into a BYDV-susceptible barley variety through two cycles of marker-assisted backcrossing

(15)

2. Pyramiding of different resistance genes for barley yellow mosaic virus (rym4, rym5, rym9, and rym11) in barley

(16)

3. Use of yield-related QTLs for MAS in maize in private sector

(17)

4. Pyramiding of disease-resistant genes in rice, (18–21) particularly against blight, blast, and both simultaneously

(continued)

286

Varshney et al.

Table 1 (continued) Achievement

Details

References

5. Pyramiding of insect and blight resistance in rice

(19)

6. Pyramiding of blight resistance with Basmati quality characters in rice

(22)

7. Pyramiding of stay green QTLs in elite but drought sensitive sorghum lines

C.T. Hash, ICRISAT (personal communication)

or EST sequencing projects for some plant species and similar efforts are under way for many other plant species, it has been possible to develop the molecular markers [and novel markers like single-nucleotide polymorphisms (SNPs) and single-feature polymorphisms (SFPs)] directly from genes (23–25). Development of such functional markers may speed up in coming years as these markers will prove promising in marker-assisted breeding and useful resource for assessment of functional diversity in germplasm collection (26, 27).

2. Materials 2.1. Molecular Markers

Molecular markers are specific locations on a chromosome which serve as landmarks for genome analysis (see Note 1). While selecting the molecular markers for marker–trait association studies, the following points need to be considered. 1. What are the relative costs of the marker assays versus other selection techniques such as phenotypic selection or various bioassay systems? 2. Is the trait dominant versus recessive?

2.2. Mapping Populations

3. What type and size of mapping population and which methodology will be used for marker–trait association studies? The use of adequate genetic material for marker–trait association studies is another important critical factor (see Note 2). While doubled haploid (DH) and recombinant inbred line (RIL) populations are most appropriate genetic material for linkage mapbased analysis, the F2 and backcross (BC) populations can be used

Molecular Plant Breeding

287

for bulked segregant analysis (BSA) (28). For using a particular mapping population for trait mapping, knowledge of the genetics for the trait, as listed below, is also important. 1. What is the nature of the trait? Is it simply inherited or multi-genic? 2. What is the heritability of the trait? 3. How much do the parental lines used to develop the population differ for the target trait? An alternative methodology to the above-mentioned trait mapping strategies is association or linkage disequilibrium (LD) mapping, based on association between phenotype and allele frequencies (see Note 3). To effectively employ association mapping, one needs to decide on best population structure. 1. The structure of the population will be related to the trait and purpose. 2. Population structure will differ for (1) self versus outcrossing species, (2) long versus short generation species, and (3) perennial versus annual crop species. After selecting a suitable mapping population, the population size is another criterion to be considered for trait mapping. 1. For mapping single genes, 50 F2s/BCs should be adequate (29). 2. For QTL analyses, a minimum of 200 individuals/lines of a population (RILs or DHs) are required (30, 31). 3. For employing LD mapping, a population of at least 300 genotypes is required (32). 2.3. Statistical Analysis and Tools

For conducting marker–trait association by using BSA and linkage maps, three widely used methods have been used: single marker analysis (SMA), simple interval mapping (SIM), and composite interval mapping (CIM) (33, 34) (see Note 4). In the case of LD or association mapping, the estimation of the LD within a species or even within individual genomes and an understanding of the structure of the population are important prerequisites. Subsequently, for conducting the marker–trait association studies, the structure of the population is considered to avoid false positives (see Note 5). 1. For performing the SMA, QGene (35) or MapManagerQTX (36) computer programmes can be used. 2. MapMaker/QTL (37, 38) and QGene can be used for SIM. 3. QTLCartographer (39), MapManager QTX (36), and PLABQTL (40) are the most appropriate computer programmes for conducting CIM. 4. In case of association mapping studies, the most commonly used computer programmes for measuring LD, population structure, and evaluating the trait associations are STRUCTURE (41, 42) and TASSEL (43).

288

Varshney et al.

3. Methods 3.1. Marker–Trait Association 3.1.1. Selection of Molecular Markers

1. Select and optimize good quality, informative, and highthroughput amenable molecular markers, if possible. 2. Choose the appropriate genotyping platform depending on the size of the population to be studied as well as the number of available molecular markers, thereby per marker per individual experimental cost is minimized.

3.1.2. Polymorphism Survey for BSA and Linkage Map-Based Trait Mapping

1. Screen the parental genotypes with the molecular markers including ‘anchor’ markers (see Note 6).

3.1.3. Measuring LD Decay and Population Structure for Association Mapping

1. Select a Discovery Panel comprising of the diverse genotypes from the population to be used for association mapping and isolate the DNA from single plants to avoid heterogeneity.

2. Identify the markers that detect polymorphism between parental genotypes (Fig. 1).

2. Collect the data on nucleotide sequence from locus samples with genome-wide coverage from the Discovery Panel. 3. Measure range of diversity (e.g., decay of LD with physical distance – r2) to be sampled for association population, marker density required for sufficient coverage of target genomic regions (or the genome) for association, level of population structure that exists within the species, evaluate genome-wide influence of demography, determine the genomic regions targeted by natural selection and domestication, and determine the number and density of the neutral markers required to evaluate background associations. 3.1.4. Genotyping

1. While conducting the BSA, genotype the bulks (two extremes of phenotype, 10–20 individual from each extreme) with the polymorphic markers, identify the putative associated markers with the trait, and subsequently genotype the complete population with the candidate markers (Fig. 2). 2. Genotype the mapping population (F2, BC, RIL, DH) with the polymorphic markers in case of linkage map-based trait mapping strategy. Number of markers to be screened on the population depends on the genome size of the species. However, an average of 100–200 markers spaced less than 15 cM apart are recommended for linkage map-based QTL analysis. In case, the ‘anchor’ or ‘core’ markers are available for the species, use some of these markers, representing the arms of each linkage groups, to provide links with other linkage maps and trait information (Fig. 3). 3. For association mapping, sequence the target region(s) that are trait dependent across the association mapping population

Molecular Plant Breeding

289

Fig. 1. Detection of DNA polymorphism. This figure shows detection of DNA polymorphisms between two homozygous parental genotypes P1 and P2 by using PCR-based microsatellite or SSR markers. Four hypothetical SSR markers (e.g., A, B, C and D in the figure) have been used for amplification of corresponding loci in two genotypes of interest. Separation of PCR products on agarose gel reveals polymorphism (size difference) between P1 and P2 for three SSR markers (A, C and D) while the marker B is monomorphic between two genotypes.

(candidate gene sequencing approach) or genotype the population with a suitable number of molecular markers covering the entire genome, depending on the LD decay (whole genome scanning approach). Genotype the population with

290

Varshney et al.

B

A Frequency

Frequency

Susceptible

Resistant Lower root biomass

Root biomass

Disease score

Bulk1

Bulk1

Bulk2 Bulk 1

F2

Higher root biomass

Bulk2

Bulk 2

P1 P2 F1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Marker A 4P1.0P2, 4H

0P1.4P2, 4H

4P1.0P2, 4H

0P1.4P2, 4H

5P1.0P2, 3H

0P1.4P2, 4H

Marker C Marker D

RIL/DH

P1 P2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Marker A Marker C Marker D

Fig. 2. Bulked segregant analysis (BSA) for simple/monogenic and quantitative traits. BSA can be used for both oligogenic trait, for example disease resistance (shown in A) and as well as quantitative trait, for example root biomass (shown in B). In both cases, two bulks are made from individuals from extreme phenotypes. The pooled DNA from these bulks together with the parental genotypes are screened with the molecular markers for the detection of polymorphism. The putative markers showing a polymorphism between parental genotypes and their characteristic amplification profiles (banding pattern) in corresponding bulks are then selected to screen on the DNA of individual lines of the bulks, and subsequently on the complete set of lines of the mapping population. A typical segregation banding pattern for three hypothetical polymorphic markers A, C and D in case of F2 population (1:2:1) and recombinant inbred line/doubled haploid (RIL/DH) population (1:1) has been shown. The markers A and C in case of F2 population (as majority of individuals of Bulk 1 and Bulk 2 for these markers reveal the alleles of the respective parents, P1 and P2 respectively) while the markers C and D in RIL/DH in the hypothetical examples, seem to be putative markers associated with trait. Therefore, these markers need to be screened on the complete mapping population or selective lines. Subsequently, the genotyping data obtained on the population together with the phenotyping data may be analyzed using appropriate statistical test (e.g., c 2-test) for marker–trait association.

Molecular Plant Breeding

291

Fig. 3. Linkage mapping-based quantitative trait loci (QTL) analysis. For quantitative traits, the most commonly used approach of QTL analyses is based on linkage mapping by using recombinant inbred line (RIL) or doubled haploid (DH) population. Here polymorphic molecular markers (e.g., A, C and D in the figure) between the parental genotypes have been screened on the complete set of the mapping population. The genotyping data obtained so are used for calculation the recombination frequency between/among the markers and construction of the genetic map. In the figure, a hypothetical molecular linkage map based on three polymorphic markers (A,C and D) is shown. The linkage mapping data together with the phenotyping data on the progeny lines of the population are then used for QTL analysis using appropriate statistical analyses (e.g., regression analysis, interval mapping, and composite interval mapping). A hypothetical QTL peak for the trait on the linkage map, as prepared above, has been shown in the figure.

292

Varshney et al.

some neutral markers also in order to test the levels of background stochastic association (Fig. 4). Experimental Details

For analyzing the polymorphism and genotyping the germplasm, several genotyping platforms like agarose gel electrophoresis, polyacrylamide gel electrophoresis, capillary electrophoresis, etc are available. Each of these platforms has some advantages and disadvantages over the others. Nevertheless, there is a need for high throughput, robust, and cost-effective genotyping platforms for

Fig. 4. Linkage disequilibrium (LD)-based association mapping for marker–trait association. In the first instance, the germplasm to be used is screened with neutral DNA markers for estimating the population structure and the LD decay in the genome and germplasm. On the basis of these analyses, one of two approaches, that is, candidate gene sequencing (A) and whole genome scanning (B) is used. In candidate gene sequencing approach, the putative candidate genes contributing the phenotypic variation for the trait are selected and appropriate genic region(s) are sequenced across the germplasm and sequence data are analyzed into haplotypes. In the whole genome scanning approach, the germplasm is screened with the molecular markers, representing the whole genome, based on LD decay study. Subsequently, the haplotype or genotyping data obtained is analyzed together with the phenotyping data with appropriate statistical tools to correct the population structure and marker–trait association.

Molecular Plant Breeding

293

molecular breeding. Currently, at majority of the places including ICRISAT, capillary electrophoresis platform is used for marker analysis. Therefore, in this section, technical details on analyzing the markers in germplasm using capillary electrophoresis, as we are doing at ICRISAT on ABI3130 and ABI3700, are provided. 1. Set up the PCR reaction as per the optimized conditions to get the amplicons by using fluorescent dye-labelled primers. The forward primers are labelled with one of the fluorescent dyes: Fam (blue), Vic (green), Ned (yellow), and Pet (red). For instance, for setting up the PCR in 10-μl volume, take 1 μl of DNA (2.5 ng/μl) in a PCR tube or a 96-well or 384well plate and add 1 μl of dNTP (2 mM), 1 μl of 10× Qiagen buffer, 0.25 U of Taq polymerase (e.g., Qiagen), 0.5 μl of labelled forward primer, 0.5 μl of reverse primer, and make volume to 10 μl by adding sterile distilled water. Sometimes, multiplex PCR by using more than one primer pair labelled with different fluorescent dyes can be set up. 2. Put the tube or plate in a thermal cycler and run the PCR using the touchdown PCR profile. For example, for a primer pair with annealing temperature 55°C, set up an initial denaturation at 94°C for 3 min followed by 5 cycles consisting of denaturation at 94°C for 20 s, annealing at 60°C for 30 s with 1°C decrease in every subsequent cycle and extension at 72°C for 30 s and then 30 cycles consisting of denaturation at 94°C for 20 s, annealing at 55°C for 30 s and extension at 72°C for 30 s with the final extension step at 72°C for 20 min. 3. Check the amplification using 2 μl PCR product on 1.2% agarose gel electrophoresis. 4. If multiplexing PCR was not set up in step 1, the PCR products obtained from individual primer pairs labelled with different fluorescent dyes sometimes can be pooled. In such cases, the pooling premix includes 1 μl of each PCR product (s), 7 μl of Hi-Di formamide for denaturing the double-stranded DNA, and 0.2 μl of GeneScan Liz 500 internal lane size standard (Orange) provided by Applied Biosystems (USA). Alternatively, 0.15 μl of GeneScan Rox 500 or Rox HD 400 size standard (red) can be used wherever PET label is not used in the pooled PCR products. 5. Denature the samples at 94°C for 5 min and cool immediately on ice. 6. Put the samples in machine and run capillary electrophoresis. 7. Import the raw allele size data to GeneScan programme for assigning the allele sizes based on the internal size standard. For example, if Liz 500 is used as the internal lane size standard, the PCR products in the range of 35–500 bp can be sized. The sizes for the Liz 500 are 35, 50, 75, 100, 139, 150, 160,

294

Varshney et al.

200, 250, 300, 340, 350, 400, 450, 490, and 500 bp. Each of the DNA fragment is labelled with Liz flourophore. Define the analysis parameters based on the amplitude of the PCR product. During the analysis, based on internal size standard, all the peaks of amplified PCR products can be assigned to their proper sizes. 8. Import the GeneScan output file to Genotyper programme and set the preferences according the dyes used. Define the category according to the primer name; select the highest peak and the range of intensity of amplicon to consider for the analysis. The genotyping files appear according to the primer name for all the genotypes screened for polymorphism. 9. Peaks can be labelled as size (bp), peak height, and peak area depending on the user requirement. Inspect some of all the peaks manually to be more confident on allele sizing. 10. Upload the results of genotyping in a table with pre-defined columns. Export this table and save it at computer for application in molecular breeding. 3.1.5. Phenotyping

1. Phenotype the population for trait of interest. This should be replicated temporally and spatially to increase the accuracy and precision of the phenotypic measurements. 2. If possible, measure the trait of interest in quantitative fashion instead of categorically. 3. Evaluate the trait heritability to define the expectation for the genetic component of the phenotypic variance.

3.1.6. Statistical Association

Bulked Segregant Analysis

Genotyping data obtained on the population are analyzed together with the replicated phenotyping data using appropriate statistical analysis and tools, as mentioned in Subheading 2.3. 1. Use single-point analysis method involving t-test, analysis of variance (ANOVA), and linear regression. 2. Calculate the phenotypic variation arising from the QTL linked to the marker (coefficient of determination, R2).

Linkage Mapping

1. Use the genotyping data to construct a framework genetic linkage map with a suitable computer programme mentioned in Subheading 2.3. 2. Use the anchor markers and their mapping positions to designate the linkage groups (Fig. 3). 3. Conduct the SMA, SIM, or CIM for identification of QTLs. 4. Calculate the phenotypic variation (R2) contributed by the QTL and measure the QTL × QTL, QTL × E, and QTL × QTL × E interaction.

Molecular Plant Breeding Association Mapping

295

1. Build statistical model(s) for the expectation of phenotypic correlation with environmental and genetic variability (VP = VG + VE). 2. Evaluate the level of co-variance between the phenotypes and combine the highly correlated traits in the same model. 3. Evaluate co-variance between the neutral marker genotypes and candidate gene genotypes, in case candidate gene-based approach is being used. 4. Determine the Type I error thresholds according to the number of tests performed and the level of flexibility in the study. 5. Determine power and false positive rate expectations for the study. 6. Run statistical association tests using appropriate statistical tool/software (e.g., TASSEL).

3.2. Marker Validation

Generally, the markers/alleles associated with the trait should be validated by testing their effectiveness in determining the target phenotype in independent populations and different genetic backgrounds, which is referred to as ‘marker validation’ (44, 45, see Note 7). 1. Confirm the QTL mapping studies by using independent populations constructed from the same parental genotypes or closely related genotypes used in the primary QTL mapping study. Larger population sizes may be used. 2. If suitable genetic material for the trait is available, for example, near isogenic lines (NILs), genotype the NILs with candidate markers and compare mean trait values of particular NILs with the recurrent parent, the effects of QTLs can be confirmed. 3. Test the presence of the marker, associated with the QTL, on a range of cultivars and other genotypes. There is no guarantee that DNA markers identified in one population will be useful in different population, especially when the populations originate from distantly related germplasm. 4. Markers that reveal polymorphism in different populations derived from a wide range of different parental genotypes will be most useful in breeding programmes (45). 5. If the candidate gene sequencing approach has been used in LD/association mapping, in addition to above, the association of allele with the trait may be verified either through reevaluation in an independent population sample or through allelic-silencing or knockout studies.

3.3. Marker-Assisted Selection

Once markers that are tightly linked to genes or QTLs of interest have been identified, prior to field evaluation of large number of plants, plant breeders may use specific DNA marker alleles as

296

Varshney et al.

a diagnostic tool to identify plants carrying the genes or QTLs. The procedure is called ‘marker-assisted selection’ or ‘markeraided selection’ (commonly referred as MAS) or ‘marker-assisted breeding’. 1. Select the markers that are tightly linked with the trait and yield the robust and clear-cut banding pattern. In principle, all such markers may be used in MAS; however, there have been reports of up to five QTLs being introgressed. 2. For early generation selection in typical breeding programme for simple (monogenic) traits, for example, disease resistance, a susceptible parent is crossed with a resistant parent and the F1 plant is self-pollinated to produce an F2 population. Use the robust marker, developed for the major gene/QTL controlling trait of interest (e.g., disease resistance) to screen the F2 population and select only those plants possessing the desirable genotypes (having the alleles conferring resistance) out of the large number (e.g., 2,000) of F2 plants (Fig. 5). It is estimated that up to 75% of plants may be eliminated after one cycle of MAS. 3. In case of marker-assisted backcrossing programme, use the molecular markers associated with the trait for foreground selection, while neutral markers covering the whole genome can be used for background selection (Fig. 6). For example, genotype the BC1F1s (by crossing the donor genotype for a trait with the F1s, obtained by crossing the donor and the elite-recipient genotypes) with the molecular markers associated with the trait as well as other neutral markers. The lines possessing the desirable genotype based on the associated markers as well as having higher proportion of genome (fingerprints) of the recipient genotype should be selected and advanced to generate the BC2F2s. Similar kind of foreground and background selection with molecular markers can be conducted in next generations until the backcrossing products have the desirable chromosomal segment from the donor genotype into the recipient genotype background.

Notes 1. DNA-based molecular markers can be classified into three categories depending on how the polymorphism is revealed: hybridization-based polymorphisms, PCR-based polymorphisms, and sequence-based polymorphisms. Details about these marker systems are discussed in several reviews and book

Molecular Plant Breeding

Genotype A (P1)

297

Genotype B (P2) ×

P1

P2

×

F1

P1

P2

F1

F2

marker assisted selection of genotypes of interest Fig. 5. A hypothetical scheme for early generation selection in a breeding programme. The polymorphic molecular markers between two parental genotypes (P1 and P2) can be used to select true F1 by analyzing co-dominant markers having the alleles (bands) of both parental genotypes. After selfing the F1 lines, several hundreds (sometimes thousands) F2 lines are raised. These F2 lines can be screened with the polymorphic co-dominant markers and based on DNA profiling the progenies of interest, for example the lines having the allele of parental genotype P2 (with higher root biomass) in the figure have been selected.

chapters (5, 46, 47). The choice of using molecular markers depends on the intended use, the microsatellite or simple sequence repeats (SSRs), however, have been recommended for molecular breeding as they are co-dominant, multi-allelic, and abundant in nature (48). For detection of microsatellite loci in the genome, two PCR-based primer pairs based on flanking regions of the microsatellite are used. For majority of

298

Varshney et al. Genotype B (P2)

Genotype A (P1)

× ×

Cycle 1

P1

P1 P2 F1

F1 BC1F1

P1 P2 1 2 3 4 5 6 7

P1 P2 1 2 3 4 5 6 7 Marker X

candidate genotypes for selection Foreground selection candidate genotypes for selection (75% similar to P1) Background selection

Cycle 2

P1 P1 P2 1 2 3 4 5 6

× 7

Selected BC1F1 lines (3, 5) BC2F1

P1 P2 1 2 3 4 5 6 7

Marker X candidate genotypes for selection Foreground selection

P1

×

candidate genotypes for selection (87% similar to P1) Background selection Selected BC2F1 lines (2, 4) BC3F1

×

P1 P2 1 2 3 4 5 6 7 Marker X

Marker analysis of BC3F1 as shown in cycle 1 and cycle 2 and selected BC3F1 lines are selfed

BC3F2

× candidate genotypes for selection BC3F3 MABC products ready for field trial and varietal development

Fig. 6. Marker-assisted backcrossing using foreground and background selection strategies. One of the most important applications of molecular markers to introgress the trait of interest in a genotype of interest has been shown in the figure. For example, marker–trait association studies provide the codominant marker X as a diagnostic or linked molecular marker with a major root trait quantitative trait loci (QTL), contributing large phenotypic variation. For this marker, the lower band (allele B) is associated with higher root biomass while the upper band (allele A) is associated with lower root biomass. In case, the genotype A is an elite variety but drought sensitive and the breeder likes to introgress the root trait QTL for higher root biomass in this genotype. The breeder needs to select a genotype (e.g., B) with higher root biomass. The genotype A is crossed with the genotype B and the resulting F1 lines can be screened with the marker X to confirm the presence of both alleles and then the true F1 lines can be advanced. These F1 lines will be backcrossed with the genotype A and the resulting BC1F1 lines can be screened with the diagnostic molecular marker X for foreground selection while with the multi-locus marker system(s) for the background selection. The foreground selection (identification of lines with the higher root biomass allele in heterozygous condition), in the figure, suggests to select the line nos. 1, 3, and 5. In parallel, these lines are analyzed for monitoring the genomic background of the recipient (A) genotype. In the figure, out of the three lines, 1, 3, and 5, only two lines, lines no. 3 and 5 fulfil the criteria of foreground (having allele of donor genotype in heterozygous condition) and background (having 75% of genome of the recipient genotype) selection.

Molecular Plant Breeding

299

the major crop species, these SSR markers are present in large numbers in public domain. 2. In general, for trait mapping based on linkage maps or BSA, several types of mapping populations, derived from crosses involving any two diverse parents, can be used. For instance, an F2 population or BC population can be derived from F1 plants through selfing or backcrossing them to one of the parents; RILs can be derived by single seeds descent for at least five or more generations; and DHs can be derived from haploid obtained from F1 plants through anther/egg cell/ovule culture or distant hybridization. The simplest mapping populations are the F2 populations or the BC populations; however, these mapping populations are not permanent, while the RILs and DHs are immortal populations and can be stored and shared across the laboratories. 3. Association or LD mapping is the basis for gene mapping in species where large mapping populations can not be readily produced such as mapping in tree species, farm animals, and humans (49, 50). The LD method, unlike the use of biparental mapping populations, uses real breeding populations; the material is diverse and relevant; and the most important genes (e.g., for adaptation) should be co-segregating in such populations (51, 52). 4. Among different statistical analyses for QTL mapping, SMA (or single-point analysis) is the simplest method for detecting QTLs associated with single markers. The statistical methods used for SMA include t-tests, ANOVA, and linear regression. Linear regression is most commonly used because the coefficient of determination (R2) from the marker explains the phenotypic variation arising from the QTL linked to the marker. In fact, this method is generally used in BSA approach for trait mapping. However, the main disadvantages of this method are: (1) the further a QTL is from a marker, the less likely it will be detected as the recombination may occur between the marker and the QTL; (ii) this causes the magnitude of

Therefore, these lines are advanced for further backcrossing. The figure shows the backcrossing of one selected BC1F1 line with the A genotype and the resulting BC2F1 lines are further screened for foreground selection (with the diagnostic X marker) and the background selection (multi-locus fingerprinting). As a result of the second foreground and background selection, two BC2F1 lines (lines no. 2 and 4) are selected that show the allele of higher root biomass (in heterozygous condition) and 87% genome of the donor genotype. Such kind of backcrossing and foreground and background selection are continued upto 3-4 cycles, depending on the nature of the crop species. At the end of MABC, the progeny lines are analyzed using diagnostic marker X, and the plants carrying higher root biomass allele are selected. Subsequently, selfing of selected progeny lines is undertaken till the appropriate marker-assisted back cross (MABC) lines with the higher root biomass alleles in the genomic background of the recipient genotype are generated. These lines, eventually, can be taken to the field trials and then other requirements of varietal development can be followed.

300

Varshney et al.

the effect of a QTL to be underestimated. The use of a large number of segregating markers covering the entire genome, usually at intervals less than 15 cM, may minimize both problems (33). Linkage map-based trait mapping approach employs the SIM method that makes use of linkage maps and analyses intervals between adjacent pairs of linked markers along chromosomes simultaneously (53). The use of linked markers for analyses under SIM is considered statistically more powerful compared to single-point analysis as the recombination between the markers and the QTL is taken care of (34). The CIM approach, however, combines interval mapping with linear regression and includes additional molecular markers in the statistical model in addition to an adjacent pair of linked markers for interval mapping (54). This method is more precise and effective at mapping QTLs as compared to single-point analysis (SMA) and SIM, especially when linked QTLs are involved. 5. While using the association or LD mapping approach, the statistical power of associations is determined by the extent of LD with the causative polymorphism, as well as sample size used for the study (55, 56). The decay of LD over physical distance in the study population determines the marker density required and the level of resolution that may be obtained in an association study. The most commonly used summary statistic for estimation of LD within the association study framework is known as r2 (57, 58). The r is the Pearson’s (product moment) correlation coefficient of the correlation that describes the predictive value of the allelic state at one polymorphic locus on the allelic state at another polymorphic locus, where r2 is the squared value of correlation coefficient that is also called coefficient of determination and it explains the proportion of a sample variance of a response variable that is explained by the predictor variables when a linear regression is performed (50). Lewontin’s D is another summary statistic for LD that is commonly used and describes the difference between the coupling gamete frequencies and repulsion gamete frequencies at two loci. From D, a second measure of LD, that is, normalized D´ can also be estimated. It is important to estimate the rate of decay of LD with physical distance, to be able to extrapolate information gathered from a small collection of sampled loci to the whole genome investigated. This extrapolation is essential for association mapping study design, since it may be used for determining the marker density required for scanning previously unexplored regions of the genome, as well as determining the maximum resolution that can be achieved for genotype–phenotype associations for the study population.

Molecular Plant Breeding

301

Another important constraint for the use of association mapping for crop plants is unidentified population sub-structuring and admixture due to factors such as adaptation or domestication (43, 59). Population structure creates genome-wide LD between unlinked loci. When the allele frequencies between sub-populations of a species are significantly different, due to factors such as genetic drift, domestication, or background selection, genetic loci that do not have any effect whatsoever on the trait may demonstrate statistical significance for their co-segregations with a trait of interest (see ref. 50 for details). In cases where the population structuring is mostly due to population stratification (41, 60), three methods are often proposed suitable for statistically controlling the effects of population stratification on association tests: (a) genomic control (GC) (61–63), (b) structured association (SA) method including two extensions that are modified for the type of association study as case-control (SA-model) (42) or quantitative trait association study (Q-model) (32, 43), and (c) unified mixed model approach (Q + K) (64). After analyzing the LD decay, population structure, and appropriate genotyping of the population, marker–trait association studies are conducted. Whether the phenotype of interest has a binary or quantitative phenotype is also of interest for the association study design. When a binary trait is being investigated, case-control type populations are required for association analysis, where equivalent sized sub-populations of individuals that display the phenotype of interest (cases) and do not display the phenotype of interest (controls) are queried for allelic association of genetic loci with the case and control phenotypes in a statistically significant manner (50). The statistical test performed is simply a hypothesis test that asks weather the allelic frequency distribution of a locus is the same or different for a given locus between the two sub-populations. Most of the statistical methods aim to detect and correct for the effects of population stratification and ancestry differences between the case and control groups (50, 65). 6. When the same set of molecular markers is used in different mapping populations of the given species to construct the linkage maps, the markers order and the linkage maps can be correlated. Therefore, in order to correlate information from one map to another, common markers are required. Common markers that are highly polymorphic in different mapping populations are called ‘anchor’ or ‘core’ markers. Generally, anchor markers are SSRs or RFLPs (31). 7. The marker validation involves testing the reliability of markers to predict phenotype and indicates whether a marker could be used in routine screening for MAS.

302

Varshney et al.

Acknowledgements Authors (RKV, DAH, SNN) are grateful to Generation Challenge Programme (GCP) of Consultative Group on International Agriculture Research (CGIAR), Indo–US Agricultural Knowledge Initiative (AKI), and National Fund of Indian Council of Agricultural Research (ICAR) and Department of Biotechnology (DBT), Government of India for financial support of their research. SNN is thankful to Council of Scientific and Industrial Research (CSIR), Government of India for sponsoring fellowship. References 1. Gur, A. and Zamir, D. (2004) Unused natural variation can lift yield barriers in plant breeding. PLoS Biol. 2, e245. 2. Tanksley, S.D., Young, N.D., Paterson, A.H., and Bonierbal1e, M.W. (1989) RFLP mapping in plant breeding: new tools for an old science. BioTechnology 7, 257–264. 3. Phillips, R.L. and Vasil, I.K. (eds.) (2001) DNABased Markers in Plants (2nd ed). Kluwer Academic Publishers, Dordrecht, The Netherlands. 4. Rafalski, J.A. and Tingey, S.V. (1993) Genetic diagnostics in plant breeding: RAPDs, microsatellites and machines. Trends Genet. 9, 275–280. 5. Azhaguvel, P., et al. (2006) Methodological advancement in molecular markers to delimit the gene(s) for crop improvement, in Floriculture, Ornamental and Plant Biotechnology: Advances and Topical Issues (Vol. I) (Teixeira da Silva, J.A., ed.), Global Science Books, London, pp. 460–469. 6. Gupta, P.K. and Varshney, R.K. (2004) Cereal genomics: an overview, in Cereal Genomics (Gupta, P.K. and Varshney, R.K., eds.), Kluwer Academic Publishers, Dordrecht, The Netherlands, pp. 1–18. 7. Jahoor, A., Eriksen, L., and Backes, G. (2004) QTLs and genes for disease resistance in barley and wheat, in Cereal genomics (Gupta, P.K. and Varshney, R. K., eds.), Kluwer Academic Publishers, Dordrecht, The Netherlands, pp. 199–252. 8. Li, W. and Gill, B.S. (2004) Genomics for cereal improvement, in Cereal genomics (Gupta, P.K. and Varshney, R.K., eds.). Kluwer Academic Publishers, Dordrecht, The Netherlands, pp. 585–634. 9. Tuberosa, R. and Salvi, S. (2004) QTLs and genes for tolerance to abiotic stress in cereals, in Cereal genomics (Gupta, P.K. and Varshney, R.K., eds.), Kluwer Academic Publishers, Dordrecht, The Netherlands, pp. 253–315.

10. Varshney, R.K., Hoisington, D.A., and Tyagi, A.K. (2006) Advances in cereal genomics and applications in crop breeding. Trends Biotechnol. 24, 490–499. 11. Toojinda, T., Baird, E., Booth, A., Broers, L., Hayes, P., Powell, W., Thomas, W., Vivar, H., and Young G. (1998) Introgression of quantitative trait loci (QTLs) determining stripe rust resistance in barley: an example of marker assisted line development. Theor. Appl. Genet. 96, 123–131. 12. Langridge, P. (2005) Molecular breeding of wheat and barley, in In the Wake of Double Helix: From the Green Revolution to the Gene Revolution (Tuberosa R., Phillips, R.L., and Gale, M., eds.), Avenue Media, Bologna, Italy, pp. 279–286. 13. Toenniessen, G.H., O’Toole,J.C., and DeVries, J. (2003) Advances in plant biotechnology and its adoption in developing countries. Curr. Opin. Plant Biol. 6, 191–198. 14. Dreher, K., Morris M., and Khairallah, M. (2000) Is marker assisted selection cost-effective compared to conventional plant breeding methods? The case of quality protein maize, in Proc 4th Annu Conf Intern Consor on Agricultural Biotechnology Research (ICABR), The Economics of Agricultural Biotechnology, Ravello, Italy. 15. Jefferies, S.P., King, B.J., Barr, A.R., Warner, P., Logue, S.J., and Langridge, P. (2003) Marker-assisted backcross introgression of the yd2 gene conferring resistance to barley yellow dwarf virus in barley. Plant Breed. 122, 52–56. 16. Friedt, W. and Ordon, F. (2007) Molecular markers for gene pyramiding and disease resistance breeding in barley, in Genomics Assisted Crop Improvement (Varshney R.K. and Tuberosa, R.T., eds.), Springer, The Netherlands, (Vol. II) pp. 97–120.

Molecular Plant Breeding 17. Koebner, R.M.D. (2004) Marker assisted selection in the cereals: the dream and the reality, in Cereal Genomics, (Gupta P.K. and Varshney, R.K., eds.), Kluwer Academic Publishers, Dordrecht, The Netherlands, pp. 317–329. 18. Sanchez, A.C., Brar, D.S., Huang, N., Li, Z. K., and Khush, G.S. (2000) Sequence-tagged site marker-assisted selection for three bacterial blight resistance genes in rice. Crop Sci. 40, 792–797. 19. Singh, S., Sidhu, J.S., Huang, N., Vikal,Y., Li, Z., Brar, D.S., Dhaliwal, H., and Khush, G.S. (2001) Pyramiding three bacterial blight resistance genes (xa-5, xa-13 and Xa21) using marker-assisted selection into indica rice cultivar PR106. Theor. Appl. Genet. 102, 1011–1015. 20. He, Y., Li, X., Zhang, J., Jiang, G., Liu, S., Chen, S., Tu, J., Xu, C., and Zhang, Q. (2004) Gene pyramiding to improve hybrid rice by molecular-marker techniques, in New Directions for a Diverse Planet: Proc. 4th Intern. Crop Sci. Cong. Brisbane, Australia. (http:// www.cropscience.org.au/icsc2004/) 21. Narayanan, N.N., Baisakh, N., Vera Cruz, C.M., Gnanamanickam, S.S., Datta, K., and Datta, S.K. (2002) Molecular breeding for the development of blast and bacterial blight resistance in rice cv. IR50. Crop Sci. 42, 2072–2079. 22. Joseph, M., Gopalakrishnan, S., Sharma, R.K., Singh, V.P., Singh, A.K., Singh, N.K., and Mohapatra, T. (2004) Combining bacterial blight resistance and Basmati quality characteristics by phenotypic and molecularmarker assisted selection in rice. Mol. Breed. 13, 377–387. 23. Rostoks, N., Borevitz, J.O., Hedley, P.E., Russell, J., Mudie, S., Morris, J., Cardle, L., Marshall, D.F., and Waugh, R. (2005) Singlefeature polymorphism discovery in the barley transcriptome. Genome Biol. 6, R54. 24. Rostoks N., Schmierer, D., Mudie, S., Drader, T., Brueggeman, R., Caldwell, D.G., Waugh, R., and Kleinhofs, A. (2006) Barley necrotic locus nec1 encodes the cyclic nucleotide-gated ion channel 4 homologous to the Arabidopsis HLM1. Mol. Genet. Genomics 275, 159–168. 25. Varshney, R.K., Graner, A., and Sorrells, M.E. (2005b) Genomics-assisted breeding for crop improvement. Trends Plant Sci. 10, 621–630. 26. Andersen, J.R. and Lübberstedt, T. (2003) Functional markers in plants. Trends Plant Sci. 8, 554–560. 27. Varshney, R.K., Graner, A., and Sorrells, M.E. (2005a) Genic microsatellite markers in plants: features and applications. Trends Biotechnol. 23, 48–55.

303

28. Michelmore, R.W., Paran, I., and Kesseli, R.V. (1991) Identification of markers linked to disease resistance genes by bulked-segregant analysis; a rapid method to detect markers in specific genomic regions by using segregating populations. Proc. Natl. Acad. Sci. USA 88, 9828–9832. 29. Young, N.D. (1994) Constructing a plant genetic linkage map with DNA markers, in DNA-Based Markers in Plants (Vasil, I.K. and Phillips, R.L., eds.), Kluwer, Dordrecht, pp. 39–57. 30. Mohan, M., Suresh, N., Bhagwat, A., Krishna, T.G., Yano, M., Bhatia, C.R., and Sasaki, T. (1997) Genome mapping, molecular and makers and marker-assisted selection in crop plants. Mol. Breed. 3, 87–103. 31. Collard, B.C.Y., Jahufer, M.Z.Z., Brouwer, J.B., and Pang, E.C.K. (2005) An introduction to markers, quantitative trait loci (QTL) mapping and marker-assisted selection for crop improvement: the basic concepts. Euphytica 142, 169–196. 32. Camus-Kulandaivelu, L., Veyrieras, J.B., Madur, D., Combes, V., Fourmann, M. Barraud, S., Dubreuil, P., Gouesnard, B., Manicacci, D., and Charcosset, A. (2006) Maize adaptation to temperate climate: relationship between population structure and polymorphism in the Dwarf8 gene. Genetics 172, 2449–2463. 33. Tanksley, S.D. (1993) Mapping polygenes. Annu. Rev. Genet. 27, 205–233. 34. Liu, B. (1998) Statistical Genomics: Linkage, Mapping and QTL Analysis. CRC Press, Boca Raton. 35. Nelson, J.C. (1997) Qgene – software for marker-based genomic analysis and breeding. Mol. Breed. 3, 239–245. 36. Manly, K.F., Cudmore R.H., and Meer, J.M. (2001) Map Manager QTX, cross-platform software for genetic mapping. Mamm. Genome 12, 930–932. 37. Lincoln, S., Daly, M., and Lander, E. (1993a) Constructing genetic linkage maps with MAPMAKER/EXP. Version 3.0. Whitehead Institute for Biomedical Research Technical Report, 3rd ed. 38. Lincoln, S., Daly M., and Lander, E. (1993b) Mapping genes controlling quantitative traits using MAPMAKER/QTL. Version 1.1. Whitehead Institute for Biomedical Research Technical Report, 2nd ed. 39. Basten, C.J., Weir, B.S., and Zeng, Z.B. (1994) Zmap-a QTL cartographer,. in Proceedings of the 5th World Congress on Genetics Applied to Livestock Production: Computing Strategies and Software, Guelph, Ontario, Canada. (Smith

304

40.

41.

42.

43.

44.

45.

46.

47.

48.

49.

50.

Varshney et al. J.S.G.C., Benkel, B.J., Chesnais, W.F., Gibson, J.P., Kennedy, B.W., and Burnside, E.B., eds.), Published by the Organizing Committee, 5th World Congress on Genetics Applied to Livestock Production. Utz, H. and Melchinger, A. (1996) PLABQTL: A program for composite interval mapping of QTL. J. Quant. Trait Loci 2. http://probe. nalusda.gov:8000/otherdocs/jqtl Pritchard, J.K. (2001) Deconstructing maize population structure. Nat. Genet. 28, 203–204. Pritchard, J.K. Stephens, M., Rosenberg, N.A., and Donnelly, P. (2000) Association mapping in structured populations. Am. J. Hum. Genet. 67, 170–181. Thornsberry, J.M., Goodman, M.M., Doebley, J., Kresovich, S., Nielsen, D. et-al. (2001) Dwarf8 polymorphisms associate with variation in flowering time. Nat. Genet. 28, 286–289. Gupta, P.K., Varshney, R., Sharma, P., and Ramesh, B. (1999) Molecular markers and their applications in wheat breeding. Plant Breed. 118, 369–390. Langridge, P., Lagudah, E., Holton, T., Appels, R., Sharp, P., and Chalmers, K. (2001) Trends in genetic and genome analyses in wheat: a review. Aust. J. Agric. Res. 52, 1043–1077. Gupta, P.K., Varshney, R.K., and Prasad, M. (2002) Molecular markers: principles and methodology, in Molecular Techniques in Crop Improvement (Jain, S.M., Brar, D.S., and Ahloowalia, B.S., eds.), Kluwer Academic Publishers, Dordrecht, The Netherlands, pp. 9–54. Somers, D.J. (2004) Molecular marker systems and their evaluation for cereal genetics, in Cereal Genomics (Gupta, P.K. and Varshney, R.K., eds.), Kluwer Academic Publishers, Dordrecht, The Netherlands, pp. 19–34. Gupta, P.K. and Varshney, R.K. (2000) The development and use of microsatellite markers for genetic analysis and plant breeding with emphasis on bread wheat. Euphytica 113, 163–185. Flint-Garcia, S.A., Thornsberry, J.M., and Buckler, E.S. (2003) Structure of linkage disequilibrium in plants. Ann. Rev. Plant Biol. 54, 357–374. Ersoz, E.S., Yu, J., and Buckler, E.S. (2007) Applications of linkage disequilibrium and association mapping in crop plants, in Genomics Assisted Crop Improvement (Varshney R.K. and Tuberosa, R.T., eds.), Springer, The Netherlands (Vol. I). pp. 97–120.

51. Buckler, E.S. and Thornsberry, J. (2002) Plant molecular diversity and applications to genomics. Curr. Opin. Plant Biol. 5, 107–111. 52. Yu, J. and Buckler, E.S. (2006) Genetic association mapping and genome organization of maize. Curr. Opin. Biotechnol. 17, 155–160. 53. Lander, E. and Botstein, D. (1989) Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics 121, 185–199. 54. Jansen, R. and Stam, P. (1994) High resolution of quantitative traits into multiple loci via interval mapping. Genetics 136, 1447–1455. 55. Long, A.D. and Langley, C.H. (1999) The power of association studies to detect the contribution of candidate genetic loci to variation in complex traits. Genome Res. 9, 720–731. 56. Wang, Y. and Rannala, B. (2005) In silico analysis of disease-association mapping strategies using the coalescent process and incorporating ascertainment and selection. Am. J. Hum. Genet. 76, 1066–1073. 57. Hill, W.G. and Robertson, A. (1968) Linkage disequilibrium in finite populations. Theor. Appl. Genet. 38, 226–231. 58. Lewontin, R.C. (1988) On measures of gametic disequilibrium. Genetics 120, 849–852. 59. Wright, S.I. and Gaut, B.S. (2005) Molecular population genetics and the search for adaptive evolution in plants. Mol. Biol. Evol. 22, 506–519. 60. Bamshad, M., Wooding, S., Salisbury, B.A., and Stephens, J.C. (2004) Deconstructing the relationship between genetics and race. Nat. Rev. Genet. 5, 598–609. 61. Devlin, B., Bacanu, S.A., and Röder, K., (2004) Genomic control to the extreme. Nat. Genet. 36, 1129–1130. 62. Devlin, B. and Roeder, K. (1999) Genomic control for association studies. Biometrics 55, 997–1004. 63. Devlin, B., Röder, K., and Wasserman, L. (2001) Genomic control, a new approach to genetic-based association studies. Theor. Popul. Biol. 60, 155–166. 64. Yu, J., Pressoir, G., Briggs, W.H., Vroh, Bi, I., Yamasaki, M., Doebley, J., Mcmullen, M., Gaut, B., Holland, J.J., Kresovich, S., and Buckler E. (2006) A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 38, 203–208. 65. Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E., Shadick, N.A., and Reich, D. (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909.

Chapter 16 Practical Delivery of Genes to the Marketplace David A. Fischhoff and Molly N. Cline Summary Although new technologies in genomics are powerful tools for discovering genes and gaining insight into their function, discovery of a gene itself does not ensure its practical application. Commercialization of transgenic crop plants has now taken place for more than a decade. Plant biotechnology, which can be seen as an extension of traditional plant breeding for crop improvement, offers one way to boost food, feed, fiber, and fuel production and has provided significant environmental and economic benefits. Like plant breeding, biotechnology introduces new traits with specific benefits into plants, and does so in a selective, precise, and controlled manner. Several steps are necessary before commercializing a crop with a biotechnology trait, including not only gene discovery and product development but also regulatory clearance, stewardship evaluation, and stakeholder dialogue. Examples will be drawn from the work at Monsanto on the development and commercialization of glyphosate-tolerant soybeans, which is representative of the first wave of agronomic traits. Key words: Genomics, Commercialization, Transgenic crop, Biotechnology, Glyphosatetolerance,Regulatory process, Safety assessment, Stewardship, Marketplace acceptance.

1. Introduction Modern plant biotechnology, especially the introduction of heterologous genes into crop plants, has now been practiced for about 25 years. Commercialization of transgenic crop plants has now taken place for more than a decade. Plant biotechnology, which can be seen as an extension of traditional plant breeding for crop improvement, offers one way to boost food, feed, fiber, and fuel production and has provided significant environmental and economic benefits. Like plant breeding, biotechnology

Daryl J. Somers et al. (eds.), Methods in Molecular Biology, Plant Genomics, vol. 513 © Humana Press, a part of Springer Science + Business Media, LLC 2009 DOI: 10.1007/978-1-59745-427-8_16

305

306

Fischhoff and Cline

introduces new traits with specific benefits into plants, and does so in a selective, precise, and controlled manner. The increasing development of genomics and other technologies are helping to elucidate the functions of genes and are expected to provide many additional genes of interest for crop improvement via biotechnology. As a result, it is of interest to consider in the context of this volume some of the necessary steps to commercialize a crop containing a biotechnology trait, including not only gene discovery and product development but also regulatory clearance, stewardship evaluation, and stakeholder dialogue. Examples will be drawn from the work at Monsanto on the development and commercialization of glyphosate-tolerant soybeans, which is representative of the first wave of agronomic traits. Plant biotechnology research and development for products on the market today initiated in the early 1980s. Those initial products required the parallel development of gene discovery and optimization for useful new traits, gene-expression technology for transgenic plants, transformation techniques for the major crop plants, and the science underpinning regulatory clearance for these products. As a result, it was about 15 years from the initial reports of the production of the first transgenic plants to the commercialization of the first products. Today, many of those approaches are much more highly developed; and gene discovery, in particular, is expected to accelerate due to recent advances such as the publishing of the entire human genome sequences, including important plant genomes sequences (such as rice and Arabidopsis) – unlocking a new era of scientific discovery. Plant biotechnology involves excising genetic elements from different plant or microbial sources and inserting them into plants. Once the DNA is inserted into the plant, the product development cycle is very similar as with traditional plant breeding. While biotechnology has allowed plant breeders to broaden the scope of traits that can be incorporated in the plant genome and therefore broaden the value seeds can bring to agricultural production, the road to market for a biotechnology-enhanced crop is long and expensive. Although new technologies in genomics are powerful tools for discovering genes and gaining insight into their function, discovery of a gene itself does not ensure its practical application as other steps are required, which will be described subsequently. This is the purpose of this chapter. Following extensive phases of discovery, research, testing, and worldwide regulatory review, crop seeds with built-in herbicide tolerance (including canola in 1995, soybean in 1996, cotton in 1998, and corn in 1998) and insect protection (including cotton and corn in 1996) started to become available to growers for the first time in 1996. A historic milestone was reached in 2006 when

Practical Delivery of Genes to the Marketplace

307

22 countries grew biotech crops, including 6 of 25 countries in the European Union, and over a billion biotech acres had been planted. Included in the crop list are soybeans, maize (corn), cotton, and canola. Ingredients from those commodity crops and others are in the food and feed supply today (1). As an example, Monsanto’s herbicide-tolerant soybeans, branded as Roundup Ready® soybeans, have been modified to withstand the applications of Roundup® agricultural herbicide – with the active ingredient is glyphosate, a nonselective herbicide that affects almost all vegetation it contacts. Roundup Ready soybeans allow growers to apply Roundup agricultural herbicides over the top of these crops, killing the weeds, but not harming the soybeans, thus achieving better weed control. Traditional herbicides can cause crop injury after application, potentially reducing yield. This is not the case with the Roundup Ready soybean system. Because Roundup Ready soybeans represent one of the earliest and most widely adopted products of agricultural biotechnology, we will use them as a case study in the sections that follow and where appropriate to indicate differences for other traits or in more current practices (2, 3).

2. Materials This described method will focus on the necessary steps before commercializing a crop with a biotechnology trait, not on the specifics of event discovery. The required materials include – Gene with commercial value – Suitable crop for transformation – Access to molecular biology laboratories – Access crop test sites

3. Methods 3.1. Phases 3.1.1. Discovery

During the discovery phase, a gene that might confer a useful trait on a crop plant, such as herbicide tolerance, is identified. During the discovery phase of Roundup Ready soybeans, the gene 5-enolpyruvylshikimate-3-phosphate synthase (EPSPS) was introduced into soybeans to confer tolerance to glyphosate (4). For glyphosate tolerance, this gene identification included some understanding of the mode of action of glyphosate on plants and gene testing in both microbial species (which can be sensitive to

308

Fischhoff and Cline

glyphosate) and crop plants. In the early stages of gene identification for glyphosate tolerance, the ability to produce fertile transgenic plants in some important crop species such as soybean and corn was not yet developed. Today, it is possible to consider testing genes in the crop of interest at a much earlier stage of development in a crop of interest. At the same time, gene testing in model systems continues to be very useful and gene discovery using a variety of functional genomics approaches in a model system such as Arabadopsis has become an important approach. The phases of commercializing a transgenic crop plant are shown in Fig. 1. However, beyond determining gene function for practical applications in agriculture, the questions need to be asked, “What application/utility could this gene have? Will it benefit growers (an agronomic trait), or processors (higher levels of protein for animal feed), or consumers directly (an improved food oil)?” Figure 2 demonstrates how genomics enables agricultural product development by linking gene structure to gene function for important traits in plants. For Roundup Ready soybeans, the answer to some of these practical questions was apparent during early phases of research, assuming that genes could be made to work to confer this trait. Similarly, the potential practical utility of insect-resistant genes, such as genes from Bacillus thuringiensis (Bt), was understood

Fig. 1. The technology development phases of transgenic crops leading to commercializing, including approximate time and cost.

Practical Delivery of Genes to the Marketplace

309

Fig. 2. Genetic and genomics information linking gene structure and gene function for agricultural product development.

at an early stage. Today, with genomics-driven gene discovery, it may not be as apparent from functional genomics testing which genes will be useful in conferring a desired trait. Once the gene is isolated, the process moves to a Proof of Concept Phase, which involves: 3.1.2. Phase I: Proof of Concept

• Transformation of the plant • Determining that the discovered gene has the potential to confer the trait in a crop plant through – Evaluation of the transformed plants for the desired phenotype; this can include biochemical, gene expression, physiological, and morphological tests; and – Greenhouse and field testing, which are typically directed at confirming that the desired trait is observed, although not necessarily at commercial levels. • Proof of Concept might also involve early stages of gene optimization, if necessary, for example, assessing the gene’s function using different promoters. For a new gene, this process can take 3–5 years, depending on the complexity. Proof of Concept Phase for a known technology like the Roundup Ready trait generally takes 2–4 years, depending on the plant, because a known trait has often been

310

Fischhoff and Cline

observed to be transferable with few or no modifications into multiple crop species. The first generation of plant biotechnology has delivered many products creating real benefits. In USA alone, more than 50 agricultural biotechnology products (including canola, corn, cotton, papaya, potato, soybeans, squash, sugar beets, sweet corn, and tomato) have completed all federal regulatory requirements from all relevant agencies and may be sold commercially, although not all currently are marketed. Products with the most extensive commercial use are • herbicide-tolerant crops such as Roundup Ready corn, canola, cotton, and soybean; • insect-resistant crops such as YieldGard® Corn Borer, YieldGard® Rootworm, and Bollgard® and Bollgard II® cotton; and • virus-resistant crops, primarily in fruits and vegetables. At the time of the development of Roundup Ready soybeans, a biolistic method was used to transform soybeans. Agrobacterium transformation of soybean has also been developed. Today, transformation techniques have been developed for essentially all commercial crop plants, based on either Agrobacterium or biolistic particle delivery approaches. The time required for development of transformed regenerated plants can be one of the rare limiting steps in testing and developing genes for commercial use. Depending on the crop, going from initiation of transformation to regenerated transgenic plants in the greenhouse can take 4–12 months or even longer. Many tests are performed on progeny plants necessitating the development of seed from the primary transformants, which can add several more months. Thus, it can be 1–2 years in some cases before gene performance in conferring a trait can be assessed. 3.1.3. Phase II: Early Development

• Bioevaluation and trait development refer to the more advanced testing of crop plants containing the gene of interest. These typically involve specific assays for the trait. In the case of Roundup Ready soybeans, these would have included enzyme assays for the EPSP synthase and tests for viability and fertility of the plants after application of Roundup agricultural herbicide. • If the gene confers the trait in laboratory and greenhouse assays, the next steps are to test the plants under more normal growing conditions in the field. Field trials for transgenic plants require specific approval from the US Department of Agriculture (USDA) or appropriate regulatory authorities in other countries and require that strict requirements for containment be followed. Some countries have not yet developed processes for granting regulatory approval for testing

Practical Delivery of Genes to the Marketplace

311

of transgenic crop plants. It is important to anticipate the need, the data likely to be required, and the time needed to apply and receive approval for field testing in order to avoid unnecessary delays in research and development. It is also important to ensure that qualified individuals are available to conduct these regulated trials. • Preregulatory data generation: On the basis of regulatory approval for the first generation of transgenic crop products, it is possible to anticipate some of the regulatory data requirements. Even at an early stage of development, it is possible to do some preliminary assessment such as determining the likely mode of action of the introduced protein, determining if the protein has a history of safe use (as with Bt proteins, which were used in microbial insecticides before the advent of transgenic crops), and determining if the protein has properties that might make it a potential food allergen. • Large-scale transformation: Typically to produce a new transgenic commercial product, it will be necessary to generate a relatively large number of transgenic plants containing the gene of interest. The primary reason to do so is that in each individual transformed, regenerated plant with the same gene (referred to as individual “events”), the gene of interest will be inserted into a different location in the plant genome. These different locations lead to different gene expression properties (position effect), which in turn can lead to differences in the expressed trait. In addition, some events might show detrimental effects on the plant because of the transformation process. Numerous transformations are completed to ensure that adequate “events” can be evaluated to select the best event to move forward toward commercialization. The first goal in advanced development is to produce an adequate number of events so that at least one event (and hopefully more) will show expression of the trait at commercial levels with no detrimental effects. While the required number of events can vary widely depending on the gene and the crop, it is not unusual to produce hundreds of transgenic events for a gene in a crop. 3.1.4. Phase III: Advanced Development

• Field testing and agronomic evaluation: Once a large number of events have been produced, they go through a similar battery of tests at the laboratory, greenhouse, and field level as was done at the early development stage. However, at the advanced development phase, the goal is to identify events that meet all the necessary criteria for commercialization, including trait performance and agronomic characteristics. Of particular importance is the assessment of yield under typical growing conditions.

312

Fischhoff and Cline

• Trait integration into agronomically adapted cultivars: In some crop plants, the cultivars that can be transformed and regenerated to whole plants are not the highest yielding or most adapted to agricultural systems. The trait must be integrated into better cultivars and these new plants evaluated for agronomic competitiveness as well as the trait of interest. Over the last 10 years, the use of molecular breeding combined with biotechnology has greatly improved this process. This process of “marker-assisted backcrossing” allows the integration of a new gene into commercial cultivars more rapidly. However, it is important to note that new cultivars with the gene must be individually assessed for trait performance and for agronomic characteristics since variations can sometimes be seen for the same gene in different genetic backgrounds. • Regulatory data generation: Numerous studies are conducted to evaluate the genes/proteins that are introduced into the plant, as well as to evaluate the whole plant and foods/feed produced. Studies are based on scientific principles established by international organizations and requirements set by individual regulatory agencies, are conducted by scientists as well as independent third parties, and are reviewed by regulatory agencies around the world. The multiple years of field testing data for the crop trait, characterization of the inserted DNA and its encoded product(s) (e.g., CP4 EPSPS or Bt protein), establishment of substantial equivalence to the traditional crop counterpart, and ecological assessment for weediness potential or nontarget insect toxicity are then reviewed by government agencies in USA and by appropriate country agencies, making these crop products the most thoroughly tested on the market. Products of agricultural biotechnology must meet the same legal standards of safety as all food under the US Federal Food, Drug, and Cosmetic Act. A robust safety testing strategy has been developed through international consensus [FAO/WHO, Codex, Organization for Economic Cooperation and Development (OECD)], which has been adopted by countries globally. 3.1.5. Phase IV: Regulatory Submission

• The regulatory submission packages are assembled and submitted to the appropriate agencies. In USA, these are USDA, Food and Drug Administration (FDA), and Environmental Protection Agency (EPA) (for pesticidal traits such as insect-protected). The regulatory process is comprehensive and transparent. All transgenic plants destined for commercialization are evaluated by the USDA, which makes a determination of “nonregulated status” or deregulation after determining that the plant is not a pest for US agriculture. Filings made to the USDA in this process are

Practical Delivery of Genes to the Marketplace

313

open to public comment prior to deregulation. Examples of previous filings are available at http://www.aphis.usda.gov/ brs/not_reg.html and can be useful for preparing filings for new genes. For example, the Roundup Ready soybean documents can be seen at http://www.aphis.usda.gov/brs/ aphisdocs2/93_25801p_com.pdf. The three US regulatory agencies that have oversight for plant biotechnology products are shown in Fig. 3. The USDA has jurisdiction over all biotechnology-enhanced traits, and is responsible for: – Regulating the shipment of seed and harvested grain, – Authorizing permits and notifications for field trials, and – Determining the nonregulated status of biotechnologyenhanced crop products. • The EPA regulates plants that have new pesticidal properties such as insect resistance conferred by Bt genes. These are known as plant-incorporated protectants (PIPs). The EPA also regulates the application of agricultural chemicals when these will be used with transgenic crops. In the case of Roundup Ready soybeans, this means that the EPA did not directly regulate the EPSP synthase gene in these plants, but they did regulate the use of Roundup agricultural herbicide on the plants. Details on EPA regulatory filings for transgenic plants can be found at http://www.epa.gov/fedrgstr/index.html. EPA has jurisdiction over PIPs and is responsible for: – Authorizing field tests and experimental use permits and field testing plant pesticides or unregistered uses of chemical pesticides over biotech crops, – Establishing tolerances of the PIP and registering the PIP as an “active ingredient” in the plant, and

Fig. 3. Regulatory agencies with oversight on plant biotechnology products.

314

Fischhoff and Cline

– Establishing registration conditions for the PIPs (e.g., refuge requirements for Bt crop products). EPA does not have jurisdiction over the regulation of herbicide-tolerant crop products. • The FDA determines the safety of transgenic crops as human food or animal feed through a voluntary consultation process with the developers of the new crops. Details on this consultation process can be found on the list of completed consultations on bioengineered foods at http://www.cfsan.fda.gov/~lrd/biocon.html. FDA has jurisdiction over all biotechnology-enhanced traits, and is responsible for assessing the food and feed safety through the consultative process. • Requests for regulatory authorizations are also submitted to other countries with functioning regulatory systems to facilitate the export of biotech crops to countries that import biotechnology-enhanced crop products, or for the direct production of the transgenic crops in those countries, if that is part of the commercialization plan. Experience to date supports the conclusion that the regulatory process for plant biotechnology products has been successful and has resulted in the marketing of products that are at least as safe as conventionally bred equivalents. 3.2. Ex-US Regulatory Schemes

Regulation of biotechnology-enhanced crops is conducted in many countries around the world. While there is a substantial degree of harmonization among major countries, the regulation of biotechnology-enhanced crop products varies: • Some countries have evolving independent regulatory systems (e.g., Korea, Argentina, and Brazil – these are also production countries); • Others are dependent on establishment of food/feed safety by the US FDA prior to submission to their countries’ agricultural or health ministers (e.g., Philippines, China, and Mexico); and • Others are developing regulatory systems (e.g., Malaysia, Indonesia, and Thailand), some as a requirement of being a signatory to the Cartagena Biosafety Protocol. In all countries, the regulatory submissions address questions pertaining to the food and/or feed safety of the biotechnologyenhanced crop product, whether the crop will be imported or produced in that country. Likewise, if viable plant material, sometimes referred to as living modified organisms, is either imported or produced in a country, an environmental assessment is typically required to establish the absence/presence of weediness potential in the country of interest.

Practical Delivery of Genes to the Marketplace

315

3.3. Safety Assessment

Across all countries, regulatory agencies in safety assessments are addressing two basic questions: Is the food/feed safe for humans and animals to consume? (safety and nutritional equivalence), and “Are the plants safe for the environment?” (no adverse effects on agriculture or negative ecological impacts).

3.3.1. Food and Feed Safety Assessment

The overall approach to the food and feed safety assessment for biotechnology-enhanced herbicide-tolerant and PIP crop products is twofold: characterize the introduced genetic material in the plant and the resultant encoded protein(s), and establish the substantial equivalence of the biotechnology-enhanced crop product to its traditional counterpart. The overall food/feed safety assessment approach for biotech products is shown in Fig. 4. • Characterization of the genetic material inserted into the crop plant genome includes assessing the established safety of the gene source, determining number of inserts and number of copies of the inserted gene, and gene intactness (integrity) of the regulatory elements in the gene of interest. The characterization of the consumed portion of the biotechnology-enhanced crop is an important aspect of the safety assessment of crops intended for food and feed use. In Roundup Ready soybeans, the composition of seeds and selected processing fractions from two glyphosate-tolerant soybean lines were compared with that of the parental soybean cultivar. Nutrients measured in the soybean seeds included macronutrients (protein, fat, fiber, ash, and carbohydrates), amino acids, and fatty acids. Additionally, antinutrients were measured in either the seed or

Fig. 4. The assessment of food/feed safety for plant biotechnology products.

316

Fischhoff and Cline

the toasted meal, and proximate analyses were performed. The analytical results demonstrated that the glyphosate-tolerant soybean lines were equivalent to the parental, conventional soybean cultivar (2). • Encoded proteins from the insert are assessed for history of consumption (e.g., CP4 EPSPS belongs to a protein class with a long history of safe consumption, i.e., EPSPS is present in all plants), functionality and specificity, levels within crop, and a computer-based bioinformatics assessment of the protein’s amino acid sequence to determine if there is significant homology to known toxins or allergens. • Substantial equivalence is established by comparing the crop characteristics, food/feed compositional constituents, and feed performance of the biotechnology-enhanced crop product to its conventional counterpart. Food/feed compositional constituents for each crop can vary, usually established by the OECD compositional consensus documents or country-specific requirements. 3.3.2. Environmental Assessment

• The environmental assessment for herbicide-tolerant (Roundup Ready trait) and PIP (Bt trait) crop products is twofold: assessing the ecological impact of the trait and determining the ecological impact of the plant. The environmental safety approach for biotechnology crops is shown in Fig. 5. • Assessing the ecological impact for all introduced traits involves determining the outcrossing potential to related plant species and assessing whether there is potential for a resulting hybrid plant to be more competitive than either parent. • For plant-incorporated pesticide traits, the ecological assessment of the introduced trait also involves determining the potential for nontarget insect toxicity and the potential for the target pest insect to develop resistance. • Ecological assessments for all biotechnology-enhanced crop products include establishing the ecological impact of the modified plant compared to its traditional counterpart (control). As part of this assessment, comparisons are made to the conventional crop product for phenotypic/agronomic similarity, weediness potential in undisturbed (wild) ecosystems and agricultural production, and a trait expression profile (e.g., Bt trait’s target insect susceptibility and production levels of the CP4 EPSPS protein in Roundup Ready crops). • The environmental aspects of Roundup Ready soybeans were assessed for:

Practical Delivery of Genes to the Marketplace

317

Fig. 5. The assessment of environmental safety for plant biotechnology products.

– Weediness potential, – Effects of glyphosate-tolerant soybeans on nontarget organisms, – Potential for outcrossing, and – Likelihood of the appearance of glyphosate-resistant weeds and volunteer soybeans (2). 3.4. Seed Bulk-Up

Once the new trait has been introduced into commercial cultivars and required regulatory authorizations obtained, the final step before commercialization is the production of large amounts of seed for commercial sale or distribution.

3.5. Commercialization

In this phase, the gene product is sold as a trait in public or private seed lines, which can be branded. An example would be the Roundup Ready trait in Asgrow and Dekalb brand soybean seeds. Product support is an ongoing process throughout the commercial life of the product and an important part of product stewardship. In addition, some regulatory clearances are for a limited time and must be periodically renewed. Important points of product stewardship are as follows: 1. Product stewardship is the legal, ethical, and moral obligation to ensure that products and technologies are safe and environmentally responsible. Product stewardship is a component of product life cycle stewardship, which also includes management of market impacts associated with product introduction, stewardship of products in the marketplace, and effective discontinuation of outdated technology. All parties putting biotechnology traits on the market should be committed to product life cycle stewardship.

318

Fischhoff and Cline

2. A tool used by Monsanto to verify proper product stewardship is a checklist that includes all key stewardship questions. This checklist must be completed and approved before a product can be commercialized in the given geography. Areas of focus include gene source, vector design, early allergy and toxicology screen of proteins, event identity/purity, insert characterization, field trial compliance, environmental risk assessment, food and feed safety, regulatory submissions, seed identity and purity, trait and germplasm performance, regulatory approvals, conditions of registration, and risk management plans, including stakeholder dialogue and market impact. 3.6. Stakeholder Dialogue

Stakeholder dialogue plays a critical role in bringing new technology to the marketplace. It is a continuous process that should begin when products are still in the pipeline to build the platform for that particular kind of application of biotechnology, and all through the regulatory and approval phases, through commercialization and postcommercialization. Stakeholders need to be identified early on. In the case of plant biotechnology, typical stakeholders are representative of the specific members of an associated food and feed value chain as shown in Fig. 6. The stakeholder discussion process allows for capacity building in the particular area –– biotechnology. Full dialogue should take place regarding the benefits of the products, how they are regulated, including the studies for food, feed, and environmental safety, how commercialization might affect the marketplace, and the appropriate product stewardship. This process also allows stakeholder groups to form policy positions, enter into the public discussion during regulatory comment periods, and develop their

Fig. 6. Typical industry stakeholders network associated with the food and feed value chain.

Practical Delivery of Genes to the Marketplace

319

own educational programs and publications about the subject at hand. It is also very helpful for technology developers to publish scientific data in applied articles about the new technology and make it easily accessible and/or to accept speaking engagements in the public fora. In the case of Roundup Ready soybeans, Monsanto published numerous scientific or refereed articles as well as more applied articles in related trade journals, including Feedstuffs (5–7). Additionally, the dialogue with stakeholders about Roundup Ready soybeans began long before their commercialization and will continue until after Monsanto discontinues the product. The experience with the Roundup Ready trait now included in several crops substantiates the thoroughness of the regulatory process and demonstrates transparency and outreach to stakeholders. Over a period of 15 years across all crop products, Monsanto has generated (8): • over 1,000 Study Reports (10 years), • tens of thousands of field tests (14 years), • hundreds of thousands of compositional equivalence analyses, and • reviews by 27 regulatory agencies in 13 countries. 3.7. Marketplace Acceptance

Food and feed products containing ingredients derived from plant biotechnology crops have a solid 10-year plus history of safe use and have been studied by top scientists and scientific organizations around the world for more than 20 years. Several organizations have issued statements or official reports documenting the safety and benefits of plant biotech crops. • World Health Organization • Food and Agriculture Organization (FAO) of the United Nations • Codex Alimentarius • National Academy of Sciences (USA) • Royal Society (UK) • American Medical Association (USA) • French Academy of Medicine • European Commission • US Food and Drug Administration • Society of Toxicology • Institute of Food Technologists As the FAO has stated, there is no reliable documentation of any food safety issues resulting from the introduction of genes, proteins, or traits through the use of plant biotechnology (9).

320

Fischhoff and Cline

As a matter of fact, this has been validated in a European Commission report that summarizes 81 biotech research projects and concludes “… the use of more precise technology and greater regulatory scrutiny probably make them even safer than conventional plants and foods (10).” Consumers around the world are eating foods that contain ingredients derived from biotechnology-enhanced crops and the meat, milk, and eggs from animals that consumed feed derived from biotechnology-enhanced crops. Several billion meals have been served around the world containing ingredients derived from biotechnology-enhanced crops since 1996. Ingredients from major commodities such as soybean and corn are found in foods in just about every supermarket aisle, in product categories from soups and frozen dinners to confections, baked goods, and beverages. Market signals will always affect supply and demand for food and food ingredients globally. The marketplace has continued to provide customers with the choice of ingredients derived from nonbiotech crops since the first introduction of crop biotechnology. In conclusion, as science advances, the methodology developed with the first wave of commercialized biotech crops will continue to occur – gene discovery, product development, regulatory clearance, stewardship evaluation, and stakeholder dialogue.

References 1. ISAAA Brief No. 35-2006: Executive summary. Global Status of Commercialized Biotech/GM Crops: 2006. 2. Padgette, S., Taylor, N., Nida, D., Bailey, M., MacDonald, J., Holden, L., and Fuchs, R. The composition of glyphosate-tolerant soybean seeds is equivalent to that of conventional soybeans. American Institute of Nutrition. 0022-3166/96, pp. 702-716. 3. Padgette, S., Re, D., Barry, G., Eichholtz, D., Delannay, X., Fuchs, R., Kishore, G., and Fraley, R. (1993) New weed control opportunities. Development of soybeans with Roundup ReadyTM Gene, in Herbicide-Resistant Crops: Agricultural, Environmental, Economic, Regulatory, and Technical Aspects. (Duke, S., ed.), CRC Press, Boca Raton, FL, pp. 53–84. 4. Burks, A., and Fuchs, R. (1995) Assessment of the endogenous allergens in glyphosate-

5.

6.

7.

8. 9. 10.

tolerant and commercial soybean varieties. J. Allergy Clin. Immunol. 96, 1008–1010. Cline, Molly N., Re, Diane B., and Hartnell, Gary F. Today’s message is equivalence, tomorrow’s may be nutrition. Feedstuffs. May 20, 1996. Re, Diane B., Cline, Molly, N., and Hartnell, Gary F. Glyphosate-tolerant soybeans found safe for use in feed. Feedstuffs. Oct. 28, 1996. Cline, Molly N. and Re, Diane B., Plant biotechnology: A progress report and look ahead. Feedstuffs. Aug. 11, 1997. Monsanto Compendium Reference. FAO, 2001. GMO research in perspective: Report of a workshop held by External Advisory Groups Quality of Life and Management of Living Resources. EU Fifth Framework Programme. Brussels. Sept. 9–10, 1999.

Chapter 17 Ecological Genomics of Natural Plant Populations: The Israeli Perspective Eviatar Nevo Summary The genomic era revolutionized evolutionary population biology. The ecological genomics of the wild progenitors of wheat and barley reviewed here was central in the research program of the Institute of Evolution, University of Haifa, since 1975 (http://evolution.haifa.ac.il). We explored the following questions: (1) How much of the genomic and phenomic diversity of wild progenitors of cultivars (wild emmer wheat, Triticum dicoccoides, the progenitor of most wheat, plus wild relatives of the Aegilops species; wild barley, Hordeum spontaneum, the progenitor of cultivated barley; wild oat, Avena sterilis, the progenitor of cultivated oats; and wild lettuce species, Lactuca, the progenitor and relatives of cultivated lettuce) are adaptive and processed by natural selection at both coding and noncoding genomic regions? (2) What is the origin and evolution of genomic adaptation and speciation processes and their regulation by mutation, recombination, and transposons under spatiotemporal variables and stressful macrogeographic and microgeographic environments? (3) How much genetic resources are harbored in the wild progenitors for crop improvement? We advanced ecological genetics into ecological genomics and analyzed (regionally across Israel and the entire Near East Fertile Crescent and locally at microsites, focusing on the “Evolution Canyon” model) hundreds of populations and thousands of genotypes for protein (allozyme) and deoxyribonucleic acid (DNA) (coding and noncoding) diversity, partly combined with phenotypic diversity. The environmental stresses analyzed included abiotic (climatic and microclimatic, edaphic) and biotic (pathogens, demographic) stresses. Recently, we introduced genetic maps, cloning, and transformation of candidate genes. Our results indicate abundant genotypic and phenotypic diversity in natural plant populations. The organization and evolution of molecular and organismal diversity in plant populations, at all genomic regions and geographical scales, are nonrandom and are positively correlated with, and partly predictable by, abiotic and biotic environmental heterogeneity and stress. Biodiversity evolution, even in small isolated populations, is primarily driven by natural selection including diversifying, balancing, cyclical, and purifying selection regimes interacting with, but, ultimately, overriding the effects of mutation, migration, and stochasticity. The progenitors of cultivated plants harbor rich genetic resources and are the best hope for crop improvement by both classical and modern biotechnological methods. Future studies should focus on the interplay between structural and functional genome organization focusing on gene regulation. Key words: Genomics, Wild cereals, Speciation, Genetic mapping, Genetic resources, Domestication, Population genetic structure.

Daryl J. Somers et al. (eds.), Methods in Molecular Biology, Plant Genomics, vol. 513 © Humana Press, a part of Springer Science + Business Media, LLC 2009 DOI: 10.1007/978-1-59745-427-8_17

321

322

Nevo

1. Introduction The emerging field of evolutionary ecological genomics (1–8) interdisciplinarily explores the structure, function, and regulation of genomic-environmental interaction. How do the organism’s genomes respond to abiotic and biotic diverse environmental stresses? Is the complex genomic response primarily adaptive to environmental challenges (9–11)? How do species arise and vary geographically in response to environmental heterogeneity and stress, and how many and what kinds of genes and regulatory mechanisms are involved? Genomic organization and diversity are the basis of evolutionary change by natural selection in natural populations. It underlies the evolution of the biodiversity of genes, genomes, populations, species, ecosystems and biomes, and the evolution of genome–phenome diversity under environmental stress (3–6). The genomic era revolutionized evolutionary biology. Since 1975, the enigma of genotypic–phenotypic diversity and biodiversity evolution of wild cereals (wild wheat, barley, and oats) and later, wild lettuce, were central in the research program of the Institute of Evolution, University of Haifa. We explored the following questions: (1) How much of the genomic and phenomic diversity and organization in nature are adaptive and processed by natural selection? (2) What are the origin and evolution of adaptation and speciation processes under spatiotemporal variables and stressful macrogeographic and microgeographic environments? (3) How much of genetic diversity of wild progenitors in nature could be harnessed to improve their genetically eroded derivative cultivars and other crops? We advanced ecological genetics into ecological genomics and analyzed globally, regionally, and locally hundreds of populations and thousands of individuals of the progenitors of cultivated wheat, barley, oats, and lettuce for allozyme and deoxyribonucleic acid (DNA) coding and noncoding diverse genomic regions (http://evolution.haifa.ac.il). We tested abiotic (climatic, chemical, edaphic) and biotic (pathogens) stresses and their environmental association with genomic diversity and explored how much of this diversity is predictable by environmental stresses. Recently, we introduced genetic maps and quantitative trait loci (QTL) (12–14), as well as single-nucleotide polymorphism (SNP) of important genes (15), to elucidate the genetic basis of adaptation and explored speciation in the wheat relative Aegilops (16) and wild barley (17). Our results reviewed below indicate abundant genotypic and phenotypic diversity in natural populations of the progenitors of cultivated plants. The organization and evolution of genomic diversity in natural populations at global, regional, and local scales are nonrandom and heavily structured, display ecogeographical

Ecological Genomics of Natural Plant Populations

323

regularities, and are positively correlated with, and partly predictable by, abiotic and biotic environmental heterogeneity and stress. Biodiversity evolution, even in small isolated populations (18), is primarily driven by natural selection including diversifying, balancing, and purifying cyclical selective regimes, interacting with, but ultimately overriding the effects of mutation, migration, and stochasticity (4–6). Quantum speciation in Aegilops (16) and incipient sympatric speciation in wild barley (17) display evolution in action.

2. Wheat and Barley as Model Organisms in Evolution and Domestication

Wheat and barley are important model organisms for testing various aspects of evolutionary theory (speciation and adaptation) and a major source of human and animal nutrition. Wheat speciation involves a polyploidy series (2x, 4x, and 6x). The origin of most hexaploid bread wheat is wild emmer, Triticum dicoccoides (genome AABB, 2n = 28) and Aegilops species, especially of the Sitopsis section (19). Wild barley is a diploid (2n = 14) and the progenitor of cultivated barley. In 1975, the Institute of Evolution at the University of Haifa initiated a long-term multidisciplinary research program (Fig.1) to study the genomics of wild cereals and lettuce (http://evolution.haifa.ac.il), in the origin and diversity center of Old World agriculture, in the Near East

Fig. 1. Multidisciplinary long-term research program of wild cereals at the Institute of Evolution, University of Haifa, Israel.

324

Nevo

Fertile Crescent (20). The program includes evolutionary ecological genomics coupled with the exploration of genetic resources for wheat and barley improvement, genetic mapping, cloning, and transformation (Fig.1). Both aspects, the theoretical and the applied, have proved to be of great importance for studying evolutionary theory and utilization in cereal crop improvement (21–24) (Fig.1).

3. Ecological Genomic Diversity of Wild Barley and Wild Emmer for Barley and Wheat Improvement: Regional and Local Perspectives

What are the population genetic and ecological-genomic structures of wild cereals? What is the role that wild relatives can play in crop improvement? This question is of great importance in view of the dramatic reduction in genetic diversity and consequent increased vulnerability to abiotic and biotic stresses of some of the prime food crops for humans (23). The present ecologicalgenomic review of studied wild plant populations display richness

Fig. 2. A Sodium dodecyl sulfate-polyacrylamide gel electrophoresis (SDS-PAGE) of the range of wild emmers to show the extent of allelic variation in High-Molecular-Weight (HMW) glutenin subunits. The proteins thought to be coded by the Glu-A1 and Glu-B1 loci are prefixed with letters A and B. Grains were taken from populations Qazrin (slot 1), Yehudiyya

Ecological Genomics of Natural Plant Populations

325

Fig. 2. (continued) (slots 2, 7, 8, and 10), Tabigha (slot 12), Bet-Meir (slot 14), Rosh Pinna (slot 11), Bat Shelomo (slot 9), Mt. Hermon (slot 15), Sanhedriyya (slots 3–6, and 13), and Taiyiba (slot 16). B Pie diagrams displaying the percentage for the 11 alleles of Glu-A1 and 15 alleles of Glu-B1 in 11 populations of wild emmer wheat, Triticum dicoccoides, and their geographic location in Israel. Population numbers as in Nevo and Payne (26) A Glu-A1; B Glu-B1; C key to diagrams. Alleles are numbered as in Nevo and Payne (26).

in adaptive diversity and support the idea that the highest hope for future crop improvement lies in rationally and effectively exploring and exploiting the rich gene pool of the plant’s wild relatives (25). The molecular diversity and divergence of wild emmer wheat, regionally in the Near East Fertile Crescent and locally in four

326

Nevo

natural microsite populations at Qazrin, Ammiad, Tabigha, and Yehudiyya (wild emmer wheat), and three natural microsite populations of wild barley at Tabigha, Newe Yaar, and “Evolution Canyon” in northern Israel, display parallel ecological-genomic patterning (26). The regional and local results demonstrated significant adaptive spatial and temporal molecular divergence at DNA and protein levels in wild cereal populations and subpopulations. For example, see the genetic polymorphism in highmolecular-weight (HMW) glutenin subunits of wild emmer wheat across Israel, which is very rich in genetic polymorphism that could contribute to bread baking quality (Fig. 2A, B). Specifically, the genetic polymorphism pattern at both protein and DNA levels revealed the following patterns (4–6, 23): 1. Significant genomic diversity and divergence exist at single-, double-, and multilocus structures of allozymes (Fig. 2A, B), and DNA coding and noncoding genome diversity [random amplification of polymorphic DNA (RAPDs), amplified fragment length polymorphism (AFLPs), ribosomal DNAs (rDNAs), and simple small repeats (SSRs)] (Figs. 3 and 4) and sequence polymorphism (15). Genome diversity is widespread regionally within and between populations. However, and most importantly, diversity is also abundant locally, over very short distances of several to a few dozen or hundreds of meters in the six microsites (24, 27–34). Figure 3 displays allele distribution of microsatellite diversity in wild emmer while exposing edaphic diversity of a 100-m transect subdivided into basalt-terra rossa at the Tabigha microsite (27); Fig. 4 displays RAPD diversity at the sun/shade microsite of Yehudiyya (28). 2. We have shown (15) that DNA sequence polymorphism in wild barley is adaptive as are its molecular markers (allozymes, RAPDs, AFLPs, and SSRs). Wild barley, Hordeum spontaneum, represents a significant genetic resource for crop improvement in barley, Hordeum vulgare, and for the study of the evolution and domestication of plant populations. The Isa gene from barley has a putative role in plant defense. This gene encodes a bifunctional α-amylase/subtilisin inhibitor (BASI) that inhibits the bacterial serine protease subtilisin, fungal xylanase, and the plant’s own α-amylase. The inhibition of plant α-amylases suggests that this protein may also be important for grain quality from a human perspective. We identified 16 SNPs in the coding region of the Isa locus of 178 wild barley accessions from eight climatically divergent regional and local sites across Israel (15). The pattern of SNPs suggested a large number of recombination events within this gene indicating that the low outcrossing rate of wild barley is not a barrier to recombinant haplotypes becoming established

Fig. 3. Allele distribution at the GWM095, GWM120, GWM 162, GWM169, and GWM218 microsatellite loci on spatially close parts of the terra rossa and basalt soils for T. dicoccoides. Allele sizes are numbered by base pairs including tandem-repeated and flanking regions. The first lane on the left is of Chinese Spring (27).

Ecological Genomics of Natural Plant Populations 327

328

Nevo

Fig. 4. The histogram of frequencies of canonical scores for wild emmer wheat at the shady and sunny niches in Yehudiyya according to 25 polymorphic random amplification of polymorphic DNA (RAPD) loci (microniches separated several meters apart) (28).

in the population. Seven amino acid substitutions were present in the coding region. Genetic diversity for each population was calculated using Nei’s diversity index, and the Spearman rank correlation was carried out to test association between gene diversity and 16 ecogeographical factors. Highly significant correlations were found between diversity at the Isa locus and key water variables – evaporation, rainfall, humidity, and latitude. The pattern of association suggests selective sweeps in the wetter climates with resulting diversity and diversifying selection in the dryer climates resulting in much higher diversity. 3. The rich genetic patterns across coding (allozymes) and largely noncoding (RAPDs, AFLPs, and SSRs) (35) genomic regions are correlated with, and predictable by, environmental stress (climatic, edaphic, and biotic) and heterogeneity, supporting the niche-width variation hypothesis (36), displaying significant niche-specific and -unique alleles, genotypes, and regulations (24, 27, 30–34, 37). 4. The genomic organization of wild cereals is nonrandom, heavily structured, and at least partly, if not largely, adaptive. It defies explanation by genetic drift, neutrality, or near neutrality models as the primary driving forces of wild cereal molecular evolution. The only viable model to explain the genomic organization of wild cereals is natural selection, primarily diversifying, balancing and cyclical selection over space and time according to the double- or multiple-niche ecological models (24). Spatial models are complemented by temporal models of genetic diversity and change (38–40). Natural selec-

Ecological Genomics of Natural Plant Populations

329

tion may interact with mutation, migration, and stochastic factors, but it overrides them in orientating wild cereal evolutionary processes. Based on mathematical modeling, we established that stabilizing selection with a cyclically moving optimum could efficiently protect polymorphism for linked loci, additively affecting the selected trait (39, 41). In particular, unequal gene action and/or dominance effects may lead to local polymorphism stability with substantial polymorphism attracting domain. Moreover, under strong cyclical selection, complex dynamic patterns were revealed including “supercycles” (with periods comprising hundreds of environmental oscillation periods) and “deterministic chaos” (38–41). These patterns could substantiate polymorphism in natural populations of wild cereals and increase genetic diversity over long periods, thereby contributing to overcoming massive extinctions in natural populations (42).

4. Unique PopulationGenetic Structures and Center of Origin of Wild Emmer Wheat and Wild Barley

Primarily, wild emmer wheat and, secondarily, wild barley, have unique ecological-genetic structures (24). Emmer wheat central populations, in the catchment area of the upper Jordan Valley, and wild barley populations, in the Golan Heights, eastern Galilee, and Jordanian Mountains, are massive and lush and represent their center of origin and diversity. However, southwards in Israel and northwards into Turkey, wild emmer becomes fragmented into sporadic semi-isolated and isolated populations that are characterized by an archipelago genetic structure in which alleles are built up locally in high frequency, but are often missing in neighboring localities. This phenomenon may even occur in the central continuous populations in which alternative fixation of up to eight alleles was described over hundreds of meters in the Golan Heights between Qazrin and Yehudiyya (24) (see Subheading 5). Dramatic local genetic divergence occur both in wild emmer and wild barley at local sites in the Galilee (24, 43).

5. Centers of Origin and Diversity The center of origin and diversity of wild emmer and wild barley, the progenitors of most wheat and barleys, and that of other progenitors of cultivated plants, is the Near East Fertile Crescent (20, 44–48). Particularly in Israel with its extraordinary biotic and

330

Nevo

physical diversity, wild emmer (24, 49) and wild barley (21, 50) in the Near East developed, both within and between populations, a wide range and rich adaptive diversity to multiple diseases, pests, and ecological stresses over a long evolutionary history. Most importantly, this diversity is neither random nor neutral. In contrast, it displays at all levels, adaptive genetic diversity for biochemical, morphological, and immunological characteristics, which contribute to the species’ ability to adapt to widely diverse climatic and edaphic conditions by multiform complex fitness syndromes. The long-lasting coevolution of wild emmer with parasites and the ecologically heterogeneous abiotic nature of Israel and the Near East Fertile Crescent led to the development of single multiallelic genes, multilocus structures, and abiotic/ biotic stress genomes locally and regionally coadapted and regulated for both short- and long-term survival. Good examples are wild barley and emmer wheat in Israel (21).

6. Genetic Resources for Cereal Improvement

Wild barley and wild emmer wheat are rich in adaptive genomic diversity and genetic resources and represent the best hope for enriching the genetically impoverished cultivars and advancing cereal improvement. These include abiotic (e.g., drought, cold, heat, salt, and metal concentrations) tolerances and biotic (viral, bacterial, and fungal) and herbicide resistances, high quantity and quality storage proteins, hordeins, glutenins and gliadins, amylase, photosynthetic yield, and drought resistance (21–24, 51– 57). A small fraction of these resources have already been used for generating disease-resistance cultivars in the USA and Europe. Most of these genetic resources are as yet untapped and provide potentially precious sources for barley and wheat improvement (22, 23, 48, 58–61). The current rich genetic map of T. dicoccoides with 549 molecular markers and 48 significant QTLs for 11 traits of agronomic importance (12), the QTL map of wild barley (14), as well as the association between molecular markers and disease resistance (21, 62) permit the unraveling of beneficial alleles of candidate genes that are otherwise hidden. These beneficial alleles could be introduced into cultivated barley and wheat (simultaneously eliminating agronomically undesirable alleles) by using the strategy of marker-assisted selection. We have shown differential expression of dehydrin (Dhn) in response to water stress in resistant and sensitive wild barley (63). Likewise, we transformed Dhn 1 gene from wild barley to Arabidopsis raising its drought tolerance (64).

Ecological Genomics of Natural Plant Populations

331

The genetic program of wild cereals conducted at the Institute of Evolution, University of Haifa, and elsewhere (25, 65, 66) confirmed that H. spontaneum and T. dicoccoides are very valuable wild germplasm resources for barley and wheat improvement. This program highlighted the evolution of wheat domestication (12) and expedited the genomic analysis of wild cereal relatives. It can thus provide a solid basis for introgression or cloning and transformation of agriculturally important genes (67) and QTLs (12) from the wild to the cultivated cereals and for advancing cereal improvement. This is particularly important in a world whose population is exploding, where hunger is prevalent, desertification and salinization are dramatically increasing, and water and fertile land resources are limited because of constant pollution and environmental degradation.

7. Genomic Microsatellites Adaptive Divergence of Wild Barley by Microclimatic Stress in “Evolution Canyon” 7.1. The Evolutionary Model of “EC”: Life’s Microcosm

Local, microcosmic, and natural laboratories, designated by us as the “EC” model (Fig.5a, b), reinforce studies of regional and global macrocosmic ecological theaters across life (3, 18, 69). They present sharp ecological contrasts at a microscale permitting the pursuit of observations and experiments across diverse prokaryote and eukaryote taxa sharing a sharp microecological subdivision. Likewise, they generate theoretical, testable, and predictable models of biodiversity and genome evolution and permit the examination of the mode and tempo of adaptation and speciation. The south-facing slopes (SFS or “African” slope) in canyons north of the equator receive higher solar radiation than on the nearby north-facing slopes (NFS or “European” slope). This solar radiation is associated with higher temperature and drought on the more stressful “African” SFS causing dramatic physical and biotic interslope divergence. These canyons are extraordinary natural evolutionary laboratories. If rocks, soils, and topography are similar on the opposite slopes (50–100 m apart at bottom), microclimate remains the major interslope-divergent factor. Even strongly sedentary organisms can migrate between the slopes. The interslope divergence of biodiversity (i.e., genes, sequences, genomes, populations, species, ecosystems, and biota) can be examined within each species distributed on the physically and biotically contrasting slopes. This intraspecific interslope divergence can be compared in many species across life from prokaryotic bacteria through eukaryotic lower and higher plants, fungi, and animals highlighting both interslope and adaptive radiation and incipient sympatric speciation (3, 17, 70).

332

Nevo

Fig. 5. a “Evolution Canyon,” lower Nahal Oren, Mount Carmel, Israel. Note the plant formations on the opposite slopes. The green, lush, liveoak, “Euroasian,” temperate, coolmesic, north-facing slope (“European” north-facing slopes, NFS) sharply contrasts with the open park forest of warm-xeric, tropical, “Afroasian” savannah on the south-facing slope (“African” SFS); (a) “cross section” view with “European” NFS to the right, (b) air view with the seven assigned stations: three on the south-facing slopes (SFS) (1–3), one at the valley bottom (4), and three on the “European” NFS (5–7). b (below) Examples of amplified fragment length polymorphism (AFLP) fingerprints of (a) wild barley (H. spontaneum) and (b) the fruit fly (Z. tuberculatus) populations at “EC” (68).

Ecological Genomics of Natural Plant Populations

333

These genomic and phenomic multiple taxa interslope comparisons permit slope-convergent and interslope-divergent generalizations of organism–environment relationhips across life and of the relative importance of evolutionary forces operating in adaptation and speciation. In a structural and functional genomic era, all available complete genomes or those partially sequenced, including stress genes, are comparable by microarray technology (71) on both slopes along with their proteomes and phenomes, that is, at the interrelated molecular and organismal levels. These long-lived natural evolutionary laboratories permit in-depth stress studies of genome evolution in adaptation and speciation in close sympatry and under critical tests of past, present, and future divergence. We conducted several studies on wild barley H. spontaneum in “EC” I in Mount Carmel, including AFLPs (68), allozymes (72), and RAPDs (73). These studies indicate higher genetic diversity on the “African” xeric SFS (Fig. 5a, b). A recent study on microsatellites in wild barley (43) revealed large interslope genetic distances, DA = 0.481, across a distance of 200 m. This led us to test it further, and we found support for incipient sympatric speciation of wild barley on the opposite slopes, comparing interslope and intraslope hybridizations and showing inferiority in 12 out of 13 tested traits (17). Most genetic markers (allozymes, DNA) suggest adaptive evolution by natural selection in the canyon, hence, could be utilized in crop improvement – both as markers and adaptive genetic sources. 7.2. Mapping QTLs for Agronomically Important Traits in Triticum dicoccoides

Most crop traits are quantitatively inherited and are controlled by multiple genes. The advent of molecular markers has made it possible to genetically dissect this type of traits. By means of statistical analysis, the variation of a quantitative trait can be partitioned into the effect of individual genome regions, the QTLs, linked to markers on a molecular-marker map (see Fig. 6A) (74). Apart from localizations of loci controlling quantitative traits, QTL mapping can also uncover “cryptic” genetic variation, that is, the identification of beneficial alleles that are otherwise hidden in a sea of deleterious alleles. This ability to detect cryptic variation can be important in the exploration of wild germplasms (75). Wild emmer wheat, Triticum dicoccoides, is the progenitor of modern tetraploid- and hexaploid-cultivated wheat. Our objective was to map domestication-related QTL in T. dicoccoides. The studied traits include brittle rachis, heading date, plant height, grain size, yield, and yield components. Our mapping population was derived from a cross between T. dicoccoides and Triticum durum. Approximately 70 domesticated QTL effects were detected, distributed nonrandomly among and along chromosomes (see example in Fig. 6A, B). Seven domestication syndrome factors were proposed, each affecting 5–11 traits. We showed (1) clustering and strong effects of some QTLs, (2) remarkable

334

Nevo

Fig. 6. A High-density molecular map of stripe-rust resistance gene YrH52 region (B) and linkage map of YrH52 stripe-rust resistance gene (A) on chromosome 1B. The vertical short bar indicates the approximate location of the centromere

Ecological Genomics of Natural Plant Populations

335

Fig. 6. (continued) according to (24) (Fig. 8.1). B Map locations of DSFs and their involved quantitative trait loci (QTLs) in L-version maps of wild emmer wheat, T. dicoccoides. (Upper) Short arms of chromosomes. (Right) domestication syndromes (DSFs) and corresponding QTLs: , DSF; , Kernel number/spike (KNS); , kernel number/spikelet (KNL); , grain yield plant (YLD); Plant height, (HT); , spikelet number/spike (SLS); , single spike weight (SSW); , spike weight/plant (SWP); , kernel weight/plant (KNP); , Head flowering date (HD); , grain weight (GWH); , spike number/plant (SNP). The regular trait name represents a single QTL; the italic trait name represents a single QTL (Q2) detected by linked-QTL analysis; the regular trait name tailed with Q1 means the first QTL and tailed with Q2, the second QTL in a pair of linked QTLs. A tailed trait name (5) means that the QTL effect is not significant at the level of 5% of FDR but is significant at FDR 10%; (10) means that the effect is not significant at FDR 10% (12).

genomic association of strong domestication-related QTLs with gene-rich regions, and (3) unexpected predominance of QTL effects in the A genome. The A genome of wheat may have played a more important role than the B genome during domestication evolution. The cryptic beneficial alleles at specific QTLs derived from T. dicoccoides, and H. spontaneum may contribute to wheat and cereal improvement (12, 14). Our QTL mapping results may be very helpful for QTL introgression and cloning. We also found that beneficial alleles existed in wild emmer wheat for some of the traits, e.g., HD, SNP, kernel number/spikelet (KNL), and GWH [(24); Tables 8.2 and 8.3], which will be useful for wheat improvement. Therefore,

336

Nevo

our results can accelerate wheat genetic analysis and be especially helpful for marker-assisted introgression of beneficial QTL alleles from T. dicoccoides to cultivated wheat. Wild emmer wheat has proven to be a valuable germplasm source of wheat improvement based on phenotypic performance and population-genetic analysis (24). Soller and Beckmann (75) pointed out that QTL mapping can facilitate the recovery of “cryptic” genetic variation, that is, the identification of beneficial alleles that are otherwise hidden in a sea of undesirable alleles. In the present study, one parental line, T. dicoccoides accession Hermon H52 (see the rust-resistance gene YRH52 and its linkage map in Fig.6A) possessed stripe-rust resistance, short stature, and high tillering capacity as well as many undesirable features for agronomic traits, such as slow growth (late heading), small grains, less spikes, small spikes, low spike weight, and low yield. Among 48 QTLs for the 11 traits, the T. dicoccoides alleles for seven traits (14.6%) were beneficial [(24); Table 8.2] (see Fig.6B).

8. Speciation 8.1. Quantum Speciation in Aegilops: Molecular Cytogenetic Evidence from rDNA Cluster Variability in Natural Populations for Edaphic Ecological Quantum Speciation

We found in Haifa Bay, Israel, evidence on quantum speciation in the Sitopsis section of the genus Aegilops (Poaceae, monocotyledones). Two small peripherally isolated wild populations of annual cross-pollinated Ae. Speltoides and annual self-pollinated Ae. Sharonensis are located 30 m apart on different soil types (Fig.7a). Despite the close proximity of the two species and their close relatedness, no mixed groups are known. Compara-

Fig. 7. a Geographical location and photos of the investigated populations. (a) Satellite image of eastern Mediterranean, (b) field position of the studied populations, (c) Aegilops sharonensis, and (d) Ae. speltoides. The photographs of Ae.

Ecological Genomics of Natural Plant Populations

337

Fig. 7. (continued) Ae. sharonensis and Ae. speltoides were taken on the same day. Although Ae. speltoides is blossoming, Ae. sharonensis is already mature (16). b. Morphological karyotypic variability of selected genotypes of Ae. speltoides and Ae. sharonensis. Spikes of different genotypes are presented. Ae. speltoides ssp. aucheri Ts-84 and genotype 1: only terminal spikelet awned. Ae speltoides ssp ligustica Ts-24 and genotypes 4-1, 4-2, Ae. sharonensis: terminal spikelet and lemmas of lateral spikelets awned. Presence or absence of keel tooth on the glume is shown, respectively, by red and blue arrows in the enlargement fragments of the spikes (Ae. speltoides ssp. aucheri Ts-84 and genotype 4-1 do not have the tooth on the glume, red arrows). Simultaneous in situ hybridization of 5S (detected red) and 45S rDNA (detected green) on the somatic chromosomes. Only chromosomal rDNA patterns are observed in plants of Ae. speltoides and Ae. sharonensis from the Kishon (16).

338

Nevo

tive molecular cytogenetic analysis based on the intrapopulation variability of rRNA-encoding DNA (rDNA) chromosomal patterns of individual Ae. Speltoides genotypes revealed an ongoing dynamic process of permanent chromosomal rearrangements (16) (Figs.6 and 7B). Chromosomal mutations can arise de novo and can be eliminated. Analysis of the progeny of the investigated genotypes testifies that inheritance of de novo rDNA sites happen frequently. Heterologous recombinations and/or transposable elements-mediated rDNA transfer seem to be the mechanisms for observed chromosomal repatterning. Consequently, several modified genomic forms, intermediate between Ae. speltoides and Ae. sharonensis, permanently arise in the studied wild population of Ae. speltoides, which make it possible to recognize Ae. sharonensis as a derivative species of Ae. speltoides as well as to propose rapidness and canalization of quantum speciation in a Sitopsis species (16). 8.2.“Evolution Canyon” as a Cradle for Species Formation: The “Israeli Galapagos” Including Wild Barley, H. spontaneum

“Evolution Canyon” (Fig.5a) is a microsite in which incipient sympatric speciation has been discovered for model organisms across life from bacteria (76), soil fungus, Sordaria fimicola (77), Drosophila flies (70), and spiny mice (3). Parnas (17) showed preliminary evidence that H. spontaneum also undergoes initial sympatric speciation at “Evolution Canyon,” Lower Nahal Oren, Mt. Carmel (Fig.8). F1 interslope hybrids showed inferiority in 12 out of 13 vegetative and fertility traits as compared to intraslope crosses. The extremely large interslope genetic distance of wild barley populations (43) and the great interslope physiological differences (68, 78), as well as reciprocal transplant experiments (79), all support the hypothesis suggesting that wild barley at “Evolution Canyon” undergoes initial incipient sympatric speciation.

Fig. 8. Evidence for incipient sympatric speciation in wild barley, Hordeum spontaneum, at “Evolution Canyon” based on hybridization and physiological and genetic diversity estimates (17).

Ecological Genomics of Natural Plant Populations

339

9. Prospects for Crop Improvement What will be the next step in the research into wild cereals and other progenitors of cultivated plants in the genomic and postgenomic era in an attempt to improve crops? Conceptually, indepth probing of comparative genome structure and function are the major challenges, in particular, the intimate interplay of the coding and noncoding genomes and the focus on genomic regulation. This may be particularly aided by the discovery that microsatellites are preferentially associated with nonrepetitive DNA in plant genomes (35) and that their regulatory functions may be of great importance (36, 80). Such studies will unravel genome evolution and highlight rich genetic potentials for wheat improvement residing in the progenitors, including Hordeum, Triticum, Aegilops species, other Triticeae, the rich repertoire of R-genes in wild lettuce, (81–84), and other diverse progenitors of cultivars. One example of QTL mapping as a basis for identifying candidate genes for wheat improvement is shown in Fig. 6B. Specifically, we could divide the prospects for future studies somewhat arbitrarily into theoretical and applied perspectives (24). 9.1. Theoretical Perspectives

1. Highlighting the genetic structure, function, regulation, and evolution at macro- and microgeographical scales of natural populations and the corresponding cultivars, bridging multilocus marker structures with fitness-related traits in order to get direct estimates of the adaptive fitness differentiation within and between populations (by using transplant experiments, mapping analysis, microarray methodology, and selection-based mapping of fitness components and domesticated contributing genes over the entire life cycle). 2. Exposing the large-scale genome organization of the progenitors and their corresponding cultivars (wheat, soybeans, rice, maize, barley, and others) using bacterially artificial chromosome (BAC) clones for molecular complete (Arabidopsis and rice) or partial sequence analysis of expressed sequence tags (ESTs), and open reading frames (ORFs) in the transcribed strand. Molecular cytogenetic methods could complementarily probe the structure and interactions of the nuclear, mitochondrial, and chloroplast genomes. Sequences upstream and downstream of selected ORFs (5′ and 3′ untranslated regions, UTRs, respectively) could be probed for functional regulatory polymorphisms and their function tested by genetic transformation. 3. Analysis of the progenitors’ genetic system, or the “transmission system,” which determines the genetic flexibility of the species in diverse ecological contexts, include

340

Nevo

(a)

Breeding system (reaction norm and genetic variation of the outcrossing rate);

(b) Mutation rate in different elements of the genome; (c)

Recombination properties of the genome, their genetic and ecological control;

(d) Genomic distribution of structural genes, primarily abiotic and biotic stress genes, and their regulation function by in-depth analysis of the structure and function of the noncoding genome, including repeated and transposon elements as well as epigenetic regulation (85); (e)

Interface between ecological and genomic spatiotemporal dynamics and adaptive systems;

(f)

Probe the contribution of small RNA in regulation, and;

(g) Genome evolution in the polyploidization process. 9.2. Applied Perspective

1. Genetic fine mapping and dissection of the collected unique genetic resources of agricultural importance by molecular markers and sequencing followed by introgression of the detected genes/alleles into elite cultivars via marker-assisted selection. 2. Molecular cloning of adaptation genes based on integrated genomic strategies and novel methodologies, including genetic and physical mapping, molecular markers, EST mapping, cloning, and sequencing, genome wide microarray expression analysis, and genetic transformation of the defined target genes/alleles. 3. Comparative genetics/genomics of cereal plants aimed at deciphering the common and specific ways of domestication evolution.

References 1. Luikart, G. , England, P.R. , Tallmon, D. , Jordan, S., and Taberlet, E. (2003) The power and promises of population genomics: from genotyping to genome typing. Nat. Rev. Genet. 4, 981–984. 2. Feder, M. and Mitchell-Olds, T. (2003) Evolutionary and ecological functional genomics. Nat. Rev. Genet. 4, 649–654. 3. Nevo, E. (2001a) Evolution of genome-phenome diversity under environmental stress. Proc. Natl. Acad. Sci. USA 98, 6233–6240. 4. Nevo, E. (2004a) Evolution of genome dynamics under ecological stress, in Dynamical Genetics 2004 (Parisi, V., DeFonzo, V., and Alluffi-Pentini, F., eds.), Research Signpost, Kerala, India, pp. 1–27. 5. Nevo, E. (2004b) Genomic diversity in nature and domestication, in Diversity and Evolution

6.

7.

8.

9.

of Plants. Genotypic and Phenotypic Variation in Higher Plants (Henry, R., ed.), CABI Publishing CAB International, Wallingford, UK, pp. 287–315. Nevo, E. (2004c) Population genetic structure of wild barley and wheat in the Near East Fertile Crescent: regional and local adaptive patterns, in Cereal Genomics (Gupta, P.K. and Varshney, R.K., eds.), Springer, the Netherlands, pp. 135–163. Shimizu, K. and Purganan, M. (2005) Evolutionary and ecological genomics of Arabidopsis. Plant Physiol. 138, 578–584. Kohn, M., Murphy, W., Ostrander, E., and Wayne, R. (2006) Genomics and conservation genetics. Trends Ecol. Evol. 21, 629–637. Kuang, H., van Eck, H.J., Sicard, D., Michelmore, R., and Nevo E. (2008) Evolution and

Ecological Genomics of Natural Plant Populations genetic population structure of prickly lettuce (Lactuca serriola) and its RGC2 resistance gene cluster. Genetics 178, 1547–1558. 10. Whitehead, A. and Crawford, D.C. (2006) Neutral and adaptive variation in gene expression. Proc. Natl. Acad. Sci. USA 103, 5425–5430. 11. Eyre-Walker, A.(2006) The genomic rate of adaptive evolution. Trends Ecol. Evol. 21, 569–575. 12. Peng, J.H., Ronin, Y.I., Fahima, T., Roder, M.S., Li, Y.C., Nevo, E., and Korol, A.B. (2003) Domestication quantitative trait loci in Triticum dicoccoides, the progenitor of wheat. Proc. Natl. Acad. Sci. USA 100, 2489–2494. 13. Verhoeven, K.J.F., Vanhala, T.K., Biere, A., Nevo, E., and van Damme, J.M.M. (2004) The genetic basis of adaptive population differentiation: a QTL-analysis of fitness traits in two wild barley populations from contrasting habitats. Evolution 58, 270–283. 14. Chen, G., Krugman, T., Fahima, T., Chen, Y., Hu, Y., Röder, M., Nevo, E., and Korol, A.B. (2008) Chromosomal regions controlling seedling drought resistance of Israeli wild barley, Hordeum spontaneum populations (submitted). 15. Cronin, J.K., Bundock, P.C., Henry, R.J., and Nevo, E. (2007) Adaptive climatic molecular evolution on wild barley at the Isa defense locus. Proc. Natl. Acad. Sci. USA 10Y: 2773–2778. 16. Raskina, O., Belyayev, A., and Nevo, E. (2004) Quantum speciation in Aegilops: molecular cytogenetic evidence from rDNA cluster variability in natural populations. Proc. Natl. Acad. Sci. USA 101, 14818–14823. 17. Parnas, T. (2006) Evidence for incipient sympatric speciation in wild barley, Hordeum spontaneum, at “Evolution Canyon”, Mount Carmel, Israel, based on hybridization, physiological, and genetic diversity estimates. M.Sc. Thesis, University of Haifa. 18. Nevo, E. (1997) Evolution in action across phylogeny caused by microclimatic stresses at “Evolution Canyon”. Theor. Popul. Biol. 52, 231–243. 19. Kimber, G. and Feldman, M. (1987) Wild wheat. An Introduction. University of Missouri, Columbia Spec. Rep. 353, College of Agriculture. 20. Zohary, D. and Hopf, M. (2000) Domestication of Plants in the Old World, 3rd Edn. Oxford University Press, Oxford. 21. Nevo, E. (1992) Origin, evolution, population genetics and resources for breeding of wild barley, H. spontaneum, in the Fertile Crescent, in Barley: Genetics, Biochemistry,

22.

23.

24.

25.

26.

27.

28.

29.

30.

31.

32.

341

Molecular Biology and Biotechnology (Shewry, P.R., ed.), CAB International, Wallingford, UK, pp. 19–43. Nevo, E. (2001b) Genetic resources of wild emmer, Triticum dicoccoides, for wheat improvement. Isr. J. Plant. Sci. 49, 77–91. Nevo, E. (2006) Genome evolution of wild cereal diversity and prospects for crop improvement. Plant Genet. Res. 4(1), 36–46. Nevo, E., Korol, A.B., Beiles, A., and Fahima, T. (2002) Evolution of Wild Emmer Wheat Improvement. Population Genetics, Genetic Resources, and Genome Organization of Wheat’s Progenitor, Triticum dicoccoides. Springer-Verlag, Berlin. Feldman, M. and Sears, E.R. (1981) The wild gene resources of wheat. Sci. Am. 244, 102–112. Nevo, E. and Payne, P.I. (1987) Wheat storage proteins: diversity for HMW glutenin subunits in wild emmer from Israel. I. Geographical patterns and ecological predictability. Theor. Appl. Genet. 74, 827–836. Li, Y.C., Fahima, T., Peng, J.H., Röder, M.S., Kirzhner, V.M., Beiles, A., Korol, A.B., and Nevo, E. (2000a) Edaphic microsatellite DNA divergence in wild emmer wheat, Triticum dicoccoides, at a microsite: Tabigha, Israel. Theor. Appl. Genet. 101, 1029–1038. Li, Y.C., Fahima, T., Beiles, A., Korol, A.B., and Nevo, E. (1999) Microclimatic stress and adaptive DNA differentiation in wild emmer wheat (T. dicoccoides). Theor. Appl. Genet. 98, 873–883. Li, Y.C. (2000) Microscale molecular population genetics of wild emmer wheat, Triticum dicoccoides, in Israel. Ph.D. Thesis, University of Haifa, Israel, pp. 258. Li, Y.C., Fahima, T., Peng, J.H., Röder, M.S., Kirzhner, V.M., Beiles, A., Korol, A.B., and Nevo, E. (2000b) Microsatellite diversity correlated with ecological-edaphic and genetic factors in three microsites of wild emmer wheat in north Israel. Mol. Biol. Evol. 17, 851–862. Li, Y.C., Röder, M.S., Fahima, T., Kirzhner, V.M., Beiles, A., Korol, A.B., and Nevo, E. (2000c) Natural selection causing microsatellite divergence in wild emmer wheat at the ecologically variable microsite at Ammiad, Israel. Theor. Appl. Genet. 100, 985–999. Li, Y.C., Krugman, T., Fahima, T., Beiles, A., Röder, M.S., Korol, A.B., and Nevo, E. (2000d) Parallel microgeographic patterns of genetic diversity and divergence revealed by allozyme, RAPD, and microsatellites in Triticum dicoccoides at Ammiad, Israel. Conserv. Genet. 1, 191–207.

342

Nevo

33. Li, Y.C., Fahima, T., Röder, M.S., Beiles, A., Korol, A.B., and Nevo, E. (2002) Climatic effects on microsatellite diversity in wild emmer wheat, Triticum dicoccoides, at the Yehudiyya microsite. Heredity. 89, 127–132. 34. Li, Y.C., Krugman, T., Fahima, T., Beiles, A., Korol, A.B., and Nevo, E. (2001) Spatiotemporal allozyme divergence caused by aridity stress in a natural population of wild wheat, Triticum dicoccoides, at the Ammiad microsite, Israel. Theor. Appl. Genet. 102, 853–864. 35. Li, Y.C., Röder, M.S., Fahima, T., Kirzhner, V.M., Beiles, A., Korol, A.B., and Nevo, E. (2002) Climatic effects on microsatellite diversity in wild emmer wheat, Triticum dicoccoides, at Yehudiyya microsite. Heredity 89, 127–132. 36. Morgante, M., Hanafey, M., and Powell, W. (2002) Microsatellites are preferentially associated with non-repetitive DNA in plant genomes. Genetics 30, 194–200. 37. Van Valen, L. (1965) Morphological variation and width of ecological niche. Am. Nat. 99, 377–390. 38. Kirzhner, V.M., Korol, A., Turpeinen, T., and Nevo, E. (1995) Genetic supercycles caused by cyclical selection. Proc. Natl. Acad. Sci. USA 92, 7130–7133. 39. Kirzhner, V.M., Korol, A.B., and Nevo, E. (1996) Complex dynamics of multilocus systems subjected to cyclical selection. Proc. Natl. Acad. Sci. USA 93, 6532–6535. 40. Kirzhner, V.M., Korol, A., and Nevo, E. (1999) Abundant multilocus polymorphisms caused by genetic interaction between species on trait-for-trait basis. J. Theor. Biol. 198, 61–70. 41. Korol, A.B., Kirzhner, V.M., and Nevo, E. (1998) Dynamics of recombination modifiers caused by cyclical selection: interaction of forced and autoscillations. Genet. Res. 72, 135–147. 42. Nevo, E. (1995a) Evolution and extinction, in Encyclopedia of Environmental Biology Vol. 1 (Nierenberg, W., ed.), Academic Press Inc., New York, pp. 717–745. 43. Nevo, E., Beharav, A., Meyer, R.C., Hackett, C.A., Forster, B.P., Russell, J.R., Handley, L., and Powell, W. (2005) Genomic microsatellite adaptive divergence of wild barley by microclimatic stress in “Evolution Canyon”, Israel. Biol. J. Linn. Soc. 84, 205–224. 44. Nesbitt, M. and Samuel, D. (1998) Wheat domestication: archaeological evidence. Science 279, 1433. 45. Badr, A., Muller, K., Schafer-Pregl, R., El Rabey, H., Effgen, S., Ibrahim, H., Pozzi,

46.

47.

48.

49.

50.

51.

52.

53.

54.

55.

56.

C., Rohde, W., and Salamini, R. (2000) On the origin and domestication history of barley. Mol. Biol. Evol. 17, 499–510. Lev-Yadun, S., Gopher, A., and Abbo, S. (2000) The cradle of agriculture. Science. 288, 1602–1603. Gopher, A., Abbo, S., and Lev-Yadun, S. (2002) The “when”, “where”, and the “why” of the Neolithic revolution in the Levant. Doc. Praehistorica 28, 49–62. Salamini, F., Ozkan, H., Brandolini, A., Schafer-Pregl, R., and Martin, W. (2002) Genetics and geography of wild cereal domestication in the Near East. Nat. Rev. Genetics 3, 429–441. Nevo, E. and Beiles, A. (1989) Genetic diversity of wild emmer wheat in Israel and Turkey: structure, evolution and application in breeding. Theor. Appl. Genet. 77, 421–455. Nevo, E., Beiles, A., and Zohary, D. (1986) Genetic resources of wild barley in the Near East: structure, evolution and application in breeding. Biol. J. Linn. Soc. 27, 355–380. Nevo, E. (1983) Genetic resources of wild emmer wheat: structure, evolution and applications in breeding. in Proceeding of the 6th International Wheat Genetic Symposium, Kyoto University, Kyoto, Japan, pp. 421–431. Nevo, E. (1989) Genetic resources of wild emmer wheat revisited: genetic evolution, conservation and utilization. in Proceedings of the Seventh International Wheat Genetics Symposium, 13–19 July 1989 (Miller T.E. and Koebner R., eds.), Institute of Plant Science Research, Cambridge, pp. 121–126. Snape, J.W., Nevo, E., Parker, B.B., Leckie, D., and Morgunov, A. (1991a) Herbicide response polymorphism in wild populations of emmer wheat. Heredity 66, 251–257. Snape, J.W., Leckie, D., Parker, B.B., and Nevo, E. (1991b) The genetical analysis and exploitation of differential responses to herbicides in crop species. in Herbicide Resistance in Weeds and Crops (Casley, J.C., Cussans, G.W., and Atkin, R.K., eds.), ButterworthHeinemann, Oxford, pp. 305–317. Cakmak, I., Torun, A., Millet, E., Feldman, M., Fahima, T., Korol, A., Nevo, E., Braun, H.J., and Ozkan, H. (2004) Triticum dicoccoides: an important genetic resource for increasing zinc and iron concerntration in modern cultivated wheat. Soil Sci. Plant Nutr. 50, 1047–1054. Peleg, Z., Fahima, T., Abbo, S., Krugman, T., Nevo, E., Yakir, D., and Saranga, Y. (2005) Genetic diversity for drought resistance in

Ecological Genomics of Natural Plant Populations

57.

58.

59.

60.

61.

62.

63.

64.

65.

66.

67.

wild wheat and its ecogeographical associations. Plant Cell Environ. 28, 176–191. Uauy, C., Distelfeld, A., Fahima, T., Blechl, A., and Dubcovsky, J. (2006) A NAC gene regulating senescence improves grain protein, zinc, and iron content in wheat. Science 413, 1298–1301. Moseman, J.G., Nevo, E., and Zohary, D. (1983) Resistance of Hordeum spontaneum collected in Israel to infection with Erysiphe graminis hordei. Crop Sci. 23, 1115–1119. Moseman, J.G., Nevo, E., El-Morshidy, M.A., and Zohary, D. (1984) Resistance of Triticum dicoccoides to infection with Erysiphe graminis tritici. Euphytica 33, 41–47. Moseman, J.G., Nevo, E., Gerechter-Amitai, Z.K., El-Morshidy, M.A., and Zohary, D. (1985) Resistance of Triticum dicoccoides collected in Israel to infection with Puccinia recondite tritici. Crop Sci. 25, 262–265. Fetch, T.G., Steffenson, B.J., and Nevo, E. (2003) Diversity and sources of multiple disease resistance in Hordeum spontaneum. Plant Dis. 87, 1439–1448. Nevo, E. (1987) Plant genetic resources: prediction by isozyme markers and ecology, in Isozymes: Current Topics in Biological Research (Vol. 16). Agriculture, Physiology and Medicine (Rattazi, M., Scandalios, J., and Whitt, G.S., eds.), Alan R. Liss Inc., New York, pp. 247–267. Suprunova, T., Krugman, T., Fahima, T., Chen, G., Shams, I., Korol, A., and Nevo, E. (2004) Differential expression of dehydrin (Dhn) in response to water stress in resistant and sensitive wild barley (Hordeum spontaneum). Plant Cell Environ. 27, 1297–1308. Weining, S., Hu, Y., and Nevo, E. (2003) Toward understanding the molecular mechanism of drought resistance in wild barleys through the identification of nucleic acids polymorphisms in dehydrin genes. (Abstract). Plant and Animal Genome XI Conf. January 11–15, San Diego, CA, pp. 416. Volis, S., Yakubov, B., Shulgina, E., Ward, D., Zur, V., and Mendlinger, S. (2001) Test for adaptive RAPD variation in population genetic structure of wild barley, Hordeum spontaneum Koch. Biol. J. Linn. Soc. 74, 289–303. Volis, S., Mendlinger, S., and Ward, D. (2002) Differentiation in populations of Hordeum spontaneum along a gradient of environmental productivity and predictability: life history and local adaptation. Biol. J. Linn. Soc. 77, 479–490. Weining, S., Xianghong, D., and Nevo, E. (2004) Transforming Arabidopsis with genes

68.

69.

70.

71.

72.

73.

74.

75.

76.

77.

343

from wild barley for the analysis of drought tolerance. (Abstract) Plant and Animal Genome XII Conf. January 10–14, San Diego, CA, pp. 98. Nevo, E., Lu, Z., and Pavlicek, T. (2006) Global evolutionary strategies across life caused by shared ecological stress: Fact or fancy? Isr. J. Plant Sci. 54, 1–8. Nevo, E. (1995b) Asian, African, and European biota meet at “Evolution Canyon”, Israel: local tests of global biodiversity and genetic diversity patterns. Proc. Roy. Soc. Lond. B. 262, 149–155. Korol, A.B., Rashkovetsky, E., Iliadi, K., and Nevo, E. (2006) Drosophila flies in “Evolution Canyon” as a model for incipient sympatric speciation. Proc. Natl. Acad. Sci. USA 103, 18184–18189. Miyazaki, S., Nevo, E., Grishkan, I., Ikleman, U., Weinberg, D., and Bohnert, H. (2003) Oxidative stress responses in yeast strains, Saccharomyces cerevisiae, from “Evolution Canyon”, Israel. Monatsh. Chem. 134, 1465–1480. Nevo, E., Apelbaum-Elkaher, I., Garty, J., and Beiles, A. (1997) Natural selection causes microscale allozyme diversity in wild barley and lichen at “Evolution Canyon”, Mt. Carmel, Israel. Heredity 78, 373–382. Owuor, E.D., Fahima, T., Beiles, A., Korol, A.B., and Nevo, E. (1997) Population genetics response to microsite ecological stress in wild barley Hordeum spontaneum. Mol. Ecol. 6, 1177–1187. Paterson, A.H., Lander, E.S., Hewitt, J.D., Peterson, S., Lincoln, S.E., and Tanksley, S.D. (1988) Resolution of quantitative trait into Mendelian factors by using a complete linkage map of restriction fragment length polymorphisms. Nature 335, 721–726. Soller, M. and Beckmann, J.S. (1988) Genomic genetics and utilization for breeding purposes of genetic variation between populations, in Proc. 2nd Int. Conf. Quant. Genet. (Weir, B.S., Eisen, D.J., Goodman, M.M., and Namkoog, G., eds.), Sinauer, Sunderland, pp. 161–188. Sikorski, J. and Nevo, E. (2005) Adaptation and incipient sympatric speciation of Bacillus simplex under microclimatic contrast at “Evolution Canyons” I and II, Israel. Proc. Natl. Acad. Sci. USA 102, 15924–15929. Lamb, B., Kozlakidis, Z., and Saleem, M. (2000) Inter-strain cross-fertility tests on cultures from Israel, America and Canada in the homothallic fungus, Sordaria fimicola. Fungal Genet. News 47, 69–71.

344

Nevo

78. Gutterman, Y. and Nevo, E. (1994) Germination comparison study of Hordeum spontaneum regionally and locally in Israel: a population in the Negev Desert highlands and from two opposing slopes on the Mediterranean Mount Carmel. Barley Genet. Newslet. 22, 18–19. 79. Lavie, B., Stow, V., Krugman, T., Beiles, A., and Nevo, E. (1994) Fitness in wild barley from two opposing slopes of a Mediterranean microsite at Mt. Carmel, Israel. Barley Genet. Newslet. 23, 12–14. 80. Li, Y.C., Korol, A.B., Fahima, T., and Nevo, E. (2004) Microsatellites within genes: structure, function, and evolution. Mol. Biol. Evol. 6, 991–1007. 81. Kuang, H., Woo, S.S., Meyers, B., Nevo, E., and Michelmore, R.W. (2004) Multiplegenetic processes result in heterogeneous rates of evolution within the major cluster of

82.

83.

84.

85.

disease resistance genes in lettuce. Plant Cell 16, 2870–2894. Kuang, H., Ochoa, O.E., Nevo, E., and Michelmore, R.W. (2006) The disease resistance gene Dm3 is infrequent in natural populations of Lactuca serriola due to deletions and frequent gene conversions. Plant J. 47, 38–48. Sicard, D., Woo, S.S., Arroyo-Garcia, R., Ochoa, O., Nguyen, D., Korol, A.B., Nevo, E., and Michelmore, R. (1999) Molecular diversity at the major cluster of disease resistance genes in cultivated and wild Lactuca spp. Theor. Appl. Genet. 99, 405–418. Beharav, A., Lewinsohn D., Lebeda, A., and Nevo, E. (2006) New wild Lactuca genetic resources with resistance against Bremia lactucae. Genet. Resour. Crop Evol. 53, 467–474. Scott, R.J. and Spielman, M. (2006) Deeper into the maize: new insights into genomic imprinting in plants. BioEssays 28, 1167–1171.

Chapter 18 Genome Sequencing Approaches and Successes Michael Imelfort, Jacqueline Batley, Sean Grimmond, and David Edwards Summary Sequence data is crucial to our understanding of crop growth and development, as differences in DNA sequence are responsible for almost all of the heritable differences between crop varieties and ecotypes. The sequence of a genome is often referred to as the genetic blueprint, and is the foundation for all additional information from the genome to the phenome. The value of DNA sequence is leading to rapid improvements in sequencing technology, increasing throughput, and reducing costs, and technological advances are accelerating with the introduction of novel approaches that are replacing the traditional Sanger-based methods. As genome sequencing becomes cheaper, it will be applied to a greater number of species with increasingly large and complex genomes. This will increase our understanding of how differences in the sequence relate to phenotypic observations, heritable traits, speciation, and evolution. Our understanding of plants will be greatly enhanced by this flow of sequence information, with direct benefit for crop improvement. Key words: DNA sequencing, Genomics, Sanger, SOLiD, Solexa Genome Analyser, 454 FLX.

1. Introduction Initial DNA sequencing projects focused on single-pass sequencing of expressed genes, to produce expressed sequence tags (ESTs). Genes are specifically expressed in tissues in the form of messenger RNA (mRNA). A pool of mRNA is extracted from a specific tissue and used to produce a library of complementary DNA (cDNA). Individual cDNAs from a library are then sequenced from one direction to produce the sequence tag. EST sequencing is a cost-effective method for the rapid discovery of

Daryl J. Somers et al. (eds.), Methods in Molecular Biology, Plant Genomics, vol. 513 © Humana Press, a part of Springer Science + Business Media, LLC 2009 DOI: 10.1007/978-1-59745-427-8_18

345

346

Imelfort et al.

gene sequences that may be associated with development or environmental responses in the tissues from which the mRNA was extracted (1). However, cDNA libraries show redundancy, and highly expressed genes are more abundant than genes expressed at lower levels. Similarly, only genes expressed in the specific tissue, growth stage, and environmental conditions used to sample the cDNA are sequenced. As the number of EST sequences produced from a cDNA library increases, fewer sequences represent new genes that have not already been sampled. Thus, while EST sequencing is a valuable means to rapidly identify genes moderately or highly expressed in a tissue, the method rarely identifies genes that are expressed at lower levels, including many genes that encode regulatory proteins that are only produced in very small quantities. An alternative to EST sequencing is genome sequencing, aiming to sequence either the whole genome or portions of the genome. This method removes the bias associated with tissue specificity or gene expression level. However, crop plant genomes tend to be very large, for example, the wheat genome is four to six times as large as the human genome. Genome sequencing only became feasible with the development of capillary-based highthroughput sequencing technology. The recent development of novel low-cost sequencing methods such as pyrosequencing and Applied Biosystem (AB)’s SOLiD technology will lead to genome sequencing becoming increasingly common. The first plant genome to be sequenced was that of the model plant Arabidopsis thaliana (2). More recently, the sequence of larger crop genomes has been attempted with the completion of the genome sequence for the model monocot and major crop, rice (3, 4). With the sequencing of A. thaliana and rice, along with ongoing sequencing projects for numerous crop and non-crop plant species such as Brassica, Medicago, Lotus, tomato, potato, poplar, soybean, Capsella, papaya, Eucalyptus, grape, Mimulus guttatus, Triphysaria versicolor, banana, Sorghum, maize, and wheat (5) as well as numerous plant pathogens including Agrobacterium tumefaciens, Burkholderia cenocepacia, Clavibacter michiganensis, Erwinia carotovora, E. chrysanthemi, Leifsonia xyli, Onion yellows phytoplasma, Pseudomonas syringae, Ralstonia solanacearum, Spiroplasma kunkelii, Xanthomonas axonopodis, Xanthomonas campestris, and Xylella fastidiosa, plant genomics has truly come of age. 1.1. Sequencing Technology

The ability to sequence genomes is being advanced by increasingly high-throughput technology. One such advance is the application of pyrosequencing. The first commercially available pyrosequencing system was developed by 454 and commercialized by Roche as the GS20, capable of sequencing over 20 million base pairs in just over 4 h. This was replaced during 2007 by

Genome Sequencing Approaches and Success

347

the GS FLX model, capable of producing over 100 million base pairs of sequence in a similar amount of time. Two alternative ultra high-throughput sequencing systems now compete with the GS FLX, Solexa technology, which is now being commercialized by Illumina; and the SOLiD system from AB. For a summary of each of these technologies, see Table 1. The Roche 454 FLX system performs amplification and sequencing in a highly parallelized picolitre format. In contrast with Solexa and AB SOLiD, GS FLX read lengths average 200– 400 bases per read, and with more than 400,000 reads per run, it is capable of producing over 100 Mbp sequence in around 4 h with a single-read accuracy of greater than 99.5%. Emulsion PCR enables the amplification of a DNA fragment immobilized on a bead from a single fragment to 10 million identical copies, generating sufficient DNA for the subsequent sequencing reaction. This involves the sequential flow of both nucleotides and enzymes over the plate that convert chemicals generated during nucleotide incorporation into a chemiluminescent signal that can be detected by a CCD camera. The light signal is quantified to determine the number of nucleotides incorporated during the extension of the DNA sequence. The Solexa sequencing system, sold as the Illumina Genome Analyzer, uses reversible terminator chemistry to generate up to three thousand million bases of usable data per run. Sequencing templates are immobilized on a flow cell surface. Solid phase amplification creates clusters of up to 1,000 identical copies of each DNA molecule, with densities of more than ten million clusters per square centimetre. Sequencing uses four proprietary fluorescently labelled nucleotides to sequence the millions of clusters on the flow cell surface. These nucleotides possess a reversible termination property, allowing each cycle of the sequencing reaction to occur simultaneously in the presence of the four nucleotides. Solexa sequencing has been developed predominantly for resequencing, with more than tenfold coverage ensuring high confidence in the determination of genetic differences. In addition, each raw read base has an assigned quality score, assisting de novo assembly, and sequence comparisons.

Table 1 Properties of the different sequencing methods Sanger

GSFLX

Solexa

AB SOLiD

nt per run

70 Kbp

100 Mbp

3 Gbp

3 Gbp

Read length

750 bp

300 bp

75 bp

35 bp

348

Imelfort et al.

The AB SOLiD system enables parallel sequencing of clonally amplified DNA fragments linked to beads. The method is based on sequential ligation with dye-labelled oligonucleotides and can generate more than 3 gigabases of ‘mappable data’ per run. The system features a two-base encoding mechanism that interrogates each base twice providing a form of built in error detection. Mate-paired sequence generation enables detection and resolution of sequence variation, including SNPs, gene copy number variations, duplications, inversions, insertions, and deletions. The system can be used for tag-based applications such as gene expression and Chromatin ImmunoPrecipitation, where a large number of reads are required and where the high throughput provides greater sensitivity for the detection of lowly expressed genes. This rapid expansion of technology, along with supporting bioinformatics developments, makes previously intractable genomes suitable targets for genome sequencing projects. 1.2. Sequencing Approaches

Multiple approaches have been assessed for the sequencing of large genomes. The most robust method for genome sequencing is know as BAC-by-BAC (BAC stands for bacterial artificial chromosome) sequencing and involves the production of an overlapping tilling path of large genomic fragments (around 120,000 bases) maintained within BACs. Each BAC is shotgun sequenced, where many short reads are assembled to produce the sequence of the BAC. The whole genome sequence may then be reassembled from these large sequenced regions based on sequence overlaps. An alternative approach is the whole genome shotgun (WGS) method, where the entire genome is cut into many smaller reads that are individually sequenced, and computational algorithms are applied to assemble the complete genome sequence. While WGS requires less time and funds than a BAC-by-BAC approach, the assembly of the genome sequence is often problematic due to sequence repeats within the genome. This is particularly true for many plant species such as wheat, with large polyploid genomes and many repetitive elements. One challenge of genome sequencing lies in the fact that only a small portion of the genome encodes genes, and that these genes are surrounded by repetitive DNA, as well as regulatory elements that are often difficult to characterize. Several statistical methods have been applied to identify candidate genes within genomic sequence. These methods are being refined as more gene and genome sequence becomes available, providing evidence of gene expression and conservation of functional parts of the genome. Modifications of sequencing methods to enrich for gene-rich regions of the genome have been applied for some crops such as maize. These include methyl filtration, and high cot analysis

Genome Sequencing Approaches and Success

349

where highly repetitive regions of the genome are filtered out prior to sequencing. Further modifications to this method include restriction analysis, where genomic DNA is digested with restriction endonucleases and fractionated to remove repeat abundant fractions. For all of these approaches however, assembly remains problematic and certain genes remain unidentified. An alternative method for reduced genome sequencing involves the isolation of specific chromosomes or portions of chromosomes by chromosome sorting. This approach enables both the sequencing of whole chromosomes by shotgun or the production of chromosome-specific BAC libraries for characterization and tiling path sequencing.

2. Examples of Plant Genome Sequencing Projects 2.1. Arabidopsis

On 14 December 2000, the Arabidopsis Genome Initiative (AGI) announced the completion of the A. thaliana genome sequence. Arabidopsis was the first plant genome to be sequenced. At publication, the sequence covered 115.4 Mbp of the estimated 130 Mbp genome and was estimated to contain 25,498 genes encoding proteins from approximately 11,000 families (2). Collation and distribution of sequence and annotation data resulting from the sequencing project is handled by The Arabidopsis Information Resource (TAIR) (6). TAIR maintains a database of genetic and molecular biology data, and as of April 2007, the TAIR7 release represented 27,029 protein-coding genes, 3,889 pseudogenes or transposable elements, and 1,123 non-coding RNAs. As Arabidopsis was the first plant to be sequenced and was undertaken by over 40 laboratories across different countries, a number of different sequencing approaches were applied, from fully random sample sequencing to highly directed localized sequencing. A WGS approach was initially suggested, but due to quality and assembly issues, it was decided to proceed with a more directed approach with either BAC or other large-insert clones that had been mapped to specific chromosomes (7). The AGI agreed to follow standards for sequence accuracy established by the human genome project to ensure less than one error per 10,000 bases in the final product. Two groups, one at Texas A&M University led by Rod Wing and another at the Max-Planck-Institut für Molekulare Pflanzenphysiologie in Golm, Germany led by Thomas Altmann constructed complementary BAC libraries. These two libraries were the most widely used in sequencing projects in Europe and the USA. The Kasuza DNA Research Institute in Japan developed libraries in a P1 vector and in a modified BAC vector, and these were used to

350

Imelfort et al.

sequence chromosome 5 (8). Due to the collaboration of numerous laboratories and the coordination of the AGI, the Arabidopsis sequencing project, which was originally scheduled for completion in 2004, was completed 4 years ahead of schedule. 2.2. Brachypodium

Brachypodium is a close relative of the cool season grasses and in 2006 was chosen by the US Department of Energy Joint Genome Institute (DOE JGI) for sequencing to act as a genomic bridge species between rice and other agronomically important cereals (9, 10). Both Arabidopsis and rice provide opportunities for furthering research into cool season grass crops, but a specific grass model will overcome problems associated with using a dicot (Arabidopsis) or a plant which is separated by 50 million years from the cool season grasses (rice) (9). Brachypodium sylvaticum was originally suggested, but B. distachyon was later selected for sequencing. B. distachyon is a self-fertile, inbreeding annual with a life cycle of less than 4 months. It has a diploid genome with 2n = 10 chromosomes and consists of approximately 335 Mbp of DNA. A WGS approach is being employed to determine the sequence of Brachypodium Bd21, which will be supplemented by the sequencing of 250,000 ESTs. Sequencing is being undertaken by DOE JGI and it is hoped that the sequencing will promote further research in the area of energy crops. As of August 2007, the final assembly of the sequence was underway.

2.3. Brassica

Among the six cultivated species of Brassica, B. rapa (syn. campestris, AA, n = 10), B. juncea (AABB, n = 18), and B. napus (AACC, n = 19) are agronomically important oilseeds, whereas B. oleracea (CC, n = 9) is valued as leafy vegetables (broccoli, cauliflower, cabbage, khol-khol, etc.). The other two species, B. nigra (BB, n = 8) and B. carinata (BBCC, n = 17) are largely valued as condiments. Brassica shares extensive synteny with A. thaliana, enabling comparative mapping and exploitation of the Arabidopsis genome sequence for Brassica crop improvement. The Steering Committee for the Multinational Brassica Genome Project (MBGP) selected B. rapa as the first Brassica species to be fully sequenced, as it has the smallest genome at 550 Mbp and communal BAC libraries and mapping populations were available. The international project, being undertaken by groups in Korea, Australia, Germany, Canada, France, USA, and UK, will sequence the genome to Phase 2, whereby BACs are to be sequenced to produce ordered and oriented contigs, but with some gaps remaining. All sequence reads and trace files, as well as the BAC libraries, are publicly available to permit users of the sequence to finish clones of particular interest as necessary.

Genome Sequencing Approaches and Success

351

2.4. Eucalyptus

In June 2007, the DOE JGI initiated the Eucalyptus grandis genome sequencing project. The project started in August 2007 and is expected to take 2 years to complete. The international programme will be coordinated by EUCAGEN and involve more than 130 scientists from 18 countries.

2.5. Grape

The sequencing of the Vitis vinifera genome is being performed on the quasi-homozygous genotype PN40024. The genome size is estimated at 475 Mbp and is being sequenced using a WGS approach, with a minimum of 12× coverage. A French–Italian collaborative project released an 8× draft sequence in August 2007. A BAC library of 70,656 clones has also been constructed and will be assembled using the ARACHNE assembler. The sequences can be downloaded from the NCBI Trace Archive.

2.6. Lotus

Lotus japonicus is a diploid self-fertile perennial pasture legume, with six chromosomes and a genome sequence of around 450 Mbp. The genome is organized into distinct gene-rich euchromatin chromosome arms and repeat-rich pericentromeric regions, lending itself to a sequencing approach using seed BACS or transformation-competent artificial chromosomes (TACS) (11). Large-scale genome sequencing of variety Miyakojima MG-20 began in 2000. ESTs, cDNAs, and gene segments from Lotus and other legumes we used to determine seed points and corresponding TAC clones were selected and sequenced.

2.7. Maize

The maize genome consists of about 2.5 Gbp of DNA maintained in ten chromosomes. The National Science Foundation (NSF), the US Department of Agriculture, and the US DOE provided US $32 million to the Washington University Genome Sequencing Centre, Cold Spring Harbor, the Arizona Genome Institute, and Iowa State University in 2005 to undertake a maize genome sequencing project. B73 was selected as the maize variety, and a BAC-by-BAC approach was chosen to complement the previous maize genome sequencing assessments. This effort, expected to require 3 years of work, will use a minimal tiling path of 19,000 mapped BAC clones, and focus on producing high-quality sequence coverage of all identifiable gene-containing regions of the maize genome. Regions will be ordered, oriented, and along with all of the intergenic sequences, anchored to the maize genetic maps. Important features of the project include immediate release of both preliminary and high-quality sequence assemblies, and the development of a genome browser that will facilitate user interaction with sequence and map data. To complement the BAC-by-BAC approach, a WGS strategy is being assessed using chromosome 10 flow sorted material of variety Mo17. Sequence is being deposited into public databases as it emerges from the sequencing pipeline.

352

Imelfort et al.

2.8. Medicago

Legumes represent the third largest plant family and are the second most important crop family (11). Medicago truncatula is an annual diploid with eight chromosomes. It is closely related to tetraploid alfalfa (M. sativa). A combination of cytogenetic and BAC sequence data shows that the M. truncatula genome is organized into distinct gene-rich euchromatin and repeat-rich pericentromeric regions, allowing the M. truncatula genespace to be efficiently sequenced using a BAC-by-BAC strategy (12). Six chromosomes are being sequenced in an NSF-funded project and two are being sequenced by partners in Europe funded by EU Framework VI. An international committee known as the International Medicago Genome Annotation Group is coordinating the annotation process using training sets of M. truncatula gene models supported by EST sequence data to train gene prediction algorithms (13). The sequence will be a valuable basis for genomic comparison with other plant genomes, and as a foundation for improving crop and forage legumes.

2.9. Monkey Flower

Monkey flowers have become a model system for studying ecological and evolutionary genetics. JGI commenced sequencing Mimulus guttatus in 2006 using a WGS approach. As of May 2007, more than 70% of the 430 Mbp genome was completed and sequences are available through the NCBI Trace Archive. In addition to the WGS sequence, JGI is sequencing 200,000 ESTs each from M. guttatus and M. lewisii.

2.10. Papaya

The Papaya sequencing project was founded by the Centre for Genomics, Proteomics, and Bioinformatics Research Initiative at the University of Hawaii in 2004. Papaya will be the first fruit species to be sequenced, it has nine chromosomes, and the size of the genome is 372 Mbp.

2.11. Poplar

Trees, due to their long life span, have characteristics that distinguish them from annual, herbaceous plants. It is likely that many of these properties are based on a tree-specific genetic foundation. Poplar has been selected as a model system for trees because it has a relatively small genome. Populus trichocarpa (black cottonwood), with a paleoploid (2n = 38) genome of approximately 480 Mbp, was selected as the first tree genome to be sequenced. Sequencing this genome allows comparison between perennial and annual plant species on a whole genome basis for the first time and provides resources to help answer tree-specific questions about dormancy, development of a secondary cambium, juvenile-mature phase change, and long-term host–pest interactions (14, 15). The poplar sequencing project was funded by the Biological and Environmental Research programme in the DOE’s Office of Science, which provided US $8 millions for sequencing and

Genome Sequencing Approaches and Success

353

US $4 millions for associated research. The primary European partner, Sweden’s Umeå Plant Science Centre, produced ESTs which were necessary for accurate gene prediction. The total investment in the Swedish Populus programme exceeded $US10 million, while Genome Canada and Genome BC contributed a further $CDN 2 million to the project. The International Populus Genome Consortium based at Oak Ridge National Laboratory (ORNL) coordinated the 2-year international research effort with the bulk of the sequencing being carried out by JGI and ORNL. A WGS approach was used, with assembled sequence being mapped to a BAC minimum tiling path provided by Genome BC. A total of 324,000 ESTs were used to identify genes, and Stanford University played an integral part in sequence finishing and quality control. The genome annotation v1.1 includes 45,555 gene models produced through the collaboration of JGI, ORNL, and Ghent University, Belgium. Annotated genomic sequence is available for download and analysis at the JGI and ORNL websites (14). Currently, around 520 Mbp of annotated genomic sequence is available. 2.12. Potato

Potato has a variety of ploidy levels, ranging from (2n = 24) to (6n = 72) and all are based on a haploid number of 12, although the cultivated potato varieties are tetraploid. The size of the genome is 840 Mbp. The Potato Genome Sequencing Consortium will sequence the potato genome using a BAC-by-BAC approach with an expected completion date of 2009.

2.13. Rice

Rice is one of the most important cereal crops and the principal food for more than half of the world’s population. This species has the smallest genome size among major cereal crops, estimated at 430 Mbp (16). Evolutionary trees have shown that cereal crops diverged from a common ancestor some 60 million years ago (17) and whole genome organization exhibits a high degree of synteny (see Chapter 3, this volume). Two general methods have been applied for the sequencing of the rice genome. A BAC-by-BAC sequencing approach undertaken by an international consortium has been complemented by two WGS projects. The genome sequences of rice (3, 4) provide a basis for integrating and comparing biological information from rice and related cereal crops. The International Rice Genome Sequencing Project (IRGSP), a consortium of publicly funded laboratories, was established in 1997 to obtain a high-quality, map-based sequence of the rice genome using the cultivar Nipponbare of Oryza sativa ssp. japonica. The consortium is composed of ten members representing Japan, USA, China, Taiwan, Korea, India, Thailand, France, Brazil, and UK. The IRGSP adopted the BAC sequencing strategy

354

Imelfort et al.

so that each sequenced clone can be associated with a specific position on the genetic map, thus ensuring a robust genome sequence assembly. The IRGSP completed the sequencing of the rice genome in December 2004 and the high-quality, map-based sequence of the entire genome is now available in public databases. WGS sequences for the genomes of indica (93–11) and japonica (Syngenta) are also available. Syngenta produced the first genome sequence of Oryza sativa ssp. japonica using a WGS approach. The Beijing Genomics Institute as part of the Super Hybrid Rice Genome Project to characterize the genome of rice released a 4.2-fold coverage draft genome sequence of 93-11 (3), a cultivar of the Oryza sativa ssp. indica grown widely in China and Southeast Asia. An improved version was later reported in which the coverage of the 93-11 dataset was brought up to 6.28fold (18). 2.14. Sorghum

The Sorghum bicolor genome consists of approximately 770 Mbp in ten chromosomes (2n = 20). A WGS approach is being applied within the DOE JGI Community Sequencing Program, and Sorghum is the most complex plant genome to be sequenced by this strategy. To accelerate the release of the sequence information, preliminary scaffolds have been made available prior to their integration with genetic and physical maps. Comparison of the genome sequence with Sorghum ESTs suggests that more than 95% of known Sorghum protein-coding genes are represented in this assembly.

2.15. Tomato

In 2004, the tomato genome was selected as the reference for sequencing projects within the Solanaceae Genomics Project. The sequencing of the tomato genome marks the first step in bringing together genetic maps and genomes of all Solanaceae and related plants, including potato, eggplant, pepper, petunia, and coffee. The 950 Mbp tomato genome was found to consist of approximately three-quarters pericentromeric heterochromatin, known to be rich in repetitive sequences and poor in genes. The remaining one-quarter of the tomato genome consists of distal, euchromatic segments of chromosomes that contain mostly single copy sequences and more than 90% of the genes. It was therefore decided to only sequence this 25% of the genome (19). As only a fraction of the genome was to be sequenced, a WGS approach would not be cost-effective. An ordered BAC approach was considered more attractive as the sequence would be used as a reference for further sequencing projects. The variety Heinz 1706 was selected as BAC resources were already available. A minimal tiling path of BAC clones was constructed and BAC clones were individually anchored to a genetic map based on a single, common L. esculentum × L. pennellii F2 population. As of October 2007,

Genome Sequencing Approaches and Success

355

30% of sequencing was complete, with 23% of BACs reported as finished and 18% of BACs available for download. 2.16. Wheat

The size of the wheat (Triticum aestivum) genome is approximately 17,000 Mbp, much larger than related cereal genomes such as barley (Hordeum vulgare, 5,000 Mbp), rye (Secale cereale, 9,100 Mbp) and oat (Avena sativa, 11,000 Mbp). The size and hexaploid nature of the wheat genome creates significant problems in elucidating its genome sequence and it may take several years before sequencing technology is fast and cheap enough to readily determine the whole genome sequence. The size of the genome is also evidenced by the scale and difficulty of BAC library production. Early wheat BAC library production focussed on individual genomes. Lijavetzky et al. (20) constructed an A genome Triticum monococcum (accession DV92) BAC library of over 276,000 clones, or 5.6 genome equivalents, while Moullet et al. (21) constructed a D genome Aegilops tauschii BAC library of 144,000 clones, or 3.7 genome equivalents. Cenci et al. (22) produced a BAC library for tetraploid durum wheat, Triticum turgidum, representing over 500,000 clones or 5.1-fold coverage of the genome. As part of a joint UK/France collaboration, Allouis et al. (23) produced a hexaploid wheat library of the variety ‘Chinese Spring’ of 1.2 million clones. This is now complemented by a 3.4-fold genome coverage Chinese Spring library of almost 400,000 clones (24). In addition, Nilmalgoda et al. (25) constructed a hexaploid wheat library of T. aestivum variety ‘Glenlea’, representing 650,000 clones or 3.1 genome equivalents. More recently, chromosome-specific BAC libraries have been produced enabling the targeted screening and sequencing of specific wheat chromosomes (26). The International Wheat Genome Sequencing Consortium (IWGSC) was established in 2005 to facilitate and coordinate international efforts towards obtaining the complete sequence of the bread wheat genome. The consortium aims to ensure that the sequence and resources resulting from it will be available to all without restriction or cost. The IWGSC has selected the cultivar Chinese Spring as the germplasm source for the project as this variety already has significant genetic and molecular resources (27). A pilot project led by the French National Institute for Agricultural Research was initiated in 2004 to assess the BAC fingerprinting of the largest hexaploid wheat chromosome 3B. A total of 68,000 BAC clones of a 3B chromosome-specific BAC library (28) have been fingerprinted at the French National Sequencing Centre, Genoscope and the sequencing of these BAC clones under progress. In NSF, USA funded wheat pilot project, researchers Bennetzen, Devos, and SanMiguel plan to sequence a total of 220 large fragments of wheat DNA, cloned into BACs. These 220 BACs will also be anchored on wheat chromosome maps.

356

Imelfort et al.

Two methods for gene enrichment sequencing, high cot analysis and hypomethylated partial restriction analysis, will be tested for their efficacy in wheat. These experiments and analyses will lay the foundation for future genomic characterizations of the wheat genome. Additional sequencing projects are expected to contribute to the endeavour to sequence the complete wheat genome. Expectations are that by 2010 the majority of the gene-rich regions of hexaploid wheat will have been sequenced. However,

Table 2 A summary of current plant genome sequencing projects Arabidopsis

http://www.arabidopsis.org/

Banana

http://www.musagenomics.org/index.php

Brachypodium

http://www.jgi.doe.gov/sequencing/why/CSP2007/ brachypodium.html

Brassica

http://www.brassica.info

Capsella

http://www.jgi.doe.gov/sequencing/why/CSP2006/ AlyrataCrubella.html

Eucalyptus

http://www.ieugc.up.ac.za/

Grape

http://www.cns.fr/externe/English/Projets/Projet_ML/projet. html/

Lotus

http://www.kazusa.or.jp/lotus/

Maize

http://www.maizegdb.org/sequencing_project.php

Medicago

http://www.medicago.org/

Monkey Flower

http://www.jgi.doe.gov/sequencing/why/CSP2006/mimulus. html

Papaya

http://cgpbr.hawaii.edu/papaya/

Poplar

http://www.ornl.gov/sci/ipgc/

Potato

http://www.potatogenome.net/

Rice

http://rgp.dna.affrc.go.jp/IRGSP/

Sorghum

http://www.jgi.doe.gov/sequencing/why/CSP2006/sorghum. html; http://www.phytozome.net/sorghum

Soybean

http://www.shigen.nig.ac.jp/legume/legumebase/

Tomato

http://www.sgn.cornell.edu/about/tomato_sequencing.pl

Triphysaria versicolor

http://www.jgi.doe.gov/sequencing/why/CSP2006/Triphysaria. html

Wheat

http://www.wheatgenome.org/

Genome Sequencing Approaches and Success

357

these expectations are likely to be modified by the rapid changes in sequencing technology and the ever reducing cost of producing sequence data. 2.17. Other Plant Sequencing Projects

With the decreasing cost of sequencing, sequencing projects are likely to be established for many additional plant species as well as organisms that interact with plants such as pathogens and symbionts. A brief summary of current plant genome sequencing projects is listed in Table 2 and further projects are expected to be established over the next few years.

References 1. Adams, M.D., Kelley, J.M., Gocayne, J.D., Dubnick, M., Polymeropoulos, M.H., Xiao, H., Merril, C.R., Wu, A., Olde, B., Moreno, R.F., Kerlavage, A.R., Mccombie, W.R., and Venter, J.C. (1991) Complementary DNA sequencing: expressed sequence tags and human genome project. Science 252, 1651–1656. 2. The Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815. 3. Yu, J., Hu, S., Wang, J., Wong, G.K., Li, S., Liu, B., Deng, Y., Dai, L., Zhou, Y., Zhang, X., Cao, M., Liu, J., Sun, J., Tang, J., Chen, Y., Huang, X., Lin, W., Ye, C., Tong, W., Cong, L., Geng, J., Han, Y., Li, L., Li, W., Hu, G., Huang, X., Li, W., Li, J., Liu, Z., Li, L., Liu, J., Qi, Q., Liu, J., Li, L., Li, T., Wang, X., Lu, H., Wu, T., Zhu, M., Ni, P., Han, H., Dong, W., Ren, X., Feng, X., Cui, P., Li, X., Wang, H., Xu, X., Zhai, W., Xu, Z., Zhang, J., He, S., Zhang, J., Xu, J., Zhang, K., Zheng, X., Dong, J., Zeng, W., Tao, L., Ye, J., Tan, J., Ren, X., Chen, X., He, J., Liu, D., Tian, W., Tian, C., Xia, H., Bao, Q., Li, G., Gao, H., Cao, T., Wang, J., Zhao, W., Li, P., Chen, W., Wang, X., Zhang, Y., Hu, J., Wang, J., Liu, S., Yang, J., Zhang, G., Xiong, Y., Li, Z., Mao, L., Zhou, C., Zhu, Z., Chen, R., Hao, B., Zheng, W., Chen, S., Guo, W., Li, G., Liu, S., Tao, M., Wang, J., Zhu, L., Yuan, L., and Yang, H. (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296, 79–92. 4. Goff, S.A., Ricke, D., Lan, T.H., Presting, G., Wang, R., Dunn, M., Glazebrook, J., Sessions, A., Oeller, P., and Varma, H. (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296, 92–100. 5. Jackson, S., Rounsley, S., and Purugganan, M. (2006) Comparative sequencing of plant genomes: choices to make. Plant Cell 18, 1100–1104.

6. Wortman, J.R., Haas, B.J., Hannick, L.I., Smith, R.K. Jr., Maiti, R., Ronning, C.M., Chan, A.P., Yu, C., Ayele, M., Whitelaw, M., White, O.R., and Town, C.D. (2003) Annotation of the Arabidopsis Genome. Plant Physiology 132, 461–468. 7. Meinke, D.W., Cherry, M.J., Dean, C., Rounsley, S.D., and Koornneef, M. (1998) Arabidopsis thaliana: a model plant for Genome analysis. Science 282, 662–682. 8. Sato, S., Kotani, H., Nakamura, Y., Kaneko, T., Asamizu, E., Fukami, M., Miyajima, N., and Tabata, S. (1997) Structural analysis of Arabidopsis thaliana chromosome 5. I. Sequence features of the 1.6 Mb regions covered by twenty physically assigned P1 clones. DNA Research 4, 215–219. 9. Garvin, D.F. (2007) Brachypodium: a new monocot model plant system emerges. Journal of the Science of Food and Agriculture 87, 1177–1179. 10. Hasterok, R., Marasek, A., Donnison, I.S., Armstead, I., Thomas, A., King, I.P., Wolny, E., Idziak, D., Draper, J., and Jenkins, G. (2006) Alignment of the genomes of Brachypodium distachyon and temperate cereals and grasses using bacterial artificial chromosome landing with fluorescence in situ hybridization. Genetics 173, 349–362. 11. Cannon, S.B., Sterck, L., Rombauts, S., Sato, S., Cheung, F., Gouzy, J., Wang, X., Mudge, J., Vasdewani, J., Schiex, T., Spannagl, M., Monaghan, E., Nicholson, C., Humphray, S.J., Schoof, H., Mayer, K.F.X., Rogers, J., Quétier, F., Oldroyd, G.E., Debellé, F., Cook, D.R., Retzel, E.F., Roe, B.A., Town, C.D., Tabata, S., Van de Peer, Y., and Young, N.D. (2006) Legume evolution viewed through the Medicago truncatula and Lotus japonicus genomes. PNAS USA 103, 14959–14964. 12. Young, N.D., Cannon, S.B., Sato, S., Kim, D., Cook, D.R., Town, C.D., Roe, B.A., and Tabata, S. (2005) Sequencing the genespaces

358

13.

14.

15.

16.

17.

18.

19.

20.

21.

22.

Imelfort et al. of Medicago truncatula and Lotus japonicus. Plant Physiology 137, 1174–1181. Cannon, S.B., Crow, J.A., Heuer, M.L., Wang, X., Cannon, E.K.S., Dwan, C., Lamblin, A., Vasdewani, J., Mudge, J., Cook, A., Gish, J., Cheung, F., Kenton, S., Kunau, T.M., Brown, D., May, G.D., Kim, D., Cook, D.R., Roe, B.A., Town, C.D., Young, N.D., and Retzel, E.F. (2005) Databases and information integration for the Medicago truncatula genome and transcriptome. Plant Physiology 138, 38–46. Brunner, A.M., Busov, V.B., and Strauss, S.H. (2004) Poplar genome sequence: functional genomics in an ecologically dominant plant species. Trends in Plant Science 9, 49–56. Tuskan, G.A., DiFazio, S.P., and Teichmann, T. (2004) Poplar genomics is getting popular: the impact of the poplar genome project on tree research. Plant Biology 6, 2–4. Zhao, W., Wang, J., He, X., Huang, X., Jiao, Y., Dai, M., Wei, S., Fu, J., Chen, Y., and Ren, X. (2004) BGI-RIS: an integrated information resource and comparative analysis workbench for rice genomics. Nucleic Acids Research 32, D377–D382. Chen, M., SanMiguel, P., de Oliveira, A.C., Woo, S.S., Zhang, H., Wing, R.A., and Bennetzen, J.L. (1997) Microcolinearity in sh2-homologous regions of the maize, rice, and sorghum genomes. PNAS USA 94, 3431–3435. Yu, J., Wang, J., Lin, W., Li, S., Li, H., Zhou, J., Ni, P., Dong, W., Hu, S., and Zeng, C. (2005) The Genomes of Oryza sativa: a history of duplications. PLoS Biology 3, e38. Shibata, D. (2005) Genome sequencing and functional genomics approaches in tomato. Journal of General Plant Pathology 71, 1–7. Lijavetzky, D., Muzzi, G., Wicker, T., Keller, B., Wing, R., and Dubcovsky, J. (1999) Construction and characterization of a bacterial artificial chromosome (BAC) library for the A genome of wheat. Genome 42, 1176–1182. Moullet, O., Zhang, H.B., and Lagudah, E.S. (1999) Construction and characterisation of a large DNA insert library from the D genome of wheat. Theoretical and Applied Genetics 99, 305–313. Cenci, A., Chantret, N., Kong, X., Gu, Y., Anderson, O.D., Fahima, T., Distelfeld, A., and Dubcovsky, J. (2003) Construction and

23.

24.

25.

26.

27.

28.

characterization of a half million clone BAC library of durum wheat (Triticum turgidum ssp. durum). Theoretical and Applied Genetics 107, 931–939. Allouis, S., Moore, G., Bellec, A., Sharp, R., Faivre Rampant, P., Mortimer, K., Pateyron, S., Foote, T.N., Griffiths, S., Caboche, M., and Chalhoub, B. (2003) Construction and characterisation of a hexaploid wheat (Triticum aestivum L.) BAC library from the reference germplasm ‘Chinese Spring’. Cereal Research Communications 31, 331–338. Shen, B., Wang, D.M., McIntyre, C.L., and Liu, C.J. (2005) A ‘Chinese Spring’ wheat (Triticum aestivum L.) bacterial artificial chromosome library and its use in the isolation of SSR markers for targeted genome regions. Theoretical and Applied Genetics 111, 1489–1494. Nilmalgoda, S.D., Cloutier, S., and Walichnowski, A.Z. (2003) Construction and characterization of a bacterial artificial chromosome (BAC) library of hexaploid wheat (Triticum aestivum L.) and validation of genome coverage using locus-specific primers. Genome 46, 870–878. Janda, J., Bartos, J., Safar, J., Kubalakova, M., Valarik, M., Cihalikova, J., Simkova, H., Caboche, M., Sourdille, P., Bernard, M., Chalhoub, B., and Dolezel, J. (2004) Construction of a subgenomic BAC library specific for chromosomes 1D, 4D and 6D of hexaploid wheat. Theoretical and Applied Genetics 109, 1337–1345. Gill, B.S., Appels, R., Botha-Oberholster, A., Buell, C.R., Bennetzen, J.L., Chalhoub, B., Chumley, F., Dvorák, J., Iwanaga, M., Keller, B., Li, W., McCombie, W.R., Ogihara, Y., Quetier, F., and Sasaki, T. (2004) A workshop report on wheat genome sequencing: international genome research on wheat consortium. Genetics 168, 1087–1096. Safar, J., Bartos, J., Janda, J., Bellec, A., Kubalakova, M., Valarik, M., Pateyron, S., Weiserova, J., Tuskova, R., Cihalikova, J., Vrana, J., Simkova, H., Faivre Rampant, P., Sourdille, P., Caboche, M., Bernard, M., Dolezel, J., and Chalhoub, B. (2004) Dissecting large and complex genomes: flow sorting and BAC cloning of individual chromosomes from bread wheat. Plant Journal 39, 960–968.

INDEX A Affinity tags ............................ 179–181, 184, 185, 188, 193 Affymetrix ..................................................81–84, 248, 250 Agrobacterium ..........................117, 118, 132, 133, 135, 137, 138, 140, 143, 144, 146, 155, 187, 188, 310, 346 Antibiotics ..............................................121, 124, 143, 146 Arabidopsis ................................................. 3–16, 31, 43, 44, 82, 112, 132, 153–158, 161, 163, 168, 169, 171, 181, 186, 187, 200, 205, 230, 231, 233–236, 239–241, 245, 248, 251, 261, 272, 275, 276, 278, 279, 306, 330, 339, 346, 349, 350 Array..... ....................................... 20, 22, 27, 28, 30, 34–36, 81, 83, 93, 94, 204, 210, 248, 261, 262, 276 Association mapping .......................................22, 287, 288, 292, 295, 300, 301 Auxins.....................................................120, 123, 202, 215

Chromosome sorting ..................................................... 349 Cmap........ ......................................................45–49, 51, 52 Codon usage ...................................................176, 178, 184 Commercialization ........................................305, 306, 311, 312, 314, 317–319 Comparative genomics .................................................... 57 Comparative mapping ................................43–45, 257, 350 Complexity reduction ...........................................22, 34, 36 Computational biology .......................................... 207, 244 Cosmid ...................................................................... 57, 58 cRNA 81–83 Crop species .......................................... 2, 3, 15, 16, 22, 81, 112, 156, 157, 161, 276, 284, 287, 299, 308, 310 Cryo-electron microscopy ..................................... 212–214 Crystallography, time-resolved .............................. 220, 221 Cultivar, identification ............................................... 21, 22

D B BAC library .............................. 59, 65–67, 69, 76, 351, 355 Baculovirus ............................................................ 185, 186 Barley.... ................................................. 2, 5, 12, 13, 30, 36, 45, 49–53, 69, 70, 82–84, 125, 135, 204, 207–212, 215, 216, 219–221, 249, 250, 278, 284, 322–324–333, 338, 339, 355 Beta-d-glucan ........................ 204, 208–212, 216, 219–222 Beta-glucanase....................................................... 215, 216 Bioinformatics ................................. 97, 153, 158, 167, 176, 206, 207, 244, 248, 251, 261, 280, 316, 348, 352 Biolistics ......................................... 132, 137, 138, 140, 146 BLAST (Basic local Alignment Search Tool) ................ 24, 95, 98, 99, 248–251, 270–272, 275, 277, 278 Brachypodium ......................................................5, 249, 350 Brassica............. 11, 12, 43–45, 112, 116, 153, 246, 346, 350 BSA (Bovine serum albumin)............................63, 75, 159, 235, 236, 238, 287, 288, 290, 299

2-D electrophroesis ................................93, 94, 99, 104–105 gels.................................................. 93, 94, 99, 100, 104 DArT (Diversity array technology) .......... 20, 22, 34, 36, 37 Database features ............................................................ 245–247 plant................................................................. 243–262 DDBJ (DNA Databank of Japan) ......................... 248, 273 Dicot.....................................................10–13, 15, 153, 350 Disulfide bond ................................................182, 183, 189 DNA array ......................................................................... 248 binding site .............................................................. 202 concentration ..................................................66, 73, 78 extraction ..........................................158, 159, 162–164 Domestication .......................................288, 301, 323–324, 326, 331, 333, 335, 340

E C Callus................................................. 94, 98, 112–114, 116, 117, 120–122, 126, 128, 133 Capillary electrophoresis....................... 30, 32, 34, 292, 293 cDNA libraries ................................. 23, 187, 240, 276, 346 Cereals.. ......................................... 13, 34, 45, 58, 116–118, 125, 129, 134, 150, 220, 284, 285, 322–324, 326, 328–331, 335, 339, 340, 350, 353, 355 Chaperones .....................................................182, 183, 191

EBI (European Bioinformatics Institute) ...................... 248 Ecological assessment ............................................ 312, 316 EMS (Ethyl methane sulphonate)......................... 155–157 EnsEMBL......................................................... 53–54, 248 Environmental assessment..............................314, 316–317 Environmental safety ............................................. 316–318 ESI (Electrospray ionization) .....................95, 97, 108, 109 EST sequencing ......................................6, 286, 345, 346 Eucalyptus ........................................................10, 346, 351

359

360 | Index Evolutionary biology ..................................................... 322 Evolution canyon ....................................326, 331–336, 338 Explant ..................................................112–113, 116, 117, 120, 125, 128, 144–150 Expression analysis............................................................. 231, 340 cell-free .............................................178, 190–193, 203

F FACS (Fluorescence-activated cell-sorting) .................. 230 FASTA ...........................................................256, 277, 279 Field trials....................................... 298, 299, 310, 313, 318 FISH (Fluorescent in situ hybridisation)....................... 122 Flowering ..............................4, 5, 8, 10, 11, 13, 15, 44, 335 454 FLX ........................................................................ 347 Functional annotation ............ 176, 206, 264, 265, 272–277 Functional genomics ...................... 132–134, 154, 157–158, 176, 189, 199–223, 250, 262, 308, 309

G GEM (Gene expression markers) ...................82, 84, 85, 90 GenBank ......................................... 25, 245–247, 256, 265, 267–269, 273, 277, 280 GeneChip ........................................................................ 83 Gene expression markers ...................................................................... 82 stable........................................................................ 186 transient ........................................................... 131, 186 Genes colinearity .................................................................. 12 expression ..................................................6, 16, 22, 81, 82, 90, 112, 122, 131, 134, 150, 185, 186, 229–241, 248, 250, 276, 306, 309, 311, 346, 348 function .. 8, 12, 154, 155, 162, 168, 200, 222, 272–274, 276, 308, 309 heterologous .....................................131, 132, 182, 305 ontology ............................................................... 7, 274 orthologous .................................................12, 206, 275 prediction.....................46, 269–271, 277, 280, 352, 353 silencing ........................................................8, 132, 134 structure .................9, 248, 249, 264, 269, 272, 308, 309 synteny....................................................................... 12 Genetics forward .............................................155, 156, 161–167 map ..................................................20–22, 41–54, 247, 291, 322, 330, 351, 354 mapping ............................................20, 21, 41–45, 324 reverse ................................... 2, 156–157, 162, 167–170 structure ........................................................... 329, 339 transformation ..................................111, 112, 339, 340 variation .................................... 7, 23, 24, 333, 336, 340 Genome annotation................................. 206, 263–280, 352, 353 browser ..................................... 268, 275, 277, 279, 351

coverage ........................................................59, 78, 355 sequencing ......................................... 21, 23, 42, 43, 57, 58, 205, 207, 219, 264, 267, 271, 277, 345–357 walking .............................................160, 161, 165–166 Genomics diversity ....................................................322, 324–330 ecological ......................................................... 322–340 Genotyping GFP (Green fluorescent protein) ..........................122, 124, 135, 146, 149, 150, 230 Glycoside hydrolases.......................................176, 212, 219 Glycosylation ...........132, 182, 184, 186, 187, 189, 191, 209 Glyphosate tolerance ............................................. 307, 308 GMOD (Generic Model Organism Database) .......................................... 277 GoldenGate assay ...................................................... 28–30 Gold particles ................................ 114, 118, 119, 123–127, 138, 140–142, 145, 147–149 GrainGenes ............................ 246, 247, 249, 253–256, 258 Gramene.......................................... 45, 46, 49, 50, 53, 247, 249, 254, 256, 257 Grape..................................................................... 346, 351 GRIN(Germplasm Resources Information Network) ....................................... 250 GUS (Beta glucuronidase) .............................117, 122, 128, 132, 133, 135, 140, 144, 146, 149, 150

H Haplotype ..................................20, 24, 25, 86, 90, 292, 326 Herbicides ......................................................124, 212, 307 Heterologous expression ....................................... 176–178, 181–189, 202, 203, 217, 223

I ICIS 250 IEF (Isoelectric focusing) ............. 93, 94, 96, 101, 103, 104 Infinium assay............................................................ 27–30 In situ hybridization ....................... 229–236, 238, 239, 337 Internet ..............................24, 243, 244, 247, 250, 254, 261 Invader assay .................................................................... 26 Isoelectric focusing ....................... 93, 94, 96, 101, 103, 104

L LCM (Laser capture micro-dissection) ................. 230, 231 LC-MS/MS .................................................................. 100 LD decay ................................................288, 289, 292, 301 Library construction ................................. 58, 59, 66, 67, 69 Linkage disequilibrium .....................................22, 287, 292 Lotus.... ..........................................................153, 346, 351

M MADS box ................................................................ 13, 44 Maize 2, 12, 45, 53, 124, 134, 153, 157, 245, 251, 271, 272, 275, 276, 307, 339, 346, 348, 351

Index |361 MALDI ........................... 30–31, 95–97, 99, 102, 107–109 Map physical .................................. 42, 45, 46, 49, 247, 354 Marker molecular ....................................... 6, 19–23, 36, 41, 43, 49–53, 155, 249, 284, 286, 288–292, 296–298, 300, 301, 326, 330, 333, 340 technology ......................................................30, 32, 34 validation ......................................................... 295, 301 Marker-assisted selection ..............................19, 21, 23, 36, 42, 284, 295–296, 301, 330, 340 Marketplace acceptance ......................................... 319–320 MAS (Marker-assisted selection) ........................19, 21, 23, 36, 42, 284, 296, 301 Mascot97–100 Mass spectrometry............................... 26, 30, 94–100, 102, 108, 109, 113, 114, 122, 136, 137, 144, 145, 232, 235 Medicago ....................................................5, 153, 346, 352 Methyl filtration ............................................................ 348 Microarrays.......................................... 6, 22, 30, 36, 81, 82, 230, 231, 250, 262, 276, 333, 340, 393 Microsatellite ........................................... 37, 253, 289, 297, 326, 327, 331–336, 339 Microscopy ..................................... 212, 231, 233, 238–239 Model species ............................1, 2, 5, 8–16, 112, 157, 230 Molecular breeding .......................27, 284, 285, 293, 294, 297, 312 markers .......................................... 6, 19–23, 36, 41, 43, 49–53, 155, 249, 284, 286, 288–292, 296–298, 300, 301, 326, 330, 333, 340 Monocot .................. 4, 5, 10–13, 15, 16, 153, 168, 278, 346 Mosses189 Multiallelic .............................................................. 21, 330 Multiplex .................................... 22, 26, 28, 31, 34, 35, 293 Mutagenesis chemical....................................................154, 155, 157 physical .................................................................... 155 Mutagenised populations ...............................155, 156, 161

N

Particle bombardment ...........................114–115, 117–120, 132, 141–143, 187 PCR primer .............................................. 21, 25, 28, 31, 32 Peptide mass ..................................................................... 95, 96 sequence................................................95, 97, 100, 278 pIndigoBAC-5 .......................................................... 59, 60 Plant breeding ............................................. 2, 8, 11, 14, 16, 19, 22, 36, 283–302, 305, 306 PLEXdB (Plant Expression Database).......................... 250 Polymorphism ..........................6, 20, 21, 23–25, 42, 81, 82, 84, 85, 90, 157, 220, 262, 284, 286, 288–290, 292, 294–296, 300, 322, 326, 329, 332, 339 Poplar.................................. 4, 205, 245, 259, 346, 352–353 Population structure ...............................287, 288, 292, 301 Positional cloning ...........................................21, 25, 57, 58 Post-translation modification ........................................ 186 Potato...............................133, 189, 251, 310, 346, 353, 354 Probes mismatch ................................................................... 82 perfect match ............................................................. 82 Protein expression .........................................132, 175–193, 212 folding ...................................... 181, 183, 186, 191, 192 membrane ......................... 184, 190–192, 204, 212–214 sequence..................................... 8, 9, 95, 200, 205, 219, 247, 248, 272, 273, 277, 279 structure .....................200, 202–205, 207, 219, 220, 249 Proteinase K ............................... 62, 78, 232, 237, 238, 241 Protein-protein interaction ....................................177, 202, 214–215, 219, 222, 223 Proteomics .................................. 94, 99, 101, 207, 218, 352 Protoplasts ...................................... 116, 117, 131–134, 230

Q QTL mapping .....................6, 295, 299, 333, 335, 336, 339

R

Oligonucleotide ligation ............................................ 32–33 Ontology ................................................................... 7, 274

Recalcitrant ................................................................... 132 Regeneration media ....................................................... 115 Regulatory ............................................. 6, 9, 134, 264, 306, 310–315, 317–320, 322, 339, 346, 348 Repetitive sequences .............................................. 270, 354 Rice genomics.................................................................... 3 RNA extraction ..........................................160, 166–167, 230 interference ...................................................... 154, 157 RNAP... ................................................................. 233, 240 Rnase... ....................161, 162, 164, 167, 170, 231, 240, 241 RoundUp ................................ 307–310, 313, 315–317, 319 Rt-PCR .................................. 160–163, 167, 169, 230, 276

P

S

PACs (P1-derived artificial chromosomes) ............... 58, 63 Papaya.............................................................310, 346, 352

SAGE (Serial analysis of gene expression) .................... 231 Sanger 23 ................................................................. 25, 278

NCBI(National Centre for Biotechnology Information) ..............................246–248, 254, 256, 259, 271–273, 277, 280, 351, 352 NCGR (National Center for Genome Resources) ..................................... 251 NMR (Nuclear magnetic resonance) ..............191, 202, 203 Northern blotting .......................................................... 230 Nuclei... .............................................. 58, 62–63, 67–69, 77

O

362 | Index SBE (Single base extension) ............................................ 30 Scutella ............ 116, 117, 121, 123, 125, 128, 140, 145, 147 SDS-PAGE ........................................................93, 94, 101 Selection media ............................................................. 115 Sequence alignment............................ 37, 90, 161, 167, 206 Single feature polymorphism (SFP) .................................... 82, 84, 85, 90, 91, 286 siRNA 111 SNP (Single-nucleotide polymorphisms) ......................... 6, 20–21, 23–28, 30–33, 37, 42, 82, 86, 90, 262, 286, 322, 326, 335, 348 SOL............................................................................... 251 Solanaceae .................................. 11, 12, 116, 134, 251, 354 Solexa ...................................................................... 23, 347 SOLiD ............................................................ 23, 346–348 Sorghum genome ...................................................... 4, 275 Southern hybridisation ...........................159–161, 165, 170 Speciation ........................................................44, 206, 322, 323, 331, 333, 336–338 SQL (Structured Query Language) ...................... 254, 255 Stakeholder.....................................................306, 318–320 Stewardship ............................................306, 317, 318, 320 Structure(al) annotation.........................................264, 265, 269–272 biology ............................................................. 199–223 phylogenomics ................................................. 176, 206 Synteny ..................................... 11–12, 20, 41–54, 350, 353

T TAIR (The Arabidopsis Information Resource) ........................5–7, 248, 349

T-DNA ......................................... 6, 7, 111, 133, 135, 145, 154, 155, 161–163, 165–171 TIGR (The Institute for Genome Research) ........248, 256, 258–260, 271, 275, 280 Tiliing path ............................................349, 351, 353, 354 TILLING ..............................................157, 158, 167, 171 Tomato ....................2, 12, 36, 153, 158, 310, 346, 354–355 Totipotency.................................................................... 112 Transformation system ...................................................................... 155 transient ........................................................... 131–150 Transgene ..........................6–8, 14, 112, 120, 131–134, 185 Transgenic crop ........................................... 305, 308, 311, 313, 314 plants ............. 14, 27, 112, 122, 177, 306, 308, 310–313 Trypsin digestion ........................................................... 106

V Vaccines ................................................................. 178, 188 VIGS (Viral-induced gene silencing) .................... 132, 134 Virus vector ................................................................... 133

W Wheat emmer..................325, 326, 328–330, 333, 335, 336

Y YACs (Yeast artificial chromosomes) .............................. 58

Z Z-score ...................................................................... 88–89