Plant genome assembly

A plant genome assembly represents the complete genomic sequence of a plant species, which is assembled into chromosomes and other organelles by using DNA (deoxyribonucleic acid) fragments that are obtained from different types of sequencing technology.

Structure

The genome of plants can vary in their structure and complexity from small genomes like green algae (15 Mbp).[1] to very large and complex genomes that have typically much higher ploidy, higher rates of heterozygosity and repetitive elements than species from other kingdoms.[2] One of the most complex plant genome assemblies available is that of loblolly pine (22 Gbp).[3] Due to their complexity, the plants' genome sequences can't be assembled back into chromosomes using only short reads provided by next-generation- sequencing technologies (NGS),[4][5] and therefore most plant genome assemblies available that used NGS alone are highly fragmented, contain large numbers of contigs, and genome regions are not finished. Highly repetitive sequences, often larger than 10kbp, are the main challenge in plants.[6][7] Most of the chromosomal sequences are produced by the activity of mobile genetic elements (MGEs) in the plant genomes.[8] MGEs are divided into two classes: class I or retrotransposons, and class II or DNA transposons. In plants, long- terminal repeat (LTR) retrotransposons are predominant and constitute from 15%[9] to 90% of the genome.[10] Polyploidy is another challenge in assembling a plant genome, and it is estimated that ≈80% of plants are polyploids.[11]

Assemblies

The first complete plant genome assembly, that of Arabidopsis thaliana, was finished in 2000,[12] being the third multicellular eukaryotic genome published after C. elegans[13] and D. melanogaster.[14] Arabidopsis, unlike other plants' genomes (e.g. Malus) has convenient traits, such as a small nuclear genome (135Mbp) and a short generation time (8 weeks from seed to seed). The genome has five chromosomes reflecting approximately 4% of the human genome size. The genome was sequenced and annotated by the Arabidopsis Genome Initiative (AGI).

The initiative for sequencing the genome of rice (Oryza sativa),[15] began in September 1997, when scientists from many nations agreed to an international collaboration to sequence the rice genome, forming "The International Rice Genome Sequencing Project" (IRGSP). At an estimated size between 400 and 430 Mb, approximatively four times larger in dimensions than A. thaliana, rice has the smallest of the major cereal crop genomes.[15]

Between 2000 and 2008 in total 10 plant genomes were published while in 2012 alone, 13 plant genomes were published. Since then the number was constantly increasing, and now more than 400 plant genomes are available in the NCBI genome database, of which 72 were re-annotated [NCBI].

Databases

EnsemblPlants[16] is part of EnsemblGenome database and contains resources for a reduced number of sequenced plant species (45, Oct. 2017). It mainly provides genome sequences, gene models, functional annotations and polymorphic loci. For some of the plant species, additional information is provided including population structure, individual genotypes, linkage, and phenotype data.

Gramene[17] is an online web database resource for plant comparative genomics and pathway analysis based on Ensembl technology.

Plant Genome DataBase Japan[18] (PGDBj) is a website that contains information related to genomes of model and crop plants from databases. It has three main components: ortholog db, DNA marker and linkage map db, and plant resource db, where multiple plant resources accumulated by different institutes are integrated. The aim is "to provide a platform, enabling comparative searches of different resources" (pgdbj.jp).

PlantsDB[19] is a resource for analysing and storing genetic and genomic information from various plants, and offers tools to query these data and to perform comparative analysis with the help of in-house tools.

PLAZA[20][21] is another online resource for comparative genomics that integrates plant sequence data and comparative genomic methods, and performs evolutionary analysis within the green plant lineage (Viridiplantae).

The Arabidopsis Information Resource (TAIR)[22] maintains a web database of the "model higher plant Arabidopsis Thaliana ".

Assembly strategies

In general, for sequencing and assembling large and complex genomes like plants, different strategies are used, based on the technologies available at that time when the project started.

Sanger clone-by-clone

Clone-by-clone sequencing strategies are based on the construction of a map for each chromosome before the sequencing, and rely on libraries made from large-insert clones. The most common type of large-insert clone is the bacterial artificial chromosome (BAC).

With BAC, the genome is first split into smaller pieces with the location recorded. The pieces of DNA are then inserted into BAC clones that are further multiplied by inserting them into bacterial cells that grow very fast. These pieces are further fragmented into overlapping smaller pieces that are placed into a vector and then sequenced. The small pieces are then assembled into contigs by overlapping them. Next, using the map from the first step the contigs are assembled back into the chromosomes.

The first complete plant genome assembly (also the first plant genome published) that used this type of technique was Arabidopsis thaliana, in 2000.[12] Different large-insert libraries like BACs, P1 artificial chromosomes (PAC), yeast artificial chromosome (YAC) and transformation-competent artificial chromosomes (TACs) were combined to assemble the genome. From clones with restriction fragment fingerprint, by comparison of the patterns and hybridization or polymerase chain reaction (PCR) the physical maps were constructed. The physical maps were integrated together with genetic maps to identify contig positions and orientations. End sequences from 47,788 BAC clones were used to extend contigs from anchored BACs and to select a minimum tiling path. A total of 1,569 clones found in minimum tiling path were selected and sequenced. Direct PCR products were used to clone remaining gaps, and YACs allowed the characterization of telomere sequences. The resulting sequenced regions were 115.4 Mb of the 125 Mb predicted size of the genome and a total of 25,498 of protein-coding genes.

To sequence and assemble the genome of Oryza sativa (japonica),[15] the same strategy was used. For Oryza sativa a total of 3,401 mapped clones in a minimum tiling path were selected from the physical map and assembled.

One of the most important crops in the world, maize (Zea mays), is the last plant genome project primarily based on Sanger BAC-by-BAC strategy.[23] The genome size of Maize, 2.3 Gb and 10 chromosomes,[23] is significantly larger than that of rice and Arabidopsis.[23] To assemble the genome of maize a set of 16,848 minimally overlapping BAC clones derived

from combinations of physical and genetic map were selected and sequenced. The assembly on maize was performed in addition with external information data. The data was obtained from cDNA and sequences from libraries with methyl-filtered DNA (libraries that uses the knowledge that the bases in genic sequences tends to be less heavily methylated than those in non-genic regions) and high C0 t techniques.

Sanger clone-by-clone strategy has the advantage of working in small units, which reduces the complexity and computational requirements, as well as minimized problems associated with the misassembly of highly repetitive DNA and therefore is an attractive solution in assembling plant genomes and other complex eukaryotic genomes. The main disadvantages of this method are the costs and the resources required. The cost of the first plant genome assemblies was estimated between 70 million dollars[24] and 200 million dollars per assembly.[25]

Sanger whole-genome shotgun (WGS)

In the WGS sequencing technology there is no order for the fragments that are sequenced. The DNA is randomly sheared and cloned fragments are sequenced and assembled using computational methods. This technology reduced the cost and the time associated with construction of the maps and relies on computational resources.

A considerable number of important plant genomes like grapevine (Vitis Vinifer),[26] papaya (Carica papaya),[27] and cottonwood (Populus trichocarpa)[28] were sequenced and assembled with Sanger WGS strategy.

The draft genome of grapevine[26] is the fourth genome published for a flowering plant and the first from a fruit crop. The sequences of the genome were obtained from different types of libraries, like plasmids, fosmids and BACs. All the data were generated by paired-end sequencing of cloned insert using Sanger technology on ABI3730x1 sequencers. To assemble the reads, Arachne, 2002,[29] a software designed to analyze reads obtained from both ends of plasmid clones, was used. In total 6.2 million paired-end tag reads were produced. The software produced 20.784 contigs that were combined into 3,830 supercontigs, having an N50 value of 64kb. Supercontigs had a total size of 498 Mb.

The anchorage of the supercontigs along the genome was performed first by joining supercontigs together using paired BAC end sequences. The resulting ultracontigs and the remained supercontigs were then aligned along the genetic map of the genome. Later improvements of this strategy enabled the sequencing of Brachypodium distachyon,[30] Sorghum bicolor[31] and soybean.[32]

Next-generation sequencing

Due to its relatively cheap cost in comparison to previous methods, most of the recent plant genomes were sequenced and assembled using data from NGS (next-generation- sequencing) technology. In general the NGS data are used in combination with Sanger Sequencing technology or long-reads obtained from the third generation sequencing. The genome of the cucumber, (Cucumis sativus),[33] was one of the plant genomes that used the NGS Illumina reads in combination with Sanger sequences. 72.2-fold genome coverage high quality base pairs were generated from which 3.9-fold coverage was provided from Sanger and the Illumina GA reads provided 68.3-fold coverage. From this two assemblies were produced based on the sequencing technology. The resulting contigs were compared between them, resulting in a total length of the assembled genome of 243.5 Mb. The result is about 30% smaller than the genome size estimated by flow cytometry of isolated nuclei stained with propidium iodide (367 Mb). A genetic map was constructed to anchor the assembled genome. 72.8% of the assembled sequences were successfully anchored onto the seven chromosomes. Another plant genome that combined NGS with Sanger sequencing was the genome of Theobroma cacao, 2010,[34] an economically important tropical fruit tree crop and the primary source of cocoa. The genome was sequenced in a consortium, "The International Cocoa Genome Sequencing consortium (ICGS) " and produced a total of 17.6 million 454 single end reads, 8.8 million 454 paired-end reads, 398.0 million Illumina paired-end reads and about 88,000 Sanger BAC reads. First by using genome assembly software, Newbler, an assembly was produced with 25,912 contigs and 4,792 scaffolds from the reads obtained from Roche/454 and Sanger raw data. This had a total length of 326.9 Mb, which represents 76% of the estimated genome size. The Illumina reads were used to complement the 454 assembly, by aligning the short reads on the cocoa genome assembly using the SOAP software. A similar strategy that combined NGS reads and Sanger Sequencing was used for other important plant species like the first published apple genome (Malus domestica),[35] cotton (Gossypium Raimond),[36] draft genome of sweet orange (Citrus sinensis)[37] and the domesticated tomato (Solanum lycopersicum) genome[38]

Third-generation

With the emergence of third-generation sequencing (TGS) some of the limitations from previous methods of sequencing and assembling plant genomes have started to be addressed. This technology is characterized by the parallel sequencing of single molecules of DNA, that results in sequences up to 54 kbp length (PacBio RS 2).[39] In general, long reads from TGS have relatively high error rates (≈10% on average)[40] and therefore repeated sequencing of the same DNA fragments is required. The price of such technology is still quite high and therefore is generally used in combination with short reads from NGS. One of the first plant genome that used long-reads from TGS, Pacific Biosciences in combination with short reads from NGS was the genome of spinach[41] having a genome size estimated at 989 Mb. For this, a 60× coverage of the genome was generated, with 20% of the reads larger than 20 kb. Data were assembled using PacBio's hierarchical genome assembly process (HGAP),[42] and showed that long-read assemblies revealed a 63-fold improvement in contig size over an Illumina-only assembly. Another plant genome that was recently published that used long reads in combination with short reads is the improved assembly of the apple genome.[43] In this project a hybrid approach was used, combining different data types from sequencing technologies. The sequences used came from: PacBio RS II, Illumina paired-end reads (PE) and Illumina mate- pair reads (MP). As a first step an assembly from Illumina paired-end reads was performed using a well-known de novo assembly software SOAPdevo.[44] Then using a hybrid assembly pipeline DBG2OLC.[45] the contigs obtained at the first step and the long reads from PacBio were combined. The assembly was then polished with the help of Illumina paired-end reads by mapping them to the contigs using BWA-MEM.[46] By mapping the mate-pair reads on the corrected contigs they scaffold the assembly. Further BioNano[47] optical mapping analysis with a total length of 649.7 Mb, were used in the hybrid assembly pipeline together with the scaffolds obtained from the previous step. The resulting scaffolds were anchored to a genetic map constructed from 15,417 single-nucleotide polymorphisms (SNPs) markers. For better understanding of the number and diversity of genes that were identified, ribonucleic acid RNA-seq, were used. The resulted genome has a dimension of 643.2 Mb getting closer to the estimated genome size than the previous published assembly[35] and a smaller number of protein-coding- genes.

The use of long reads in the plant genome assemblies became more popular, for reducing the number of scaffolds and increasing the quality of the genome by improving the assembly and coverage in regions that are not clearly defined by NGS assembly.

References

  1. ^ Moreau H, Verhelst B, Couloux A, Derelle E, Rombauts S, Grimsley N, et al. (August 2012). "Gene functionalities and genome structure in Bathycoccus prasinos reflect cellular specializations at the base of the green lineage". Genome Biology. 13 (8): R74. doi:10.1186/gb-2012-13-8-r74. PMC 3491373. PMID 22925495.
  2. ^ Gregory TR (January 2005). "The C-value enigma in plants and animals: a review of parallels and an appeal for partnership". Annals of Botany. 95 (1): 133–146. doi:10.1093/aob/mci009. PMC 4246714. PMID 15596463.
  3. ^ Zimin A, Stevens KA, Crepeau MW, Holtz-Morris A, Koriabine M, Marçais G, et al. (March 2014). "Sequencing and assembly of the 22-gb loblolly pine genome". Genetics. 196 (3): 875–890. doi:10.1534/genetics.113.159715. PMC 3948813. PMID 24653210.
  4. ^ Deschamps S, Campbell MA (2010-04-01). "Utilization of next-generation sequencing platforms in plant genomics and genetic variant discovery". Molecular Breeding. 25 (4): 553–570. doi:10.1007/s11032-009-9357-9. S2CID 29239452.
  5. ^ Shendure J, Ji H (October 2008). "Next-generation DNA sequencing". Nature Biotechnology. 26 (10): 1135–1145. doi:10.1038/nbt1486. PMID 18846087. S2CID 6384349.
  6. ^ Treangen TJ, Salzberg SL (November 2011). "Repetitive DNA and next-generation sequencing: computational challenges and solutions". Nature Reviews. Genetics. 13 (1): 36–46. doi:10.1038/nrg3117. PMC 3324860. PMID 22124482.
  7. ^ Harrison GE, Heslop-Harrison JS (February 1995). "Centromeric repetitive DNA sequences in the genus Brassica". Theoretical and Applied Genetics. 90 (2): 157–165. doi:10.1007/BF00222197. PMID 24173886. S2CID 20591213.
  8. ^ Lanciano S, Carpentier MC, Llauro C, Jobet E, Robakowska-Hyzorek D, Lasserre E, et al. (February 2017). "Sequencing the extrachromosomal circular mobilome reveals retrotransposon activity in plants". PLOS Genetics. 13 (2): e1006630. doi:10.1371/journal.pgen.1006630. PMC 5338827. PMID 28212378.
  9. ^ Michael TP, VanBuren R (April 2015). "Progress, challenges and the future of crop genomes". Current Opinion in Plant Biology. 24: 71–81. Bibcode:2015COPB...24...71M. doi:10.1016/j.pbi.2015.02.002. PMID 25703261.
  10. ^ Flavell RB, Gale MD, O'dell M, Murphy G, Moore G, Lucas H (1993). "Molecular organization of genes and repeats in the large cereal genomes and implications for the isolation of genes by chromosome walking". Chromosomes Today. Dordrecht: Springer. pp. 199–213. doi:10.1007/978-94-011-1510-0_16. ISBN 9789401046602.
  11. ^ Meyers LA, Levin DA (June 2006). "On the abundance of polyploids in flowering plants". Evolution; International Journal of Organic Evolution. 60 (6): 1198–1206. doi:10.1554/05-629.1. PMID 16892970. S2CID 198156503.
  12. ^ a b The Arabidopsis Genome Initiative (December 2000). "Analysis of the genome sequence of the flowering plant Arabidopsis thaliana". Nature. 408 (6814): 796–815. Bibcode:2000Natur.408..796T. doi:10.1038/35048692. PMID 11130711.
  13. ^ The C. elegans Sequencing Consortium (December 1998). "Genome sequence of the nematode C. elegans: a platform for investigating biology". Science. 282 (5396): 2012–2018. Bibcode:1998Sci...282.2012.. doi:10.1126/science.282.5396.2012. JSTOR 2897605. PMID 9851916.
  14. ^ Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, et al. (March 2000). "The genome sequence of Drosophila melanogaster". Science. 287 (5461): 2185–2195. Bibcode:2000Sci...287.2185.. CiteSeerX 10.1.1.549.8639. doi:10.1126/science.287.5461.2185. PMID 10731132.
  15. ^ a b c Goff SA, Ricke D, Lan TH, Presting G, Wang R, Dunn M, et al. (April 2002). "A draft sequence of the rice genome (Oryza sativa L. ssp. japonica)". Science. 296 (5565): 92–100. Bibcode:2002Sci...296...92G. doi:10.1126/science.1068275. PMID 11935018. S2CID 2960202.
  16. ^ Bolser D, Staines DM, Pritchard E, Kersey P (2016). "Ensembl Plants: Integrating Tools for Visualizing, Mining, and Analyzing Plant Genomics Data". Plant Bioinformatics. Methods in Molecular Biology. Vol. 1374. Humana Press, New York, NY. pp. 115–140. doi:10.1007/978-1-4939-3167-5_6. ISBN 9781493931668. PMID 26519403.
  17. ^ Gupta P, Naithani S, Tello-Ruiz MK, Chougule K, D'Eustachio P, Fabregat A, et al. (November 2016). "Gramene Database: Navigating Plant Comparative Genomics Resources". Current Plant Biology. 7–8: 10–15. Bibcode:2016CPBio...7...10G. doi:10.1016/j.cpb.2016.12.005. PMC 5509230. PMID 28713666.
  18. ^ Nakaya A, Ichihara H, Asamizu E, Shirasawa S, Nakamura Y, Tabata S, Hirakawa H (2017). "Plant Genome DataBase Japan (PGDBJ)". Plant Genomics Databases. Methods in Molecular Biology. Vol. 1533. New York, NY: Humana Press. pp. 45–77. doi:10.1007/978-1-4939-6658-5_3. ISBN 9781493966561. PMID 27987164.
  19. ^ Spannagl M, Nussbaumer T, Bader K, Gundlach H, Mayer KF (2017). "PGSB/MIPS PlantsDB Database Framework for the Integration and Analysis of Plant Genome Data". Plant Genomics Databases. Methods in Molecular Biology. Vol. 1533. New York, NY: Humana Press. pp. 33–44. doi:10.1007/978-1-4939-6658-5_2. ISBN 9781493966561. PMID 27987163.
  20. ^ Vandepoele K (2017). "A Guide to the PLAZA 3.0 Plant Comparative Genomic Database". Plant Genomics Databases. Methods in Molecular Biology. Vol. 1533. Humana Press, New York, NY. pp. 183–200. doi:10.1007/978-1-4939-6658-5_10. ISBN 9781493966561. PMID 27987171.
  21. ^ Van Bel M, Silvestri F, Weitz EM, Kreft L, Botzki A, Coppens F, Vandepoele K (January 2022). "PLAZA 5.0: extending the scope and power of comparative and functional genomics in plants". Nucleic Acids Research. 50 (D1): D1468 – D1474. doi:10.1093/nar/gkab1024. PMC 8728282. PMID 34747486.
  22. ^ Reiser L, Berardini TZ, Li D, Muller R, Strait EM, Li Q, et al. (2016-01-01). "Sustainable funding for biocuration: The Arabidopsis Information Resource (TAIR) as a case study of a subscription-based funding model". Database. 2016: baw018. doi:10.1093/database/baw018. PMC 4795935. PMID 26989150.
  23. ^ a b c Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, et al. (November 2009). "The B73 maize genome: complexity, diversity, and dynamics". Science. 326 (5956): 1112–1115. Bibcode:2009Sci...326.1112S. doi:10.1126/science.1178534. PMID 19965430. S2CID 21433160.
  24. ^ Feuillet C, Leach JE, Rogers J, Schnable PS, Eversole K (February 2011). "Crop genome sequencing: lessons and rationales". Trends in Plant Science. 16 (2): 77–88. Bibcode:2011TPS....16...77F. doi:10.1016/j.tplants.2010.10.005. PMID 21081278.
  25. ^ Saegusa A (April 1999). "US firm's bid to sequence rice genome causes stir in Japan". Nature. 398 (6728): 545. Bibcode:1999Natur.398..545S. doi:10.1038/19123. PMID 10217128.
  26. ^ a b Jaillon O, Aury JM, Noel B, Policriti A, Clepet C, Casagrande A, et al. (September 2007). "The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla". Nature. 449 (7161): 463–467. Bibcode:2007Natur.449..463J. doi:10.1038/nature06148. hdl:11577/2430527. PMID 17721507.
  27. ^ Ming R, Hou S, Feng Y, Yu Q, Dionne-Laporte A, Saw JH, et al. (April 2008). "The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus)". Nature. 452 (7190): 991–996. Bibcode:2008Natur.452..991M. doi:10.1038/nature06856. PMC 2836516. PMID 18432245.
  28. ^ Tuskan GA, Difazio S, Jansson S, Bohlmann J, Grigoriev I, Hellsten U, et al. (September 2006). "The genome of black cottonwood, Populus trichocarpa (Torr. & Gray)". Science (Submitted manuscript). 313 (5793): 1596–1604. Bibcode:2006Sci...313.1596T. doi:10.1126/science.1128691. PMID 16973872. S2CID 7717980. Archived from the original on 2023-05-29. Retrieved 2023-06-19.
  29. ^ Swan KA, Curtis DE, McKusick KB, Voinov AV, Mapa FA, Cancilla MR (July 2002). "High-throughput gene mapping in Caenorhabditis elegans". Genome Research. 12 (7): 1100–1105. doi:10.1101/gr.208902. PMC 186621. PMID 12097347.
  30. ^ The International Brachypodium Initiative; et al. (February 2010). "Genome sequencing and analysis of the model grass Brachypodium distachyon". Nature. 463 (7282): 763–768. Bibcode:2010Natur.463..763T. doi:10.1038/nature08747. PMID 20148030.
  31. ^ Paterson AH, Bowers JE, Bruggmann R, Dubchak I, Grimwood J, Gundlach H, et al. (January 2009). "The Sorghum bicolor genome and the diversification of grasses". Nature. 457 (7229): 551–556. Bibcode:2009Natur.457..551P. doi:10.1038/nature07723. PMID 19189423.
  32. ^ Schmutz J, Cannon SB, Schlueter J, Ma J, Mitros T, Nelson W, et al. (January 2010). "Genome sequence of the palaeopolyploid soybean". Nature. 463 (7278): 178–183. Bibcode:2010Natur.463..178S. doi:10.1038/nature08670. PMID 20075913. S2CID 4372224.
  33. ^ Huang S, Li R, Zhang Z, Li L, Gu X, Fan W, et al. (December 2009). "The genome of the cucumber, Cucumis sativus L". Nature Genetics. 41 (12): 1275–1281. doi:10.1038/ng.475. PMID 19881527.
  34. ^ Argout X, Salse J, Aury JM, Guiltinan MJ, Droc G, Gouzy J, et al. (February 2011). "The genome of Theobroma cacao". Nature Genetics. 43 (2): 101–108. doi:10.1038/ng.736. PMID 21186351. S2CID 4685532.
  35. ^ a b Velasco R, Zharkikh A, Affourtit J, Dhingra A, Cestaro A, Kalyanaraman A, et al. (October 2010). "The genome of the domesticated apple (Malus × domestica Borkh.)". Nature Genetics. 42 (10): 833–839. doi:10.1038/ng.654. PMID 20802477.
  36. ^ Wang K, Wang Z, Li F, Ye W, Wang J, Song G, et al. (October 2012). "The draft genome of a diploid cotton Gossypium raimondii". Nature Genetics. 44 (10): 1098–1103. doi:10.1038/ng.2371. PMID 22922876. S2CID 38495587.
  37. ^ Xu Q, Chen LL, Ruan X, Chen D, Zhu A, Chen C, et al. (January 2013). "The draft genome of sweet orange (Citrus sinensis)". Nature Genetics. 45 (1): 59–66. doi:10.1038/ng.2472. PMID 23179022.
  38. ^ Tomato Genome Consortium (May 2012). "The tomato genome sequence provides insights into fleshy fruit evolution". Nature. 485 (7400): 635–641. Bibcode:2012Natur.485..635T. doi:10.1038/nature11119. PMC 3378239. PMID 22660326.
  39. ^ Bleidorn C (2015). "Third generation sequencing: technology and its potential impact on evolutionary biodiversity research". Systematics and Biodiversity. 14 (1): 1. Bibcode:2016SyBio..14....1B. doi:10.1080/14772000.2015.1099575.
  40. ^ Lee H, Gurtowski J, Yoo S, Nattestad M, Marcus S, Goodwin S, McCombie WR, Schatz M (2016-04-13). "Third-generation sequencing and the future of genomics". bioRxiv: 048603. doi:10.1101/048603.
  41. ^ van Deynze A (2015). "Using spinach to compare technologies for whole genome assemblies". Plant & Animal Genomics XXIII Conference.
  42. ^ Chin CS, Alexander DH, Marks P, Klammer AA, Drake J, Heiner C, et al. (June 2013). "Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data". Nature Methods. 10 (6): 563–569. doi:10.1038/nmeth.2474. PMID 23644548. S2CID 205421576.
  43. ^ Daccord N, Celton JM, Linsmith G, Becker C, Choisne N, Schijlen E, et al. (July 2017). "High-quality de novo assembly of the apple genome and methylome dynamics of early fruit development". Nature Genetics. 49 (7): 1099–1106. doi:10.1038/ng.3886. hdl:10449/42064. PMID 28581499. S2CID 24690391.
  44. ^ Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, et al. (December 2012). "SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler". GigaScience. 1 (1): 18. doi:10.1186/2047-217X-1-18. PMC 3626529. PMID 23587118. S2CID 2681931.
  45. ^ Ye C, Hill CM, Wu S, Ruan J, Ma ZS (August 2016). "DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies". Scientific Reports. 6 (1): 31900. Bibcode:2016NatSR...631900Y. doi:10.1038/srep31900. PMC 5004134. PMID 27573208.
  46. ^ Li H (2013). "Aligning sequence reads, clone sequences and assembly contigs with BWA- MEM". arXiv:1303.3997 [q-bio.GN].
  47. ^ "Bionano: Transforming the Way the World Sees the Genome". bionanogenomics.