DNA sequencing theory

DNA sequencing theory is the broad body of work that attempts to lay analytical foundations for determining the order of specific nucleotides in a sequence of DNA, otherwise known as DNA sequencing. The practical aspects revolve around designing and optimizing sequencing projects (known as "strategic genomics"), predicting project performance, troubleshooting experimental results, characterizing factors such as sequence bias and the effects of software processing algorithms, and comparing various sequencing methods to one another. In this sense, it could be considered a branch of systems engineering or operations research. The permanent archive of work is primarily mathematical, although numerical calculations are often conducted for particular problems too. DNA sequencing theory addresses physical processes related to sequencing DNA and should not be confused with theories of analyzing resultant DNA sequences, e.g. sequence alignment. Publications^[1] sometimes do not make a careful distinction, but the latter are primarily concerned with algorithmic issues. Sequencing theory is based on elements of mathematics, biology, and systems engineering, so it is highly interdisciplinary. The subject may be studied within the context of computational biology.

YouTube Encyclopedic

1/5
Views:
4 447 563
451 140
32 073
104 437
101 213

Transcription

It's just beautiful, isn't it? It's just mesmerizing. It's double hel-exciting! You really can tell, just by looking at it, how important and amazing it is. It's pretty much the most complicated molecule that exists, and potentially the most important one. It's so complex that we didn't even know for sure what it looked like until about 60 years ago. So multifariously awesome that if you took all of it from just one of our cells and untangled it, it would be taller than me. Now consider that there are probably 50 trillion cells in my body right now. Laid end to end, the DNA in those cells would stretch to the sun. Not once... but 600 times! Mind blown yet? Hey, you wanna make one? Of course you know I'm talking about deoxyribonucleic acid, known to its friends as DNA. DNA is what stores our genetic instructions -- the information that programs all of our cell's activities. It's a 6-billion letter code that provides the assembly instructions for everything that you are. And it does the same thing for pretty much every other living thing. I'm going to go out on a limb and assume you're human. In which case every body cell, or somatic cell, in you right now, has 46 chromosomes each containing one big DNA molecule. These chromosomes are packed together tightly with proteins in the nucleus of the cell. DNA is a nucleic acid. And so is its cousin, which we'll also be talking about, ribonucleic acid, or RNA. Now if you can make your mind do this, remember all the way back to episode 3, where we talked about all the important biological molecules: carbohydrates, lipids and proteins. That ring a bell? Well nucleic acids are the fourth major group of biological molecules, and for my money they have the most complicated job of all. Structurally they're polymers, which means that each one is made up of many small, repeating molecular units. In DNA, these small units are called nucleotides. Link them together and you have yourself a polynucleotide. Now before we actually put these tiny parts together to build a DNA molecule like some microscopic piece of Ikea furniture, let's first take a look at what makes up each nucleotide. We're gonna need three things: 1. A five-carbon sugar molecule 2. A phosphate group 3. One of four nitrogen bases DNA gets the first part its name from our first ingredient, the sugar molecule, which is called deoxyribose. But all the really significant stuff, the genetic coding that makes you YOU, is found among the four nitrogenous bases: adenine (A), thymine (T), cytosine (C) and guanine (G). It's important to note that in living organisms, DNA doesn't exist as a single polynucleotide molecule, but rather a pair of molecules that are held tightly together. They're like an intertwined, microscopic, double spiral staircase. Basically, just a ladder, but twisted. The famous Double Helix. And like any good structure, we have to have a main support. In DNA, the sugars and phosphates bond together to form twin backbones. These sugar-phosphate bonds run down each side of the helix but, chemically, in opposite directions. In other words, if you look at each of the sugar-phosphate backbones, you'll see that one appears upside-down in relation to the other. One strand begins at the top with the first phosphate connected to the sugar molecule's 5th carbon and then ending where the next phosphate would go, with a free end at the sugar's 3rd carbon. This creates a pattern called 5' (5 prime) and 3' (3 prime). I've always thought of the deoxyribose with an arrow, with the oxygen as the point. It always 'points' from from 3' to 5'. Now on the other strand, it's exactly the opposite. It begins up top with a free end at the sugar's 3rd carbon and the phosphates connect to the sugars' fifth carbons all the way down. And it ends at the bottom with a phosphate. And you've probably figured this out already, but this is called the 3' to 5' direction. Now it is time to make ourselves one of these famous double helices. These two long chains are linked by the nitrogenous bases via relatively weak hydrogen bonds. But they can't be just any pair of nitrogenous bases. Thankfully, when it comes to figuring out what part goes where, all you have to do is remember is that if one nucleotide has an adenine base (A), only thymine (T) can be its counterpart (A-T). Likewise, guanine (G) can only bond with cytosine [C] (G-C). These bonded nitrogenous bases are called base pairs. The G-C pairing has three hydrogen bonds, making it slightly stronger than the A-T base-pair, which only has two bonds. It's the order of these four nucleobases or the Base Sequence that allows your DNA to create you. So, AGGTCCATG means something completely different as a base sequence than, say, TTCAGTCG. Human chromosome 1, the largest of all our chromosomes, contains a single molecule of DNA with 247 million base pairs. If you printed all of the letters of chromosome 1 into a book, it would be about 200,000 pages long. And each of your somatic cells has 46 DNA molecules tightly packed into its nucleus -- that's one for each of your chromosomes. Put all 46 molecules together and we're talking about roughly 6 billion base pairs .... In every cell! This is the longest book I've ever read. It's about 1,000 pages long. If we were to fill it with our DNA sequence, we'd need about 10,000 of them to fit our entire genome. POP QUIZ!!! Let's test your skills using a very short strand of DNA. I'll give you one base sequence -- you give me the base sequence that appears on the other strand. Okay, here goes: 5' -- AGGTCCG -- 3' And... time's up. The answer is: 3' -- TCCAGGC -- 5' See how that works? It's not super complicated. Since each nitrogenous base only has one counterpart, you can use one base sequence to predict what its matching sequence is going to look like. So could I make the same base sequence with a strand of that "other" nucleic acid, RNA? No, you could not. RNA is certainly similar to its cousin DNA -- it has a sugar-phosphate backbone with nucleotide bases attached to it. But there are THREE major differences: 1. RNA is a single-stranded molecule -- no double helix here. 2. The sugar in RNA is ribose, which has one more oxygen atom than deoxyribose, hence the whole starting with an R instead of a D thing. 3. Also, RNA does not contain thymine. Its fourth nucleotide is the base uracil, so it bonds with adenine instead. RNA is super important in the production of our proteins, and you'll see later that it has a crucial role in the replication of DNA. But first... Biolo-graphies! Yes, plural this week! Because when you start talking about something as multitudinously awesome and elegant as DNA, you have to wonder: WHO figured all of this out? And how big was their brain? Well unsurprisingly, it actually took a lot of different brains, in a lot of different countries and nearly a hundred years of thinking to do it. The names you usually hear when someone asks who discovered DNA are James Watson and Francis Crick. But that's BUNK. They did not discover DNA. Nor did they discover that DNA contained genetic information. DNA itself was discovered in 1869 by a Swiss biologist named Friedrich Miescher. His deal was studying white blood cells. And he got those white blood cells in the most horrible way you could possibly imagine, from collecting used bandages from a nearby hospital. It's for science he did it! He bathed the cells in warm alcohol to remove the lipids, then he set enzymes loose on them to digest the proteins. What was left, after all that, was snotty gray stuff that he knew must be some new kind of biological substance. He called it nuclein, but was later to become known as nucleic acid. But Miescher didn't know what its role was or what it looked like. One of those scientists who helped figure that out was Rosalind Franklin, a young biophysicist in London nearly a hundred years later. Using a technique called x-ray diffraction, Franklin may have been the first to confirm the helical structure of DNA. She also figured out that the sugar-phosphate backbone existed on the outside of its structure. So why is Rosalind Franklin not exactly a household name? Two reasons: 1. Unlike Watson & Crick, Franklin was happy to share data with her rivals. It was Franklin who informed Watson & Crick that an earlier theory of a triple-helix structure was not possible, and in doing so she indicated that DNA may indeed be a double helix. Later, her images confirming the helical structure of DNA were shown to Watson without her knowledge. Her work was eventually published in Nature, but not until after two papers by Watson and Crick had already appeared in which the duo only hinted at her contribution. 2. Even worse than that, the Nobel Prize Committee couldn't even consider her for the prize that they awarded in 1962 because of how dead she was. The really tragic thing is that it's totally possible that her scientific work may have led to her early death of ovarian cancer at the age of 37. At the time, the X-Ray diffraction technology that she was using to photograph DNA required dangerous amounts of radiation exposure, and Franklin rarely took precautions to protect herself. Nobel Prizes cannot be awarded posthumously. Many believe she would have shared Watson and Crick's medal if she had been alive to receive it. Now that we know the basics of DNA's structure, we need to understand how it copies itself, because cells are constantly dividing, and that requires a complete copy of all of that DNA information. It turns out that our cells are extremely good at this -- our cells can create the equivalent 10,000 copies of this book in just a few hours. That, my friends, is called replication. Every cell in your body has a copy of the same DNA. It started from an original copy and it will copy itself trillions of times over the course of a lifetime, each time using half of the original DNA strand as a template to build a new molecule. So, how is a teenage boy like the enzyme Helicase? They both want to unzip your genes. Helicase is marvelous, unwinding the double helix at breakneck speeds, slicing open those loose hydrogen bonds between the base pairs. The point where the splitting starts is known as the replication fork, has a top strand, called the leading strand, or the good guy strand as I call it and another bottom strand called the lagging strand, which I like to call the scumbag strand, because it is a pain in the butt to deal with. These unwound sections can now be used as templates to create two complementary DNA strands. But remember the two strands go in opposite directions, in terms of their chemical structure, which means making a new DNA strand for the leading strand is going to be much easier for the lagging strand. For the leading, good guy, strand an enzyme called DNA polymerase just adds matching nucleotides onto the main stem all the way down the molecule. But before it can do that it needs a section of nucleotides that fill in the section that's just been unzipped. Starting at the very beginning of the DNA molecule, DNA polymerase needs a bit of a primer, just a little thing for it to hook on to so that it can start building the new DNA chain. And for that little primer, we can thank the enzyme RNA primase. The leading strand only needs this RNA primer once at the very beginning. Then DNA polymerase is all, "I got this" and just follows the unzipping, adding new nucleotides to the new chain continuously, all the way down the molecule. Copying the lagging, or scumbag strand, is, well, he's a freaking scumbag. This is because DNA polymerase can only copy strands in the 5' -- 3' direction, and the lagging strand is 3' -- 5', so DNA polymerase can only add new nucleotides to the free, 3' end of a primer. So maybe the real scumbag here is the DNA polymerase. Since the lagging strand runs in the opposite direction, it has to be copied as a series of segments. Here that awesome little enzyme RNA Primase does its thing again, laying down an occasional short little RNA primer that gives the DNA Polymerase a starting point to then work backwards along the strand. This is done in a ton of individual segments, each 1,000 to 2,000 base pairs long and each starting with an RNA primer, called Okazaki fragments after the couple of married scientists who discovered this step of the process in the 1960s. And thank goodness they were married so we can just call them Okazaki fragments instead of Okazaki-someone's-someone fragments. These allow the strands to be synthesized in short bursts. Then another kind of DNA Polymerase has to go back over and replace all those RNA Primers and THEN all of the little fragments get joined up by a final enzyme called DNA Ligase. And that is why I say the lagging strand is such a scumbag! DNA replication gets it wrong about one in every 10 billion nucleotides. But don't think your body doesn't have an app for that! It turns out DNA polymerases can also proofread, in a sense, removing nucleotides from the end of a strand when they discover a mismatched base. Because the last thing we want is an A when it would have been a G! Considering how tightly packed DNA is into each one of our cells, it's honestly amazing that more mistakes don't happen. Remember, we're talking about millions of miles worth of this stuff inside us. And this, my friends, is why scientists are not exaggerating when they call DNA the most celebrated molecule of all time. So, you might as well look this episode over a couple of times and appreciate it for yourself. And in the mean time, gear up for next week, when we're going to talk about how those six feet of kick-ass actually makes you, you. Thank you to all the people here at Crash Course who helped make this episode awesome. You can click on any of these things to go back to that section of the video. If you have any questions, please, of course, ask them in the comments or on Facebook or Twitter.

Theory and sequencing strategies

Sequencing as a covering problem

All mainstream methods of DNA sequencing rely on reading small fragments of DNA and subsequently reconstructing these data to infer the original DNA target, either via assembly or alignment to a reference. The abstraction common to these methods is that of a mathematical covering problem.^[2] For example, one can imagine a line segment representing the target and a subsequent process where smaller segments are "dropped" onto random locations of the target. The target is considered "sequenced" when adequate coverage accumulates (e.g., when no gaps remain).

The abstract properties of covering have been studied by mathematicians for over a century.^[3] However, direct application of these results has not generally been possible. Closed-form mathematical solutions, especially for probability distributions, often cannot be readily evaluated. That is, they involve inordinately large amounts of computer time for parameters characteristic of DNA sequencing. Stevens' configuration is one such example.^[4] Results obtained from the perspective of pure mathematics also do not account for factors that are actually important in sequencing, for instance detectable overlap in sequencing fragments, double-stranding, edge-effects, and target multiplicity. Consequently, development of sequencing theory has proceeded more according to the philosophy of applied mathematics. In particular, it has been problem-focused and makes expedient use of approximations, simulations, etc.

Early uses derived from elementary probability theory

The earliest result may be found directly from elementary probability theory. Suppose we model the above process taking $L$ and $G$ as the fragment length and target length, respectively. The probability of "covering" any given location on the target with one particular fragment is then $L/G$ . (This presumes $L\ll G$ , which is valid often, but not for all real-world cases.) The probability of a single fragment not covering a given location on the target is therefore $1-L/G$ , and $\left[1-L/G\right]^{N}$ for $N$ fragments. The probability of covering a given location on the target with at least one fragment is therefore

P=1-\left[1-{\frac {L}{G}}\right]^{N}.

This equation was first used to characterize plasmid libraries,^[5] but it may appear in a modified form. For most projects $N\gg 1$ , so that, to a good degree of approximation

\left[1-{\frac {L}{G}}\right]^{N}\sim \exp(-NL/G),

where $R=NL/G$ is called the redundancy. Note the significance of redundancy as representing the average number of times a position is covered with fragments. Note also that in considering the covering process over all positions in the target, this probability is identical to the expected value of the random variable $C$ , the fraction of the target coverage. The final result,

E\langle C\rangle =1-e^{-R},

remains in widespread use as a "back of the envelope" estimator and predicts that coverage for all projects evolves along a universal curve that is a function only of the redundancy.

Lander-Waterman theory

In 1988, Eric Lander and Michael Waterman published an important paper^[6] examining the covering problem from the standpoint of gaps. Although they focused on the so-called mapping problem, the abstraction to sequencing is much the same. They furnished a number of useful results that were adopted as the standard theory from the earliest days of "large-scale" genome sequencing.^[7] Their model was also used in designing the Human Genome Project and continues to play an important role in DNA sequencing.

Ultimately, the main goal of a sequencing project is to close all gaps, so the "gap perspective" was a logical basis of developing a sequencing model. One of the more frequently used results from this model is the expected number of contigs, given the number of fragments sequenced. If one neglects the amount of sequence that is essentially "wasted" by having to detect overlaps, their theory yields

E\langle contigs\rangle =Ne^{-R}.

In 1995, Roach^[8] published improvements to this theory, enabling it to be applied to sequencing projects in which the goal was to completely sequence a target genome. Michael Wendl and Bob Waterston^[9] confirmed, based on Stevens' method,^[4] that both models produced similar results when the number of contigs was substantial, such as in low coverage mapping or sequencing projects. As sequencing projects ramped up in the 1990s, and projects approached completion, low coverage approximations became inadequate, and the exact model of Roach was necessary. However, as the cost of sequencing dropped, parameters of sequencing projects became easier to directly test empirically, and interest and funding for strategic genomics diminished.

The basic ideas of Lander–Waterman theory led to a number of additional results for particular variations in mapping techniques.^[10]^[11]^[12] However, technological advancements have rendered mapping theories largely obsolete except in organisms other than highly studied model organisms (e.g., yeast, flies, mice, and humans).

Parking strategy

The parking strategy for sequencing resembles the process of parking cars along a curb. Each car is a sequenced clone, and the curb is the genomic target.^[13] Each clone sequenced is screened to ensure that subsequently sequenced clones do not overlap any previously sequenced clone. No sequencing effort is redundant in this strategy. However, much like the gaps between parked cars, unsequenced gaps less than the length of a clone accumulate between sequenced clones. There can be considerable cost to close such gaps.

Pairwise end-sequencing

In 1995, Roach et al.^[14] proposed and demonstrated through simulations a generalization of a set of strategies explored earlier by Edwards and Caskey.^[15] This whole-genome sequencing method became immensely popular as it was championed by Celera and used to sequence several model organisms before Celera applied it to the human genome. Today, most sequencing projects employ this strategy, often called paired end sequencing.

Post Human Genome Project advancements

The physical processes and protocols of DNA sequencing have continued to evolve, largely driven by advancements in bio-chemical methods, instrumentation, and automation techniques. There is now a wide range of problems that DNA sequencing has made in-roads into, including metagenomics and medical (cancer) sequencing. There are important factors in these scenarios that classical theory does not account for. Recent work has begun to focus on resolving the effects of some of these issues. The level of mathematics becomes commensurately more sophisticated.

Various artifacts of large-insert sequencing

Biologists have developed methods to filter highly-repetitive, essentially un-sequenceable regions of genomes. These procedures are important for organisms whose genomes consist mostly of such DNA, for example corn. They yield multitudes of small islands of sequenceable DNA products. Wendl and Barbazuk^[16] proposed an extension to Lander–Waterman Theory to account for "gaps" in the target due to filtering and the so-called "edge-effect". The latter is a position-specific sampling bias, for example the terminal base position has only a $1/G$ chance of being covered, as opposed to $L/G$ for interior positions. For $R<1$ , classical Lander–Waterman Theory still gives good predictions, but dynamics change for higher redundancies.

Modern sequencing methods usually sequence both ends of a larger fragment, which provides linking information for de novo assembly and improved probabilities for alignment to reference sequence. Researchers generally believe that longer lengths of data (read lengths) enhance performance for very large DNA targets, an idea consistent with predictions from distribution models.^[17] However, Wendl^[18] showed that smaller fragments provide better coverage on small, linear targets because they reduce the edge effect in linear molecules. These findings have implications for sequencing the products of DNA filtering procedures. Read-pairing and fragment size evidently have negligible influence for large, whole-genome class targets.

Individual and population sequencing

Sequencing is emerging as an important tool in medicine, for example in cancer research. Here, the ability to detect heterozygous mutations is important and this can only be done if the sequence of the diploid genome is obtained. In the pioneering efforts to sequence individuals, Levy et al.^[19] and Wheeler et al.,^[20] who sequenced Craig Venter and Jim Watson, respectively, outlined models for covering both alleles in a genome. Wendl and Wilson^[21] followed with a more general theory that allowed for an arbitrary number of coverings of each allele and arbitrary ploidy. These results point to the general conclusion that the amount of data needed for such projects is significantly higher than for traditional haploid projects. Generally, at least 30-fold redundancy, i.e. each nucleotide spanned by an average of 30 sequence reads, is now standard.^[22] However, requirements can be even greater, depending upon what kinds of genomic events are to be found. For example, in the so-called "discordant read pairs method", DNA insertions can be inferred if the distance between read pairs is larger than expected. Calculations show that around 50-fold redundancy is needed to avoid false-positive errors at 1% threshold.^[23]

The advent of next-generation sequencing has also made large-scale population sequencing feasible, for example the 1000 Genomes Project to characterize variation in human population groups. While common variation is easily captured, rare variation poses a design challenge: too few samples with significant sequence redundancy risks not having a variant in the sample group, but large samples with light redundancy risk not capturing a variant in the read set that is actually in the sample group. Wendl and Wilson^[24] report a simple set of optimization rules that maximize the probability of discovery for a given set of parameters. For example, for observing a rare allele at least twice (to eliminate the possibility is unique to an individual) a little less than 4-fold redundancy should be used, regardless of the sample size.

Metagenomic sequencing

Next-generation instruments are now also enabling the sequencing of whole uncultured metagenomic communities. The sequencing scenario is more complicated here and there are various ways of framing design theories for a given project. For example, Stanhope^[25] developed a probabilistic model for the amount of sequence needed to obtain at least one contig of a given size from each novel organism of the community, while Wendl et al. reported analysis for the average contig size or the probability of completely recovering a novel organism for a given rareness within the community.^[26] Conversely, Hooper et al. propose a semi-empirical model based on the gamma distribution.^[27]

Limitations

DNA sequencing theories often invoke the assumption that certain random variables in a model are independent and identically distributed. For example, in Lander–Waterman Theory, a sequenced fragment is presumed to have the same probability of covering each region of a genome and all fragments are assumed to be independent of one another. In actuality, sequencing projects are subject to various types of bias, including differences of how well regions can be cloned, sequencing anomalies, biases in the target sequence (which is not random), and software-dependent errors and biases. In general, theory will agree well with observation up to the point that enough data have been generated to expose latent biases.^[21] The kinds of biases related to the underlying target sequence are particularly difficult to model, since the sequence itself may not be known a priori. This presents a type of Catch-22 (logic) problem.

References

^ Waterman, Michael S. (1995). Introduction to Computational Biology. Boca Raton: Chapman and Hall/CRC. ISBN 978-0-412-99391-6.
^ Hall, P. (1988). Introduction to the Theory of Coverage Processes. New York: Wiley. ISBN 978-0-471-85702-0.
^ Solomon, H. (1978). Geometric Probability. Philadelphia: Society for Industrial and Applied Mathematics. ISBN 978-0-898-71025-0.
^ ^a ^b Stevens WL (1939). "Solution to a Geometrical Problem in Probability". Annals of Eugenics. 9 (4): 315–320. doi:10.1111/j.1469-1809.1939.tb02216.x.
^ Clarke L, Carbon J (1976). "A colony bank containing synthetic Col-El hybrid plasmids representative of the entire E. coli genome". Cell. 9 (1): 91–99. doi:10.1016/0092-8674(76)90055-6. PMID 788919. S2CID 2535372.
^ Lander ES, Waterman MS (1988). "Genomic mapping by fingerprinting random clones: a mathematical analysis". Genomics. 2 (3): 231–239. doi:10.1016/0888-7543(88)90007-9. PMID 3294162.
^ Fleischmann RD; et al. (1995). "Whole-genome random sequencing and assembly of haemophilus influenzae Rd". Science. 269 (5223): 496–512. Bibcode:1995Sci...269..496F. doi:10.1126/science.7542800. PMID 7542800.
^ Roach JC (1995). "Random subcloning". Genome Research. 5 (5): 464–473. doi:10.1101/gr.5.5.464. PMID 8808467.
^ Wendl MC, Waterston RH (2002). "Generalized gap model for bacterial artificial chromosome clone fingerprint mapping and shotgun sequencing". Genome Research. 12 (12): 1943–1949. doi:10.1101/gr.655102. PMC 187573. PMID 12466299.
^ Arratia R; et al. (1991). "Genomic mapping by anchoring random clones: a mathematical analysis". Genomics. 11 (4): 806–827. CiteSeerX 10.1.1.80.8788. doi:10.1016/0888-7543(91)90004-X. PMID 1783390.
^ Port E; et al. (1995). "Genomic mapping by end-characterized random clones: a mathematical analysis". Genomics. 26 (1): 84–100. CiteSeerX 10.1.1.74.4380. doi:10.1016/0888-7543(95)80086-2. PMID 7782090.
^ Zhang MQ, Marr TG (1993). "Genome mapping by nonrandom anchoring: a discrete theoretical analysis". Proceedings of the National Academy of Sciences. 90 (2): 600–604. Bibcode:1993PNAS...90..600Z. doi:10.1073/pnas.90.2.600. PMC 45711. PMID 8421694.
^ Roach JC; et al. (2000). "Parking strategies for genome sequencing". Genome Research. 10 (7): 1020–1030. doi:10.1101/gr.10.7.1020. PMC 310895. PMID 10899151.
^ Roach JC, Boysen C, Wang K, Hood L (1995). "Pairwise end sequencing: a unified approach to genomic mapping and sequencing". Genomics. 26 (2): 345–353. doi:10.1016/0888-7543(95)80219-C. PMID 7601461.{{cite journal}}: CS1 maint: multiple names: authors list (link)
^ Edwards, A.; Caskey, T. (1991). Closure strategies for random DNA sequencing. Vol. 3. A Companion to Methods in Enzymology. pp. 41–47.
^ Wendl MC, Barbazuk WB (2005). "Extension of Lander–Waterman Theory for sequencing filtered DNA libraries". BMC Bioinformatics. 6: article 245. doi:10.1186/1471-2105-6-245. PMC 1280921. PMID 16216129.
^ Wendl MC (2006). "Occupancy modeling of coverage distribution for whole genome shotgun DNA sequencing". Bulletin of Mathematical Biology. 68 (1): 179–196. doi:10.1007/s11538-005-9021-4. PMID 16794926. S2CID 23889071.
^ Wendl MC (2006). "A general coverage theory for shotgun DNA sequencing". Journal of Computational Biology. 13 (6): 1177–1196. doi:10.1089/cmb.2006.13.1177. PMID 16901236. S2CID 17112274.
^ Levy S; et al. (2007). "The diploid genome sequence of an individual human". PLOS Biology. 5 (10): article e254. doi:10.1371/journal.pbio.0050254. PMC 1964779. PMID 17803354.
^ Wheeler DA; et al. (2008). "The complete genome of an individual by massively parallel DNA sequencing". Nature. 452 (7189): 872–876. Bibcode:2008Natur.452..872W. doi:10.1038/nature06884. PMID 18421352.
^ ^a ^b Wendl MC, Wilson RK (2008). "Aspects of coverage in medical DNA sequencing". BMC Bioinformatics. 9: article 239. doi:10.1186/1471-2105-9-239. PMC 2430974. PMID 18485222.
^ Ley TJ; et al. (2008). "DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome". Nature. 456 (7218): 66–72. Bibcode:2008Natur.456...66L. doi:10.1038/nature07485. PMC 2603574. PMID 18987736.
^ Wendl MC, Wilson RK (2009). "Statistical aspects of discerning indel-type structural variation via DNA sequence alignment". BMC Genomics. 10: article 359. doi:10.1186/1471-2164-10-359. PMC 2748092. PMID 19656394.
^ Wendl MC, Wilson RK (2009). "The theory of discovering rare variants via DNA sequencing". BMC Genomics. 10: article 485. doi:10.1186/1471-2164-10-485. PMC 2778663. PMID 19843339.
^ Stanhope SA (2010). "Occupancy modeling maximum contig size probabilities and designing metagenomics experiments". PLOS ONE. 5 (7): article e11652. Bibcode:2010PLoSO...511652S. doi:10.1371/journal.pone.0011652. PMC 2912229. PMID 20686599.
^ Wendl MC; et al. (2012). "Coverage theories for metagenomic DNA sequencing based on a generalization of Stevens' theorem". Journal of Mathematical Biology. 67 (5): 1141–1161. doi:10.1007/s00285-012-0586-x. PMC 3795925. PMID 22965653.
^ Hooper SD; et al. (2010). "Estimating DNA coverage and abundance in metagenomes using a gamma approximation". Bioinformatics. 26 (3): 295–301. doi:10.1093/bioinformatics/btp687. PMC 2815663. PMID 20008478.

This page was last edited on 7 November 2023, at 14:26

From Wikipedia, the free encyclopedia