To install click the Add extension button. That's it.

The source code for the WIKI 2 extension is being checked by specialists of the Mozilla Foundation, Google, and Apple. You could also do it yourself at any point in time.

4,5
Kelly Slayton
Congratulations on this excellent venture… what a great idea!
Alexander Grigorievskiy
I use WIKI 2 every day and almost forgot how the original Wikipedia looks like.
Live Statistics
English Articles
Improved in 24 Hours
Added in 24 Hours
What we do. Every page goes through several hundred of perfecting techniques; in live mode. Quite the same Wikipedia. Just better.
.
Leo
Newton
Brights
Milds

Third-generation sequencing

From Wikipedia, the free encyclopedia

Third-generation sequencing (also known as long-read sequencing) is a class of DNA sequencing methods currently under active development.[1] Third generation sequencing works by reading the nucleotide sequences at the single molecule level, in contrast to existing methods that require breaking long strands of DNA into small segments then inferring nucleotide sequences by amplification and synthesis.[2] Critical challenges exist in the engineering of the necessary molecular instruments for whole genome sequencing to make the technology commercially available.

Second-generation sequencing, often referred to as Next-generation sequencing (NGS), has dominated the DNA sequencing space since its development. It has dramatically reduced the cost of DNA sequencing by enabling a massively-paralleled approach capable of producing large numbers of reads at exceptionally high coverages throughout the genome.[3]

Since eukaryotic genomes contain many repetitive regions, a major limitation to this class of sequencing methods is the length of reads it produces.[3] Briefly, second generation sequencing works by first amplifying the DNA molecule and then conducting sequencing by synthesis. The collective fluorescent signal resulting from synthesizing a large number of amplified identical DNA strands allows the inference of nucleotide identity. However, due to random errors, DNA synthesis between the amplified DNA strands would become progressively out-of-sync. Quickly, the signal quality deteriorates as the read-length grows. In order to preserve read quality, long DNA molecules must be broken up into small segments, resulting in a critical limitation of second generation sequencing technologies.[3] Computational efforts aimed to overcome this challenge often rely on approximative heuristics that may not result in accurate assemblies.

By enabling direct sequencing of single DNA molecules, third generation sequencing technologies have the capability to produce substantially longer reads than second generation sequencing.[1] Such advantage has critical implications for both genome science and the study of biology in general. However, due to various technical challenges, third generation sequencing has error rates at almost unrepairable levels, rendering the technologies impractical for certain applications such as de novo genome assembly.[4] These technologies are undergoing active development and therefore it is expected that there will be further improvements to the high error rates. For applications that are more tolerant to error rates such as metagenomics or larger structural variant calling, third generation sequencing has been found to outperform existing methods.

YouTube Encyclopedic

  • 1/5
    Views:
    180 125
    10 954
    117 736
    164 133
    1 434
  • Next-Generation Sequencing Technologies - Elaine Mardis (2012)
  • 3) Cell Culture - The Basics of Recombinant Lentivirus System
  • Next-Generation Sequencing Technologies - Elaine Mardis (2010)
  • Agarose Gel Electrophoresis, DNA Sequencing, PCR, Excerpt 1 | MIT 7.01SC Fundamentals of Biology
  • (About to Introduce) Third Generation Agile

Transcription

Dr. Andy Baxevanis: All right, good morning everyone and welcome to week six of this Current Topics lecture series. Before I introduce today's guest to you, just a brief program note to remind you that there is no Current Topics lecture next week, February 29th. Instead, next Wednesday morning I strongly encourage all of you to attend NHGRI's 10th annual Trent Lecture which will be held at 10:00 a.m. in the Kerstein Auditorium [spelled phonetically] over in the Natcher Building. This year's Trent Lecture will be given by Bert Vogelstein from the Johns Hopkins University School of Medicine and his lecture is entitled "Cancer Genomes and Their Implications for Basic and Applied Research". Of course, those of you who are working on questions in cancer research already know that Dr. Vogelstein was the first researcher to elucidate the molecular basis of a common human cancer and that his work on colorectal cancer forms a paradigm for much of the work that's being done in modern cancer research. If you haven't heard him before, Bert is a fantastic speaker and he brings a unique world view to the field of cancer genomics and the themes he will touch upon during his lecture will dovetail quite nicely with a lot of the themes that we are going to be addressing over the 13 weeks of this Current Topics lecture series. Please keep in mind that Dr. Vogelstein's lecture will not be videocast and it will not be videotaped. So, please mark your calendars for next week and I hope to see many of you over in the Natcher Auditorium for Bert's lecture. So, today, it's my great pleasure to introduce to you Dr. Elaine Mardis, who is a professor of Genetics and Molecular Microbiology and the co-director of the Genome Institute at the Washington University School of Medicine. Dr. Mardis' involvement in the field of genomics dates back to the beginning of genome sequencing when in 1993 she joined the -- excuse me, the Genome Institute at Wash. U. as its director of technology development. And in that role, she helped create methods and automation pipelines that were critical for sequencing the human genome. And if you think back to Dr. Green's lecture during the first week of this course, you'll recall the technological challenges that those that were working on the Human Genome Project in the early days of that project faced as they sequenced each and every one of the chromosomes that comprised the human genome, and Dr. Mardis was really one of the key players and thought leaders who helped to figure out how to best approach and operationalize such a huge biological and technological problem. In her current role as the co-director, she orchestrates the Genome Institute's efforts to explore next-generation and third-generation sequencing technologies, some of which you'll hear about today. And the goal of that is to transition these technologies into a production-sequencing environment that can serve as a very strong foundation for addressing really important questions in genetics and genomics. Dr. Mardis also has a very strong research interest involving the application of DNA sequencing approaches to the characterization of cancer genomes with a particular focus on facilitating the translation of basic science discoveries about human disease into the clinical setting. Her work and contributions to the field of genomics has been recognized by numerous organizations; most recently in 2010, she was awarded the Scripps Translational Research Award for her work on cancer genomics, and in 2011, she was named a Distinguished Alumna of The University of Oklahoma College of Arts and Sciences. I am very pleased that Elaine could join us today and be part of this series presenting her perspective on next-generation sequencing technologies. So, with that, please join me in welcoming today's speaker, Dr. Elaine Mardis. [applause] Dr. Elaine Mardis: Thanks, that was great. Well, good morning everybody and thanks again, Andy, for the kind invitation to be here and provide an educational lecture for you on next-gen sequencing technologies. I believe I'm supposed to flash the next slide, which is not working for some reason, to provide the fact that I have nothing to disclose. Am I doing this wrong? I think I have a pretty good amount of practice, having done this -- there we go. So, this is my non-disclosure slide -- I mean, my disclosure slide of no relevant financial relationships. So, what I'm going to do today is basically tell you all about next-gen and third-generation sequencing instruments and I sort of have a laundry list for you here -- [clears throat] excuse me -- to consider; and then I'll spend the last portion of the talk just giving you a probably somewhat limited, but hopefully a good broad brush perspective on all of the ways that these technologies are now being used, I think, to really transform the biological research enterprise. And I hope to give you a feel for that. Of course, there is so much going on that it would never be comprehensive, or we would be here for hours and neither you nor I want that. So, I'll try to give you just a few salient features, give you some references from our own work, and try to mention the work of others as I go along, assuming that I can remember to do that. So, you may be familiar with a very nice issue of Nature that came out towards the beginning of last year. In it was featured "The Roadmap for the National Genome Research Institute's Next Five Years." And I was also very honored to be asked at that point in time to provide this perspective piece, a reference to which is at the bottom here, to really look back over the past 10 years since the human genome sequence was completed at the trajectory of technology improvements toward DNA sequencing. And this is just a figure from the paper that sort of gives the timeline over that 10-year period. I'll talk about some of the highlights during this timeline today as I go through the work. But, really, also to just reflect that back around the time that we finished the genome and moving forward, there's been this explosion of ability to produce sequence data, as you can see. So, inflecting, round about 2005, in the introduction of the first next-generation sequencing instruments from 454 technologies, and now moving up and ever upward with recent announcements that I'll just briefly mention, in late January, about sequencing technologies that will now take us to the point of sequencing an entire human genome in essentially an overnight time period. So, this is a very radical transformation over a very short period of time, and it's had a tremendous amount of impact already in this short timeframe on biological research, and I'll try to give you a flavor for that. But, if you don't remember anything about it, just remember that the cost of data production, not the cost of data analysis, but the cost of data production has fallen dramatically. So if you look at capillary-based sequencing technology round about the time that we finished finishing the human genome in 2004, if you went back to that capillary sequencer from ABI and you ran through DNA sequences to satisfy coverage on the three billion base pair human genome, you would be talking about a $15 million price tag. So, most of us in this room, including myself, can't afford that if you wanted to have your genomes sequenced. Maybe some of you guys can, and if you can, congratulations on that. But you would have wanted to wait because in a very, very short time period, the transition of about six or seven years, if you will, that cost has fallen dramatically to around about $10,000, perhaps moving towards the mythical $1,000 figure in this calendar year; we'll see. And the time to produce that data, you know, wouldn't have been weeks and weeks, months and years, depending upon how many of these capillary sequencers you had; you can literally today on this Illumina box just provided for illustration, sequence six human genome equivalents in about a 10- or 11-day period. So, you and five of your friends can all pool your money together and get your genomes sequenced very rapidly. So, what are the basics behind all of these next-generation sequencing platforms? I mean, for years and years, all we had to choose from basically was the capillary sequencer from Applied Biosystems. So, it's kind of difficult to illustrate for you a crazy wealth of riches in terms of all of the sequencing platforms that are available; that's the good news. The bad news is that, for people who don't live, sleep, and breathe this like I do, there are some questions that may arise about what's the exact right technology for the application that I have in mind. I'll try and shed a little bit of light on that later in the talk, but let's just take some time pacing through the basics of how these things work, how they do what they do, and what they turn out in terms of the data that's produced. And I'll try and present that for you here. Now, each and every one of the manufacturers of these sequencing instruments would like you to think that their instrument is highly unique and capable and poised above all of the others that are available in the commercial space. Of course, as a skeptical scientist, you won't believe that, and that would be wise. But, what I want to walk you through first is all of the ways that these sequencers are actually the same, because there are a lot of similarities. And I do this to set the stage for them pacing through each one and telling you how they're unique, but keep in mind the similarities that go across the different platforms because that gives you a fundamental basis for understanding how they work. So, all of the shared attributes are listed here. First, we'll start with the fact that making libraries, for those of you who may have in the past done clone-based libraries for capillary sequencing, is now faster, easier, and cheaper than ever. There's no need to run through an E. coli intermediate. There's no need to do cloning. It's all a very straightforward process that begins with the random fragmentation of the starting DNA that you are interested in sequencing. And if, for example, these are just PCR products, then there is no need for fragmentation, you can just go forward to the next step which is ligation of these fragments with custom linkers or adapters to make a library. And as you'll see with each one of these technologies, library construction is basically the same approach. Each instrument has their own specific and unique adapters, as you might guess. But nonetheless, the overall process is exactly the same and highly concordant. So, instead of spending a week producing a sub-clone library which you then pick, amplify an E. coli, and isolate all of the DNA, in the old style process, instead with this process, you essentially in the period of a day's time, fragment the DNA, add on the adapters by ligation, do some purification and amplification steps, quantitate the library, and you're ready to go. So, the whole process literally takes less than a day's time and costs in our hands on the order of $100 or $150 to complete to the point where you now have hundreds of millions of DNA fragments that are ready to do next-gen sequencing on. So, I mentioned that there's library amplification in these processes. Depending upon the platform you're talking about, it takes place on some sort of a solid surface. So, either a bead or a glass, on a glass surface, and I'll show you the differences between those different sequencers. But, the net impact is the same. You're taking these unique fragments and now you're starting from one fragment, amplifying it up to multiple copies. Well, sometimes when I present this lecture, I ask students, "Why do you think we do that?" I won't do that for you guys, but I'll just tell you the answer, which is: now for all but some of the single molecule sequencers, which I'll mention as we go through them, you must do library amplification in order to see the signal that's coming back from the sequencing reaction itself. So, most of these sequencers start with single molecules. They're amplified in place, either on a bead or glass, and then they're sequenced. And to see the sequencing reaction going on in real time, you actually have to do that amplification step. So, that's not a bad thing in most cases; however, with any type of enzymatic amplification, you're going always going to get some aspects of biasing and some aspects of what are called -- what's called "duplication" or "jackpotting" where some of the library fragments will preferentially amplify and you'll get more of those sequences than of others. And so we have ways to adjust or ameliorate for that in our processes. Okay, and then on to the sequencing reactions themselves. For most of these technologies I'll talk about today, a direct step-by-step detection of each nucleotide base incorporated during the sequencing reaction occurs. Commonly, these approaches are referred to as "sequencing by syntheses," if you will, if they use a polymerase or sequencing by ligation. But, they do occur in a direct, step-by-step fashion. So again, let's hearken back to days of old and capillary sequencing. If you're familiar with this, what happened procedurally was: you sequenced all of your DNA fragments, perhaps 96 or 384 at a time, and then you applied them to a sequencer after the sequencing reaction was over with and they were separated by electrophoreses and the fragments were detected, either by radioactivity if you're as old as I am, or fluorescence if you're younger than I am. So, in contrast, next-generation sequencing, everything happens together at the same time. Sequencing and detection happen in a step-by-step fashion so that you essentially don't decouple the sequencing reaction from the detection of the sequencing reaction. And this leads to another name which I actually quite prefer over next-generation sequencing or some generation of sequencing, which more accurately represents what's going on in these sequencing instruments, as I hope you'll come to appreciate, mainly you're performing hundreds of thousands to hundreds of millions of sequencing reactions all at the same time. And so the term that's often applied to these technologies is "massively parallel sequencing," which is exactly what you're doing. You're sequencing everybody together simultaneously, performing an imaging step to detect what happened, and then moving on to the next base incorporation step over and over and over again until you generate your full sequence run. Now, the consequences of doing next-generation sequencing, with a couple of exceptions again that I'll point out, is that in general, these reads are shorter than capillary sequencers. And there are a number of reasons for this, but it mainly comes down to one word: signal versus noise. I guess that's three words technically, although I think of them as one. So, what you're always battling in this detection game is the signal to noise ratio, and in most of these technologies there's some cost to pay that ultimately limits the read lengths. And I'll give you the specifics for each platform, but just consider in principle that these are going to be shorter read lengths. And I mentioned earlier the contrast between the cost to generate data and the cost to analyze data, and this is where push comes to shove because there is a tool exacted from the fact that you can produce lots of short reads and then you have to go and analyze those. And I'll point out to you why -- the reasons why that becomes much more difficult and why the bioinformatics overhead, the analysis overhead, is quite expensive for us still. So, I already alluded to this for you, but I'll talk about it again a little later on when I show some examples of how we're actually using now the fact that these are digital reads. So, each read of a massively parallel sequencer originates from one fragment in the library, even though it's amplified. What that means is that you can literally apply counting-based methods to the analysis of these data that will tell you things, for example, as I'll show, how many tumor cells in the collective that produced DNA for a tumor genome actually contained each one of the mutations that you detected. So, you can get it down to that level of sensitivity, you can look at the number of counts for a given messenger RNA, for example, and look at quantitative aspects of sequencing as we've never been able to do before. And this is a tremendously exciting application space for next-generation sequencing. I'll try and give you a feel for that. And then lastly, one of the newest abilities that's come on board for these sequencing instruments is the use of what we refer to as paired-end reads. So, most of these technologies started out by priming a sequencing reaction, extending off of a single primer for certain read lengths, and that was it. And you got a single fragment read. And in most cases, that was pretty darn good and we learned to work with it, but over time what emerged was the ability to sample from not one but both ends of the fragment in that library. Namely, each technology applies a different adapter to each end of the DNA fragment that's being put into the library and you can exploit that by using one primer for one adapter, and in the second round of sequencing, a second primer for the second adapter, effectively collecting data from both ends of the fragment. And also understanding that based on the sized of fragments that went into the library that you made, let's say 300 base pairs, now that you have 100 from one end and 100 from the other end, you can actually align those back to a reference genome and expect that they will align at about 300 base pairs apart from one another. And when that doesn't work out and you have a distance further apart or closer together, or maybe even mapping to two separate chromosomes, you can actually use that information to make sense out of the genome that you're sequencing, and I'll give you a few examples of this from our work a little bit later on. So, paired-end reads had all kinds of other advantages that I've listed here. There's also a bit of a nuance to paired-end reads that I want to spend a little bit of time on because it is a major point of misunderstanding, if you will, and various literature and manufacturers, quite frankly, will try to trip you up with this one. So, you can get the paired-end read data, sequence can be derived from both ends of the library fragments, as I just mentioned. There are basically two kinds of paired-end reads, however, that go by different names. So, in my vocabulary, true paired ends mean that you have a linear fragment. It's typically, as I said, on the order of three to 500 base pairs, if you will. And you are literally going to be using two different primer extension steps sequenced at both ends in two separate reactions. So that's paired-end reads. The second type of read pair that you can generate on a next-gen sequencer is a so-called mate pair. And the nuance here is that rather than using two separate adapters, you literally circularize a large DNA fragment, typically greater than a KB in length, 3-, 8-, 20KB libraries are typically made. And by circularizing that around a common signal adapter, you actually can generate mate pairs where the ends of the DNA come together. You go through a second step to remove the extraneous DNA that's the part of the circle that you don't care about. And you either use the adapter to sequence across the DNA or you do a single reaction read -- sorry, use a single reaction read to sequence across the DNA or two separate end reads to tell you the sequences at either ends of those fragments. So, the advantage of mate pairs is that you can stretch out to much longer lengths across the DNA that you're interested in sequencing and hopefully understand better the long-range structure of that DNA. The downside of mate pairing as opposed to paired-end reads is that this approach, because DNA circularization is inherently not very efficient, is that large amounts of DNA, typically several micrograms, are required for each library that's constructed. Okay, and that just goes with the inefficiency of mate pairing. So, in general, whether they're mate pairs or paired-end reads, these offer advantages for sequencing especially when the genome is like the human, large and complex. The reason why is you can more accurately place that read on the genome then you can a single-ended read, and the main reason for that is that is, as long as both reads don't fall into a repetitive structure, you can anchor one with certainty even if the other one doesn't anchor with a high degree of mapping quality, as we call it. So, it may be a read that could place at multiple positions in the genome, but as long as the companion read to it places exactly at one place in the genome, when you go back to align the reads to the reference, you can identify exactly where that read came from. So, the net result of this is that you can use more reads towards your ultimate analysis than you might be able to do, which is single-end reads. And this provides a huge advantage in the economy of sequencing as well. Okay, so that's sort of the introductory similarities and differences and basic terminology. Now what I want to do is use those concepts to walk you through each of the different types of approaches and keep those similarities in mind as we also examine the differences. So, this was the, as I mentioned, one of the first technologies to come to us for next-generation sequencing, round about late 2004, early 2005. This is a massively parallel version of a sequencing approach you might have heard about called "pyrosequencing," which basically uses the emission of light to register the incorporation of DNA nucleotides into a growing strand. So, the 454 approach to constructing a library is exactly as I walk you through. So we do a random fragmentation step, you can see the DNA's intact here; now in small pieces; now with the adapters ligated onto either end, and actually, this is a denatured molecule which is the precursor to the amplification step. So, you take your single stranded adapter-ligated library now, so the yellow are the DNAs that you care about; the different colors at the end are the specific adapters for this platform. And you essentially do a step called emulsion PCR. This is the amplification step that I talked about earlier, and it's a really unique way of doing this. Namely, all of the PCR occurs in an emulsion of oil and reagents that are in aqueous phase. So, if you can see this, hopefully well enough, what we try to achieve in emulsion PCR, as in my cell, a sort of clear area in the center of the picture, that contains a bead, shown here, for amplification. And if you can see the little squiggles on the surface of the bead, those represent the complementary sequences to the adapter that you ligated on in your library construction step. And for each amplification, ideally we have only a single molecule of fragment from that library in association with the bead. Now, what's invisible in this aqueous micelle, of course, are the basic building blocks of DNA as well as DNA polymerases that are going to effect the amplification of this single fragment on the surface of the bead during a PCR step: namely, temperature cycling. Of course, we aren't just doing one bead at a time, we're doing several million beads at a time in a single microtube in this oil and water emulsion called emulsion PCR. Okay. So you go through the PCR cycling steps, what you end up with is this micelle containing a bead with lots and lots of copies of this original single fragment and placed on its surface. You then go through a series of steps that I won't show you to effectively break the emulsion so you separate the oil from water, and you can extract the beads away during this step, free from the oil and ready for the deposition into the sequencing plate that's used for the 454 sequencer. So, that process is shown here. In the sequencer for 454, the picotiter plate is literally the glass structure that's going to serve as the flow cell. This is a diffusion-mediated process that occurs sort of on the upper surface of this picotiter plate. So, we are depositing these DNA-containing beads, cleaned up from our emulsion PCR, down into the wells by the use of a centrifugation step, and effectively, these wells are about the right size of the just one bead fits. They won't all be ultimately filled, but most of them will be filled with a single bead that's going to provide a sequencing reaction. So, the upper surface of the flow cell, as I mentioned, is where the reagent flow occurs, okay. So, we're going to be flowing reagents through this process, allowing them to diffuse in and out of the wells to provide the sequencing process. In the meantime, this side of the picotiter plate is the business side for imaging. So, this is optically flat, optically clear glass that sits right up against a very, very high sensitivity CCD camera and is literally going to be recording the light flashes from about a million sequencing reactions as they all occur in lockstep. So, to do this pyrosequencing reaction, we also needs some helper beads, these little brown beads that are added in, and they sort of nestle down around the larger bead with the DNA on it. And their purpose is that they're linked with one of two enzymes, sulfurylase and luciferase here that affect the sequencing reaction that I'll describe in just a moment. But they need to be down in there in the mix so that all of the reactions can take place and so the light can be produced when the base is incorporated. Let's look quickly at the sequencing by synthesis steps on the 454. We're going to imagine that we have one of our DNA capture beads here and one of the, you know, millions of copies that are hanging off of it is going to be imagined right here. And this large gray blob is the DNA polymerase that is now seated at this annealed primer and ready to go. We then add in the first nucleotide, and this is a T. The first four nucleotides for this process are always the same because these four nucleotides are always determined by the sequencing adapter, and as you'll see, there's one A, one C, one T, one G, and this is the so-called "key sequence" that now tells the downstream interpretation software what a single nucleotide incorporation looks like. Why is that important? Because what we're flowing across the surface of the flow cell are native nucleotides. So, in the case where you run into, like, four As here in a row together; all four of those As are going to get incorporated against at once. There's no stopping A by A by A; they all four go together and the downstream output of light is effectively four times as high as one nucleotide. You wouldn't know that if you didn't have that key sequence at the beginning for your software to look at and gauge all of the other incorporation cycles. So, when this T gets incorporated, what happens? Well, we all know this, basic polymerase biology. A pyrophosphate moiety is released; that goes through a series of downstream reactions that are catalyzed by these enzymes on the bead and the output is essentially light. And that light is detected now by the CCD camera, which knows all the positions of all the wells that are emitting in these first four key sequences, and now records that cycle by cycle by cycle. Okay? And so we run these cycles effectively for several hundred times to generate the read lengths that are obtained from this sequencing instrument. I present for you here just short of a trajectory of improvements, if you will, on the 454 instrumentation since this was introduced in 2005, where you can see that there have been increases in read lengths so that with this latest flex plus, which we're just testing in our laboratory over the last couple of months, you can now get close to seeing your capillary read lengths actually out of this technology. So, about 650 to 700; about one gigabase of data per run is being yielded, and this takes on the order of about 20 hours to complete. So, it's an overnight run still. The error rate is about 1 percent, so you get 99 percent accuracy out of any given read. And we know that when you have an inaccuracy, it's typically in the range of an insertion/deletion type error. These typically are occurring now at those homopolymer runs like the four-A stretch that I pointed out to you where that exceeds six or seven nucleotides of the same identity in a row. You basically max out the detection on the CCD camera and you can now no longer make that correlation back to the key sequence that I was talking about earlier. So, that's a deficiency but you can't actually typically make this up with what we call coverage, which means you don't just ever sequence through once, you actually have multiple molecules which will include that multi-homopolymer run; and the more you sequence it, the more sure that you're six versus seven nucleotides of the same type, for example. So, that's one way to get around the insertion/deletion error model here. The other advantage of this platform, which I think I've pointed out here, this is a great platform for targeted validation where you're looking for single nucleotide changes because the way the nucleotides are flowed one at a time, you almost never see a substitution error on this platform. The error rate for substitutions is extraordinarily low, and so, if you're looking for specific base in a PCR product, or whatever, you can almost always detect that that's there and not be worried that you're getting some sort of a platform-specific error. Okay, let's shift gears now to the Illumina platform. This is round about the second platform introduced, originally marketed as Solexa. Do you have a question? Male Speaker: In the previous technology -- Dr. Elaine Mardis: Yeah. Male Speaker: -- a single bead carried a single fragment -- Dr. Elaine Mardis: That's been amplified, [affirmative]. Male Speaker: -- [unintelligible] of a single fragment? Dr. Elaine Mardis: That's right. Yeah. So that provides the three prime hydroxyl for extension, if you will, and it goes and goes. Okay, so the Illumina technology, again, note the similarities now with what we've already been discussing. So, DNA's fragmented. Here we blunt the ends because they tend to be ragged, and we did that in the 454 process, I just didn't point it out. We actually phosphorylate the ends and added an A-overhang. These are all enzymatic steps that take place in quick succession. You ligate on the adapters utilizing this A-overhang to get the adapters ligated on. And you do a quick clean-up step, a sizing step if you're interested in getting very definitive sizes from the library, which we usually are just for uniformity's sake, and you're good to go. So this is a very straight forward process. As I mentioned earlier, in our laboratory, because of the need to sequence thousands of samples at a time in some cases for very directed sequencing projects such as looking at case control cohorts, we've actually automated this to a very large extent to where we can produce on the order of about 1,000 Illumina libraries on a weekly basis, which is a technician in a fleet of small inexpensive pipe heading robots. So, this is very automatable and works very well. In Illumina sequencing, the amplification of the library fragments now is occurring on the surface of the glass flow cell, so each of these technologies has their own nomenclature as well for the device that does the sequencing. In Illumina, it's the flow cell. And again, you can see sort of very shared characteristics here. The surface of the flow cell is decorated with the same type of adapters that are put onto the Illumina library fragments. This provides a point for hybridization of the single-stranded fragments. The amplification steps are essentially what's called a bridge amplification, where the DNA molecule will then bend over and encounter a complementary second-end primer and the polymerase essentially does multiple copies in one place, which results in the collection of fragments times several hundred million on the surface of the flow cell now. And this is called a "cluster." So, when you image a cluster during sequencing, it looks like this very bright little dot here, and when you image a portion or all of the flow cell, it begins to look like this star field, if you will. As the incorporated fragments are scanned with the laser, they emit laser -- they emit a light frequency and that is detected by the camera that's coincident with the scanning by the laser. And I should point out then that there's another process that takes place when you go to the read two, or the paired-end read, which is essentially that you wash away all the fragments that you've already synthesized. You go through another round of some amplification and now you change the chemistry for release of this fragment up from sequencing and you effectively copy the other strand in the other direction from the way that it was first copied in this initial go-round. Okay, so how does the sequencing chemistry work? Now, this is fundamentally different from the 454 chemistry that I showed you in a couple of ways. First of all, we're supplying all four nucleotides into each step of the stepwise process. And in fact, the way these nucleotides are designed is very specific, so we have all four in the mix because each one, A, C, G, and T has their own unique fluor, so they report at a specific wavelength when they're scanned by the laser. So, that's why you can have them all in there at once because you can get the identity back just based on the wavelength that's interpreted by the machine camera. In addition, at the three prime end where normally you would have a hydroxyl available for the next base incorporation, there's a chemical block that's in place. And that chemical block doesn't allow you to incorporate another nucleotide until you go through the detection step and the de-blocking step that removes that and turns it into a hydroxyl ready for the next go-round of incorporation. In a subsequent step, the florescent group is also cleaved off because it has a labile connection here, and that also removes the florescence so that you don't get any background noise, if you will, for the next step of incorporation. Now, what I just said is absolutely not true 100 percent of the time, right? Because of one fundamental rule of life, if you don't walk out of here today with anything else, you have to understand the chemistry is never 100 percent right. You probably learned that in, you know, college. And it's also true here. So, two things can go wrong in particular. The block can not be there, so it may not have been synthesized correctly. Some portion of the molecules in this mix actually won't have a block and you'll incorporate two nucleotides, let's just say, for the sake of argument, instead of one in a single step. That puts the strands that incorporated those two nucleotides instead of one so-called "out of phase" with the rest of the nucleotides in that cluster. And if this will happen multiple times throughout the hundred cycles per read, you will encounter a noise that's due to molecules that are out of phase with the others. This is the source that limits read length on this particular instrument. Of course, the other thing that can go wrong here is two-fold, again, once because chemistry is never 100 percent, you might not have a fluor on there so you can't detect the molecule that's been incorporated. But, of course this is the beauty of having several million copies of it, right? Because that's only one of several million or two, if you will. So, that's not necessarily a bad thing, but then if this cleavage stuff doesn't occur, the worst encounter is that it's actually going to interfere with the signal that's coming for the next go-round. And again, these are cumulative processes, so they may happen a lot over the course of the sequencing reaction. They produce noise and that can cause errors and ultimately limits the read length as well. So, just always be skeptical about how well each of these steps work because there are places where they fall apart. Okay. Illumina has been, I think, in my opinion, a pretty remarkable company also just in terms of the amount of data produced from these instruments, so I don't have a comprehensive listing here of all of the, you know, the early Solexa basically produced about a billion based pairs per read of single-ended reads. The newer iteration once Illumina took over was the GAIIx. We then in early 2010 encountered the HiSeq 2000, which ran two flow cells coincident with one another and produced on the order of about 200 gigabases per run in about an eight-day period. And then the most recent version of the HiSeq, which was announced -- released, rather, in July of this past year, is the equivalent of six whole human genome sequences per run of about 10 to 11 days, as I mentioned. So, this is a remarkable jump up in terms of the quantity and actually also the quality of the data that results from this, and in our laboratory, this is the primary sequencing instrument that we're using. There's also a newer version called the MiSeq, which I'll talk about in a little bit, which is a personal genome sequencer, if you will. It's sort of the desktop instrument. Much lower scale, as you can see from the numbers here, but I'll get to that in just a moment. And then the last thing to mention on the Illumina technology is, as I mentioned, there were some recent announcements in January about improvements to the different sequencing platforms. Illumina's announcement was for the 2500 sequencing instrument, which is basically now just a modification of the HiSeq instrument. In fact, you would be able to upgrade the instrument itself, and that will produce less data: about 120 gigabase pairs per run, but the run requires only 25 hours to complete. So, now you're talking about the rough equivalent of covering a human genome in a one-day period to generate the data. And we don't have those instruments yet, but we're beginning to look at data from them. I should point out that nobody has them yet. They're somewhat vaporware at this particular point in time. We shall see. Female Speaker: [inaudible] Dr. Elaine Mardis: The error rate is -- yes, thank you, I forgot to mention that. The error rate has been improving pretty dramatically over time. Originally, we were round about 1 percent error rate on this platform when we first started working with it. The recent version three chemistry that I mentioned is down around .3 percent error on the reads, and we also are seeing a much better coverage on the G plus C content where in the past, very high G plus C sequences of 95 percent or higher actually did not represent well in the Illumina datasets, and that's now been addressed by some changes to the chemistry that we saw with the version three release, so that's improved the coverage overall on the genome as well, which has been a relief because it was pretty easy to see. We published a paper in 2008 that actually showed that you couldn't see these sequences. So, the third large sequencing technology for human sequencing and other approaches is from a company: Life Technologies. This is a different beast, sequencing by ligation, than what we've been talking about earlier, which is sequencing by polymerase. We use a custom adapter library, as I mentioned. This is also an emulsion PCR-based sequencing based instrument. And Life Technologies actually has some nice modular equipment that you can buy to facilitate the emulsion PCR stuff, because they are pretty manual if you don't have that instrumentation and subject to some errors and failure points. And those instruments seem to help. So, this is sequencing by ligation. The bottom line is that we have florescent probes. They are about nine bases in length and they have a very defined content. And we are also priming from a common primer once we have these emulsion PCR beads to sequence from. So, rather than sequential rounds of basing incorporation, this is now sequential rounds of ligation of the primer, a detection step takes place, there's some enzymatic trimming of the primer that -- or of the N-mer that was added on and then a second round of ligation follows, et cetera, et cetera. So, we go to sequential rounds of ligation. We also go through a sequential round of primer. So, when your first sequencing primer sits down on your adapter, it's at the N minus zero position, and your sequence base is 5-10-15-20, et cetera. The second go round, your primer sits down at N minus 1, and your sequence base is 4-9, et cetera. So, it's sequential and sequential, if you will. The beauty of this approach is that, in the design of the ligated nine-mers, you've effectively had the first two bases identified with the specific known sequence, and those correspond to the fluorescent group that's on that particular nine-mer. So, why is that important? Well, what it means is that you're effectively sequencing every base two times because these first two nucleotides are fixed and their sequence is known from the fluorescent group that's there. So, effectively, if you can look at this particular little diagram here, and if you can't see these well on your slides, or on my slides, for any of these approaches I would suggest very strongly going to the manufacturers' websites because they have extraordinary, sometimes animated visual aids to help you understand the unique attributes of their sequencing process. But, effectively, by sequencing two bases per cycle on each incorporation, you end up overlapping the bases that are read from the fragment. And so you effectively sample every base twice. I like to refer to this for a common analogy, is when you write a check, you write in the number of the dollars and cents that you want to pay on the check, and you write it out secondarily in longhand; so there's that ability to sort of cross compare, if you will, from one read to the next read to make sure you've gotten it right. And this yields an extraordinarily high -- extraordinarily low, rather, substitution error rate for this technology. I'm going to have to change that analogy, because not many people write checks anymore, but so far it's still working, I guess. So, this is the SOLiD instrument, is the name of the platform. These are the two most recent versions with the 5500xl being most recently introduced just last year. And these are some of the attributes; as you can see here, the error rate is extraordinarily low. We have this front-end automation, a six-lane flow chip that actually allows you to use some lanes and not other lanes sort of to ameliorate this need to sort of load up everything at once if you don't have enough samples, et cetera, and a very high accuracy, as I mentioned. And they're introducing some new primer chemistries, I'm not sure that those are actually out yet, that would increase the accuracy pretty wildly high. So that could be, I think, very interesting. Okay. So, let's shift gears now away from next-generation massively parallel instruments to what are commonly referred to by some as third-generation sequencers. And this is really, more than anything, just to note a time point which was sort of the beginning of last year when these sequencers started to hit the early axis or to actually hit the marketplace. So, these include the Pacific Biosciences sequencer, which is the first true single-molecule sequencing instrument that I'm going to talk about. So, we'll see what the specific attributes are for that system which basically marries nanotechnology with molecular biology. The Ion Torrent System now is a variation of pyrosequencing and, instead of detecting light, it actually detects changes in pH, because not only is the pyrophosphate released when you incorporate a base, but a hydrogen ion is too, so you effectively can monitor base-by-base changes in terms of whether an incorporation has happened by monitoring the pH and effectively a little modified semiconductor apparatus. And then the MiSeq I've already mentioned, which is really just a scaled-down version of the HiSeq with very great similarities in terms of the process and the chemistry, et cetera. These all offer some shared attributes as well. Faster run times, as we'll talk about. Lower cost per run. And a reduced amount of data generated relative to the second-gen or next-gen platforms that I talked about. And also, some are touting a potential to address genetic questions in a clinical setting because of the low cost and speed with which results can be returned from these instruments. And I probably won't talk about that, but during Q&A we can address that if you're interested. So, I put these in with the other systems just to place them in context along the lines of what I just talked about. Different detection methodologies, how the libraries are made, and you can see here from the Pacific Biosciences, that we don't really require any of these concerted amplifications steps. Although, in the true sense of full disclosure, there are some PCR steps involved here as you make the libraries with the specific adapters. And then the run times, as you can see, are quite low: 45 minutes, two hours, and on the order of about 19 hours now for the MiSeq platform. So, let's go through these step by step. I just put the PacBio here first because that was the first third-gen instrument that we received in our laboratory. And you can see again just sort of very common shared steps here relative to the next-gen platform. So, the sample prep requires shearing, polishing of the ends, and ligation of the specific adapter that's called the SMRTbell. I'll show you in the next slide exactly what that is. SMRTbell is obviously just a marketing name for it, but it's kind of a clever little adaptation. And then sequencing a primer annealing takes place where you actually bind the DNA polymerase to the library molecules first and then introduce them onto the surface of the SMRT cell, which is a little nanodevice, essentially, that contains on the order of about 150,000 so-called zero-mode wave guides; I'll tell you what those are in just a moment. But you effectively image half of the chip at the time so you look at the possibility of sequencing about 75,000 single-molecule reactions, then you actually switch the imaging, the chip physically moves on the platform, and you image the other 75,000 zero-mode wave guides, and that's one run of the SMRT cell, which is what this little device is called. So, this just shows that you're actually sequencing first half of the SMRT cell and in the second half of the SMRT cell. And the reason that it says "movie" here is that this a true real-time sequencer. So, what you're doing in this sequencing instrument is you have the DNA, complex to DNA polymerase, you're providing it with fluorescent nucleotides. Once it nestles down in the bottom of that zero-mode wave guide, you use the camera and optics and laser system to effectively watch every one of those DNA polymerases in real time as it accepts the end fluorescent nucleotides to the active site, which is where the optics is focused, incorporates it, and in the process of incorporation, the fluorescent moiety diffuses away and the strand translocates so that you can watch the next nucleotide float in, sample, if it's not the right one, it floats out too quickly to be detected in the ideal world. If it's not, it sits there long enough to be detected and then floats away, and this is the source of errors, et cetera. The difficulty in single-molecule sequencing as opposed to amplified-molecule sequencing is just that: You have one shot to get it right, and there are a variety of sources of error that are highly unique to single-molecule sequencing that don't occur in amplified-molecule sequencing, such as dark bases, such as sampling in for too long so that you get detected but you're not really the right nucleotide, so you don't get incorporated, and multiple nucleotides getting incorporated too quickly to distinguish the individual pulses that result. So, the SMRTbell is shown here, and you can see the source of the name. It's essentially just a DNA lollipop, if you will, that gets adapted onto the ends of double-stranded fragments and gets primed with the primer so the sequence is known, it's complementary at its ends but not in its middle so it forms that open, single-stranded portion of the molecule. When the maturation impact such as sodium hydroxide is applied, the molecule then opens up and becomes a circle. So, the beauty of this is that, for very short fragments, complex of the DNA polymerase in a 40-or so minute run time, you can sample around multiple times, both the Watson and Crick strands of that circle going through the adapter each time. And then in bioinformatics phase you can align all of those reads and end up with a much higher quality consensus sequence for that short fragment. So, that's one application of the PacBio. If you want to run this sequencing for longer, you take these VLRs as we call them, very large fragments, also with the SMRTbell adapter, and you essentially just sequence as long as you can during that 45-minute movie. If you're wondering, the read length here is really limited by the amount of data storage capacity on the movies because they are actually very, very large. The movies themselves never get stored because it wouldn't work, storage-wise. But, they get converted very quickly into a down-sampled data file that then gets operated on by the instrument software to do the base calling. Okay. So that's the limitation, but as you'll see in the next slide these can be very, very long reads as we've experienced them. So, now we're really out into outer space in terms of read length compared to everything I've showed you so far, and compared to the capillary sequencers we're looking with the latest chemistries that read lengths at average, so about 3,500 base pairs in lengths for these VLR libraries. And you can see that some are actually very, very long. But, these would be, you know, the outliers in the distribution curve, if you will. And here's just some exemplary data down here from a 45-minute movie where you're looking here at about 8,000 nucleotides in length at the extreme on that distribution curve, and you can see that there are also some failure sequences as well. So, that all sounds great. Because of the sensitivity of single-molecule sequencing to errors, as I just mentioned, the error rate on this is still quite high, about 15 percent, so 15 out of 100 bases are incorrect, and most of those errors, confoundingly, are actually insertion errors. So, this makes the alignment of the reads difficult to do back to a known template or genome reference, and it also complicates the ability to assemble these reads, but I'll show you some of the ways that we're trying to address that here in just a minute or two. Shift to the Ion Torrent, this was released last year. This is the pyrosequencing approach that doesn't use light, but rather uses the release of hydrogen ions. Again, this is very similar to pyrosequencing. The guy who developed pyrosequencing is also the guy who developed the Ion Torrent, you won't be surprised. And this is bead-based, amplification-based. You're sitting in a well here on a semiconductor chip and you're basically on top of a sensor plate that when the hydrogen ion release occurs at each flow of the nucleotide, again this is the same sort of shtick as the 454 approach, you get the release of hydrogen, and this is impacting the detector there that's sensing changes in pH. Some advantages for this are potentially that the linear dynamic range on a pH meter is much better than on a camera, so you could have a better sensitivity to long homopolymers. We haven't really seen that yet, but in fairness to the instrument, these are very new days, and also, we should see a lower substitution rate on this instrument as well. This has been on a trajectory of improvements over time since we first received the instrument commercial release in late, well it was actually early in this past year in 2011, and these chips are increasingly bumping up the yield on the run of the sequencer. Keep in mind, this is about a two-hour to three-hour run time, and we're just recently now experimenting with these 318 chips where the read length is 200 base pairs. Still not paired-end reads, but their working on that, so this is a 200-base pair fragment, and they're hoping in this calendar year to go up to 400 base pairs. I should also point out that Ion Torrent, Life Technologies is one of the other companies that also mentioned a new version of this instrument at the J.P. Morgan meeting in January, and that's the Ion Proton, which will also move using the same technology with a slightly larger chip with more wells, et cetera, et cetera, will move you to this mythical $1,000 genome in a 24-hour period on a completely different instrument that's relatively low cost. Again, vaporware, we have to make sure that it actually comes to fruition, but they're projecting towards the end of 2012 that this would actually be available for the coverage of a whole human genome, $1,000, a 24-hour period. This just shows some of the work that we've been doing on this platform. You can see the different chips listed here. These are all bacterial genomes that we're working with, Enterococcus faecalis or Escherichia coli: these are different emulsion approaches, manual versus these now scaled-down modules that were the same as what were developed essentially for the SOLiD technology that I mentioned. And some enrichment approaches also can be automated as well, and of course you'll immediately see that kicking in the automation actually bumps up the output of this sequencer inordinately, so, you know, it's kind of, I think, probably worth the money for the automation to get that kind of yield increase, just my personal opinion. Now, last week was the Advances in Genome Biology and Technology meeting that I've organized now for 13 years down in Florida, which is really kind of the showcase for sequencing technology. And at that meeting we had a late-breaking abstract provided by this company, Oxford Nanopore, on their Nanapore sequencing device that should, if it, again, is real, revolutionize everything that I've talked about today. We have to be skeptical, however; we are scientists. So, this essentially could use two different processes for sequencing through nanopores. Not trivial technological feed, I might point out, that many have been pursuing at the academic level for well over 15 years now with no tangible commercial success having resulted. So, if these guys can do it, they're really bucking the trend. But, we'll see. So, there are two flavors, if you will, that are being proposed by this company. The one that wasn't talked about is this exonuclease-aided sequencing where you can see a lipid bilayer here, a nanopore with a sensor of some sort, and an exonuclease poised right at the top which would, of course, routinely and uniformly cleave off the DNA basis in this strand shown here, and electrical field would suck them through and detect them one at a time in a neat and orderly fashion. And if you can detect the sarcasm in my voice, you can imagine the ways in which that might go wrong. But, that remains to be seen because I don't think biology is always that neat and orderly, for starters. But, I am a skeptic, as you probably can tell by now. So, the type of sequencing that was talked about is being sort of near term, second half of 2012, a device would be available that looks for all intents and purposes, like the data stick that I loaded my talk from today, is this pore translocation sequencing approach. Here you have a double-stranded molecule that's held by an enzyme, probably a helicase that essentially single strands it and preferentially feeds one strand down through the pore with a tether on this end to make it behave because the biggest problem, or one of the biggest problems in nanopores, is that the DNA looks nice and straight in these beautiful pictures, but, of course, the tendency to form secondary structure, even base-stacking interactions, sometimes kiboshes the fact that it's actually going to go through that pore in a nice orderly fashion. So, these guys have apparently solved that problem, they're reporting base -- length reads of hundreds of thousands of bases having sequenced through Lambda in its entirety for an example. All highly purified DNAs, I might add, so it remains to be seen. And in their big version of this, which looks -- you should check out their website if you're intrigued, it looks for all intents and purposes, like a server rack in a data center. This would be the consumable that you would put into each one of those server modules, if you will, to perform large-scale sequencing. Two to 8,000 nanopores per membrane that goes into this, you add on your DNA, you push the button and you walk away as it complexes it together with the enzyme; they find the nanopores and the sequencing begins. I should point out that this is all detection by current fluctuation, so as the DNA translocates through the pore, the current fluctuation differs depending on the triplet codon that's occupying that pore at the time. This group has studied that extensively for every triplet codon and they have, supposedly, a hidden mark-off model of what each triplet looks like; and you can infer the DNA-based sequence using that hidden mark-up model base collar. Okay. So, that's that. Let's spend just the last five or 10 minutes here talking about some applications of next-gen sequencing. I've made sort of a perfunctory but probably incomplete list here, and I would refer you to an old review that I wrote and then also to this Nature paper that I mentioned at the beginning of the talk which is a little more up to date, and not just to sort of toot my own horn, there are lots of reviews out there on what people are doing with next-gen sequencing, it's just that I didn't list them all on the slide, but they're pretty easy to find. So, let's just talk through a few of these examples and I'll show you some papers from the literature on our work. So, one of the things that we've been doing a lot, and really pioneered in many ways, is whole genome sequencing. These just show the Illumina and SOLiD 5500s because these are really the high throughput here today, whole human genome sequencing instruments that are out there on the marketplace. In the stepwise progression to generate a whole genome is actually pretty simple. It wasn't when we first started trying to do this, mind you, but I think we've got it pretty well down to an art at this point in time that's highly reproducible, and as you'll see in a minute, we do hundreds of whole genomes a year. You prepared the paired-end libraries using the processes I've walked you through. You've produced paired-end data about 30-fold, so about 100 GBs of data per three-gigabase haploid human genome is sufficient. You could go deeper, but this is an economic decision in some ways, because this is, of course, the most expensive application towards the human genome. And then you use computer programs to align the read paired sequences. Back on to that human genome reference sequence that we generated in the early 2000s and we used different algorithms, as I'll show you, to discover variants, genome wide, of all types. We first published the initial description of the whole genome tumor normal comparison in 2008 in this Nature paper with Tim Ley, our AML collaborator, using back then the Solexa technology, 32 based-pair unpaired reads gigabase at a time. It took us about 90 runs on six sequencers, 90 runs total on six sequencers, to produce the full equivalent of the tumor genome. At the time that we did this, we couldn't get any funding to do it, so we went to a private donor and he contributed a million dollars to this project, which when it was all said and done, we figured it probably cost us about a million six, because at this point in time, none of the bio informatics that I'll tell you about had ever been developed. So, we kind of scaled this up since then, I say, tongue in cheek. These are the number of tumor genomes, just at a fairly out-of-date look, you can see an up-to-date look if you check our website, that have been sequenced. Keep in mind, each one of these cases reflects, at minimum, a tumor normal pair that have been sequenced by whole genome methods. You can see the most of our work has been in AML and breast cancer, and then we have a very large amount of work that is now entering the publication phase with the St. Jude Pediatric Cancer Project that just reported the first two -- three papers in Nature II and Nature Genetics I within the last three weeks. So, this is now ramping up to the point where we have over 500 whole genome sequences alone. As you can imagine, sequencing lots of genomes exacts a higher toll than just taking each one and finding out all the somatic mutations, so we've actually developed now a software that's available for download through our website or SourceForge called Music, which allows you to take all this information from genome sequencing and start to make sense of it across different types of functionality, like significantly mutated genes, pathway-based analysis, mutation rate analysis, and looking out to the databases such as COSMIC and OMIM to identify previously identified mutations, and then also taking in any clinical data that might be available for those samples to do clinical correlation analyses to the different mutations that are identified. So, that's Music. Most recently, we've moved forward, exploiting the digital nature of the technology in a paper that was published in Nature January 12th of this year looking at patients at their initial presentation of AML and at their relapse of AML, and basically showing that, unlike that before, this is actually often an oligoclonal presentation of the disease. I mentioned to you earlier that we can map each one of these tumor mutations into a specific subset of the tumor cells that originally contributed DNA, showing for this patient that four sub-clones are originally present in her tumor. The effect of chemotherapy is to whittle away most, but not all, of those sub-clones. They acquire additional mutations, often through the DNA-damaging impact of chemotherapy, and come out on the other side as a flourishing, usually monoclonal presentation that kills the patient. We've also spent a lot of time developing software looking at read pair analysis for structural variant detection. We've known since the 1970s that lots of translocations occur in cancer for example, and this is one way of getting at it using that read pairing information that I told you about earlier, where you look specifically at how and what orientations the read pairs map, and you can use this BreakDancer algorithm to interpret different types of structural variants, including deletions, insertions, inversions on the chromosome, and intra- and inter-chromosomal translocations. And this is actually quite a widely-used software now, by many groups, to do this kind of interpretation. One of the things that we often want to do once we've identified a structural variant, is we want to understand the base sequence at that structural variant so we really understand exactly where it took place down to the nucleotide. You need an assembly algorithm to do that, this is just one example from our work, called TIGRA_SV, which allows us to assemble many but not all structural variants down to single nucleotide resolution. And here is an example of how we use that in a clinical case of a patient whose genome was sequenced: this is a patient presenting with a pathologic examination, acute promyelocytic leukemia, often found under the microscope by cytogenetics as a translocation between chromosomes 15 and 17. However, cytogenetics in this patient did not allow her to take the standard of care, which is all trans-retinoic acid consolidation, because her cytogenetics showed that she did not have the 15-17 translocation. So, she was referred in to us. We decided to do sequencing on her genome because this is a very important consideration, as you can imagine. Her referral in to us was for stem cell transplant. This is because the cytogenetic examination did reveal that she had complex cytogenetics and was therefore a high-risk patient. This is the standard of care for high-risk patients. Nonetheless, this is very expensive, about half a million dollars, unless there are complications, then it's more: associated mortality and morbidity. And if we can make the determination of whether the 15-17 translocation was really there, we could allow her to avoid stem cell transplant and go instead back onto the normal paradigm of care. So, this is now trying to ameliorate conventional pathology and cytogenetics with an intermediary that involves whole genome sequencing. And basically what we found is shown here. Rather than the translocation, this patient actually encountered an insertional mutation, if you will. A portion of 15 containing the PML gene inserted physically into chromosome 17, producing the net PML-RAR alpha fusion that is the hallmark of the normal 15-17 translocation. So, the mechanism was different; the net result was exactly the same. In using TIGRA, we were actually able to assemble this break, and this break down to nucleotide resolution to predict the open reading frames on the proteins were being conserved here, but not in the other fusion -- sorry, insertion products, and this was reported in the Journal of the American Medical Association in mid-last year. And you can see a diagram here from the paper that shows you the net result of her insertional change. So, just a few other sort of illustrations here of using next-gen and third-gen sequencing. I mentioned the very long reads available from next-gen, PacBio reads. This is now showing an Illumina assembly, which are all of these colored short reads, and we're spanning a gap here in the assembly where we lack coverage using some very long reads from the PacBio that we're able to span across that gap. So, this just shows the power of contiguating assemblies with the very long read technologies. There are actually now mechanisms that have been published from Mike Shatt's [spelled phonetically] lab at Cold Spring Harbor where you could use the high quality of the Illumina sequence to actually improve the quality of the PacBio reads once you have those aligned. We're also using the Ion Torrent sequencer that I mentioned earlier, as are many people, to just do rapid genotyping, specifically, small PCR products that can be quickly sequenced in a two-hour timeframe and analyzed for the presence of mutations. And the company has just announced recently some defined content [unintelligible] that will allow you to generate amplicons and then sequence them on this platform. And that's also available for MiSeq. One last note here on hybrid capture technology. This is sort of an offshoot that many people prefer as opposed to whole genome sequencing. It came along a little bit later, but it's been very rapidly advancing. And this is really just taking a very clever approach where you begin with your standard whole genome library, but instead of sequencing, you take it through a hybrid capture approach with a biotinylated probe set. In this example, the biotinylated probes are representing most of the human exome, namely all of the exons that are present in the known genes in the human genome. You can complex the library with the capture probes during a hybridization step which allows you to have a biotinylated probe and the captured DNA molecule, those can be separated away from everything that didn't capture during streptavidin magnetic beads that bind to the biotin, apply a magnet, wash everything else away, and then release the bound fragment, which is already adapted and ready for whatever next-gen sequencing approach you'd like. So, this is a way of down-sampling complex genomes to only look at the genic content and this has now been exploited beyond the exome to custom capture reagents that can be synthesized through these same manufacturers, and you can do the same approach over and over and over again. Here's just an example from our work. A really hard target, which is Merkel cell polyomavirus, which is a very small genome, about five KB. It is a virus that inserts in the human genome, but you can't go in routinely by PCR and amplify it out because it deletes its own genome in specific and unknown places that are not uniform, and the site of integration is not known either. So, we published this in last year in the Journal of Molecular Diagnostics with some of our pathology colleagues. This is basically looking at four different cases in this initial set. Out of formalin-fixed, paraffin-embedded tissue, we're actually able to show the differences in the genomes as they were captured relative to a reference full-length Merkel polyoma cell. And we we're also able to use a bioinformatics approach called SLOPE to look at the paired reads, and identify the exact junction fragment into the human genome where these viruses were able to insert, and predict which genes were being interrupted by them, and so on. And then, just one last word RNA sequencing, which turns out to be pretty important. If you take an RNA isolate, you can treat it in a multitude of different ways, some of which are shown here. Adapter ligate the obtained fragments, and perform next-generation sequencing. This usually starts like whole genome sequencing with alignment of the reads to reference databases and discovery efforts, but the multitude of different types of analysis that one can do on RNA as opposed to DNA, because there are so many things that happen to RNA, such as looking at expression levels, looking at novel splicing events where exons are added or deleted, looking at allelic bias, for example where one allele is preferentially transcribed over the other, and in cancer, looking at the known fusion transcripts to identify them in a cell population. All are possible using the right bioinformatics. I would say that this is one of the most intriguing and tricky problems in analysis, all of these things that we currently face today, and we're working very, very hard on it. Just a couple of quick examples from our tumor sequencing, you can expect anything to happen with RNA. Here's a tumor normal pair predicting a mutation, but when we look in the RNA-Seq data, it's a very high depth. This gene isn't even expressed in the tissue at all; this is from the tumor. Here's an example of very allele-specific expression, where the wild-type allele only is being expressed in spite of very high prevalence of reads in the tumor genome. And here's an example of a splice-site mutation showing alternative splicing that results from an accepter site mutation detected in the whole genome data, and verified by the mapping of reads from RNA-Seq across this region, and the links between mate pairs that show that there are some problems in terms of the splice site being missed by the transcription machinery, and so on. And then, another way that this technology is being used is not just for human sequencing, but also for looking at microbes. So, there's been a very large project that's been funded by the NIH Common Fund on the Human Microbiome Project. I won't go into the details of it for you today, but just to say that this has been a site of -- our medical center has been a site of collection for healthy volunteers being sampled across multiple body sites; sequencing takes place, you can do whole genome sequencing or 16S ribosomal RNA sequencing. And the bottom line is, the question is, "Who's there, and how can we determine that by examining DNA sequencing data?" This, as you can imagine, presents huge challenges in terms of bioinformatic interpretation of the data and has been a major cornerstone of a lot of bioinformatic development. Here's just a couple of quick examples of what you can do with microbial sequencing. Looking at the stability of the virome over time, for multiple body sites that are sampled on different individuals, you can identify different viral types that are shared and similar between a person when they visit the first time to clinic, and a person when they visit the second time to clinic, according to these different body sites. This is one of the ways that we can monitor healthy individuals and how their microbiome changes over time. This gives us a beautiful baseline from now understanding the impact of the microbiome changes when a diseased individual, if you will, comes into clinic. And we really, I think, did this project right by looking at healthy volunteers first and now moving into disease. Here is just one example of that, which I think doesn't show it particularly well, but shows the combined power of next-gen sequencing. All 400 of these MRSA genomes were essentially sequenced on one run of a Illumina sequencer. And then using phylogenetic analysis, you can see that most of the 400, these ones in green, can form to the common ST8 strain of MRSA, but there are a variety here that group together phylogenetically with sequence differences that distinguish them from the ST8 subtype and are also distinguished when we specifically look at the MLST subtyping as well. So, I'm running out of time. I won't have time to cover this last bit, but we are working very hard on envisioning a clinical sequencing pipeline using some of the human analysis tools that I've told you about today towards individuals that consent for return of information about targetable therapies for helping to treat their cancers. We have some examples already published. I mentioned the JAMA paper with the individual with acute promyelocytic leukemia who's now in remission two and a half years after we sequenced her genome and she was treated with standard induction and consolidation therapy, and is healthy and alive and back at work today. We've also sequenced a number of patients, now many of them metastatic, including a patient shown here, which is a HER2-positive disease patient, and by sequencing her genome, we can detect the extreme amplification of HER2 by sequencing her transcriptome. We can also show that this transcript is wildly over-expressed relative to ER-positive patients. And we can also do, if you will, conventional pathology with our sequence data from RNA showing that she's also PR-negative and ER-negative. Interestingly, this patient was already known to be HER2-positive; what we were looking for was some potential therapeutic options. And you can see a similar picture here for chromosome 6 amplification in the extreme on the DNA, amplification of expression extreme on the RNA. And this turns out to be a druggable target, a histone deacetylase, which should respond to Vorinostat, and her oncologist now has this at the ready as she may progress out of her currently stable brain metastasis that was her last data point in the clinic. We've done this for additional metastatic patients, and this is just to illustrate that, going through the genome, we can actually predict a large number of potential targets for each one of these patients. We return this information to the oncologist; it's then up to them to decide what's the best option for the patient. But, the bottom line is, is that most of the prescribed therapies are going to be so-called "off target." I think this presents a challenge for clinical paradigms, but one that we should start thinking about facing up to in the near term. And these are just some other examples of off-target drugs now for estrogen receptor-positive disease that were identified from sequencing through 50 patients in a clinical cohort trial from the American College of Surgeons Oncology Group which we're just now getting ready to publish. So, I want to thank you for your attention and just leave you with these conclusions. Hopefully I've convinced you that these approaches are revolutionizing biological research. The earliest impacts have been on cancer genomics and metagenomics, but many other types of impact that I haven't been able to cover just because it would be overwhelming. I think I also really want to emphasize, and this will make Andy happy, you know, the extreme need for bioinformatics-based analytical approaches is really still a big challenge for this. It's getting better, but it's not quite there yet, and this is the most expensive part of the sequencing. So when people say sequencing is cheap, they mean one thing, and that's generating sequence data. I would say that this analysis is still the most expensive and complicated part. We're looking in that context at now integrating through multiple data types for cancer patients, as I illustrated, and I think the clinical applications of these technologies are pressing, they're real, they're happening, and they really increase the ante, if you will, on needing good bioinformatics. Now, not just for analysis, but for what I would call interpretation of the data so that the physician ordering the test gets the maximum benefit back from having ordered that test to the benefit, ultimately, of the patient. And I think that's the biggest challenge in front of us right now, today. So, I will finish then by saying thanks to all of my colleagues back at the Genome Institute, clinical collaborators, only a few of who are listed here, and without whom we could not do our work. And I will take any final questions in the remaining minutes. Thanks for your attention. [applause] Male Speaker: [inaudible] Dr. Elaine Mardis: Yeah, you're welcome to come up and ask questions if you like, because I know people have to get on with the rest of their day. Thanks.

Contents

Current technologies

Sequencing technologies with a different approach than second-generation platforms were first described as "third-generation" in 2008-2009.[5]

There are two companies currently at the heart of third generation sequencing technology development: Pacific Biosciences and Oxford Nanopore Technology. These companies are taking fundamentally different approaches to sequencing single DNA molecules.

PacBio developed the sequencing platform of single molecule real time sequencing (SMRT), based on the properties of zero-mode waveguides. Signals are in the form of light signals.

Oxford Nanopore’s technology involves passing a DNA molecule through a pore structure and then measuring changes in electrical field surrounding the pore.

Advantages

Longer reads

In comparison to the current generation of sequencing technologies, third generation sequencing has the obvious advantage of producing much longer reads. It is expected that these longer read lengths will alleviate numerous computational challenges surrounding genome assembly, transcript reconstruction, and metagenomics among other important areas of modern biology and medicine.[1]

It is well known that eukaryotic genomes including primates and humans are complex and have large numbers of long repeated regions. Short reads from second generation sequencing must resort to approximative strategies in order to infer sequences over long ranges for assembly and genetic variant calling. Pair end reads have been leveraged by second generation sequencing to combat these limitations. However, exact fragment lengths of pair ends are often unknown and must also be approximated as well. By making long reads lengths possible, third generation sequencing technologies have clear advantages.

Epigenetics

Epigenetic markers are stable and potentially heritable modifications to the DNA molecule that are not in its sequence. An example is DNA methylation at CpG sites, which has been found to influence gene expression. Histone modifications are another example. The current generation of sequencing technologies rely on laboratory techniques such as ChIP-sequencing for the detection of epigenetic markers. These techniques involve tagging the DNA strand, breaking and filtering fragments that contain markers, followed by sequencing. Third generation sequencing may enable direct detection of these markers due to their distinctive signal from the other four nucleotide bases.[6]

Portability and speed

Other important advantages of third generation sequencing technologies include portability and sequencing speed.[7] Since minimal sample preprocessing is required in comparison to second generation sequencing, smaller equipments could be designed. Oxford Nanopore Technology has recently commercialized the MinION sequencer. This sequencing machine is roughly the size of a regular USB flash drive and can be used readily by connecting to a laptop. In addition, since the sequencing process is not parallelized across regions of the genome, data could be collected and analyzed in real time. These advantages of third generation sequencing may be well-suited in hospital settings where quick and on-site data collection and analysis is demanded.

Challenges

Third generation sequencing, as it currently stands, faces important challenges mainly surrounding accurate identification of nucleotide bases; error rates are still much higher compared to second generation sequencing.[4] This is generally due to instability of the molecular machinery involved. For example, in PacBio’s single molecular and real time sequencing technology, the DNA polymerase molecule becomes increasingly damaged as the sequencing process occurs.[4] Additionally, since the process happens quickly, the signals given off by individual bases may be blurred by signals from neighbouring bases. This poses a new computational challenge for deciphering the signals and consequently inferring the sequence. Methods such as Hidden Markov Models, for example, have been leveraged for this purpose with some success.[6]

On average, different individuals of the human population share about 99.9% of their genes. In other words, approximately only one out of every thousand bases would differ between any two person. The high error rates involved with third generation sequencing are inevitably problematic for the purpose of characterizing individual differences that exist between members of the same species.

Genome assembly

Genome assembly is the reconstruction of whole genome DNA sequences. This is generally done with two fundamentally different approaches.

Reference alignment

When a reference genome is available, as one is in the case of human, newly sequenced reads could simply be aligned to the reference genome in order to characterize its properties. Such reference based assembly is quick and easy but has the disadvantage of “hiding" novel sequences and large copy number variants. In addition, reference genomes do not yet exist for most organisms.

De novo assembly

De novo assembly is the alternative genome assembly approach to reference alignment. It refers to the reconstruction of whole genome sequences entirely from raw sequence reads. This method would be chosen when there is no reference genome, when the species of the given organism is unknown as in metagenomics, or when there exist genetic variants of interest that may not be detected by reference genome alignment.

Given the short reads produced by the current generation of sequencing technologies, de novo assembly is a major computational problem. It is normally approached by an iterative process of finding and connecting sequence reads with sensible overlaps. Various computational and statistical techniques, such as de bruijn graphs and overlap layout consensus graphs, have been leveraged to solve this problem. Nonetheless, due to the highly repetitive nature of eukaryotic genomes, accurate and complete reconstruction of genome sequences in de novo assembly remains challenging. Pair end reads have been posed as a possible solution, though exact fragment lengths are often unknown and must be approximated.[8]

 Hybrid assembly - the use of reads from 3rd gen sequencing platforms with shorts reads from 2nd gen platforms - may be used to resolve ambiguities that exist in genomes previously assembled using second generation sequencing. Short second generation reads have also been used to correct errors that exist in the long third generation reads.
Hybrid assembly - the use of reads from 3rd gen sequencing platforms with shorts reads from 2nd gen platforms - may be used to resolve ambiguities that exist in genomes previously assembled using second generation sequencing. Short second generation reads have also been used to correct errors that exist in the long third generation reads.

Hybrid assembly

Long read lengths offered by third generation sequencing may alleviate many of the challenges currently faced by de novo genome assemblies. For example, if an entire repetitive region can be sequenced unambiguously in a single read, no computation inference would be required. Computational methods have been proposed to alleviate the issue of high error rates. For example, in one study, it was demonstrated that de novo assembly of a microbial genome using PacBio sequencing alone performed superior to that of second generation sequencing.[9]

Third generation sequencing may also be used in conjunction with second generation sequencing. This approach is often referred to as hybrid sequencing. For example, long reads from third generation sequencing may be used to resolve ambiguities that exist in genomes previously assembled using second generation sequencing. On the other hand, short second generation reads have been used to correct errors in that exist in the long third generation reads. In general, this hybrid approach has been shown to improve de novo genome assemblies significantly.[10]

Epigenetic markers

DNA methylation (DNAm) – the covalent modification of DNA at CpG sites resulting in attached methyl groups – is the best understood component of epigenetic machinery. DNA modifications and resulting gene expression can vary across cell types, temporal development, with genetic ancestry, can change due to environmental stimuli and are heritable. After the discovery of DNAm, researchers have also found its correlation to diseases like cancer and autism.[11] In this disease etiology context DNAm is an important avenue of further research.

Advantages

The current most common methods for examining methylation state require an assay that fragments DNA before standard second generation sequencing on the Illumina platform. As a result of short read length, information regarding the longer patterns of methylation are lost.[6] Third generation sequencing technologies offer the capability for single molecule real-time sequencing of longer reads, and detection of DNA modification without the aforementioned assay.[12]

 PacBio SMRT technology and Oxford Nanopore can use unaltered DNA to detect methylation.
PacBio SMRT technology and Oxford Nanopore can use unaltered DNA to detect methylation.

Oxford Nanopore Technologies’ MinION has been used to detect DNAm. As each DNA strand passes through a pore, it produces electrical signals which have been found to be sensitive to epigenetic changes in the nucleotides, and a hidden Markov model (HMM) was used to analyze MinION data to detect 5-methylcytosine (5mC) DNA modification.[6] The model was trained using synthetically methylated E. coli DNA and the resulting signals measured by the nanopore technology. Then the trained model was used to detect 5mC in MinION genomic reads from a human cell line which already had a reference methylome. The classifier has 82% accuracy in randomly sampled singleton sites, which increases to 95% when more stringent thresholds are applied.[6]

Other methods address different types of DNA modifications using the MinION platform. Stoiber et al. examined 4-methylcytosine (4mC) and 6-methyladenine (6mA), along with 5mC, and also created a software to directly visualize the raw MinION data in human-friendly way.[13] Here they found that in E. coli, which has a known methylome, event windows of 5 base pairs long can be used to divide and statistically analyze the raw MinION electrical signals. A straightforward Mann-Whitney U test can detect modified portions of the E. coli sequence, as well as further split the modifications into 4mC, 6mA or 5mC regions.[13]

It seems likely that in the future, MinION raw data will be used to detect many different epigenetic marks in DNA.

PacBio sequencing has also been used to detect DNA methylation. In this platform the pulse width - the width of a fluorescent light pulse - corresponds to a specific base. In 2010 it was shown that the interpulse distance in control and methylated samples are different, and there is a "signature" pulse width for each methylation type.[12] In 2012 using the PacBio platform the binding sites of DNA methyltransferases were characterized.[14] The detection of N6-methylation in C Elegans was shown in 2015.[15] DNA methylation on N6-adenine using the PacBio platform in mouse embryonic stem cells was shown in 2016.[16]

Other forms of DNA modifications – from heavy metals, oxidation, or UV damage – are also possible avenues of research using Oxford Nanopore and PacBio third generation sequencing.

Drawbacks

Processing of the raw data – such as normalization to the median signal – was needed on MinION raw data, reducing real-time capability of the technology.[13] Consistency of the electrical signals is still an issue, making it difficult to accurately call a nucleotide. MinION has low throughput; since multiple overlapping reads are hard to obtain, this further leads to accuracy problems of downstream DNA modification detection. Both the hidden Markov model and statistical methods used with MinION raw data require repeated observations of DNA modifications for detection, meaning that individual modified nucleotides need to be consistently present in multiple copies of the genome, e.g. in multiple cells or plasmids in the sample.

For the PacBio platform, too, depending on what methylation you expect to find, coverage needs can vary. As of March 2017, other epigenetic factors like histone modifications have not been discoverable using third-generation technologies. Longer patterns of methylation are often lost because smaller contigs still need to be assembled.

Transcriptomics

Transcriptomics is the study of the transcriptome, usually by characterizing the relative abundances of messenger RNA molecules the tissue under study. According to the central dogma of molecular biology, genetic information flows from double stranded DNA molecules to single stranded mRNA molecules where they can be readily translated into function protein molecules. By studying the transcriptome, one can gain valuable insight into the regulation of gene expressions.

While expression levels as the gene level can be more or less accurately depicted by second generation sequencing, transcript level information is still an important challenge.[17] As a consequence, the role of alternative splicing in molecular biology remains largely elusive. Third generation sequencing technologies hold promising prospects in resolving this issue by enabling sequencing of mRNA molecules at their full lengths.

Alternative splicing

Alternative splicing (AS) is the process by which a single gene may give rise to multiple distinct mRNA transcripts and consequently different protein translations.[18] Some evidence suggests that AS is a ubiquitous phenomenon and may play a key role in determining the phenotypes of organisms, especially in complex eukaryotes; all eukaryotes contain genes consisting of introns that may undergo AS. In particular, it has been estimated that AS occurs in 95% of all human multi-exon genes.[19] AS has undeniable potential to influence myriad biological processes. Advancing knowledge in this area has critical implications for the study of biology in general.

Transcript reconstruction

The current generation of sequencing technologies produce only short reads, putting tremendous limitation on the ability to detect distinct transcripts; short reads must be reverse engineered into original transcripts that could have given rise to the resulting read observations.[20] This task is further complicated by the highly variable expression levels across transcripts, and consequently variable read coverages across the sequence of the gene.[20] In addition, exons may be shared among individual transcripts, rendering unambiguous inferences essentially impossible.[18] Existing computational methods make inferences based on the accumulation of short reads at various sequence locations often by making simplifying assumptions.[20] Cufflinks takes a parsimonious approach, seeking to explain all the reads with the fewest possible number of transcripts.[21] On the other hand, StringTie attempts to simultaneously estimate transcript abundances while assembling the reads.[20] These methods, while reasonable, may not always identify real transcripts.

A study published in 2008 surveyed 25 different existing transcript reconstruction protocols.[17] Its evidence suggested that existing methods are generally weak in assembling transcripts, though the ability to detect individual exons are relatively intact.[17] According to the estimates, average sensitivity to detect exons across the 25 protocols is 80% for Caenorhabditis elegans genes.[17] In comparison, transcript identification sensitivity decreases to 65%. For human, the study reported an exon detection sensitivity averaging to 69% and transcript detection sensitivity had an average of mere 33%.[17] In other words, for human, existing methods are able to identify less than half of all existing transcript.

Third generation sequencing technologies have demonstrated promising prospects in solving the problem of transcript detection as well as mRNA abundance estimation at the level of transcripts. While error rates remain high, third generation sequencing technologies have the capability to produce much longer read lengths.[22] Pacific Bioscience has introduced the iso-seq platform, proposing to sequence mRNA molecules at their full lengths.[22] It is anticipated that Oxford Nanopore will put forth similar technologies. The trouble with higher error rates may be alleviated by supplementary high quality short reads. This approach has been previously tested and reported to reduce the error rate by more than 3 folds.[23]

Metagenomics

Metagenomics is the analysis of genetic material recovered directly from environmental samples.

Advantages

The main advantage for third-generation sequencing technologies in metagenomics is their speed of sequencing in comparison to second generation techniques. Speed of sequencing is important for example in the clinical setting (i.e. pathogen identification), to allow for efficient diagnosis and timely clinical actions.

Oxford Nanopore's MinION was used in 2015 for real-time metagenomic detection of pathogens in complex, high-background clinical samples. The first Ebola virus (EBV) read was sequenced 44 seconds after data acquisition.[24] There was uniform mapping of reads to genome; at least one read mapped to >88% of the genome. The relatively long reads allowed for sequencing of a near-complete viral genome to high accuracy (97–99% identity) directly from a primary clinical sample.[24]

A common phylogenetic marker for microbial community diversity studies is the 16S ribosomal RNA gene. Both MinION and PacBio's SMRT platform have been used to sequence this gene.[25][26] In this context the PacBio error rate was comparable to that of shorter reads from 454 and Illumina's MiSeq sequencing platforms.

Drawbacks

MinION's high error rate (~10-40%) prevented identification of antimicrobial resistance markers, for which single nucleotide resolution is necessary. For the same reason eukaryotic pathogens were not identified.[24] Ease of carryover contamination when re-using same the flow cell (standard wash protocols don’t work) is also a concern. Unique barcodes may allow for more multiplexing. Furthermore, performing accurate species identification for bacteria, fungi and parasites is very difficult, as they share a larger portion of the genome, and some only differ by <5%.

The per base sequencing cost is still significantly more than that of MiSeq. However, the prospect of supplementing reference databases with full-length sequences from organisms below the limit of detection from the Sanger approach;[25] this could possibly greatly help the identification of organisms in metagenomics.

Before third generation sequencing can be used reliably in the clinical context, there is a need for standardization of lab protocols. These protocols are not yet as optimized as PCR methods.

References

  1. ^ a b c Bleidorn, Christoph (2016-01-02). "Third generation sequencing: technology and its potential impact on evolutionary biodiversity research". Systematics and Biodiversity. 14 (1): 1–8. doi:10.1080/14772000.2015.1099575. ISSN 1477-2000. 
  2. ^ "Illumina sequencing technology" (PDF). 
  3. ^ a b c Treangen, Todd J.; Salzberg, Steven L. (2012-01-01). "Repetitive DNA and next-generation sequencing: computational challenges and solutions". Nature Reviews Genetics. 13 (1): 36–46. doi:10.1038/nrg3117. ISSN 1471-0056. PMC 3324860Freely accessible. PMID 22124482. 
  4. ^ a b c Gupta, Pushpendra K. (2008-11-01). "Single-molecule DNA sequencing technologies for future genomics research". Trends in Biotechnology. 26 (11): 602–611. doi:10.1016/j.tibtech.2008.07.003. 
  5. ^ Check Hayden, Erika (2009-02-06). "Genome sequencing: the third generation". Nature News. doi:10.1038/news.2009.86. 
  6. ^ a b c d e Simpson, Jared T.; Workman, Rachael; Zuzarte, Philip C.; David, Matei; Dursi, Lewis Jonathan; Timp, Winston (2016-04-04). "Detecting DNA Methylation using the Oxford Nanopore Technologies MinION sequencer". bioRxiv 047142Freely accessible. 
  7. ^ Schadt, E. E.; Turner, S.; Kasarskis, A. (2010-10-15). "A window into third-generation sequencing". Human Molecular Genetics. 19 (R2): R227–R240. doi:10.1093/hmg/ddq416. ISSN 0964-6906. 
  8. ^ Li, Ruiqiang; Zhu, Hongmei; Ruan, Jue; Qian, Wubin; Fang, Xiaodong; Shi, Zhongbin; Li, Yingrui; Li, Shengting; Shan, Gao (2010-02-01). "De novo assembly of human genomes with massively parallel short read sequencing". Genome Research. 20 (2): 265–272. doi:10.1101/gr.097261.109. ISSN 1088-9051. PMC 2813482Freely accessible. PMID 20019144. 
  9. ^ Chin, Chen-Shan; Alexander, David H.; Marks, Patrick; Klammer, Aaron A.; Drake, James; Heiner, Cheryl; Clum, Alicia; Copeland, Alex; Huddleston, John (2013-06-01). "Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data". Nature Methods. 10 (6): 563–569. doi:10.1038/nmeth.2474. ISSN 1548-7091. 
  10. ^ Goodwin, Sara; Gurtowski, James; Ethe-Sayers, Scott; Deshpande, Panchajanya; Schatz, Michael C.; McCombie, W. Richard (2015-11-01). "Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome". Genome Research. 25 (11): 1750–1756. doi:10.1101/gr.191395.115. ISSN 1088-9051. PMC 4617970Freely accessible. PMID 26447147. 
  11. ^ Fraser, Hunter B.; Lam, Lucia L.; Neumann, Sarah M.; Kobor, Michael S. (2012-02-09). "Population-specificity of human DNA methylation". Genome Biology. 13 (2): R8. doi:10.1186/gb-2012-13-2-r8. ISSN 1474-760X. PMC 3334571Freely accessible. PMID 22322129. 
  12. ^ a b Flusberg, Benjamin A.; Webster, Dale R.; Lee, Jessica H.; Travers, Kevin J.; Olivares, Eric C.; Clark, Tyson A.; Korlach, Jonas; Turner, Stephen W. (2010-06-01). "Direct detection of DNA methylation during single-molecule, real-time sequencing". Nature Methods. 7 (6): 461–465. doi:10.1038/nmeth.1459. PMC 2879396Freely accessible. PMID 20453866. 
  13. ^ a b c Stoiber, Marcus H.; Quick, Joshua; Egan, Rob; Lee, Ji Eun; Celniker, Susan E.; Neely, Robert; Loman, Nicholas; Pennacchio, Len; Brown, James B. (2016-12-15). "De novo Identification of DNA Modifications Enabled by Genome-Guided Nanopore Signal Processing". bioRxiv 094672Freely accessible. 
  14. ^ Clark, T. A.; Murray, I. A.; Morgan, R. D.; Kislyuk, A. O.; Spittle, K. E.; Boitano, M.; Fomenkov, A.; Roberts, R. J.; Korlach, J. (2012-02-01). "Characterization of DNA methyltransferase specificities using single-molecule, real-time DNA sequencing". Nucleic Acids Research. 40 (4): e29–e29. doi:10.1093/nar/gkr1146. ISSN 0305-1048. PMC 3287169Freely accessible. PMID 22156058. 
  15. ^ Greer, Eric Lieberman; Blanco, Mario Andres; Gu, Lei; Sendinc, Erdem; Liu, Jianzhao; Aristizábal-Corrales, David; Hsu, Chih-Hung; Aravind, L.; He, Chuan. "DNA Methylation on N6-Adenine in C. elegans". Cell. 161 (4): 868–878. doi:10.1016/j.cell.2015.04.005. PMC 4427530Freely accessible. PMID 25936839. 
  16. ^ Wu, Tao P.; Wang, Tao; Seetin, Matthew G.; Lai, Yongquan; Zhu, Shijia; Lin, Kaixuan; Liu, Yifei; Byrum, Stephanie D.; Mackintosh, Samuel G. (2016-04-21). "DNA methylation on N6-adenine in mammalian embryonic stem cells". Nature. 532 (7599): 329–333. doi:10.1038/nature17640. ISSN 0028-0836. PMC 4977844Freely accessible. PMID 27027282. 
  17. ^ a b c d e Steijger, Tamara; Abril, Josep F.; Engström, Pär G.; Kokocinski, Felix; The RGASP Consortium; Hubbard, Tim J.; Guigó, Roderic; Harrow, Jennifer; Bertone, Paul (2013-12-01). "Assessment of transcript reconstruction methods for RNA-seq". Nature Methods. 10 (12): 1177–1184. doi:10.1038/nmeth.2714. ISSN 1548-7091. PMC 3851240Freely accessible. PMID 24185837. 
  18. ^ a b Graveley, Brenton R. "Alternative splicing: increasing diversity in the proteomic world". Trends in Genetics. 17 (2): 100–107. doi:10.1016/s0168-9525(00)02176-4. 
  19. ^ Pan, Qun; Shai, Ofer; Lee, Leo J.; Frey, Brendan J.; Blencowe, Benjamin J. (2008-12-01). "Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing". Nature Genetics. 40 (12): 1413–1415. doi:10.1038/ng.259. ISSN 1061-4036. 
  20. ^ a b c d Pertea, Mihaela; Pertea, Geo M.; Antonescu, Corina M.; Chang, Tsung-Cheng; Mendell, Joshua T.; Salzberg, Steven L. (2015-03-01). "StringTie enables improved reconstruction of a transcriptome from RNA-seq reads". Nature Biotechnology. 33 (3): 290–295. doi:10.1038/nbt.3122. ISSN 1087-0156. PMC 4643835Freely accessible. PMID 25690850. 
  21. ^ Trapnell, Cole; Williams, Brian A.; Pertea, Geo; Mortazavi, Ali; Kwan, Gordon; van Baren, Marijke J.; Salzberg, Steven L.; Wold, Barbara J.; Pachter, Lior (2010-05-01). "Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation". Nature Biotechnology. 28 (5): 511–515. doi:10.1038/nbt.1621. ISSN 1087-0156. PMC 3146043Freely accessible. PMID 20436464. 
  22. ^ a b Abdel-Ghany, Salah E.; Hamilton, Michael; Jacobi, Jennifer L.; Ngam, Peter; Devitt, Nicholas; Schilkey, Faye; Ben-Hur, Asa; Reddy, Anireddy S. N. (2016-06-24). "A survey of the sorghum transcriptome using single-molecule long reads". Nature Communications. 7. doi:10.1038/ncomms11706. ISSN 2041-1723. PMC 4931028Freely accessible. PMID 27339290. 
  23. ^ Au, Kin Fai; Underwood, Jason G.; Lee, Lawrence; Wong, Wing Hung (2012-10-04). "Improving PacBio Long Read Accuracy by Short Read Alignment". PLOS ONE. 7 (10): e46679. doi:10.1371/journal.pone.0046679. ISSN 1932-6203. PMC 3464235Freely accessible. PMID 23056399. 
  24. ^ a b c Greninger, Alexander L.; Naccache, Samia N.; Federman, Scot; Yu, Guixia; Mbala, Placide; Bres, Vanessa; Stryke, Doug; Bouquet, Jerome; Somasekar, Sneha (2015-01-01). "Rapid metagenomic identification of viral pathogens in clinical samples by real-time nanopore sequencing analysis". Genome Medicine. 7: 99. doi:10.1186/s13073-015-0220-9. ISSN 1756-994X. PMC 4587849Freely accessible. PMID 26416663. 
  25. ^ a b Schloss, Patrick D.; Jenior, Matthew L.; Koumpouras, Charles C.; Westcott, Sarah L.; Highlander, Sarah K. (2016-01-01). "Sequencing 16S rRNA gene fragments using the PacBio SMRT DNA sequencing system". PeerJ. 4: e1869. doi:10.7717/peerj.1869. PMC 4824876Freely accessible. PMID 27069806. 
  26. ^ Benítez-Páez, Alfonso; Portune, Kevin J.; Sanz, Yolanda (2016-01-01). "Species-level resolution of 16S rRNA gene amplicons sequenced through the MinION™ portable nanopore sequencer". GigaScience. 5: 4. doi:10.1186/s13742-016-0111-z. ISSN 2047-217X. PMC 4730766Freely accessible. PMID 26823973. 
This page was last edited on 18 January 2018, at 15:12.
Basis of this page is in Wikipedia. Text is available under the CC BY-SA 3.0 Unported License. Non-text media are available under their specified licenses. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc. WIKI 2 is an independent company and has no affiliation with Wikimedia Foundation.