To install click the Add extension button. That's it.

The source code for the WIKI 2 extension is being checked by specialists of the Mozilla Foundation, Google, and Apple. You could also do it yourself at any point in time.

4,5
Kelly Slayton
Congratulations on this excellent venture… what a great idea!
Alexander Grigorievskiy
I use WIKI 2 every day and almost forgot how the original Wikipedia looks like.
Live Statistics
English Articles
Improved in 24 Hours
Added in 24 Hours
What we do. Every page goes through several hundred of perfecting techniques; in live mode. Quite the same Wikipedia. Just better.
.
Leo
Newton
Brights
Milds

Biomedical Computation Review

From Wikipedia, the free encyclopedia

Biomedical Computation Review
DisciplineComputational biology
LanguageEnglish
Publication details
History2005 to present
Publisher
Mobilize Center, Stanford University (United States)
Frequencyquarterly
Yes
Standard abbreviations
ISO 4Biomed. Comput. Rev.
Indexing
ISSN1557-3192
Links

Biomedical Computation Review (BCR) is a quarterly, open-access magazine funded by the National Institutes of Health[1] and published by the Mobilize Center at Stanford University.[2] First published in 2005, BCR covers such topics as molecular dynamics, genomics, proteomics, physics-based simulation, systems biology, and other research involving computational biology. BCR's articles are targeted to those with a general science or biology background, in order to build a community among biomedical computational researchers who come from a variety of disciplines.

YouTube Encyclopedic

  • 1/3
    Views:
    922
    12 203
    365
  • Dr. Daoud Meerzaman: Computational Tools for Cancer Genome Analysis
  • Novartis Uses AWS to Conduct 39 Years of Computational Chemistry In 9 Hours
  • Gibbons Series 2016 01: Computational Physiology

Transcription

>> George Komatsoulis: Thank you for joining us today for the NCI CBIIT Speaker Series. I'm George Komatsoulis, Interim Director here at the NCI Center for Biomedical Informatics Information Technology. The Speaker Series is intended to be a knowledge sharing forum featuring both internal and external speakers on topics of interest to the biomedical informatics and to the broader research community. Today we are happy to welcome our very own Dr. Daoud Meerzaman, Section Head of Computational Genomics Research at CBIIT. The title of Dr. Meerzaman's presentation is Computational Tools for Cancer Genome Analysis. Dr. Meerzaman previously served as the Scientific Project Manager at the Center for Cancer Research at NCI and is also an adjunct faculty member at the George Washington University. Let me also remind you that this presentation will be available on the Wiki for the Speaker Series as a screen cast with voiceover and also posted on the NCI YouTube channel and if you Google NCI Speaker Series you'll find that Wiki page. Just to let everyone know there will be a slight delay in the availability of these materials as there are collaborators who must approve the release of this information before we can make it generally available to the public. And also we have to spend some time making these materials Section 508 compliant and generally available for all to download. I have two sort of very, very brief sort of housekeeping things for those are physically here. Every power jack on the table is live, so if you need to charge your laptops or whatever, go ahead. And if you're going to ask questions later on and you are in the room, please you're going to need to push the button on the microphone to ask a question. And with that I'll turn the event over to Dr. Meerzaman. >> Dr. Daoud Meerzaman: Alright, thanks very much George. Before I actually dive in into my presentation can you hand-- okay. >> Yeah, speak up. Dr. Daoud Meerzaman: Yeah, I will. Not only am I going to speak up but I'm going to probably go a little faster also. So, what I wanted to actually do is before I start my presentation I wanted to go ahead and acknowledge my team that I've been working with and I'm lucky enough to have such a great and talented group of people in my team. Chunhua, Qing-Rong, Ying, Chih Hao, Cu, Richard Finney, Xiaoping and our two new member interns Brian and Jeffrey and, of course, none of this would happen without the support of George Kamatsoulis, Interim Director. We really appreciate his help in supporting our projects and our vision. So, with that said again, I think I have a little more than I was supposed to put in here. So, I might go a little fast in terms of covering most of the material. So, what I will do is I will give a brief overview of what is really, what is we are doing in general and then I can give you a little more detail about some of the tools that we are actually-- we have developed and we are in the process of developing. So, the CBIIT Computational Genomics Research is part of the- it is part of the CBIIT support group and our main focus is to carry out comprehensive and integrative data analyses both for array and next generation sequencing. I didn't put it there but we do a great deal of statistical analysis as well. And we do this true collaboration with NCI, NIH as well as extramural groups outside of NIH and I'll speak a little bit about each one of those later. Develop bioinformatics tools and this is a big part of our group actually. In the past, we have also done this and now we're still in the process of developing new algorithms and new tools for genomic analysis, specifically cancer genomic analysis. So, just very quickly I'd like to actually go and talk a little bit about some of our collaborators, some of our collaborations. The very first one is the TARGET and it's an NCI initiative project that we're dealing with AML (acute myelogenous leukemia), NBL (neuroblastoma), Wilms tumor and the type of analysis that we're carrying out is including RNA Seq, Exome, SNP6 Methylation and in addition to that we also are dealing with some of TCGA groups, specifically we'll be working with the GBM, the Glioblastoma multiforme as well as the breast cancer consortium. And again we have done analysis for the RNA Seq, whole genome sequencing and SNP6. We are also very actively involved with the CCR and just getting involved with DCEG. CCR is specifically with Maxwell Lee, sitting in the audience and also we've been working with Gordon Hager trying to set up some work, but we're waiting for some sequencing. But we have been also working with Dena Singer trying to provide some bioinformatic support as well as other-- Jabi Kahn [assumed spelling] is also part of that group actually. And so the DCEG again we're just initiating this collaboration and it's at work. We also have specific collaboration with James Dorsch's [assumed spelling] group and Munk [assumed spelling] who is actually working on the NCI 60 cell line and NCI-Frederick and basically what we are looking for is the anti- cancer direct response. And we have done a lot of expression arrays and SNP6 analysis for them as well. Gentotype to Phenotype is one of the newest projects that we got involved. This is headed by Mickey Williams and again we are looking at the ovarian cancer and trying to actually analyze a lot of the Exomes, about 68, 67 samples that they did Exomes on. Last, but not least, is the Tata Memorial Cancer Centre, one of the largest cancer centers in India, that George has initiated this and we are basically working on the progesterone treated breast cancer patient and we're specifically looking at the RNA Seq and trying to determine the effect of the progesterone on these patients. So this is just an overview of some of the projects that we are dealing with, but here is a little more detail; again, not to share any of the data yet, but just to give you a snapshot of what kind of analysis we're carrying out. So, we're using our in-house, our in-house tool called Bambino for variant calls. And this is for the TARGET Wilms tumor by the way. It's headed by Dr. Elizabeth Pearlman in Northwestern. We also have and I think Warren probably knows her and then we have the-- we have then also-- we have used the R software packages in order to carry out methylation profiling for these samples and also we have used the Broad CBS algorithm as well as the GISTIC trying to determine the copy number and the loss of heterozygosity and copy number neutral loss of heterozygosity. Now once we get all that done then what we need to do is to actually do an integrative analysis, a system biology type of integrative analysis where we look at the correlation between methylation gene expression, mutation and copy number. That is the final goal obviously. Finally, TCGA GBM Imaging Project this is headed by Dr. Max Wintermark and he is a PI at University of Virginia as part of the TCGA GBM group and what he wanted us to do was he actually provided us with some data and he wanted us to do some statistical analysis and create a model to actually see the correlation between patient survival and various type of features, one of them being clinical attributes, which includes age, race and date of diagnosis, relapse and so on and so forth. Also VASRI features, these are radiological imaging, genomic variations, which include copy number, mutation and expression. And what we did was we actually did all of those association correlation but then we did different iterations of those associations. We also looked at the VASRI feature with basically looking at the patient with the clinical, patient with the VASRI, patient with Genomic variation and then combined them as step-on Cox model and then check the VASRI and clinical and VASRI and genomic, but at the end what we found was when you combine all this together VASRI, clinical features as well as genomic variation you get the best predictor model for the survival. This paper is right now being reviewed in Journal of Radiology and hoping that it will be accepted soon. Our AML group is the target AML it's in Fred Hutchinson it's headed by Soheil Meshinchi and we have actually worked with the RNA Seq project and here we are actually trying to do a landscape of the transcript and basically trying to identify all the species of transcript including the messenger RNA, non- coding RNA, small RNA and mRNA, whatever RNA actually falls within that category. And within that we also are hoping and we have actually done this already, the differentially expressed genes at the gene level, at the exome level and most importantly something that the RNA Seq allows us is the usage of the exomes. This is something we weren't able to do in the past with arrays. So, we are all-- we have actually done this analysis and we have submitted the results of them and they seem to be very happy with that waiting for additional information. We also are waiting for normal which will allow us to do base substitution, insertion, deletion and allele expression. What we have also done for this project is the actually the fusion protein. We have been able to identify a total of 66 different fusion proteins that they have actually done with lab verification or validation and they seem to be validated. The question right now is whether there's a real fusion or they're just read-through and we're in the process of actually figuring that out. But right now that is where the actual project is sitting at this moment. So, I don't usually like to use cartoons but I thought this was kind of cute because somebody spent all their lifetime trying to figure something out and the, of course, one guy came with Twitter and with just 140 characters they were well, you're wrong. [laughter] With that said, what I'm hoping to actually convey here is that what I'm going to show you is specifically our tools and our algorithms that we developed here in our-- in here in CBIIT. So, this by no means, by no means, it actually means that we are trying to promote this or we think that anybody else's product is not good or ours is better. We're just going to show you what we have developed and we're using this along with many other tools and algorithms that are available in the scientific community. But again, please take that in account, this is just our tools, that's why we like to use them and there is-- I'm not surveying other tools that are available. Some of them I do mention, but not a whole lot of them are available. So, in-house tools and algorithms and I'm going to actually go through each one of these briefly. So, the CGR has one tool that is called the Pathway Interaction Database and I'll go into a little bit more detail on that. We also have another tool that is called PathOlogist. And this is not the traditional pathologist, but it's a pathway analysis. We also have Pathway of Distinction Analysis also known as PoDA. We also have developed another tool called Bambino and Cancer Genomic Work Bench and I'm going to actually go into a little bit more detail on each one of them so you don't have to worry about what they are, because some of the names are kind of odd. Gene Fusion Viewer, as well as OmicCircos, this is the final, the newest member of the tools and algorithms. So, let me just go ahead and dive in and talk about the Pathway Interaction analysis. So, this was initiated or developed initially by Carl Schaefer who was part of the CIIBIT and now he's retired and he actually worked with the Nature Publishing Group in trying to develop this and what is exactly PID. And by the way, it's a published unit so if I miss something here or if you feel like you really need additional information you can go to this publication and it should be useful. So, here is what PID actually represents. It's the basically free collection of curated peer- reviewed information knowledge that is about the molecular signaling and regulatory events about pathways. So, the Nature Publishing Group had been doing this until September of last year, but they stopped now curating it and we actually took that job now. I have actually hired or am in the process of hiring a fellow who is going to take care of that process of curation and hopefully this will happen within a few weeks and we'll continue the curation. It's going to be a bit slower, but the hope is that it will continue and actually this is probably a great time to also mention that we are inviting the community, scientific community to actually tell us what are the pathways that they are interested and we'll work with the program at CIIBIT to identify those specific pathways in order to prioritize those in terms of which ones to focus on. So, PID provides information about transcription, translocation as well as modification and interaction. It also provides a predetermined pathway or you can actually create interaction network using one single gene or potentially put a list of genes and you can identify pathways for an entire list of genes. And what I want to do is actually very quickly, I've always been against this, I never, ever told people to do this but I'm going to show you-- at least I'm going to the website, hopefully this will work, very quickly show you some of the features that it has. If not then that's okay. We can actually skip that part. It's not that big of a deal, but here is the-- I thought I had it up so hopefully-- okay so here is what it looks like basically. And one of the things that I actually, I guess is this a pointer also? >> I have no idea. >> No. >> Oh, yes it is. >> George Komatsoulis: Yes, yes there's a dot on the wall. >> Yes it is pointing. >> Dr. Daoud Meerzaman: Oh well I guess it doesn't point in here, but that's okay. So-- >> George Komatsoulis: There it is. >> But it won't show up on the visual. >> George Komatsoulis: Oh okay. >> Dr. Daoud Meerzaman: Which is perfectly fine, yeah, that's fine. Actually it consists of 137 pathways that are curated by NCI and Nature and about 322 pathways that were basically transferred from the BioCarta and reactive. And what you can do is simply put the name of the gene of interest, you know let's say SHH at this point is one of the genes that I was working on and you can actually click and you have a number of different pathways that will show up. And this one is specifically the one that I am interested in. You can just go in and basically it gives you an incredibly great deal of information that are all curated by, at least by this point by Nature Publishing. So, some of the things that are important here is that you can actually see the green showing the positive regulator and then this part the tool is basically a transcription right here. This is an actual complex formation. What I really like about this tool is that it also gives its translocation which is kind of critically important. If you paid attention right here, if there is an N that means that this CDC2 cycle if we wanted to actually accomplish that is in the nucleus but when it gets translocated into the membrane and it adds itself to the patch receptor. And I'll come back to that in a minute. So, this is the kind of information that you can actually extract from the PID. You can also again put in a batch, I can't really see my [inaudible] but I think this is a batch. So you can actually put in a list of genes and you can potentially rather than looking at one or two genes you can actually look at a list of genes that you know you identify by whatever means you have. And now you want to actually look at them, see where are those genes located? And here's what happens. It gives you, I had like ten genes and I wanted to see the group 1 and group 2 where would they fall in terms of their pathways. So, there seems to be part of a pathway connection. Anyway that's I think all I want to talk about this-- right now. I want to move back to the presentation you see. So again, if there's any question I'll be happy to address that later. If not, if you have a chance to look at it I think it's really a great place to go and check your pathway analysis. So, sorry guys this was supposed to be-- okay so since we are in the business of pathway right now I thought it would be a great idea to go ahead and talk about PathOlogist. And the reason why I wanted to bring in the PathOlogist mix is as we're all pretty much aware is that unlike other disease such as cystic fibrosis and MS, cancer is a very complicated disease and it requires a really, really multiple network in genes involvement. So, for that reason we thought there is a paradigm shift in terms of actually the scientific community looking more into the pathway analysis rather than the single gene effect. So, what I want to do is then go ahead and discuss it. Again this paper is already published, so if I can't cover some of the things that you need to know, you can always go and check that out if you want to please. What is PathOlogist? It is an alternative to single gene-to-phenotype association and it is actually-- it is done by the molecular network interactions. And I'll get into a little bit of detail for that in a minute. PathOlogist transforms large set of data. It transforms large set of data from gene expression, RNA seq and it would actually change into quantitative identifier or descriptor which will eventually be superimposed into pathways. And also PathOlogist does this-- carry out this analysis by looking at two very important matrix. One is the activity score and the other one is the consistency score. Now for the entire 500 pathways that I just mentioned in the PID, so it's a very intense tool and again I don't really intend to go over this you know, expect people to really understand this in the next couple of minutes, but I will give my best shot to see if it really works or not. So, here is what's known as a logical potential pathway, pathway network that you can actually, the probability of gene A and gene B. If you have gene A and gene B, you expect gene C to be present. Now, the blue color basically shows the promoters. They are the promoters so if you have gene A and gene B and, of course, this node of interaction will be active and you will end up with the gene C. That is one of the probability or one of the way that the nodes at the interaction level is being calculated. Same thing is here, that's another way of calculating the interaction node is that if you have gene A and gene B, but in this case gene B is short and is red meaning this is an inhibitor or a negative regulator of this process. So, if gene A is present and gene B is present than obviously you don't expect gene C to be present, that's just the game, the type of probability that you are calculating for each interaction one more. And this can go on and on for many, many different iterations. You have gene A and gene B. Again gene B is an inhibitor. If gene B is absent you will definitely see that gene C will be present. So, what you can do is then based on that criteria and, of course, there's a lot more involved in that; I just gave you a snapshot of what it looks like. You can actually superimpose this red and green that represents basically the expression analysis that was taken away from either an array or from RNAC. So a green means that gene is expressed and red means that gene is not expressed. So, if you will look at this again gene AB is expressed. Obviously you'll expect gene C to be expressed, because that's what it's supposed to do. And you have also here in thsi case gene A is present. Gene B is absent. You expect gene C to be present because it's an inhibitor and when you don't have this present gene C is showing up. So, this is how it goes on. The key here is that you actually do this for all of the interactions and eventually for all the pathways and this is what you're then finding out. So, basically using this structure above for the network you actually score. You score all these, basically the nodes and then eventually the pathways for both activity and consistency. A consistency is, for example, here gene A is absent. Gene B is present and this is absent. And that makes sense because again this is an inhibitor if you have this present. This will be absent. This is a different scenario. You have this one now red and this one being green. That is inconsistent. So, there's a lot of again various type of calculations that's going on and what happens is at the end of the day you come up with two scores: activity score and consistency score for each sample for each pathway. So, here is what it may look like. So you calculate the activity and the consistency score for all the pathways for each sample and then compare. Now this is the key about the PathOlogist. You actually compare the scores between normal or tumor, treated or untreated, relapsed or non-relapsed. So you can now look at the changes that took place between normal and cancer. Let's say that there was one pathway that was highly active and consistent in cancer and it was not in the case of the normal or vice versa. So, the key here is you are actually identifying a specific pathway that are overrepresented in terms of the activity or consistency when you compare the two normal tumor relapse/ non-relapse. We took that into account. We wanted to actually test this hypothesis. We wanted to take it to the lab and what we did was we took expression array of 48 samples. In that 48 samples we had 24 hepatocellular carcinoma samples and 24 tumor adjacent samples. We did expression array on them and did the same thing that we have done in the past where we calculated activity and consistency scores for all of these and then simply trying to find out if there was any difference between normal tumor in terms of pathway activation. Our interests, we found that one of the pathways and let me just quickly tell you this is the activity score right here. This is just basically explaining the samples and the blue is the tumor adjacent and the red is the tumor sample. So, as you can see basically about 83 percent of all tumor seems to have very activity score for a specific pathway called Sonic Hedgehog pathway. Now Sonic Hedgehog, that's a very funny name. So, the Sonic Hedgehog pathway the question was what is the significance of that? We'll get into that in a minute but this was kind of interesting. We thought this pathway must have something to do with the-- at least development if not progression of these tumors. So, what we wanted to do was to make sure this was an artifact or a fluke from an algorithm we actually took another experimental approach where we used the siRNA knockdown, transfected the SNU. It's a hepatocellular carcinoma cell like SNU-449 and we transfected you with three type of siRNAs. One is a positive regulator of cell cycle. It's got a positive control. The second one is SHH siRNA of interest because we have the Sonic Hedgehog and then in a Scramble siRNA, which is a negative control we wanted to see if this was specific targeted or not. So, we took this approach and we transfected the cell and harvest the cell in 24, 48, 72, 96 hours and we did two assays. One was simply looking at the RNA to make sure that siRNA is really knocking down the RNA here with the RASGRF1 RNA. Day one, day two, day three you see about 80 percent of the RNA is now knocked down and, of course, after day four because it still gets diluted you get back that expression. A similar thing happened with the SHH. So, we know that the RNA was actually down regulated or destroyed or inhibited by the siRNA. Now the more important question was what is the function of this? What is it doing? You're looking at this white panel the top one is again for the positive regulator of cell cycle. So this actually is enhanced its cell proliferation. It's imported cell proliferation. When you looked at these two top ones these are the scrambled negative control and also a mock transfection. So, we don't expect any affect at all and you can see they are growing nicely. Here with the RASGRF1 both siRNAs are corresponding to the RASGRF1, showed some degree of reduction in cell proliferation. But, of course, this is expected. Our intention was to find out what happened with the change. And as you can see the negative controls looked really beautiful here but the SHH RNA, both of them actually, the SHH are corresponding to siRNA. Both of them seem to be retarding the cell proliferation by about 40 to 60 percent. So, this was quite nice and then, of course, the question as how is it that this was involved? How is it that the SHH is involved? Now this is a very busy slide but I actually will get your attention right up here. You can ignore this part. So, the Sonic Hedgehog pathway has other players, one of them being patch 1 and then another one is the CDC2 and Cyclin B1. Cyclin B1 and CDC2 actually combine and form a complex called mitosis promoting factor. So, what happens is that this complex goes around and doesn't stay and then finally it comes in here and it actually gets sequestered, not a very popular word in Washington, D.C. at this point because of sequestration, but it literally gets sequestered within patch 1 and it stays there as long as there is no SHH available. SHH is Sonic Hedgehog. As soon as the Sonic Hedgehog shows up the Sonic Hedgehog binds to the patch 1 or PCC 1 receptor. It degrades this receptor and allows this now to go into nucleus and start the process of cell proliferation. So here is what we saw in our pathway analysis. We found out that Sonic Hedgehog was very highly active. Number one if you have a lot of activities in Sonic Hedgehog, then that means the process will be continuously going. So, that is exactly what happened that you get a lot of cell proliferation. And if you basically destroy this Sonic Hedgehog by siRNA, which we did, what you are doing is that you're causing this sequestration at all time and make you this complex the sequester and cell arrests in here and doesn't go into the next phase. So, that again simply proved the concept of the point that you know that we're making Sonic Hedgehog is actually picked up by the PathOlogist and it is an important player in terms of the cell proliferation. So, I think as far as the pathway knowledge is concerned, I think that's enough. I wanted to-- we're not going to talk about the Pathway of Distinction Analysis, PoDA, because of the interest of time, but I'll go ahead and jump onto the next subject which is our next tool which is the Bambino. The Bambino was initially developed here again by Michael Edmonson and it is-- so I guess what is Bambino? It is a-- it is a tool that has two very important features. One feature is the assembly viewer and assembly viewer has the ability to align BAM reads after your sequencing, next generation sequencing, get the BAM files and then you can align these using this assembly viewer. You can actually have database annotation using this assembly viewer. Also you have a really nice summary for all of these databases that is displayed by this assembly viewer. And last but not least is the display of the filter tools that allows you to make changing in terms of the filtering of your data. But, what we are most interested in is we are using this more often is for the variant detector. Now Bambino is a variant detector and it simply finds SNVs, small insertions and deletions and this works both interactive with assembly and a command line for the variant detector. So, let me just give you a quick first look of what does the assembly viewer look like. Here is what actually it does. It gives you a nice view of the coverage of your next generation sequencing. If you have a BAM file it actually did a histogram to tell you how much coverage is. It also right here is what you're focusing and zooming in in here. It gives you a great deal of information including whether these are exons whether are entrons or the variant frequency which I'll come back in a minute and also this gives you the reference sequence that shows in yellow and the tumor that's shown in red. The normal is shown in blue. So, among other again features, this is one of the things that I was mentioning before, that the summary of the allele frequency, the alternative allele frequency, here it is showing that it is red and it's a really nice coverage of this specific allele. And what this basically tells us that in the tumor you have almost 100 percent of this alternative allele. And when you look at this in the normal you have-- I don't know if you guys can see this basically what you see is a change of an A to a C and it shows in red and it has to go on as representing the actual frequency of it. And if you look at the normal sample on the contrary you have some-- about maybe 50 percent of the samples shows the presence of that alternative allele. What then this allows us to speculate that this could potentially be a loss of heterozygocity since you have 50 percent here and now you totally lost it and in some ways it gives you that ability to make that assumption. Again, also the Bambino has the ability to show you the orientation of which direction the actual transcript or the actual sequence is going. It has a couple more features that I'm actually going to show here. Let's say you were looking at the variant detector. It tells you what to reference, sequence that you're looking at the NM number as well what exon, so it's really, really focused on specific and again for kind of information you're looking for. So this one is exon 5, for example, for this specific reference TP53 and you can at the-- you can actually look at the dbSNP site. In other words if you have identified some variant it can simply tell you whether it is found in dbSNP or not and if it is found whether it is important or not. Because I know we have learned now that some of the stuff in dbSNP is not always you know-- the way they are reported, they could be important. So they are shown by little dollar signs and in addition to that we also have the protein translation and another important feature of this assembly viewer is actually the quality of the read. And that become important for those of you who are working with variant detectors. The darker the reads are the less quality they are. The lighter, the brighter or the whiter the reads are the better quality the reads are. So, if you have let's say something that is called the variant and it's located in a position where it's very dark gray you may want to think twice about it, because it may not be a real variant. And that's a good feature to have. In addition to what I have shown you so far, all of the ones that I showed you was including the exons here is what it will look like when you have a intronic or genomic sequence. It shows the exon, it shows you the actual skipping of the sequence which gives you a gap and that is an intronic sequence and allows, the intronic sequence should be whether the 5 prime to 3 prime direction, so it gives you the direction as well as presence of the intron. It also has the ability to identify specific insertions, small insertions and here is the reference sequence, the consensus that you got four bases that have been actually added. So, when you look at the sequence that you're developing, that you're looking at in the tumor you find out that there are four additional A here, two in here, three in here, so there's at least in some cases there are four bases that have been inserted and that can be potentially again displayed by a gap using this assembly. So, so much for the assembly viewer. For the variant detector again, which we use more often than assembly, is it this one. So, one of the best part of the Bambino at least in our hands is that it is highly configurable. And I think that's important for people who are using tools to actually do their variant calls. Because it is not like a black box and I'm not saying again anyone else is, but I'm just saying that because of the fact that we know how configurable this is, it's easy to actually maneuver this and also allows you with a lot of sophisticated filtering and allows you to make you know changes in terms of the noise or false positive filtering. You can do the variant with a single or pooled BAM data and that's important because you can do, for example, as a pooled tumor normal which will allow you to look at the germ lines, the schematic and determine the possibility of loss of heterozygosity or LOH. And it has the ability as I mentioned before to identify the dbSNP variant so that way you can again find out whether this SNP was pursuing or not. In addition to that, while you're actually calling the variant it allows you to look at the frequency of that specific variant in the tumor and in the normal. And oops-- and when you want to do it that way [inaudible] then we use the ANNOVAR variant annotation. It's a fantastic program and we like it a lot. For the subsequent filtering and post processing and then eventually trying to actually identify whether something is really a non-synonomous, this is a frame shift, is it really worth pursuing for validation or not? I'm not going to go into the detail of this, but this is just one of the example of the settings controls for detection process. You can actually pick out specific-- you can pick out the specific filters such as minimum nucelocyte quality of 10, minimum mapping quality of 1, minimum coverage of 4 and minimum frequency of the alternative allele of .05. These are some of the ones that we have as a default and you can actually change this and you can make, manipulate this as you please and that's what makes it really, really great. Okay. So, as I'm sure most of you that work with the variant detector one of the most critical aspects of the variant detectors or are they accurate or not, the actual rate of accuracy. So, what we did was we wanted to actually check to see Bambino's accuracy and we have done this with our own sample. We had three liver cancer samples and we identified 55 and then basically validated 50, so about 90.9 percent. And this was a very heterogeneous sample, but when we look at the TCGA validated variant found in next generation sequencing, there was a total of 440 samples and we looked at 1,739 variants and out of that, 1,704 was detected so we have an accuracy of 97.9 percent and when we did the same thing with the TCGA with variants that was identified by the SNP6 with the seven samples with that many SNPs we had a 99.3 percent accuracy. So, we feel that the accuracy of base calling is really, really fantastic with Bambino. And you don't have to take my word for it. Actually there was a study or there was a paper published from a group who we had no idea at the time who they were, and they basically compared, in PLOS One, they compared a number of these tools and this was the criteria for their comparison. And they basically had seven aligners, these are the ones that you just actually generate your BAM files: Bowtie, Smalt, Stampy, Ssaha, Novoalign, Bwa, Bfast and also they looked into SNP callers and that included: Samtools, GATK and Bambino and Freebayes. So, of course, GATK is from Broad. It's one of the most well recognized, one of the best tools that are out there. People use it on a regular basis. In fact, TCGA uses that on a regular basis. So, when they compared these and using these aligners, they have some very interesting finding and I'd like to share that. So, out of the 28 different combinations what they actually identified was the 7 aligners and 7 variant callers. They found out that 2 aligners, Novoalign and Stampy were the best with the two variant callers being GATK and Bambino. So, that really made us feel great because you know again GATK being one of the best tool and when they compared that with Bambino, we were like okay cool. And GATK is also, you know has the most quality variants when you look at the transition and transversion and that also were true with the Bambino. So, it seemed to have worked in terms of the quality of callings and regardless of what aligner they were using Bambino and GATK came head to head. And this is basically a picture that worked and basically you're looking at different aligners right here. And if you look at GATK there's almost the rediscovery called-- it's almost 100 percent all true using GATK. It doesn't matter which aligner you're using. And very similar thing happened with the Bambino. Regardless of what aligner you use the variant called the rediscovery was really, really close to GATK, close to 100 percent. Samtools and Freebayes seemed to have a little bit of an issue in terms of using different aligners. And this was also I think worth mentioning in here that the transition and transversion ratio for the two, for all the tools we're showing here-- and it's important to I guess know that the transition and transversion are numbers that usually fall between 2.5 to 2.8 and the GATK continues in the neighborhood to be always 2.73738677 and so on. And ours were in the 2.558928876 and 5. So we are closest again to the GATK compared to the other ones that was present in this study. So, the take-home message was then basically that Bambino actually is a pretty decent tool in terms of making the variant call and again we use it and we like to use it and we've been using it because we have the ability to make changes as we go along with our project. So, now let me quickly discuss the Cancer Genome Workbench. And the Cancer Workbench is actually a visualization tool. It's a visualization tool that you can, it is based on the UCSE code, but we have or own, I'm going to skip right here. We have our own custom tools and tracks and actually visualize the sample of interest in this CGWB with this SNP id, with the gene name, term or genomic location. I don't want to actually show this because it can be a little bit sneaky because of the number of samples that are there and just to tell you how many samples we have here. I'm going to skip this part, here it is. So, we have about a 100 terabytes of sample or more and there is about 14,000 BAM files that can be accessed or looked at but cannot be disseminated or shared with anyone else. And, of course, these have two flavor of the data. We have the publically available TCGA data, that is available to the public and you can look at it and you can actually visualize it. We also have the specific target data that are only for the people who have credential for the target data that they can potentially again go ahead and look at the data and do the analysis, I mean do their visualization. Among some of the things I picked two that has worked, actually pursuing here. One is the visualization of RNA-Seq and this is just some liver cancer samples, four of them. I hid the ids but basically what this is showing that if you're looking at this and specializing this on the CGWB you're looking at the blue being the tumor. I'm sorry, the blue being the normal and the red is a tumor here you see that there is expression of gene on both normal and tumor. In both samples you have normal tumor being expressed. So, when you're looking at this as the specific gene, which is the telomerase [inaudible]. This is the only gene that is found in the tumor. So, that gives you a sort of a profile, expression profile of what this gene looks like and the coverage and again similar [inaudible] is right here that you see only normal expression data. So, this is the tool where you can actually put, dump your sequence of interest and simply visualize. And in some cases you can also see big gaps of amplification or deletion. We also have visualization ability for the fusion protein or translocation. Here what I'm actually showing you is this little gray bar that you can see with a black line going through it. It's simply the LOH and right there the blue is the lesion. And since this is a translocation we also have the purple that is the right junction of the translocation and we have the left junction that is shown here in green. And you can actually take this and again I couldn't share the actual sample here with you guys because this is part of AML that we don't want to share. You can actually sort those inversions, I'm sorry, those fusion proteins according to some of the phenotypes, you know whether it's the gender, race, ethnicity, perhaps the type of cytogenetics criteria or whatever the phenotype of clinical attribute of interest is. You can sort of buy that. I'll be going a little fast so we might be okay with time. Okay. >> About quarter of. >> Dr. Daoud Meerzaman: Okay. So, this is last but not least is the newest member of our genomic tools it's called only Circos but I wanted to actually mention that before I got into our tools that there is really no shortage of Circos tools as you can see. It made it to most of the very, very popular cover of all the journals scientific but even made it to the New York Times so you know that you know that's something that's obviously important. And it is one deficiency of this-- these tools are again that they're all done in Perl. And what we have done here is to basically look and see if we can actually generate the Circos using R package. And what we did was we looked at the published top mat as well as the bioconductor. We found two potential candidates that have already done this. One is right here that shows the linear and one called them at the time which is a really great-- it's called Gviz. The other one is a fantastic tool right here known as Ggbio. And it has some gene mutation and rearrangement but unfortunately there are some limitations to this. So, what we wanted to do was to see if we could actually add some improvements to this already existed using the OmicCircos. So we called this OmicCircos and basically it is an R package. It's generated with high quality for the genomic variation data and, of course, its- you know just like every other Circos we can actually put in the chromosome position-based mutation, copy number variation, gene expression, methylation and you can also put in-- you can actually determine the relationship between genomic features and the different type of plots whether it's a box plot, scatter plot or heat-map. And this is already submitted to Bioconductor and the review and we got an email saying that it actually will be accepted with minor revisions. So, why we think that only Circos is worth spending time getting it hopefully published. One of the most important feature of this is actually that it is a finger command. In the Perl if you remember I mentioned each one of these if you were-- let's say you have a Circos and you want to change one of the circles inside or somewhere in the middle of this Circos, you have to start from the beginning. This R package the only Circos give you the flexibility that if you wanted to change this into a gene expression you can actually leave everything else, independent of everything else you can actually change that specific one or you can go inside and change the fusion protein or so on and so forth. So, that is of critical importance in terms of making modifications affecting only one but not other Circos. And I think this is also of critical importance that I don't, again I could be wrong, but I don't know if this has ever been done before where you can actually carry out integrated and statistical analysis using the Circos and explaining it at the same time. And that is what I think is really of critical importance. And I know that some of the other Circos tools also have this. We are trying to make it a little bit more complete in terms of visualizing multiple samples simultaneously. So, let me just then go ahead and show some of the features of this. And some of the features for example statistical analysis could be quantile 90 percent, median and variance, mean and scaled value, heat map, or median standard deviation as well as the mean and confident intervals, 95 percent confident intervals. You can display those statistical analysis I mentioned before either as a Boxplot, as a Histogram, as a Scatterplot or as a dot or whatever is your favorite way of showing the analysis. You can also have text annotation. You can have the annotation either outside or inside and that's again an important feature because if you find that another gene is important you can actually go back and change the gene without starting all over. So it has the ability to annotate as you please. So what we wanted to do was, since we want part of the TCGA breast cancer consortium analysis group we basically, and the paper was published in Nature and we're part co-author. So, we took some of the data from that paper published and we tried to actually see if we could display it using our new OmicCircos tool to see if it helps us visualize some of this stuff. And here is basically some information about the actual study but here is what I wanted to spend a little bit of time. So, the Circos that I'm going to show you next slide; it has basal type of the breast cancer. We have actually included 15 samples here. We have included gene expression from Affymetrix that were used with the U133plus2 and copy number data from Affymetrix SNP6 and the correlation between these two as well as the RNAseq that was used to determine or actually generate in-house some fusion protein. So, here is what it looks like. Now some of you obviously wouldn't be able to see this but take my word for it please. And what you are actually seeing here is simply if I can put my little arrows, so the very outer layer is the chromosome 1 to 22. And then the insides to the outside is actual cytobans that allows you to identify here with the centromere which allows you to locate the p-arm of the chromosome and the q-arm of the chromosome. So-- and that right inside of that is the gene. So, you have the chromosome, the cytobans, the gene and then the 15 sample heat map expression profile is located right there. And here is the copy number variation. The copy number variation is shown -- amplification is in red and deletion is in blue and what you're seeing here is the correlation between the copy number and the expression and, of course, this is a key value. So, since and this is-- my animation wasn't working but here is the fusion protein that we actually in-house identified. This wasn't published in the paper but again since we didn't validate it, we don't know if they're real but clearly it seems to be meeting the criteria of fusion protein. But what is important here is that if you are interested in a specific region of this breast cancer for example, what you could do is you could potentially zoom in specifically in that region. Let's assume that we wanted to look at chromosome 11 and chromosome 17 and see if there is anything interesting going on. So, what you're seeing in this other half is the zoomed version of this chromosome 11 and chromosome 17. So now you can see some focal amplification and deletion and stuff like that. So, when you actually are looking at this chromosome 11, chromosome 17, one thing that really jumps at us was the cyclin D1 and it's really, really highly correlated and there is a definite focal amplification there and also, and of course, cyclin D1 shows you the involved, again I don't want to say anything that goes-- Barbara is here. She is a breast cancer specialist [laughter] but I know this has something to do with the Tamoxifin treatment resistance. But you also have P53 as well as ERB2 that are, of course, ERB2 everyone knows that is a very, very critical part of the breast cancer biomarker and CDC6, which has a very important role in terms of the survival and its expression and copy number. Taking that one step further, we looked at the six-- we looked at the four different subtypes. We looked at the basal, Her2, LuminalA and LuminalB. These are the four subtypes of the copy, so the four subtypes of the-- she gave me a five minute sign so I'm panicking now [laughter] Gotcha, no, no, no that's all perfect. So, what I actually did was again we looked at the copy number expression and here is what you are seeing, simply the four subtypes of the breast cancer. Again, what you're seeing is and I'm going to speed up a little bit more than I have; what you're seeing here is a common amplification among all four subtypes. The outside being the basal, Her2, LuminalA and LuminalB and this is a PCA for the expression which categorizes nicely. But here is what I'm interested in. You see amplification of that region. It's specifically for the four subtypes and you also see a very specific deletion of 5Q deletion and the basal and, of course, right here is the 7P11 which is location for EGFR amplification and that is again very specific. If you really look at this it's very hard to see, but there's a really focal amplification of EGFR in the basal. And you also have amplification of the Q which is CMEC. It's amplified all over and again I'm going to-- so this is the P10 deletion. P10 is a tumor suppressor gene and also you got two of the very important the ERB2 and Cyclin E1, both very important in terms of survival for the breast cancer patient that are actually being, again you can look- and again you can and you can identify this very easily all in one shot and again this sort of a bird's eye view of all of that data. And what I will do is I will skip all of this because I think this is the same exact thing, but I just want to-- this is some of the future work that we're doing and I'm just going to very quickly summarize it by the last slide, say that our team has expertise mostly on DNA- Seq, Exome-Seq, much of your RNA, much of your RNA sequencing you know the mRNA-Seq, the miRNA-Seq and what we do is we do the analysis. We have done many, many different type of alignment and eventually we found out that Novalign was the best way to do this. And we also know that our Bambino seems to be working well and we use the Bambino to make the SNP calls both for RNA and for the DNA and then do the ANNOVAR to make the calls. But for the RNA in order to look at the fusion protein or Cufflink or tophat, those are the actual software that use trying to identify the fusion protein and the alternative splicing. And, of course, the goal here is always to do integrative analysis, integrative analysis whether it's messenger RNA, miRNA expression, copy number mutations, splicing, fusion. So, find out exactly what is really-- are they playing together and what is the significance of correlating those different features. Of course, once we have that we can also do the impact on the proteins. Okay the CAVART and SIFT, LogE is on of the two reaction generated as well as Polyphen. Polyphen is not ours. And at last when we've done all of the analysis we can do either you know PathOlogist. Also we can visualize it in CGWB or we have another tool that I didn't go over which is the Gene Fusion viewer or we can show it with the OmicCircos. And with that I hope I didn't go too fast [laughter] too fast for you guys. I'll be happy to take any questions. Thank you very much.

Notes

  1. ^ "About | Biomedical Computation Review". biomedicalcomputationreview.org. Retrieved 2018-11-05.
  2. ^ "Biomedical computation review. - NLM Catalog - NCBI". www.ncbi.nlm.nih.gov. Retrieved 2018-11-05.

External links


This page was last edited on 4 August 2023, at 13:20
Basis of this page is in Wikipedia. Text is available under the CC BY-SA 3.0 Unported License. Non-text media are available under their specified licenses. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc. WIKI 2 is an independent company and has no affiliation with Wikimedia Foundation.