To install click the Add extension button. That's it.

The source code for the WIKI 2 extension is being checked by specialists of the Mozilla Foundation, Google, and Apple. You could also do it yourself at any point in time.

4,5
Kelly Slayton
Congratulations on this excellent venture… what a great idea!
Alexander Grigorievskiy
I use WIKI 2 every day and almost forgot how the original Wikipedia looks like.
What we do. Every page goes through several hundred of perfecting techniques; in live mode. Quite the same Wikipedia. Just better.
.
Leo
Newton
Brights
Milds

From Wikipedia, the free encyclopedia

GenBank
Content
DescriptionNucleotide sequences for more than 300,000 organisms with supporting bibliographic and biological annotation.
Data types
captured
  • Nucleotide sequence
  • Protein sequence
OrganismsAll
Contact
Research centerNCBI
Primary citationPMID 21071399
Release date1982; 42 years ago (1982)
Access
Data format
WebsiteNCBI
Download URLncbi ftp
Web service URL
Tools
WebBLAST
StandaloneBLAST
Miscellaneous
LicenseUnclear[1]

The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information (NCBI; a part of the National Institutes of Health in the United States) as part of the International Nucleotide Sequence Database Collaboration (INSDC).

GenBank and its collaborators will receive sequences produced in laboratories throughout the world from more than 500,000 formally described species.[2] The database started in 1982 by Walter Goad and Los Alamos National Laboratory. GenBank has become an important database for research in biological fields and has grown in recent years at an exponential rate by doubling roughly every 18 months.[3][4]

Release 250.0, published in June 2022, contained over 17 trillion nucleotide bases in more than 2,45 billion sequences.[5] GenBank is built by direct submissions from individual laboratories, as well as from bulk submissions from large-scale sequencing centers.

YouTube Encyclopedic

  • 1/5
    Views:
    3 696
    3 164
    552
    3 155
    11 734
  • Elizabeth Nabel speaking at the Genbank 25th Anniversary
  • Bruno J. Strasser speaking at the Genbank 25th Anniversary
  • GenBank NCBI (How to retrieve nucleotide sequence from NCBI GenBank)
  • Sharon Terry speaking at the Genbank 25th Anniversary
  • NCBI 20th Anniversary Video

Transcription

Thank you very much, Steven. It's a pleasure to be here today. I want to think David for inviting me to join in this grand celebration. I thought that I would do three things this morning, and I really come here as a fan of NCBI. I think my loyalty to NCBI certainly began when I joined the intramural program as scientific director for clinical research for the NHLBI, and it certainly has expanded. Now as director of NHLBI - and I recognize that many of the programs that we've initiated in NHLBI simply could not exist if NCBI did not exist, so we're enormously grateful. Today, I'm going to cover three topics. I'm going to present some of the NHLBI programs, which represent advances in genetics and genomics. I'm going to discuss some of the major results that have come forward through a number of GWA studies around the world on cardiovascular diseases, particularly coronary artery diseases in the past year. And then the last few minutes, I'm going to discuss the trans-NIH GWAS policy, which a number of us have developed in collaboration with NCBI and particularly dbGaP. So those of you who are not familiar with cardiovascular disease what you see here is an age-adjusted death rate for heart disease in this country. You can see that we were clipping along at a pretty high death rate up until about 1970 or so, and the death rate has been declining ever since then. We know that about 50 percent of the improvement in the age-adjusted death rate is due to primary prevention, that is recognition of risk factors and their treatment. About 50 percent of it is due to secondary prevention: stents, defibrillators, and all those wonderful devices now that save lives. But if you look at the inflection point, around 1970 it was really due to the identification of risk factors for heart disease, and that story goes back to the Framingham Heart Study. Post-World War II, vets were coming back, dying at high rates, and we didn't understand why. The public health service decided to initiate a longitudinal study of cardiovascular disease in Framingham, Massachusetts, because they had a prior history of being a very cooperative community with respect to tuberculosis studies. William Kannel was the first Framingham director, and in 1961 he published a paper in the Annals of Internal Medicine, called "Factors of Risk for Heart Disease and Stroke," in which he defined these risk factors, which we all now learn in high school biology, but this is probably one of the most significant advances made through the Framingham Heart Study. And over the years now, the NHLBI has engaged in support of a number of longitudinal cohort studies, epidemiology studies, which have really tracked the natural history of heart, lung, and blood diseases in this country. You can see a number of these on the left hand column. The cardiovascular ones, for example, cover cardiovascular disease in the young, in the old, multiple ethnic groups. How did we develop it in our 20s and 30s? How do we deal with it in our 60s and 70s and 80s? Again, the ethnic groups go from African-Americans to Native Americans to Hispanics, to Caucasians. And it really is quite a remarkable legacy of epidemiological data. Now, probably prior to the year 2000 or so, we would have just called this epidemiological data, but we now call these phenotypes. And, indeed, about three years ago when I became director of the NHLBI, we recognized that we had a phenomenal dataset from which we can now layer on genetic information to the phenotypes that had been so richly collected over the decades, and formulate a very rich data pool from which we could understand the genetic contribution to heart, lung, and blood diseases. So with that in mind we have made a significant investment in genomics research. I'm going to discuss some of these programs very briefly today just to give you an idea of the flavor and depth of the program. We are hoping that these datasets, which we anticipate will be fully deposited in dbGaP will be used broadly by biomedical researchers internationally, and certainly if you're interested in learning more, I would draw your attention to the NHLBI homepage in which you can find further information. I also want to put - before I go into detail about some of these programs, I want to put them into the context of how we view genomic studies in the NHLBI. Through work with a number of individuals across the NIH campus, including Francis Collins here, we've begun to think about what happens post-genome-wide association study or post-GWAS, and certainly from our institutes' perspective, we really view GWA studies as really the beginning of understanding genetic contribution, but we are fully investing in all of these other downstream analyses that are going to have to be conducted if we are to achieve our final goal of genomic medicine. So let me go back now and tell you a little bit about some of these studies. We started first with the Framingham Heart Study as sort of our poster child. The Framingham Heart Study consists of three generations of families and their offspring who have enrolled and have received repeated physical exams very two to four years. The original cohort was about 5,000 men and women who enrolled in 1948. Their offspring enrolled in 1972, and the third generation enrolled in 2002. And in fact in the town of Framingham, it is a matter of civic pride and family pride that you enroll in this study, and I've had many third generation individuals say to me, my grandmother told me that if I did not enroll in the Framingham Heart Study, doom would become me, so it's a source of tremendous pride as I said, and that has lead to the success of the study. So approximately two years or so ago, we developed a program called Framingham SHARe, the Framingham SNP Health Association Resource, and parenthetically we called it SHARe for a reason. As I say to my epidemiology colleagues, you want to be buried with your data. They were not necessarily used to data sharing like the geneticists were, and so we have really had to pull that scientific culture along to get them used to the idea that they need to start sharing this data. So nonetheless, we took about 10,000 Caucasians from those three generations. We genotyped them using the Affymetrix 500K chip. That data then came in to dbGaP, QA was performed - cleaning of the data. We then deposited all of the public phenotypes that had been accrued on Framingham SHARe from 60 years worth of exams, and also deposited those in dbGaP. So the Framingham SHARe dataset now contains five and a half billion genotypes, which along with all of the phenotypes permits over five trillion association tests. Obviously a lot of work to be done by a lot of individuals. This dataset was made available to the public beginning last October, and I'm now going to credit Jim Ostell, Steve Sherry, and his colleagues for so very quickly setting up the database of Genotype and Phenotype or dbGaP. As you can see this is the list of studies in dbGaP as of yesterday or so, and if you click on Framingham SHARe, you will then see this page, which you can read the study description, the large number of variables that are available, the documents and the analysis, but importantly, you can receive information about how to gain access to dbGaP, and these links just take you through all of the different mechanisms that one can use to gain information as well as providing information about participant confidentiality protection. When we began to set up this program, we were pulling together policies that had existed for the GAIN study, and have now fed in to our GWAS policy, but the principles behind this is that we wanted comprehensive sharing done in a rapid manner. And I'll go through the GWAS policy with you later, so I won't hone in on the details here now. In terms of intellectual property, we discouraged IP protection of associations, and we considered the simple associations through KAI SQUARE and others as pre-competitive data, and to achieve that we would rapidly post these associations. In order to give the primary investigators some window in which to publish their work, we allowed for a publication window of 12 months beginning in October. Anyone can have access to the data, but we simply ask that if you are not a primary Framingham investigator that you withhold from publishing an abstract or a paper until this window is completed, which it will soon be in another six months or so. Now our experience from this program suggested to us that we really needed to keep the community closely involved. Community involvement, I think, has lead to the success of the Framingham program over a number of years, and I think it's a lesson learned for future genomic studies going forward, and you can see that we have an ethics advisory board. We have observational safety boards. We have regular meetings with the participants. And we really kept them quite informed as to what we're doing, and I can tell you the level of enthusiasm for this project is extraordinarily high. In November, we had a celebration announcing the release of the data through dbGaP in Framingham. We had a number of members of the first generation of enrollers participate, and it was really something that was quite spectacular to see. Now we then decided that now that we had finished the Framingham cohort, we wanted to then begin to march through some of our other major cohorts and begin genotyping as well. The next study that we wanted to embrace was one that would have more ethnic diversity. We turned first to our Jackson Heart Study, but that's a relatively new study, and the participants really weren't ready to go forward with genotyping yet. So we turned to our MESA Study, which is a multi-ethnic study of atherosclerosis. It includes African-Americans, Hispanics, and Asian-Americans, and we are proceeding now with the genotyping of that study. In addition, we've turned to one of our major lung phenotypes, asthma. We've had an asthma research network in the institute for the past 15 years, and we now fortunately have saved DNA from those participants and have adequate informed consent and are proceeding with that genotyping as well. We will have genotyping by the end of this calendar year now in about 30,000 subjects, and you can see the various phenotypes that we have covered across these three studies. In addition, about four or five years ago, we started another program prior to the time when GWAS was widely available. Again we had a number of our cohorts, and we said, well, what if we simply identified the major candidate genes that might be associated with heart, lung, and blood diseases, and do genotyping as SNPs in those candidate genes, so we started what we are calling the Candidate Gene Association Resource. This is a contract to the Broad Institute in which we have nine cohort studies, which you can see cover the full gamete in our institute. That genotyping of the candidate genes is proceeding quite well. In addition, in order to gain more information quickly with genetic ethnic diversity, we are performing GWAS on about 10,000 African-Americans from across these cohorts, and again that will be a common bioinformatics platform in the Broad, but we will link to dbGaP, and I think those discussions are ongoing about how to proceed. So this dataset should be available, I'm anticipating, by the fall or end of the calendar year. Now as we have begun to fund studies beginning with GWAS, we knew that there were other components of genomic medicine that were beginning to be ready for primetime in testing in clinical studies, and probably the warfarin story is the one that is farthest along. As many of you know over the past several years there is a haplotype VKR1C1 and a genotype CYP2C9, which have been identified that affect warfarin sensitivity. These reflect enzymatic liver enzymes, which process vitamin K formation of clotting factors in the liver, and the mutations in the haplotype or the genotype increase your sensitivity to warfarin, thus so that you require lower doses of warfarin. As many of you know, warfarin is commonly prescribed in the United States, over 2,000,000 prescriptions a year. It's used to treat blood clots that form in the heart or in the vasculature to prevent heart attack and stroke, and so we felt that we were really ready to begin with what we're calling a gene-based clinical trial where we want to ask the question will use of a genotype-enhancing dosing algorithm using both genetic and clinical factors when you initiate warfarin therapy improve the anticoagulation status and the clinical outcome compared to our current guideline-based therapy, which is everybody gets started on five milligrams, and you adjust according to measurements of the INR which measures the prothrombin time. And we also wanted to ask a clinical effectiveness question. What will be the benefit of adding genetic information to a strategy that uses clinical information known to influence warfarin? In other words, for community docs, if they consider clinical information, what will be the benefit of adding genetic information to it? As many of you know the FDA has already put a statement on the warfarin label suggesting that practicing physicians should consider genotype and haplotype information when prescribing warfarin doses. So we are just about ready to embark on a genotype dosing trial of warfarin therapy. We're in the final stages of signing a contract with the institution which will conduct the study. This will be a multi-center, randomized, double-blinded, three-arm, superiority trial that will allow for a year and a half of recruitment and a minimum one-year follow-up. The primary outcome is the percent of time that people stay in the therapeutic range - their prothrombin time at one month and then secondary outcomes will be measures of anti-coagulation status and measures of clinical outcome. Our initial sample size was thought to be about 2,000 individuals. We thought that would give us sufficient power to detect a 20 percent difference in INR time, but after further discussions with both the FDA and CMS, they would like this study adequately powered so that we could make adequate determination in multiple ethnic groups. And so we are thinking about increasing the power so that we could make determinations in African-Americans, Hispanics, and Asian-Americans, and Caucasians, since obviously we all received warfarin therapy, so that will go on as we further design this trial, but this is the randomization as we're currently anticipating it. Individuals who will be starting on warfarin therapy will be eligible - anyone with a contraindication obviously will be excluded. Everyone will receive a rapid genotyping so that we have the genotype information on all participants in the trial. Individuals will then be randomized one-to-one-to-one. The control group is the standard guideline-based strategy, which I said earlier everybody basically gets started on the same dose, and you measure INR. The second control group will be a clinical algorithm approach, because we now know that about 50 percent of the variability in warfarin dosing is due to environmental factors. So this will take into consideration liver status, age, gender, diet, and other sorts of parameters. The genotype arm will take into consideration the clinical information plus the genotype information, and if you have one of the mutations in either the 2C9 genotype or the VKR1C1 haplotype, you will probably start then on a lower dose and again follow measures in INR. So we're very excited about this trial. It's remarkable the level of interest that the FDA and CMS has had in this study, because they view this as being critically important in trying to understand not only the design and conduct of genotype trials but also they're very interested in learning the outcome, and they see it feeding quite directly into their recommendations. With that information in hand, let me just review with you very, very quickly some of the major findings that we have learned over the past year as a result of either sequencing studies or GWA studies. I'm going to highlight one study on rare variance in CAD, and this was a study that was done by Jonathan Cohen and Helen Hobbs at UT Southwestern, where they examined sequence variants in a liver enzyme, PCSK9, and what they did is in the Dallas Heart Study, they took extremes of the bell-shaped curve, and they took individuals with very low LDL as well as individuals with very high LDL, and they compared the individuals and found that individuals with very low LDL had variations in this particular liver enzyme. This enzyme is critical for processing of LDL in the liver, and individuals with the mutation have more rapid processing of LDL, make more LDL receptors, which then pull in reverse cholesterol transport, pull in LDL from the circulation and lower LDL levels, leading to improvements in coronary artery disease. This had been a fascinating story, because we've learned a lot about the management of LDL and coronary heart disease in more garden variety, adult forms, and what we've learned from these individuals is that a lifelong history of very low LDL does protect against coronary heart disease, and it feeds into our understanding that coronary heart disease is a chronic inflammatory process that builds over decades, so if you're really interested in truly preventing coronary heart disease, start with a low LDL in your teens and 20s is one of the take-home messages from this. There will be paper out this week, I believe it's online now in Nature Genetics where Rick Lifton, working with Ann Levy in our Framingham Heart Study have now found rare variants among the Framingham population that lead to alteration at protection against hypertension, and again I think this will be a very fascinating story about rare variants with lessons learned from more common forms of heart disease. Well, I think that the field of coronary artery disease really had a major significant advance with this story that appeared last summer. Through two groups, one lead by Ruth McPherson at the Ottawa Heart Study in collaboration with a number of individuals again Jonathan Cohen at UT Southwestern and then the deCODE group, which I will describe in a minute, found that a common allele on chromosome 9, 9p21, was associated with premature heart disease. This was a genome-wide association study at case control, initially in only 300 cases and about 300 controls which they replicated then in two additional relatively small datasets but then validated in the Copenhagen City Heart Study, the Dallas Heart Study, and again in another cohort from the Ottawa Heart Study, but this study first opened our eyes to a new loci on chromosome 9. Their findings were then also reported somewhat simultaneously by the group in deCODE, this was also a case controlled study, about 4,500 cases, about 13,000 controls from an Icelandic population, which were then replicated in three US cohorts at the University of Pennsylvania, Emory, and Duke. Their phenotype here was early myocardial infarction, which is obviously another sub-phenotype of coronary artery disease. Again, suggesting that this particular allele loci is quite important. What's interesting is that this particular SNP was not within one particular gene. A close gene of particular interest encodes for P16, a cyclin dependent kinase inhibitor. The Welcome Trust Case Control Consortium then reported their general findings around the same time in June. As you recall, this was a case-controlled study with about 14,000 cases covering seven common diseases, but indeed in their initial report the same finding of a locus on chromosome 9 associated with coronary artery disease. A follow-up from the group was published in the New England Journal in August of that year, of last year, and indeed what is quite interesting as I said earlier, the locus for MI is close to, but not within any particular gene, the closest are CDKN2A which encodes for P15 and DCKN2B, which encodes for P16, again two cyclin dependent kinase inhibitors, members of the INK family of proteins, which lead to G1 arrest. What's also very interesting is that a number of investigators have found a site, a number of SNPs which were associated with adult-onset diabetes. These two loci were in linkage disequilibrium, and with not all studies showing overlap between the SNPs associated with the two diseases. This is an area that's under very intense scrutiny by a number of individuals across the country, and I think we're going to certainly learn much more about the pathogenesis of coronary artery disease as the story unfolds. What's very interesting about the 9p21 variant is that it associates with coronary artery disease, premature coronary artery disease, early MI, but also other types of cardiovascular disease, in particular abdominal aortic aneurism, and intracranial aneurism. This was a follow-up study done by the deCODE population. Now if we look more broadly across the cardiovascular diseases, there had been an earlier GWAS finding by Aravinda Chakravarti in his group where they used a quantitative trait that is Long QT Syndrome, Long QT, which can lead to sudden death, but they used this quantitative trait in a case controlled association analysis where extremes of the QT interval were examined, leading to a discovery of NOS1AP or the CAPON gene and in subsequent work done by Eduardo Marban's group, NOS1AP modulates a potassium channel in the heart which we know is associated with cardiac repolarization and Long QT. The other area that's received a lot of interest is that of atrial fibrillation. There's a variant which now has been associated with 4Q25, also coming from the deCODE population, and there's been much work following-up. Other mutations have more recently been associated with aortic aneurism. This is a mutation in smooth muscle alpha-actin, which is a differentiation protein in smooth muscle, which has been associated with thoracic aneurysms and dissections. So we now are at the beginning of making a remarkable number of discoveries with respect to the cardiovascular field. Much of that is leading to replication work, re-sequencing and ultimately functional studies. The other direction that this is going is in integrative genomics, and this paper represents a work that was done in the asthma field where initially this consortium of individuals did a GWAS study on about 1,000 cases of childhood-onset asthma with controls, replication, and discovered an ORMDL3 gene in the particular region. They then took EBV transformed lymphoblasts, focused on an analysis of genes, did microarray analysis and then focused on an analysis of genes in the associated region, and what they found interestingly is that the disease associated markers accounted for about 30 percent variance in RMDL3 expression, so a very nice example of integrative genomics. So finally, let me just finish by bringing many of you up to date with where we stand with respect to data sharing of GWA studies across the NIH campus. We do have a policy in place that was approved by all of the NIH directors last August. It went to the Federal Register last August, and has been fully implemented beginning in January of this year such that all GWA studies funded after January 25th, must comply with the policy. For those GWA studies which were funded prior to January 2008, we are asking for compliance, but asking is the directive there. I want to acknowledge the vast load of individuals across the NIH campus who have been instrumental in formulating this policy and in particular the extraordinary work done by Francis and Laura Rodriguez within NHGRI as well as participation really by all of the institutes. I'm going to walk you through this policy very quickly because again I think that we are still in the implementation and information stage. So what motivated us to do it? Well, we recognized that a number of the institutes were making significant investments in GWA studies, and for those studies to have the greatest benefit to the public, we needed to make the data widely available so that many investigators could work on it. We felt that this was a wise use of tax payer dollars. The policy itself contains three components: data management, scientific publication, and intellectual property. And I'll take you through each of those. It all starts with research participants who must assign informed consent, indicating that they agree to have their data shared in a centralized databank. The submitting investigators prepares that dataset, de-identifies the data and then submits it to our GWAS data repository, which is dbGaP, and again I want to commend Jim Ostell and group in NCBI for formulating this database. The data then is deposited in dbGaP. Again all HIPAA identifiers are removed. Only the submitting investigator has the original code linking the individuals to the dataset. dbGaP itself, is a wonderful resource. I introduced it to you earlier when we spoke about the Framingham SHARe Project. I would encourage you all to go on the website and learn how to become facile in order to access the wealth of information available. So when data is submitted, the local institution certifies that they approve the data going into dbGaP. The institution certifies that they are complying with all local laws and regulations. That we receive any limitations on the data use. There are some individuals for example who don't want their data used by individuals in the private sector - our pharmaceutical company, for example, and we have received ruling from OHRP, the Office of Humans Research Protection, which is part of DHHS, indicating that this dataset itself is not constituted human subjects research, because all of the data has been de-identified. Okay, so I think what's really important is what if you are on the other end, and you want to receive one of the datasets in dbGaP. There is a process for doing this. Each of the institutes now has a data access committee. The applications are handled through dbGaP. They are then triaged to the particular institute, and the DAC then performs an administrative, not scientific, review with a turnaround time of about two weeks. Sends the approval back to dbGaP and the dataset can then be released for your use. Prior to receiving the dataset, we ask that you sign what's called a data use certification, where you agree to a number of stipulations. These are intended to protect the privacy and confidentiality of the research participants. You agree not to try to identify them. You agree not to share with other individuals who are not listed on your application, and that you agree to comply with the intellectual property and publication policy, and then you're good to go in terms of working and performing analysis. Now dbGaP has two sites. One is a public site in which there is a lot of important information that is in the public purview. You can certainly look at what studies are available. You can see which groups are working, have received those datasets and are working on those studies, and you can learn about any pre-computed or published genetic associations, but importantly there is no individual data that can be viewed on that dataset. When you receive access to a particular dataset there are a number of security measures in place. You must have a login and password which is supplied to you by dbGaP. On a local computer each dataset is encrypted, and you must use that unique password to decrypt the file. And the consent in terms of use for each dataset are included in the downloaded files, and again we are working very hard on the encryption issues. With respect to publication this was a discussion in the scientific community. We have now given a period of 12 months for the primary PIs to publish, but again the datasets are widely available to everyone to begin your analysis. We have said that institutes may elect to shorten this time period, and my guess is that it will be shortened as we all gain experience and confidence with the use of these datasets, and we simply ask that when recipient investigators publish, they acknowledge the contributing investigators and the funding organization. With respect to intellectual property, as I referred to earlier with the Framingham dataset, we encourage that all of the simple genotype/phenotype associations be available to all investigators without IP claims, so we're actively discouraging premature claims on pre-competitive information. And again, we're encouraging individuals to use policies that are consistent with the NIH best practices for licensing of genomic inventions. So in summary, this is the way that it works. Once you have a GWAS study that is supported through an institute, you conduct your genomic study. You then gather the genotypes and phenotypes, you de-identify them, deposit them in dbGaP. There will be some information available in the public access site, but then the controlled access will be obtained through a recipient investigator requesting research use that goes through the data access committee, assigned a data use certification, and you are good to go. So this is online. It's working. Many of the institutes have DACs in place, which are fully functional, and it's really quite gratifying to see the number of individuals now who are requesting access to these datasets in dbGaP. So finally, I simply want to say GenBank, happy 25th birthday, and from the NHLBI, we want to thank NCBI for the tremendous service that you have provided to many of us throughout the institutes, and you have really facilitated our ability to take many of the rich datasets which we have accrued for a number of decades, now layer them with genetic information, and make this information available to investigators who previously would not have had access. I think that the legacy of what NCBI will be enduring - we will continue to see tremendous scientific discoveries unfold for decades to come. So from the NHLBI and on behalf probably of the other institutes, thank you very much.

Submissions

Only original sequences can be submitted to GenBank. Direct submissions are made to GenBank using BankIt, which is a Web-based form, or the stand-alone submission program, Sequin. Upon receipt of a sequence submission, the GenBank staff examines the originality of the data and assigns an accession number to the sequence and performs quality assurance checks. The submissions are then released to the public database, where the entries are retrievable by Entrez or downloadable by FTP. Bulk submissions of Expressed Sequence Tag (EST), Sequence-tagged site (STS), Genome Survey Sequence (GSS), and High-Throughput Genome Sequence (HTGS) data are most often submitted by large-scale sequencing centers. The GenBank direct submissions group also processes complete microbial genome sequences.[6][7]

History

Walter Goad of the Theoretical Biology and Biophysics Group at Los Alamos National Laboratory (LANL) and others established the Los Alamos Sequence Database in 1979, which culminated in 1982 with the creation of the public GenBank.[8] Funding was provided by the National Institutes of Health, the National Science Foundation, the Department of Energy, and the Department of Defense. LANL collaborated on GenBank with the firm Bolt, Beranek, and Newman, and by the end of 1983 more than 2,000 sequences were stored in it.

In the mid 1980s, the Intelligenetics bioinformatics company at Stanford University managed the GenBank project in collaboration with LANL.[9] As one of the earliest bioinformatics community projects on the Internet, the GenBank project started BIOSCI/Bionet news groups for promoting open access communications among bioscientists. During 1989 to 1992, the GenBank project transitioned to the newly created National Center for Biotechnology Information (NCBI).[10]

Genbank and EMBL: NucleotideSequences 1986/1987 Volumes I to VII.
CD-ROM of Genbank v100

Growth

Growth in GenBank base pairs, 1982 to 2018, on a semi-log scale

The GenBank release notes for release 250.0 (June 2022) state that "from 1982 to the present, the number of bases in GenBank has doubled approximately every 18 months".[5][11] As of 15 June 2022, GenBank release 250.0 has over 239 million loci, 1,39 trillion nucleotide bases, from 239 million reported sequences.[5]

The GenBank database includes additional data sets that are constructed mechanically from the main sequence data collection, and therefore are excluded from this count.

Top 20 organisms in GenBank (Release 250)[5]
Organism base pairs
Triticum aestivum 2.15443744183×10^11
SARS-CoV-2 1.65771825746×10^11
Hordeum vulgare subsp. vulgare 1.01344340096×10^11
Mus musculus 3.0614386913×10^10
Homo sapiens 2.7834633853×10^10
Avena sativa 2.1127939362×10^10
Escherichia coli 1.5517830491×10^10
Klebsiella pneumoniae 1.1144687122×10^10
Danio rerio 1.0890148966×10^10
Bos taurus 1.0650671156×10^10
Triticum turgidum subsp. durum 9.981529154×10^9
Zea mays 7.412263902×10^9
Avena insularis 6.924307246×10^9
Secale cereale 6.749247504×10^9
Rattus norvegicus 6.548854408×10^9
Aegilops longissima 5.920483689×10^9
Canis lupus familiaris 5.776499164×10^9
Aegilops sharonensis 5.272476906×10^9
Sus scrofa 5.179074907×10^9
Rhinatrema bivittatum 5.178626132×10^9

Incomplete identifications

Public databases which may be searched using the National Center for Biotechnology Information Basic Local Alignment Search Tool (NCBI BLAST), lack peer-reviewed sequences of type strains and sequences of non-type strains. On the other hand, while commercial databases potentially contain high-quality filtered sequence data, there are a limited number of reference sequences.

A paper released in the Journal of Clinical Microbiology[12] evaluated the 16S rRNA gene sequencing results analyzed with GenBank in conjunction with other freely available, quality-controlled, web-based public databases, such as the EzTaxon-e[13] and the BIBI[14] databases. The results showed that analyses performed using GenBank combined with EzTaxon-e (kappa = 0.79) were more discriminative than using GenBank (kappa = 0.66) or other databases alone.

GenBank, being a public database, may contain sequences wrongly assigned to a particular species, because the initial identification of the organism was wrong. A recent article published in Genome showed that 75% of mitochondrial Cytochrome c oxidase subunit I sequences were wrongly assigned to the fish Nemipterus mesoprion resulting from continued usage of sequences of initially misidentified individuals.[15] The authors provide recommendations how to avoid further distribution of publicly available sequences with incorrect scientific names.

Numerous published manuscripts have identified erroneous sequences on GenBank.[16][17][18] These are not only incorrect species assignments (which can have different causes) but also include chimeras and accession records with sequencing errors. A recent manuscript on the quality of all Cytochrome b records of birds further showed that 45% of the identified erroneous records lack a voucher specimen that prevents a reassessment of the species identification.[19]

See also

References

  1. ^ The download page at UCSC says "NCBI places no restrictions on the use or distribution of the GenBank data. However, some submitters may claim patent, copyright, or other intellectual property rights in all or a portion of the data they have submitted. NCBI is not in a position to assess the validity of such claims, and therefore cannot provide comment or unrestricted permission concerning the use, copying, or distribution of the information contained in GenBank."
  2. ^ Eric W Sayers; Mark Cavanaugh; Karen Clark; Kim D Pruitt; Conrad L Schoch; Stephen T Sherry; Ilene Karsch-Mizrachi (7 January 2022). "GenBank". Nucleic Acids Archive. 50 (D1): D161–D164. doi:10.1093/nar/gkab1135. PMC 8690257. PMID 34850943.
  3. ^ Benson D; Karsch-Mizrachi, I.; Lipman, D. J.; Ostell, J.; Wheeler, D. L.; et al. (2008). "GenBank". Nucleic Acids Research. 36 (Database): D25–D30. doi:10.1093/nar/gkm929. PMC 2238942. PMID 18073190.
  4. ^ Benson D; Karsch-Mizrachi, I.; Lipman, D. J.; Ostell, J.; Sayers, E. W.; et al. (2009). "GenBank". Nucleic Acids Research. 37 (Database): D26–D31. doi:10.1093/nar/gkn723. PMC 2686462. PMID 18940867.
  5. ^ a b c d "GenBank release notes (Release 250)". NCBI. 15 June 2022. Retrieved 20 July 2022.
  6. ^ "How to submit data to GenBank". NCBI. Retrieved 20 July 2022.
  7. ^ "GenBank Submission Types". NCBI. Retrieved 20 July 2022.
  8. ^ Hanson, Todd (2000-11-21). "Walter Goad, GenBank founder, dies". Newsbulletin: obituary. Los Alamos National Laboratory.
  9. ^ LANL GenBank History
  10. ^ Benton D (1990). "Recent changes in the GenBank On-line Service". Nucleic Acids Research. 18 (6): 1517–1520. doi:10.1093/nar/18.6.1517. PMC 330520. PMID 2326192.
  11. ^ Benson, D. A.; Cavanaugh, M.; Clark, K.; Karsch-Mizrachi, I.; Lipman, D. J.; Ostell, J.; Sayers, E. W. (2012). "GenBank". Nucleic Acids Research. 41 (Database issue): D36–D42. doi:10.1093/nar/gks1195. PMC 3531190. PMID 23193287.
  12. ^ Kyung Sun Park; Chang-Seok Ki; Cheol-In Kang; Yae-Jean Kim; Doo Ryeon Chung; Kyong Ran Peck; Jae-Hoon Song; Nam Yong Lee (May 2012). "Evaluation of the GenBank, EzTaxon, and BIBI Services for Molecular Identification of Clinical Blood Culture Isolates That Were Unidentifiable or Misidentified by Conventional Methods". J. Clin. Microbiol. 50 (5): 1792–1795. doi:10.1128/JCM.00081-12. PMC 3347139. PMID 22403421.
  13. ^ EzTaxon-e Database eztaxon-e.ezbiocloud.net (archive accessed 25 March 2021)
  14. ^ leBIBI V5 pbil.univ-lyon1.fr (archive accessed 25 March 2021)
  15. ^ Ogwang, Joel; Bariche, Michel; Bos, Arthur R. (2021). "Genetic diversity and phylogenetic relationships of threadfin breams (Nemipterus spp.) from the Red Sea and eastern Mediterranean Sea". Genome. 64 (3): 207–216. doi:10.1139/gen-2019-0163. PMID 32678985.
  16. ^ van den Burg, Matthijs P.; Herrando-Pérez, Salvador; Vieites, David R. (13 August 2020). "ACDC, a global database of amphibian cytochrome-b sequences using reproducible curation for GenBank records". Scientific Data. 7 (1): 268. Bibcode:2020NatSD...7..268V. doi:10.1038/s41597-020-00598-9. eISSN 2052-4463. PMC 7426930. PMID 32792559.
  17. ^ Li, Xiaobing; Shen, Xuejuan; Chen, Xiao; Xiang, Dan; Murphy, Robert W.; Shen, Yongyi (6 February 2018). "Detection of Potential Problematic Cytb Gene Sequences of Fishes in GenBank". Frontiers in Genetics. 9: 30. doi:10.3389/fgene.2018.00030. eISSN 1664-8021. PMC 5808227. PMID 29467794.
  18. ^ Heller, Philip; Casaletto, James; Ruiz, Gregory; Geller, Jonathan (7 August 2018). "A database of metazoan cytochrome c oxidase subunit I gene sequences derived from GenBank with CO-ARBitrator". Scientific Data. 5 (1). Bibcode:2018NatSD...580156H. doi:10.1038/sdata.2018.156. eISSN 2052-4463. PMC 6080493. PMID 30084847.
  19. ^ Van Den Burg, Matthijs P.; Vieites, David R. (22 September 2022). "Bird genetic databases need improved curation and error reporting to            <scp>NCBI</scp>". Ibis. doi:10.1111/ibi.13143. eISSN 1474-919X. hdl:10261/282622. ISSN 0019-1019.


External links

This page was last edited on 13 May 2024, at 10:39
Basis of this page is in Wikipedia. Text is available under the CC BY-SA 3.0 Unported License. Non-text media are available under their specified licenses. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc. WIKI 2 is an independent company and has no affiliation with Wikimedia Foundation.