The CNGBdb Scientific databases will build data applications in different fields based on the underlying data structures and data of CNGBdb, aiming to provide scientific data services for different research areas, such as biodiversity, microbe, cancer, immune, reproductive health, pathogen, etc., meet the needs of researchers in different fields, enhance the value of data, and promote data development and application.
The 1, 000 Plants project(OneKP or 1 KP) is an international multi - disciplinary consortium that has generated large - scale genome sequencing data for over 1, 000 species of plants.We constructed online BLAST platform based on the OneKP datasets in the database.By June 21, 2017, BLAST for OneKP has 947 users and has completed more than 74, 726 jobs.
Species: 1199Samples: 1504 Scaffold data: 81G
The millet database is based on the data of the millet genome project researched by BGI and Zhangjiakou Academy of Agricultural sciences. The database records the genotype-phenotype information of millet. Users can query and retrieve the genotype of millet through the phenotype, and the corresponding phenotype can be retrieved by genotype. Besides, the database also applies the big data technology and machine learning method to construct the genotype-phenotype model to promote the intelligent molecular breeding.
The Bird 10,000 Genomes (B10K) Project plans to generate representative draft genome sequences from all extant bird species within the next five years (2015-2020). The B10K project will complete a genomic level tree of the entire bird species, decode the relationship between genetic variation and phenotypic variation, uncover the correlation of genetic evolutionary and biogeographical and biodiversity patterns, evaluate the impact of various ecological factors and human influence on species evolution, and unveil the demographic history. As of March 14, 2017, B10K has processed 2,500 samples, representing 2,400 species from 1,370 genera, 300 families, and 36 orders.
Species: 933Samples: 940 Scaffold data: 86G
FishT1K (Transcriptomes of 1,000 fishes) project was officially launched by BGI in November 2013, with the aim of generating genome-wide transcriptome sequences for 1,000 diverse species of fishes using RNA-seq. The FishT1K database will establish the first data storage, application, sharing platform for fish group research, greatly advancing the study of fish biology, eventually contributing towards global fish biodiversity conservation efforts and sustainable utilization of natural resources. In addition, the database will promote development of new technologies and softwares for transcriptome sequencing, data analysis, annotation, and storage.
Species: 129Samples: 158 Scaffold data: 21G
The Microbiome Database (MDB) provides relevant sample and microbial data. The Human Microbial Database currently covers sequencing data volume of 83G and phenotypic information from 1,443 cases of stool samples from 8 human intestinal microbiological research projects. It also contains the most complete human intestinal microbial gene set in the world. By the end of 2017, MDB phase II will collect 3,000 samples (blood, saliva, plaque, skin, genital tract flora and so on), and the amount of data will increase 250GB. The corresponding gene sets will be available too.
DISSECT (Data Integration Solution for Systematic Exploration of Cancer Traits) is a comprehensive data integration platform for cancer research, including the first mirror site of ICGC Data Portal in China, which provide important resources for domestic researchers. Based on the big data research, we attempt to establish the most comprehensive cancer big data integration system through large-scale, standardized data platform construction. DISSECT has already stored genomic and clinical data from around 20,000 cancer cases, and will continuously release updated data. The second version of DISSECT will integrate 10,000 Chinese people cancer WGS data from BGI cancer research institute. The most valuable of the system is providing omics data integrating and the depth excavation analysis of the large samples data with single cancer or multiple cancers, to support the development of Chinese precision cancer medicine.
Cancers：37 Projects：112 Samples：46299
Pan Immune Repertoire Database (PIRD) mainly focuses on immune data related to human body. It collects BCR and TCR sequencing data of various disease and experimental information and phenotype information of the corresponding individual. The PIRD V1 has stored data of 1,923 samples and 554,696,060 sequences. The PIRD V2 will integrate more samples and data. By the end of 2017, it will add 5,000 samples information and the total data will reach 10TB. The PIRD provides data comparison and visualization services for researchers and clinicians in the field of disease and public health.
Projects: 6Samples: 1824 Individuals: 1608 Raw data: 1TBSequence：564057891
The Chinese Millionome Database(CMDB) is a unique large-scale Chinese genomics database produced by BGI and hosted in the National GeneBank. The CMDB delivers periodical and useful variation information and scientific insights derived from the analysis of millions of Chinese sequencing data. The results aim to promote genetic research and precision medicine actions in China.
GDRD is an integrated platform for genetic disease and rare disease research and application which focuses on collection, storage, analysis and mining of human genetic data, and phenotype data. Now in GDRD phase I，around 7,000 papers, 10,000 causative variants and 300 families with rare disorders from BGI, clinVar and OMIM database have been organized and presented on the website. Subsequently, GDRD phase II will aggregate sequencing data and phenotypic characters from a variety of genetic disease and rare disease research projects in CNGB and collaborators. By the end of 2017, it will add WGS data of 6,000 samples, and the amount of generated data will reach 20~30TB. In addition, it which provides automated explanation process will help explanation the genetic test results by clinicians, improve the efficiency of explanation greatly and avoid human error.
Papers: 25 Families: 286 Variations: 11104
Pathogen Variation Database (PVD) focuses on the identification and detection of unknown pathogens in human samples containing various pathogenic genomic data and related annotation information. The PVD demonstrates the results clearly and easily by data analysis and visualization, and will provide the toxicity identification and drug resistance of some pathogenic bacteria (HBV/HIV/HCV/HP). Thus, we offer the fast and comprehensive detection services for clinicians, patients and researchers.
Samples: 1501 Ribose types: 164 Virulence gene representative samples: 135