Home / Projects
All projects of Data Center
We aim to integrate of big biological data, to construct different topics databases including tumor diseases, population polymorphism, biodiversity, microbiological and others, to provide data sharing systems and communities to meet the needs of researchers in different areas, to enhance the data value and promote data development application.
The millet database is based on the data of the millet genome project researched by BGI and Zhangjiakou Academy of Agricultural sciences. The database records the genotype-phenotype information of millet. Users can query and retrieve the genotype of millet through the phenotype, and the corresponding phenotype can be retrieved by genotype. Besides, the database also applies the big data technology and machine learning method to construct the genotype-phenotype model to promote the intelligent molecular breeding.
The 1,000 Plants project (OneKP or 1KP) is an international multi-disciplinary consortium that has generated large-scale genome sequencing data for over 1,000 species of plants. We constructed online BLAST platform based on the OneKP datasets in the database. By June 21, 2017, BLAST for OneKP has 947 users and has completed more than 74,726 jobs.
Scanffold data: 81G
The Bird 10,000 Genomes (B10K) Project plans to generate representative draft genome sequences from all extant bird species within the next five years (2015-2020). The B10K project will complete a genomic level tree of the entire bird species, decode the relationship between genetic variation and phenotypic variation, uncover the correlation of genetic evolutionary and biogeographical and biodiversity patterns, evaluate the impact of various ecological factors and human influence on species evolution, and unveil the demographic history. As of March 14, 2017, B10K has processed 2,500 samples, representing 2,400 species from 1,370 genera, 300 families, and 36 orders.
Scaffold data: 86G
Marine Life Genome Database (MLGD) is an on-line database aiming to provide a comprehensive knowledge and analysis for the genome of marine lives. We collected the genome, transcriptome and proteome data and information of the marine species that has been sequenced and published so far. These information are organized based on the taxonomy of marine lives and each specie can be searched and viewed in the taxon tree. The information of each specie contains: description, reference, image, genetic information and data. Data sets can be downloaded directly or redirected to the NCBI Genome database, and we will add some on-line analysis tools in the future editions. We highly welcome cooperations to sequence and analyze new marine species for a better understanding of the genomes of marine lives. At present, 472547 species information, 7538 genomic data, and 25514 image information have been collected in MLGD.
FishT1K (Transcriptomes of 1,000 fishes) project was officially launched by BGI in November 2013, with the aim of generating genome-wide transcriptome sequences for 1,000 diverse species of fishes using RNA-seq. The FishT1K database will establish the first data storage, application, sharing platform for fish group research, greatly advancing the study of fish biology, eventually contributing towards global fish biodiversity conservation efforts and sustainable utilization of natural resources. In addition, the database will promote development of new technologies and softwares for transcriptome sequencing, data analysis, annotation, and storage.
Scaffold data: 21G
Insects are one of the most species-rich groups of metazoan organisms. They play a pivotal role in most non-marine ecosystems and many insect species are of enormous economical and medical importance. Unraveling the evolution of insects is essential for understanding how life in terrestrial and polar environments evolved. The 1KITE (1K Insect Transcriptome Evolution) project aims to study the transcriptomes (that is the entirety of expressed genes) of more than 1,000 insect species encompassing all recognized insect orders.
Config sequence data: 4.3G
The 10K Mitochondrial Genome Project (MT10K) is a global research project initiated by the Chinese National GeneBank. The project is planned to construct a mitochondrial database covering all groups of animals with more than 10,000 mitochondrial genomes. It aims to cover all the families of animals, and creates a truly comprehensive mitochondrial database. At present, the database consists of 5,157 species and about 35Gb of mitochondrial genomic data. The database also incorporates Blast and PhyML tools to provide data and analytical support for researchers in the field.
Data sources: 3
The 10,000 plants (tenKP or 10KP) aims to sequence over 10,000 genomes representing every major clade of plants and eukaryotic microbes. This project would generate large-scale plant genome data within the next five years (2017-2022), addressing fundamental questions about plant evolution. Major supporters include Beijing Genomics Institute in Shenzhen (BGI-Shenzhen) and China National Gene Bank (CNGB). BGI corporate will support this project by developing new tools for de novo genome sequencing and assembly on BGISEQ platforms.
The Biodiversity Comparative Genomics Database aims to bring together knowledge and omics data sets of different species on Earth, including some excellent data sets of major international projects such as B10K, 1KITE, etc. to understand the species diversity and construct phylogenetic trees to reveal the evolutionary relationships between species through cross-species comparative analysis. Based on species knowledge, biological data, barcode, pictures and other data to build a species identification system for species identification and information query.
Agriculture BioDiversity Database (ADD) is a database integrated with data from agricultural genome projects and knowledges of different species, which provides a platform with a user-friendly interface for the agricultural researchers. ADD provides a visualized taxonomy structure to organize the integrated information and also a searching service. At present, the database archives description, image, reference, genomic data and available gene structure and function annotation of each agriculture-related species, especially plants. In the future, on-line analysis tools will be added, and the content of the database will be expanded to transcriptome, SNP, and related microorganism. As its name, ADD, we all hope it can be more abundant in information and more helpful to agriculture researchers. So that we highly welcome cooperations to reveal the stories of new species and dig into the sequenced species, for a better understanding of the agriculture.
Pathogen Variation Database (PVD) focuses on the identification and detection of unknown pathogens in human samples containing various pathogenic genomic data and related annotation information. The PVD demonstrates the results clearly and easily by data analysis and visualization, and will provide the toxicity identification and drug resistance of some pathogenic bacteria (HBV/HIV/HCV/HP). Thus, we offer the fast and comprehensive detection services for clinicians, patients and researchers.
No available statictics.
GDRD is an integrated platform for genetic disease and rare disease research and application which focuses on collection, storage, analysis and mining of human genetic data, and phenotype data. Now in GDRD phase I，around 7,000 papers, 10,000 causative variants and 300 families with rare disorders from BGI, clinVar and OMIM database have been organized and presented on the website. Subsequently, GDRD phase II will aggregate sequencing data and phenotypic characters from a variety of genetic disease and rare disease research projects in CNGB and collaborators. By the end of 2017, it will add WGS data of 6,000 samples, and the amount of generated data will reach 20~30TB. In addition, it which provides automated explanation process will help explanation the genetic test results by clinicians, improve the efficiency of explanation greatly and avoid human error.
Pan Immune Repertoire Database (PIRD) mainly focuses on immune data related to human body. It collects BCR and TCR sequencing data of various disease and experimental information and phenotype information of the corresponding individual. The PIRD V1 stores the 1,923 samples data and 554,696,060 sequence data. The PIRD V2 will integrate more samples and data. By the end of 2017, it will add 5,000 samples information and the total data will reach 10TB. The PIRD which provides data comparison and visual analysis services for disease health researchers and clinicians.
Raw data: 4.7T
Analysis data(compressed): 37G
The Human Microbiome Database (HMD) provides relevant sample and microbial data. The Human Microbial Database currently covers sequencing data volume of 83G and phenotypic information from 1,443 cases of stool samples from 8 human intestinal microbiological research projects. It also contains the most complete human intestinal microbial gene set in the world. By the end of 2017, HMD phase II will collect 3,000 samples (blood, saliva, plaque, skin, genital tract flora and so on), and the amount of data will increase 250GB. The corresponding gene sets will be available too.
GeMap is a comprehensive database which integrates genome data of 27 different races that come from 18 countries, and collects data from six authoritative databases, including 38,659 genes and millions of mutation information. GeMap can provide data retrieval service. Searching in GeMap can use rs id, gene name, disease name and chromosome position, etc. In the future, GeMap will integrates disease data, so that users can retrieve phenotype information with disease name, or retrieve related disease with gene name, etc. In addition, GeMap will build personal data analysis workstation, and then users can upload personal sequencing data, and get personal genome data analysis results. GeMap not only provides massive data support for the scientific researchers and medical practitioners, but also will provide the public with easy-to-use personal genome analysis tools and platforms, fully meet the needs of the masses.
Data sources: 6
DISSECT (Data Integration Solution for Systematic Exploration of Cancer Traits) is a comprehensive data integration platform for cancer research, including the first mirror site of ICGC Data Portal in China, which provide important resources for domestic researchers. Based on the big data research, we attempt to establish the most comprehensive cancer big data integration system through large-scale, standardized data platform construction. DISSECT has already stored genomic and clinical data from around 20,000 cancer cases, and will continuously release updated data. The second version of DISSECT will integrate 10,000 Chinese people cancer WGS data from BGI cancer research institute. The most valuable of the system is providing omics data integrating and the depth excavation analysis of the large samples data with single cancer or multiple cancers, to support the development of Chinese precision cancer medicine.
All data: 18G
The Database of Human Genetic Variations (DHGV) advances scientific researches and human health by providing the variations of different human populations from the world. So far, the database contains more than 10,145 human samples around the world, and more than 170 million mutations. The database will continue to collect data on the genetic variation from Chinese. In the first release of DHGV, You can search any variants you expect to obtain the variation information, including alleles, distribution, frequency and annotations in genome. The second version is intended to add information on the relationship between mutations and disease. Not only that, DHGV's free data service will also greatly drive research of the world's population, especially the Chinese population, which contains the evolution of origin, genetic diseases and precision medical and other aspects of research and application.
The single cell database will integrate the atlas of human cells, catalog all kinds of the body cells including subtypes, build a complete list of human cells, define human cells and construct the cell framework. The first version of single cell database demonstrates four projects including 46 samples, 30,854 cells, and 470GB data.
Birth Defects Database focuses on birth defect with genetic or partially genetic origin. Collected the genetic data and phenotype data of birth defects related samples. To accelerate data sharing, exchange, and cooperation in this field. Researchers can get general information of particular disease and basic clinical information of cases by searching.
ICGC Data Portal China Mirror provides visualization, query and download services of tumor data, covering 70 tumor research projects, 19,290 samples, 46,429,997 individual mutations and 57,658 mutant genes. The mirror site is regularly synchronized with the ICGC master station, providing more rapid service to Chinese researchers.
Cancer primary sites: 21
Simple somatic mutations: 46693172
Mutated genes: 57658
Details to follow.
No available statictics.
Children 's Cancer Database was established to research and improve treatments for children with cancer. Cancer is the second leading cause of death (following accidents) in children aged 0 to 14 years. The most common types of cancer diagnosed in children ages 0 to 14 years are leukemias and pediatric brain tumors. In the last 40 years, the overall survival rate for children’s cancer has increased from 10% to nearly 90%. DNA and other tissue samples from childhood cancer patients which collected from Beijing Tiantan Hospital and then sequenced in BGI. This database is a collection of genotype and phenotype information data to research the childhood cancer. The application of next-generation sequencing to analysis of childhood cancer can provide an unprecedented understanding of their molecular biology and clinically relevant alterations.
Details to follow.
No available statictics.
CNGB Nucleotide Sequence Archive (CNSA) is a service that provides online data submitting of data and metadata of projects, samples, experiments and sequencing. It meet the needs of sharing a small batch of sequencing data. CNSA is committed to the storage and sharing for data of any type and phase of projects, samples, experiments. It archives raw data, intermediate and final data files produced from a wide variety of sequencing platforms. As a data storage and sharing platform, CNSA which adopts in the International Nucleotide Sequence Database Collaboration (INSDC) standard accepts the data submissions from all over the world, and shares to INSDC. Besides, each complete dataset (including phenotype, omics, experimental analysis methods and other data) in CNSA will be assigned a DOI that can be used as a standard citation for future use of these data. CNSA makes biological sequence data available to the researchers to enhance reproducibility, improve utilization and allow for new discoveries by comparing data sets.
No available statictics.
BLAST stands for Basic Local Alignment Search Tool. The BLAST service of CNGB is developed with NCBI BLAST+ 2.6.0 standalone version, downloaded from NCBI FTP server, providing sequences searching on public data of CNGB applications, BGI projects and external data sources. The word, BLAST, in the name "the BLAST service of CNGB", is standing for kinds of sequence searching method. More types of sequence searching algorithms will be integrated in the future.
CNGB Projects: 4
GigaDB primarily serves as a repository to host data and tools associated with articles in GigaScience. However, it also includes a subset of datasets that are not associated with GigaScience articles. GigaDB defines a dataset as a group of files (e.g., sequencing data, analyses, imaging files, software programs) that are related to and support an article or study. Through the association with DataCite, each dataset in GigaDB will be assigned a DOI that can be used as a standard citation for future use of these data in other articles by the authors and other researchers. Datasets in GigaDB all require a title that is specific to the dataset, an author list, and an abstract that provides information specific to the data included within the set. We encourage detailed information about the data we host to be submitted by their creators in ISA-Tab, a format used by the BioSharing and ISA Commons communities that we work with to maintain the highest data and metadata standards in our journal. To maximize its utility to the research community, all datasets in GigaDB are placed under a CC0 waiver (for more information on the issues surrounding CC0 and data see Hrynaszkiewicz and Cockerill, 2012). Datasets that are not affiliated with a GigaScience article are approved for inclusion by the Editors of GigaScience. The majority of such datasets are from internal projects at the BGI, given their sponsorship of GigaDB. Many of these datasets may not have another discipline-specific repository suitably able to host them or have been rapidly released prior to any publications for use by the research community, whilst enabling their producers to obtain credit through data citation. The GigaScience Editors may also consider the inclusion of particularly interesting, previously unpublished datasets in GigaDB, especially if they meet our criteria and inclusion as Data Note articles in the journal.
No available statictics.
China National GeneBank provides a bio-data search engine, BioMiGo, for data sharing. The search engine, BioMiGo, integrates comprehensive recources including 1,000 Plants (OneKP), 10K Mitochondrion Genome (MT10K), 1K Insect Transcriptome Evolution (1KITE), Transcriptomes of 1,000 Fishes (FishT1K), Bird 10K Genomes (B10K), ICGC Data Portal China Mirror (ICGC), provides more than 200 million entries with more than 7 thousand species, 27 populations, 70 thousand samples, millions genes and PB level downloadable data for searching.
CNGB Projects: 14