The Data Application Team of the Big Data Center
The China National GeneBank (CNGB) serves as a platform to achieve high performance computing, full data retrieval, automated data management and analysis through the API to open data storage at several levels of granularity, which involves the data layer (data warehouse, data mart, database cluster, index cluster and computing cluster) and application layer (search engine, data analysis, data visualization, authorization management, data submission and download services) to support PB-level biological data sharing. At present, a series of multi-domain and multi-application oriented biological databases have been constructed. In addition, a data archiving system (CNSA, CNGB Nucleotide Sequence Archive) has been released to provide scientific data storage and sharing. On the basis of these, a unified platform (CGA, CNGB Global Archive) for biological data sharing and data application will be constructed. The data archiving system will be improved for better data archiving and submission service. It will establish automated data archiving and data authorization management workflows, provide full data retrieval and download service, covering some data such as gene, mutations, expression, protein, and phenotype. Based on underlying biological data warehouse, cloud computing technology and high-performance automated analysis services, it will provide intelligent computing service for biological data, and simultaneously implement automated data management, efficient data authorization management.
CNGB Nucleotide Sequence Archive
Biomigo is a basic database of biological data based on CGA (CNGB Global Archive), providing full-data retrieval service, integrating large number of biological data resources, and covering some data of gene, mutation, expression, protein and phenotype.
CNGB Global Archive is a unified platform for biological data sharing and data application, providing scientific data storage and sharing service, implementing automatic management and archiving of biological data, full data retrieval and download, and intelligent computing service for biological big data.
The databases cover important topics in human(disease), agriculture, animals, plants, viruses and so on.
VizMusée, Visualize Atlas of Lives, supports visualization of all datasets included in BigData Application Center.
Computing tools platform collects a variety of computing tools that are commonly used in genomics research area, such as BLAST.
Birth Defects Database focuses on birth defect with genetic or partially genetic origin. Collected the genotype data and phenotype data of birth defects related samples. To accelerate data sharing, communication and cooperation in this field. Researchers can obtain general information of particular disease and basic clinical information of cases by searching.
The Database of Human Genetic Variations collects the largest public catalogue of human genetic variation, especially for Chinese populations. Up to now, the database has collected over 220 million human variants identified from more than 30,000 individuals.Removing unnecessary variants，the database has 120 million human variants. This database will continue to collect more and more human variants and heartily welcome your personal genetic variations.
Data Integration Solution for Systematic Exploration of Cancer Traits(DISSECT) provides an integrated platform with multi-omics data and various analytic tools, in an effort to help users dissect data of common database and their own. Those analyses, from single cluster and single data type to cross clusters and multi data types, are all well-designed to be easy to use.
For version 1.1 of HMD (Human Microbiome Database), a new bacterial strain has been added. The retrieval filtering function for data of 1443 sample has been optimized. The concerned phenotypic data item can be selected from the query result, in addition, function of statistics and visualization has been added. Path of Gene profile (internal data) is also provided at the same time.
Cell-free DNA (cfDNA) database collects sequencing data of the large-scale high-throughput sequencing platform – BGI-SEQ 500, providing strong support for conducting quantifiable free DNA studies. The database not only contains detected data and related phenotypic data (for internal visitors) of 10,000 samples, but also a visual presentation for NIPT data.
Children's Cancer DataBase is a collection of phenotype information and multi-omics sequencing data to research the childhood cancer, helping to study of the molecular biology and clinically relevant genomic alterations of childhood cancer.
The Biodiversity Comparative Genomics Database aims to bring together knowledge and omics data sets of different species on the earth, including some excellent data sets of major international projects such as B10K, 1KITE, understand the species diversity, construct phylogenetic trees to reveal the evolutionary relationships between species through cross-species comparative analysis, and build a species identification system for species identification and information query based on species knowledge, biological data, barcode, pictures and other data.
Agriculture BioDiversity Database (ADD) is a database integrated with data from agricultural genome projects and knowledges of different species, which provides a platform with a user-friendly interface for the agricultural researchers. ADD provides a visualized taxonomy structure to organize the integrated information and also a searching service. At present, the database includes species introduction, genome data, gene structure and functional annotation information of all plants and some animals that have completed genome sequencing.
Human infertility and fetal tissue database mainly contains the genomic sequencing information of several diseases which relate to human reproductive function, including infertility, recurrent miscarriage, azoospermia, polycystic ovary syndrome, uterine fibroid, endometriosis, adenomyosis etc. In addition, the database also includes sequencing data of fetal tissue and normal individuals. Human infertility and fetal tissue database provide SNPs (single nucleotide polymorphisms), InDels (small Insertions and Deletions, shorter than 50bp), CNVs (copy number variants: including deletions and insertions) and SVs (structural variants: including translocations and inversions) for clinicians and researchers.
Pan Immune Repertoire Database (PIRD) includes the original and analyzed sequences of immunoglobulins (IGs) and T cell receptors (TCRs) of vertebrate species in different phenotypes. For version 1.2, not only more samples and data are updated (Now 1809 samples are disclosed), but also the data visualization has been added. Users can submit their own immunological data for comparison analysis and visualization. The sequence alignment service for BLAST is also provided.
The 10,000 plants (tenKP or 10KP) aims to sequence over 10,000 genomes representing every major clade of plants and eukaryotic microbes. This project would generate large-scale plant genome data within the next five years (2017-2022), addressing fundamental questions about plant evolution. Major supporters include Beijing Genomics Institute in Shenzhen (BGI-Shenzhen) and China National Gene Bank (CNGB). BGI corporate will support this project by developing new tools for de novo genome sequencing and assembly on BGISEQ platforms.
CNGB Nucleotide Sequence Archive (CNSA) is a convenient and fast online submission system for biological research projects, samples, experiments and other information data. CNSA is committed to the storage and sharing of biological sequencing information and data, and is designed to provide global researchers with the most comprehensive data and information resources, enabling researchers to access and use data easily and deeply.
The Single Cell DataBase (SCDB) will create an atlas of human cells, catalog all the body cells and subtypes, build a complete list of human cells, define cells and construct a human cell frame. By now, the database has pooled and demonstrated a single cell project group including 46 samples, 30,854 cells, and 470G single cell omics data.
Pathogen Variation Database (PVD) focuses on the identification and detection of millions of pathogens in human samples containing various pathogenic genomic data and related annotation information. The PVD demonstrates the results clearly and easily by data analysis and visualization, and will provide the toxicity identification and drug resistance of some pathogenic (HBV/HIV/HCV/HP).
CNGB is developing a high-performance sequence searching service for researchers. Now the public beta version is based on NCBI BLAST+ 2.6.0, and integrated with most of NCBI BLAST databases and some of the CNGB public data. In the second half of this year, we will make more effort on optimization of the sequence searching service based on parallel computing method, collection of the new high-quality datasets, integration with the visualization function, and release the stable version.
Marine Life Genome Database (MLGD) is an on-line database aiming to provide a comprehensive knowledge and analysis for the genome of marine lives. We collected the genome, transcriptome and proteome data and information of the marine species that has been sequenced and published so far. This information is organized based on the taxonomy of marine lives and each species can be searched and viewed in the taxon tree. At present, 472,547 species information, 7,538 genomic data, and 25,514 image information have been collected in MLGD.
The new version v1.1 of Pan Immune Repertoire Database (PIRD) is integrated with the knowledge repository including records CDR3 sequences, specific diseases and corresponding CDR3 information, and provides support for immunological disease research.