To provide a uniform external shared portal CNGB is constructed for biological data sharing and application service which contains a data layer (data warehouse, data mart, database cluster, index cluster and computing cluster) and a application layer (search engine, data analysis, authorization management, data submission and download services). The data storage is opened at several levels of granularity through API to support petabyte-scale biological data sharing. In addition, we have presented a convenient and fast online submission platform named CNGB Nucleotide Sequence Archive (CNSA) to archive raw sequencing data including project, samples, experiments, assemblies and other support data. It provides data submission, data download and data management services for researchers all over the world.
1 Pan Immune Repertoire Database (PIRD V1.1)
Pan Immune Repertoire Database (PIRD) which focuses on human immune research has collected 1923 samples of information and 554,696,060 sequence . All of them were reads related to the BCR and TCR data including experimental and phenotype information from various diseases. This issue of PIRD V1.1 incorporates a repository that records CDR3 sequences, as well as specific disease and corresponding CDR3 information, providing support for immunological disease research. In the new version, which is under development and will be released this year, the samples and data will increase to 5000 individuals and 10TB, respectively. The PIRD aims to provide data analysis and visualization services to meet requirements for disease health researchers and clinicians in the field of disease and public health who have no muchfew computinge resources, analysis tools and data
2 Pathogen Variation Database
Pathogen Variation Database(PVD)focuses on the identification and detection of millions of pathogens in human samples containing various pathogenic genomic data and related annotation information. The PVD demonstrates the results clearly and easily by data analysis and visualization, and will provide the toxicity identification and drug resistance of some pathogenic(HBV/HIV/HCV/HP). In the future, we will offer the fast and comprehensive detection services for clinicians, patients and researchers.
3 Single Cell Database(SCDB)
The single cell database will integrate create the atlas of human cells, catalog all kinds of the body cells including subtypes, build a complete list of human cells, define human cells and construct the cell framework. The first version of single cell database demonstrates four projects including 46 samples, 30,854 cells, and 470GB data.
4 Marine Life Genome Database (MLGD)
Marine Life Genome Database (MLGD) is an on-line database aiming to provide a comprehensive knowledge and analysis for the genome of marine lives. We collected the genome, transcriptome and proteome data and information of the marine species that has been sequenced and published so far. This information is organized based on the taxonomy of marine lives. and Eeach species can be searched found and reviewed in the taxon tree. The information of each specie contains: description, reference, images, genetic information and data. Data sets can be downloaded directly or redirected to the NCBI Genome database, and we will add some on-line analysis tools in the future editions. We highly welcome cooperationcooperation to sequence and analyze new marine species for ato better understanding of the genomes of marine lives.
The species are organized based on the taxonomy information with categories from kingdom to subsection. Each category is colored differently as described in the legend. A category can be selected by searching in the form or by clicking on the nodes in the taxon tree. The scale of the taxon tree can be adjusted by rolling the mouse, and the taxon tree can also be moved by clicking and dragging. At present, 472547 species information, 7538 genomic data, and 25514 image information have been collected in MLGD.
5 BLAST
CNGB is developing a high-performance sequence searching service for researchers. Now the public beta version is based on NCBI BLAST+ 2.6.0, and integrated with most of NCBI BLAST databases and some of the CNGB public data. In the second half of this year, we will make more effort on optimizing the sequence service based on parallel computing method, collecting the new high-quality datasets, providing the visualization function, and releasinge the stable version.
CNGB construct different topics databases including tumor diseases, population polymorphism, biodiversity, microbiological and others, to provide data sharing systems and communities to meet the needs of researchers in different areas, to enhance the data value and promote data development application.
Biological diversity
1KITE: 1K Insect Transcriptome Evolution
B10K: Bird 10K Genomes
FishT1K: Transcriptomes of 1,000 Fishes
MilletDB: Millet DataBase
MLGD: Marine Life Genome Database
OneKP: 1000 Plants
MT10K: 10K Mitochondrion Genome
ADD: Agriculture BioDiversity Database
10KP: 10,000 plants
Health & Disease
BDDB: Birth Defects Database
DHGV: Database of Human Genetic Variations
DISSECT: Data Integration Solution for Systematic Exploration of Cancer Traits
GDRD: Genetic Disease and Rare Disease database
GeMap: Human omics-scale annotation system
MDB: Microbiome Database
ICGC: ICGC Data Portal China Mirror
PIRD: Pan Immune Repertoire Database
PVD: Pathogen Variation Database
SCDB: Single Cell DataBase
Service
Biomigo: Biomigo
BLAST: The public beta version of high-performance sequence alignment service
CNSA: CNGB Nucleotide Sequence Archive
GigaDB: GigaDB