China National GeneBank(CNGB) is developing a new system architecture to rebuild the data layer (data warehouse, data mart, database cluster, index cluster and computing cluster) and application layer (search engine, data analysis, data visualization, authorization management, data submission service) of the platform which can support the rapid growth of PB-level biological big data. We also use API services to manage and open the different ability of our system including the storage, computing, network, search, and analysis capabilities. Current CNGB has upgraded and developed some new database applications with the new architecture, and will release more powerful applications with new features (including data search engine and submission services) in the second half of this year.
The Single Cell DataBase (SCDB) will create an atlas of human cells, catalog all the body cells and subtypes, build a complete list of human cells, define cells and construct a human cell frame. By now, the database has pooled and demonstrated a single cell project group including 46 samples, 30,854 cells, and 470G single cell omics data.
Pathogen Variation Database (PVD) focuses on the identification and detection of millions of pathogens in human samples containing various pathogenic genomic data and related annotation information. The PVD demonstrates the results clearly and easily by data analysis and visualization, and will provide the toxicity identification and drug resistance of some pathogenic (HBV/HIV/HCV/HP).
CNGB is developing a high-performance sequence searching service for researchers. Now the public beta version is based on NCBI BLAST+ 2.6.0, and integrated with most of NCBI BLAST databases and some of the CNGB public data. In the second half of this year, we will make more effort on optimization of the sequence searching service based on parallel computing method, collection of the new high-quality datasets, integration with the visualization function, and release the stable version.
Marine Life Genome Database (MLGD) is an on-line database aiming to provide a comprehensive knowledge and analysis for the genome of marine lives. We collected the genome, transcriptome and proteome data and information of the marine species that has been sequenced and published so far. This information is organized based on the taxonomy of marine lives and each species can be searched and viewed in the taxon tree. At present, 472547 species information, 7538 genomic data, and 25514 image information have been collected in MLGD.
The new version v1.1 of Pan Immune Repertoire Database (PIRD) is integrated with the knowledge repository including records CDR3 sequences, specific diseases and corresponding CDR3 information, and provides support for immunological disease research.