CNGB Nucleotide Sequence Archive (CNSA) releasedOctober 25, 2017
On October 25, 2017, China National GeneBank released CNGB Nucleotide Sequence Archive (CNSA). CNSA is a convenient and fast online submission system for biological research projects, samples, experiments and other information data. CNSA is committed to the storage and sharing of biological sequencing information and data, and is designed to provide global researchers with the most comprehensive data and information resources, enabling researchers to access and use data easily and deeply.
With the development of biotechnology, a large number of biological research data have been produced. The massive achievement need to be shared, bring data security management and efficient transmission bottlenecks. CNSA established by China National GeneBank, built for the big data of life science, will solve the problems completely.
Combined with the international authority of the data structure standards to meet the global share of scientific research
CNSA accepts the submission of the raw reads and other support data, integrating with INSDC and Datacite standard, sharing different types of research and scale data.
Following the International data open protocol, Serving as a complement to the literature publication process of scientific research achievement all over the world.
CNSA follows the international data open protocol such as Fort Lauderdale Agreement, NHGRI Rapid Data Release Policies, Joint Data Archiving Policy, CC0-No Rights Reserved, accepting the submission of global scientific research sequencing data (including raw data and other support data), its data submission service can be used as a supplement to the literature publishing process to support early data sharing.
Following the user's stated data permissions and rights constraints.
CNSA follows the "Interim Measures for the Management of Human Genetic Resources" and ethical norms of users' countries. Researchers need to send an electronic copy of the document that the Ethics Committee agrees to approve to the firstname.lastname@example.org. For the data related to the collection, sale, export and exit approval of human genetic resources, researchers need to send an electronic copy of the document which the relevant department of human genetic resources management approve (Eg, a regional or national country with a human genetic resources management approach).
Ensuring a level of security，taking into account the categories of data
CNSA combines the data types and processing methods, using the corresponding technical and management measures to ensure that different levels of security.
Using high-performance distributed archiving system.
CNSA uses high-performance distributed for data archiving, with independent high availability backup storage system for secure data storage.
Having the high-speed internet network and logistics network
CNSA Relies on the high-speed internet network and logistics network of BGI and CNGB, covering the global multi-center, synchronizing the data to the to the global public databases quickly.
Having a full-text search engine
CNSA has a full-text search engine which can support Petabytes of data, combine any keywords and fast position.
Providing Localized Chinese language services, fastest feedback, zero-distance communication
CNSA provides Chinese and English artificial bilingual services and can contact us by phone, email, etc., to achieve barrier-free and zero-distance communication.
CNSA Quick Start Guide:
1.Raw sequence data submission
Raw data refers to all the original data generated by a sequencing without any filter theoretically.For raw sequence data submission, CNSA integrates data standards and structure of INSDC for data review and archiving, including projects, samples, experiments and data submission.After the raw data and related metadata has been submitted and reviewed by data administrator, CNSA will synchronize these data to ENA (European Nucleotide Archive) public database to obtain the ENA accession ID as ENA broker by default, and automatically return the ID to CNSA in which submitters can view directly on the overview page in related modules. If the submitted data requires permission control, or needs to be uploaded to NCBI SRA (Sequence Read Archive, National Center for Biotechnology Information) or DDBJ DRA (Sequence Read Archive, DNA Data Bank of Japan), please contact the administrator email@example.com.
2.Other support data submission
Other support data except the raw reads, which is related to articles or research, includes but not limited to process and result data, analysis methods, software programs, image files, audio files, video files, imaging files, electronic charts and word documents. CNSA cooperates with Gigascience GigaDB to archive the support data. With a link to DataCite, each dataset will be assigned with a DOI which can be directly referenced(Fig 2).
3.Data search and download
With the full-text search engine on the home page of CNSA, users can search with any combined keywords, obtain the retrieval results quickly, locate and download the related data. Users can download data on the Run page or the Assembly page by clicking the accession ID of Run or Assembly which can be acquired through the full-text search engine on the home page.
China National GeneBank (CNGB) joins Data Center Alliance and Open Data Center CommitteeSeptember, 2017
In September 2017, the China National GeneBank (CNGB) joined the Data Center Alliance as a member of the board of directors and fulfilled the responsibilities of associate members. Xun Xu, the executive director of CNGB and dean of BGI serves as the director of the Alliance, who will give priority to the work of the alliance infrastructure working group, IT equipment working group, internet security working group, international cooperation committee and so on, and gradually extend to other work.
At the same time, the CNGB will join the Open Data Center Committee (ODCC), which will fully participate in all working groups of the Alliance and gradually improve the standardization of the construction and development of the CNGB. Joining domestic and foreign scientific research organizations and standardization agencies and participating in related projects will promote and support the rapid development of the CNGB, and will help to create and transfer transferable technology standards to the unique technologies in the life sciences field, and also accelerate the integration of BT and IT, build a broader development platform for cross-disciplinary talents.
China National GeneBank (CNGB) series database adding new membersMay 8 , 2018
To provide a uniform external shared portal CNGB is constructed for biological data sharing and application service which contains a data layer (data warehouse, data mart, database cluster, index cluster and computing cluster) and a application layer (search engine, data analysis, authorization management, data submission and download services). The data storage is opened at several levels of granularity through API to support petabyte-scale biological data sharing. In addition, we have presented a convenient and fast online submission platform named CNGB Nucleotide Sequence Archive (CNSA) to archive raw sequencing data including project, samples, experiments, assemblies and other support data. It provides data submission, data download and data management services for researchers all over the world.
1 Pan Immune Repertoire Database (PIRD V1.1)
Pan Immune Repertoire Database （PIRD） which focuses on human immune research has collected 1923 samples of information and 554,696,060 sequence . All of them were reads related to the BCR and TCR data including experimental and phenotype information from various diseases. This issue of PIRD V1.1 incorporates a repository that records CDR3 sequences, as well as specific disease and corresponding CDR3 information, providing support for immunological disease research. In the new version, which is under development and will be released this year, the samples and data will increase to 5000 individuals and 10TB, respectively. The PIRD aims to provide data analysis and visualization services to meet requirements for disease health researchers and clinicians in the field of disease and public health who have no muchfew computinge resources, analysis tools and data
2 Pathogen Variation Database
Pathogen Variation Database（PVD）focuses on the identification and detection of millions of pathogens in human samples containing various pathogenic genomic data and related annotation information. The PVD demonstrates the results clearly and easily by data analysis and visualization, and will provide the toxicity identification and drug resistance of some pathogenic（HBV/HIV/HCV/HP）. In the future, we will offer the fast and comprehensive detection services for clinicians, patients and researchers.
3 Single Cell Database（SCDB）
The single cell database will integrate create the atlas of human cells, catalog all kinds of the body cells including subtypes, build a complete list of human cells, define human cells and construct the cell framework. The first version of single cell database demonstrates four projects including 46 samples, 30,854 cells, and 470GB data.
4 Marine Life Genome Database (MLGD)
Marine Life Genome Database (MLGD) is an on-line database aiming to provide a comprehensive knowledge and analysis for the genome of marine lives. We collected the genome, transcriptome and proteome data and information of the marine species that has been sequenced and published so far. This information is organized based on the taxonomy of marine lives. and Eeach species can be searched found and reviewed in the taxon tree. The information of each specie contains: description, reference, images, genetic information and data. Data sets can be downloaded directly or redirected to the NCBI Genome database, and we will add some on-line analysis tools in the future editions. We highly welcome cooperationcooperation to sequence and analyze new marine species for ato better understanding of the genomes of marine lives.
The species are organized based on the taxonomy information with categories from kingdom to subsection. Each category is colored differently as described in the legend. A category can be selected by searching in the form or by clicking on the nodes in the taxon tree. The scale of the taxon tree can be adjusted by rolling the mouse, and the taxon tree can also be moved by clicking and dragging. At present, 472547 species information, 7538 genomic data, and 25514 image information have been collected in MLGD.
CNGB is developing a high-performance sequence searching service for researchers. Now the public beta version is based on NCBI BLAST+ 2.6.0, and integrated with most of NCBI BLAST databases and some of the CNGB public data. In the second half of this year, we will make more effort on optimizing the sequence service based on parallel computing method, collecting the new high-quality datasets, providing the visualization function, and releasinge the stable version.
CNGB construct different topics databases including tumor diseases, population polymorphism, biodiversity, microbiological and others, to provide data sharing systems and communities to meet the needs of researchers in different areas, to enhance the data value and promote data development application.
1KITE: 1K Insect Transcriptome Evolution
B10K: Bird 10K Genomes
FishT1K: Transcriptomes of 1,000 Fishes
MilletDB: Millet DataBase
MLGD: Marine Life Genome Database
OneKP: 1000 Plants
MT10K: 10K Mitochondrion Genome
ADD: Agriculture BioDiversity Database
10KP: 10,000 plants
Health & Disease
BDDB: Birth Defects Database
DHGV: Database of Human Genetic Variations
DISSECT: Data Integration Solution for Systematic Exploration of Cancer Traits
GDRD: Genetic Disease and Rare Disease database
GeMap: Human omics-scale annotation system
HMD: Human Microbiome Database
ICGC: ICGC Data Portal China Mirror
PIRD: Pan Immune Repertoire Database
PVD: Pathogen Variation Database
SCDB: Single Cell DataBase
BLAST: The public beta version of high-performance sequence alignment service
CNSA: CNGB Nucleotide Sequence Archive
The China National GeneBank (CNGB) Officially Join American Children's Brain Tumor Tissue Consortium (CBTTC) to establish an International Children's Brain Tumor Disease Data Center in ChinaMay 08 , 2018
On May 8, 2017, the CNGB and the Neurosurgery Center of Beijing Tian Tan Hospital joined the CBTTC which also officially announced that the CNGB became a new satellite member. So far, the CBTTC already has 15 members from Europe, Asia, and America respectively. Countries will work together to advance children’s brain tumor disease research and open a new chapter in children’s health.
The CNGB will establish the China Children's Brain Tumor Disease Data Center with the CBTTC to help the effective data accumulation of children's brain tumor disease in China. This will help researchers better master, share, and analyze children's cancer data. The CNGB calls for more Chinese hospitals to join the program, share the research results, promote the rapid development of life sciences, work hard to solve the globally shared health challenges, and make active contributions to improve human well-being. The CNGB relies on its own large samples and big data platform to accelerate the scientific research and clinical transformation of children's brain tumors and other children's tumors, help to eliminate diseases, and commit to realize “owned by all, shared by all, and completed by all” of the genetic resources.