Handbook

CNGBdb hosts a vast amount of molecular data and other information that is indexed by CNGBdb Search. These data include literature, project, sample, experiment, run, assembly, variation, gene, genome, protein, sequence and taxonomy et al.

On the homepage of CNGBdb, you can enter any meaningful word or number to find relevant information. For example, literature number (CNL_PMID24971553), gene name (TP53), species, disease, etc. CNGBdb supports word search. For example, if you search for "homo", it will return search results that match "homo" and will not return search results that match "ho" or "hom". More complex query syntax will be added in the CNGBdb iteration version.

Query examples

Following the aforementioned query syntax, users can search according to data content and characteristics.

A few examples of queries that can be performed using CNGBdb Search are listed below.

Search for literature CNL_PMID24971553.

Search for gene TP53.

Search for protein Ovarian cancer-related protein 1.

Search for project CNP0000028.

Search for sample CNS00000027.

Search results

CNGBdb contains information on 12 data structures of project, sample, experiment, run, assembly, literature, variation, taxonomy, protein, sequence, gene and genome. Searching by keyword on the homepage, all results in 12 data structures will be returned by default. On the page of search results, you can see the top 3 search results with the highest relevance of each sub database. If you want to view more results, click on “More results” below each sub database to view. Select one of the sub databases from the drop-down list on the left side of the search bar on the homepage to search, corresponding search results of the sub database will be returned. Scroll down the page, more search results up to 100 will be loaded, Search results after 100 will not be displayed. If the results you want are still not found in the 100 search results, it is recommended to modify the search terms to re-search.

Users can also re-search by inputting the search term through the search bar of the search result page, and the search bar of the search results page has the same function as the home search bar.

Filter

The navigation filter on the left allows users for a compact view and easy navigation across different databases. It provides a means for exploring the search results grouped in relevant databases and drilling down the scope of the results.

Detailed data and Related data

If you click on the number of a certain data, you can go to the details page of the data to view more detailed information. For example, click the literature number (CNL_PMID24971553) on the search result page to jump to the literature details page (/search/literature/CNL_PMID24971553/).

If you click on Related data for a particular entry you can explore its cross-references to other databases resources of CNGBdb, such as in the genome database of the search results, click on the organism in a certain data, you can link to the organism information page of the Taxonomy Database.

Synonym conversion for CNGBdb search

CNGBdb search configures synonymous organisms (the synonym table is mainly from taxonomy database) and medical subject words (the synonym table is mainly comes from mesh). When you search for a keyword, the synonym of the keyword can also be retrieved, for example, Oryza sativa in taxonomy database. Its scientific name is Oryza sativa L, Genbank common name is rice, Inherited blast name is monocots. When you search for Oryza sativa, all of its synonyms including Oryza sativa L, rice, monocots can also be retrieved.

Search fields

The 12 data structures of CNGBdb support different search fields. The search fields are as follows.

StructureSearch fields
VariationVariant ID, HGVS/Genome variation, Location, Organism, Gene(s), Condition(s), Phenotype(s), Literature(s), Project, Identifier(s)
Project Project ID, Accession in other database, Related accession, Title, Description, Data type
SampleSample ID, Accession in other database, Organism, Related accession, Sample name, Sample title, Sample type
ExperimentExperiment ID, Accession in other database, Related accession, Platform, Strategy, Selection
RunRun ID, Accession in other database, Related accession
AssemblyAssembly ID, Accession in other database, Related accession, Assembly name, Molecule type, Sequencing technology, Assembly method
GeneGene ID,Identifier(s),Organism,Symbol, Title, Also knowns as
GenomeGenome ID, Title, Submitter, Project id, Other project ID, Organism, Lineage, Description, Literature ID, Identifier(s)
LiteratureLiterature ID, Title, Author, Journal, Publication type, Identifier(s), Abstract, Keywords, Available
OrganismOrganism ID, organism, identifier, rank, reference, related Name
ProteinProtein ID, Protein name(s), Identifier(s), Entry name, organism, Gene(s), Status, Keywords
SequenceSequence ID, Source database, Source ID, Title, Organism, Taxonomic division, Molecule type, Gene(s), Reference, Related accession

Numbering rules

In order to standardize the data and facilitate user retrieval, CNGBdb has developed the V1.0 version numbering rule. The numbering rules are different for different data sources. Current data sources of CNGBdb include data sources collected from external open sources such as NCBI, EBI,etc., CNGB data sources for various research directions such as Pan Immune Repertoire Database (PIRD), Genetic Disease and Rare Disease database (GDRD), Human Microbiome Database (HMD), Yanhuang database (YH), Millet database, etc., and CNGBdb archive databases such as CNGB Nucleotide Sequence Archive (CNSA). You can use the CNGBdb accession number for different types of data retrieval. The detailed numbering rules are as follows:

Data structureNumbering rule for archive databasesNumbering rule for data sources of NCBI, EBI, etc.Numbering rule for data sources of CNGB
LiteratureCNL+numbersCNL+PMID+numbers(CNL_PMID988776)None
GeneCNGN+numbersCNGN+GENE+numbers(CNGN_GENE9887776)CNGN+database+numbers(CNGN_GDRD78876)
GenomeCNG+numbersCNG+GENOME+numbers(CNG_GENOME9887776)CNG+database abbreviation+numbers(CNG_GDRD78876)
ProteinCNNP+numbersCNNP+USP_+numbers(CNNPUSP_B5BAY2)CNNP+ database number+numbers(CNNP_PIRD8776776)
SequenceSEN_+ String(nucleotide)
SEN_+ String(amino acid)
SEN+[REF/GB]_+numbers
(Genbank:SENGB_AJ007012.1 refseq:SENREF_NM_001164717.1)
SEP+[GB/REF/USP/PDB]_+numbers
(SEPGB_KCP8877766.1/SEPUSP_B5BAY2/SEPPDB_1VZM_C)
SEN+database number+numbers (SENPIRD_8776776)
SEP+database number+numbers (SEPPIRD_8776776)
OrganismCNO+numbersCNO+TAXON+numbers(CNO_TAXON8877666)CNO+database number+numbers(CNO_PIRD8877666)
Variationvar01+numbers (less than or equal to 50bp)
var02+numbers (more than 50bp)
var+rs+numbers(var_rs887776)
var+esv+numbers(var_esv887776)
var+nsv+numbers(var_nsv887776)
var+clin+numbers(var_clin887776)
var+database number+numbers(var_GDRD887776)
ProjectCNP+7 digits(CNP0000049)CNP+his+ 7 digits(CNPhis0000049)None
SampleCNS+7 digits(CNS0000110)CNS+his+ 7 digits(CNShis0000110)None
ExperimentCNX+7 digits(CNX0000080)CNX+his+ 7 digits(CNXhis0000080)None
RunCNR+7 digits(CNR0000036)CNR+his+ 7 digits(CNRhis0000036)None
AssemblyCNA+7 digits(CNA0000007)CNA+his+ 7 digits(CNAhis0000007)None

Abbreviations of databases:

Database Abbreviation
The 10,000 plants10KP
1K Insect Transcriptome Evolution1KITE
Agriculture BioDiversity DatabaseADD
The Bird 10,000 Genomes B10K
Database of Human Genetic Variations DHGV
Data Integration Solution for Systematic Exploration of Cancer Traits DISSECT
Transcriptomes of 1,000 Fishes FishT1K
Genetic Disease and Rare Disease database GDRD
Human omics-scale annotation system GeMap
Human infertility and fetal tissue database HIFTD
Human Microbiome Database microbiome
ICGC Data Portal China Mirror ICGC
Millet DataBase MilletDB
Marine Life Genome Database MLGD
10K Mitochondrion Genome MT10K
The 1,000 Plants project OneKP
Pan Immune Repertoire Database PIRD
Pathogen Variation Database PVD
Single Cell DataBase SCDB
chicken chicken
pig pig
rice;rice2 rice
yanhuang yh
silkworm silkworm
panada panada
macaque monkey
sheep sheep
pepper sequence pepper
cucumber cucumber
foxtail millet millet
naked mole rat rat

Data source of CNGBdb

Data source of CNGB

External data sources

Reference of data source

1. Millet: Jia G, Huang X, Zhi H, et al. A haplotype map of genomic variations and genome-wide association studies of agronomic traits in foxtail millet (Setaria italica). Nature genetics. 2013;45(8):957-61.

2. 1KP: Matasci N, Hung LH, Yan Z, et al. Data access for the 1,000 Plants (1KP) project. GigaScience. 2014;3:17.

3. 1KITE: Misof B, Liu S, Meusemann K, et al. Phylogenomics resolves the timing and pattern of insect evolution. Science. 2014;346(6210):763-7.

4. HPO: Kohler S, Vasilevsky NA, Engelstad M, et al. The Human Phenotype Ontology in 2017. Nucleic acids research. 2017;45(D1):D865-D76.

5. NCBI: Coordinators NR. Database resources of the National Center for Biotechnology Information. Nucleic acids research. 2018;46(D1):D8-D13

6. dbSNP: Smigielski EM, Sirotkin K, Ward M, et al. dbSNP: a database of single nucleotide polymorphisms. Nucleic acids research. 2000;28(1):352-5

7. SRA: Kodama Y, Shumway M, Leinonen R, et al. The Sequence Read Archive: explosive growth of sequencing data. Nucleic acids research. 2012;40(Database issue):D54-6.

8. Assembly: Kitts PA, Church DM, Thibaud-Nissen F, et al. Assembly: a resource for assembled genomes at NCBI. Nucleic acids research. 2016;44(D1):D73-80.

9. Refseq: Pruitt KD, Tatusova T, Brown GR, et al. NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic acids research. 2012;40(Database issue):D130-5.

10. Gene: Brown GR, Hem V, Katz KS, et al. Gene: a gene-centered information resource at NCBI. Nucleic acids research. 2015;43(Database issue):D36-42.

11. Taxonomy: Federhen S. The NCBI Taxonomy database. Nucleic acids research. 2012;40(Database issue):D136-43.

12. GEO: Barrett T, Wilhite SE, Ledoux P, et al. NCBI GEO: archive for functional genomics data sets--update. Nucleic acids research. 2013;41(Database issue):D991-5.

13. dbvar: Barrett T, Wilhite SE, Ledoux P, et al. NCBI GEO: archive for functional genomics data sets--update. Nucleic acids research. 2013;41(Database issue):D991-5.

14. Clinvar: Landrum MJ, Lee JM, Benson M, et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic acids research. 2016;44(D1):D862-8.

15. OMIM: Amberger JS, Bocchini CA, Schiettecatte F, et al. OMIM.org: Online Mendelian Inheritance in Man (OMIM(R)), an online catalog of human genes and genetic disorders. Nucleic acids research. 2015;43(Database issue):D789-98.

16. dbgap: Mailman MD, Feolo M, Jin Y, et al. The NCBI dbGaP database of genotypes and phenotypes. Nature genetics. 2007;39(10):1181-6.

17. EBI: Park YM, Squizzato S, Buso N, et al. The EBI search engine: EBI search as a service-making biological data accessible for all. Nucleic acids research. 2017;45(W1):W545-W9.

18. BIGD: Members BIGDC. Database Resources of the BIG Data Center in 2018. Nucleic acids research. 2018;46(D1):D14-D20.

Data archive

The data archive services of CNGBdb include CNGB Nucleotide Sequence Archive (CNSA) , Pan immune repertoire database (PIRD) and GigaDB, which are committed to the submission, storage and sharing of data for biological sequencing research projects, samples, experiments, assembly, variations, etc. They’re designed to provide researchers around the world with the comprehensive data and information resources today, enabling researchers to use data with maximum authority.

CNSA: CNGB Nucleotide Sequence Archive

PIRD:Pan immune repertoire database

GigaDB

SciRAID

The CNGBdb Scientific Research Application Databases (SciRAID) will build data applications in different fields based on the underlying data structures and data of CNGBdb, aiming to provide scientific data services for different research areas, such as biodiversity, microbe, cancer, immune, reproductive health, pathogen, etc., meet the needs of researchers in different fields, enhance the value of data, and promote data development and application.

PIRD:Pan immune repertoire database

GDRD:Genetic Disease and Rare Disease

DISSECT:Data Integration Solution for Systematic Exploration of Cancer Traits

HMD:Human Microbiome Database

PVD:Pathogen Variation Database

Data analysis

Based on the underlying data, CNGBdb builds a distributed high-performance computing platform, and deploys application services such as BLAST, Cancer Data Analysis, Pathogen Identification.

Blast:BLAST service of CNGB

DISSECT:Data Integration Solution for Systematic Exploration of Cancer Traits

PVD:Pathogen Variation Database

Data visualization

Visualization is designed to visualize the biological data of CNGBdb using multiple visualization techniques, including the visualization of genomes, transcriptome, proteome and so on.

Data management

CNGB Data Access (CDA) provides users with the services of approval, authorization, and distribution of controlled data. Whether data is authorized for access is determined by the data owner/organization.

E-BioBank

China National GeneBank E-BioBank is committed to building a global biobank inventory, creating a bio-resource information sharing environment, and stimulating bio-resources utilization in scientific researches.

EBB:E-BioBank

Data standard

CNGBdb integrates data structures and standards of international omics, health, and medicine, such as The International Nucleotide Sequence Database Collaboration (INSDC), The Global Alliance for Genomics and Health GA4GH (GA4GH), Global Genome Biodiversity Network (GGBN), American College of Medical Genetics and Genomics (ACMG), and constructs standardized data standards and structures with wide compatibility.