CNGB Nucleotide Sequence Archive

CNGB Nucleotide Sequence Archive (CNSA) is a convenient and fast online submission system for biological research projects, samples, experiments and other information data. Based on the International Nucleotide Sequence Database Collaboration(INSDC) standard and DataCite standard, accepting the submission of global scientific research sequencing data (including raw data and other support data), its data submission service can be used as a supplement to the literature publishing process to support early data sharing. CNSA is committed to the storage and sharing of biological sequencing information and data, and is designed to provide global researchers with the most comprehensive data and information resources, enabling researchers to access and use data easily and deeply.

Metadata objects

Currently , CNSA contains submitter, project, Sample, Experiment, Run and Assembly six parts。

Submitter

A Submitter contains submission actions to be performed by the archive. A submission can add more objects to the archive, update already submitted objects or make objects publicly available.

Project

An overall description of a single research initiative; a project will typically relate to multiple samples and datasets..

Sample

Description of biological source material; each physically unique specimen should be registered as a single Sample with a unique set of attributes.

Experiment

A description of sample-specific sequencing library, instrument and sequencing methods. An Experiment references 1 Project and 1 Sample.

Run

Runs describe the files that belong to the previously created Experiments. They specify the data files for a specific sample. Note that Paired-end data files, conversely, MUST be listed in a single run in order for the two files to be correctly processed as paired-end.

Assembly

An assembly contains genome assembly results computed from the primary sequencing reads.

Data model

data-modal.png

The data model of CNSA mainly contains 5 components, project, sample, experiment, run, and assembly. Project and sample can be submitted independently, with Linking projects and samples through experimentation and assembly. Assembly data through projects and samples associated with already submitted experiments and run data.

Metadata

Required *

Conditionally required if no *

Submitter

*First name

First (given) name of the submitter.

Middle name

Middle name of the submitter.

*Last name

Last (family) name of the submitter.

*Primary E-mail

Primary Email address of the submitter.

Secondary E-mail

Secondary/Alternate Email address of the submitter.

*Submitting organization

Full name of submitter’s organization.

Submitting organization URL

The URL of submitter’s organization.

*Department

The department of the submitter.

Phone

The phone number of the submitter.

Fax

The Fax number of the submitter.

*Street

The street name of the submitter.

*City

The city name of the submitter.

State/Province

The state/province of the submitter.

*Country

The Country/Region of the submitter.

*Postal code

The Postal code of the submitter.

Project

The definition of a set of related data, a 'project' is very flexible and supports the need to define a project using different parameters.For example, Project records can be established for:

  • Genome sequencing and assembly
  • Metagenomes
  • Transcriptome sequencing and expression
  • Targeted locus sequencing
  • Genetic or RH Maps
  • Epigenetics
  • Phenotype or Genotype
  • Variation detection

Project represents a submission, initiative, or group of data that is logically related in some manner, or is of interest to retrieve as a distinct dataset. A project may be identified in terms of distinctions in the type of data produced.

General info

Basic information

Project name

A short name for the study.

*Project title

Short descriptive name of the project such as a phrase or short sentence for public display.

*Public description

A description (a paragraph) of the study goals and relevance. Provide enough information (more than 100 characters) in the description for other users to interpret the data.

*Relevance

The primary general relevance of the project.

Relevancedescription
Agricultural
Medical
IndustrialCould include bio-remediation, bio-fuels and other areas of research where there are areas of mass production.
Environmental
Evolution
Model organism
OtherUnspecified major impact categories to be defined in the "Relevance description".

*Relevance description

Describe the relevance when the Other is selected.

*Functional annotation

You are asked if the project will contain functional annotation. If yes, then a unique locus tag prefix will be created.

*Locus tag prefix

"Locus tag prefixes are only associated to projects providing functional genome annotation.
A locus tag prefix must have the following format:
- starts with a letter
- is at least 3 characters long
- is upper case
- contains only alpha-numeric characters and no symbols such as -_*
Separate multiple prefixes with a comma. Locus tags can be added but existing locus tags cannot be removed or edited."

External links

The web sites that are related to this project.

Link description

Display name of web site that is related to this project.

URL

URL of web site that is related to this project.

Related projects

The projects that are related to this project.

Accession

Related Project accession ID

Description

Description of related Project

Grants

The funding sources of this project.

The funding sources of this project.

Grant number is collected to support searches (e.g., publications often cite Grant numbers).

Grant title

Grant title may also support searches.

Agency

The agency supported the searches.

Agency abbr.

Agency abbreviation.

Consortium

If project is carried out as part of a consortium, please provide the related consortium information.

Consortium name

If project is carried out as part of a consortium, provide the consortium name.

Consortium URL

If the consortium maintains a web site, provide the URL.

Data providers

Indicate the data provider (data submitter) if it is someone other than the submitting organization or consortium.

Data provider

Data provider

Data provider URL

If the data provider maintains a web site, provide the URL.

Project release

Release date

When should this submission be released to the public

Specific Info

*Project data type

A general label indicating the primary study goal. Select appropriate types.

Project data typeDescription
Genome sequencing and assemblywhole, or partial, genome sequencing project (with or without a genome assembly)
Raw sequence readsSubmission of raw sequencing information as it comes out of machine.
Genome sequencingGenome sequencing
AssemblyAssembly
Clone endsclone-end sequencing project
EpigenomicsDNA methylation, histone modification, chromatin accessibility datasets
Exomeexome resequencing project
Mapproject that results in non-sequence map data such as genetic map, radiation hybrid map, cytogenetic map, optical map, and etc.
Metagenomesequence analysis of environmental samples
Metagenomic assemblyMetagenomic assembly
Phenotype or Genotypeproject correlating phenotype and genotype
Proteomelarge scale proteomics experiment including mass spec. analysis
Random surveysequence generated from a random sampling of the collected sample; not intended to be comprehensive sampling of the material.
Targeted loci culturedTargeted loci cultured
Targeted loci environmentalTargeted loci environmental
Targeted Locus (Loci)project to sequence specific loci, such as a 16S rRNA sequencing
Transcriptome or Gene expressionlarge scale RNA sequencing or expression analysis. Includes cDNA, EST, RNA_seq, and microarray.
Variationproject with a primary goal of identifying large or small sequence variation across populations.
Othera free text description is provided to indicate Other data type

* Project data type description

Describe the project data type when the Other is selected.

*Sample scope

The scope and purity of the biological sample used for the study.

Sample scopeDescription
MonoisolateMonoisolate: a single animal, cultured cell-line, inbred population (or possibly a heterogeneous population when a single genome assemby is generated from the pooled sample; not preferred).
MultiisolateMultiisolate: multiple individuals, a population (representative of a species). To be used for variation or other sequence comparison projects, not when multiple genomes will be annotated. Make separate monoisolate projects when more than one genome will be annotated.
MultispeciesMulti-species: sample represents multiple species.
EnvironmentEnvironment: the species content of the sample is not known.
SyntheticSynthetic: the sample is synthetically created by a machine.
OtherOther: specify the sample scope that was used.

* Target description

Describe the target description when the Other is selected.

Publications

PubMed ID

The PubMedID will be used to populate the publication information.

DOI

Provide a DOI if a PubMed ID is not available. Provide the additional reference information.
If you choose DOI, you need to fill in the following information.

*Reference title

A title of reference.

*Journal title

A title of journal.

*Year

Publication year.

*Volume

Journal volume.

*Issue

Journal issue.

*Page from

Reference start page.

*Page to

Reference end page.

*Author

Name of author.

*Consortium

Consortium of author

Sample

Description of biological source material; each physically unique specimen should be registered as a single Sample with a unique set of attributes.

General info

Submission type

*Batch/Multiple Samples

Users will be asked to upload a text file that describes each of your samples and their attributes.

*Single Sample

Users will be asked to manually complete a web form to describe one sample and its attributes.

Sample release

*Release date

When should this submission be released to the public

Sample type

In preparing your submission, please refer to this attributes list and Sample examples and fill in the relevant fields. Select the package that best describes your samples

Sample typeDescription
Clinical or host-associated pathogen
Environmental, food or other pathogen
Combined pathogenBatch submissions that include both clinical and environmental pathogen.
Microbial sampleUse for bacteria or other unicellular microbes when it is not appropriate or advantageous to use MIxS, Pathogen or Virus packages.
Model organism or animal sample Use for multicellular samples or cell lines derived from common laboratory model organisms, e.g., mouse, rat, Drosophila, worm, fish, frog, or large mammals including zoo and farm animals.
Metagenome or environmental sample Use for metagenomic and environmental samples when it is not appropriate or advantageous to use MIxS packages.
Invertebrate sampleUse for any invertebrate sample.
Human sample Only use for human samples or cell lines that have no privacy concerns. For samples isolated from humans use the Pathogen, Microbe or appropriate MIxS package.
Plant sampleUse for any plant sample or cell line.
Virus sampleUse for all virus samples not directly associated with disease.
GSC MIxS air
GSC MIxS built environment
GSC MIxS host associated
GSC MIxS human associated
GSC MIxS human gut
GSC MIxS human oral
GSC MIxS human skin
GSC MIxS human vaginal
GCS MIxS microbial mat biolfilm
GSC MIxS miscellaneous natural or artificial environment
GSC MIxS plant associated
GSC MIxS sediment
GSC MIxS soil
GSC MIxS wastewater sludge
GSC MIxS water
Beta-lactamaseUse for beta-lactamase gene transformants that have antibiotic resistance data.

Sample attributes

A major component of a Sample record is the sample attributes section. Attributes define the material under investigation and can include sample characteristics such as cell type, collection site and phenotypic information like disease state.
Sample attributes are captured as structured name: value pairs, for example, tissue:liverThe first targeted dictionaries implemented in the Sample submission are the MIxS minimum information checklists for standardizing descriptions of genomes, metagenomes and targeted locus sequences as recently developed by the Genomics Standards Consortium.

Experiment

A description of sample-specific sequencing library, instrument and sequencing methods. An Experiment references 1 Project and 1 Sample.

General info

Submission type

Batch/Multiple Submission

Users will be asked to upload a text file that describes each of your experiments and runs.

Single Experiment/Run(s)

Users will be asked to manually complete a web form to describe your sequencing experiment and upload your raw sequencing reads.

*Project accession

Select the project this experiment affiliates.

*Sample accession

Select the sample this experiment uses.

*Release date

When should this experiment/run metadata be released to the public.

Metadata

Files

*File type

The sequence data file format.

File TypeDescription
fastqfastq files
sff454 Standard Flowgram Format file
hdf5PacBio hdf5 Format file
bamBinary SAM format for use by loaders that combine alignment and sequencing data

Experiment initialization

Copy from an existing experiment

Reuse information of experiment that has already been submitted. The existing experiment information will be automatically populated into cells so that users can quickly submit.

General infomation

* Platform

Select a sequencing platform

* Title

Short text that can be used to call out experiment records in searches or in displays.

Library

* Library name

Provide a name for your library if you have any.

* Strategy

The library strategy specifies the sequencing technique intended for this library.

StrategyDescription
WGARandom sequencing of the whole genome following non-pcr amplification
WGSRandom sequencing of the whole genome
WXSRandom sequencing of exonic regions selected from the genome
RNA-SeqRandom sequencing of whole transcriptome
miRNA-SeqRandom sequencing of small miRNAs
WCSRandom sequencing of a whole chromosome or other replicon isolated from a genome
CLONEGenomic clone based (hierarchical) sequencing
POOLCLONEShotgun of pooled clones (usually BACs and Fosmids)
AMPLICONSequencing of overlapping or distinct PCR or RT-PCR products
CLONEENDClone end (5', 3', or both) sequencing
FINISHINGSequencing intended to finish (close) gaps in existing coverage
ChIP-SeqDirect sequencing of chromatin immunoprecipitates
MNase-SeqDirect sequencing following MNase digestion
DNase-HypersensitivitySequencing of hypersensitive sites, or segments of open chromatin that are more readily cleaved by DNaseI
Bisulfite-SeqSequencing following treatment of DNA with bisulfite to convert cytosine residues to uracil depending on methylation status
Tn-SeqSequencing from transposon insertion sites
ESTSingle pass sequencing of cDNA templates
FL-cDNAFull-length sequencing of cDNA templates
CTSConcatenated Tag Sequencing
MRE-SeqMethylation-Sensitive Restriction Enzyme Sequencing strategy
MeDIP-SeqMethylated DNA Immunoprecipitation Sequencing strategy
MBD-SeqDirect sequencing of methylated fractions sequencing strategy
Synthetic-Long-ReadBinning and barcoding of large DNA fragments to facilitate assembly of the fragment
ATAC-seqAssay for Transposase-Accessible Chromatin (ATAC) strategy is used to study genome-wide chromatin accessibility. alternative method to DNase-seq that uses an engineered Tn5 transposase to cleave DNA and to integrate primer DNA sequences into the cleaved genomic DNA
ChIA-PETDirect sequencing of proximity-ligated chromatin immunoprecipitates.
FAIRE-seqFormaldehyde Assisted Isolation of Regulatory Elements. reveals regions of open chromatin
Hi-CChromosome Conformation Capture technique where a biotin-labeled nucleotide is incorporated at the ligation junction, enabling selective purification of chimeric DNA ligation junctions followed by deep sequencing
ncRNA-SeqCapture of other non-coding RNA types, including post-translation modification types such as snRNA (small nuclear RNA) or snoRNA (small nucleolar RNA), or expression regulation types such as siRNA (small interfering RNA) or piRNA/piwi/RNA (piwi-interacting RNA).
RAD-SeqRestriction Site Associated DNA Sequence
RIP-SeqDirect sequencing of RNA immunoprecipitates (includes CLIP-Seq, HITS-CLIP and PAR-CLIP).
SELEXSystematic Evolution of Ligands by EXponential enrichment
ssRNA-seqstrand-specific RNA sequencing
Targeted-Capture
Tethered Chromatin Conformation Capture
OTHERLibrary strategy not listed (please include additional info in the "design description")

* Source

The library source specifies the type of source material that is being sequenced.

Source Description
GENOMIC Genomic DNA (includes PCR products from genomic DNA)
TRANSCRIPTOMIC Transcription products or non genomic DNA (EST, cDNA, RT-PCR, screened libraries)
METAGENOMIC Mixed material from metagenome
METATRANSCRIPTOMIC Transcription products from community targets
SYNTHETIC Synthetic DNA
VIRAL RNA Viral RNA
OTHER Other, unspecified, or unknown library source material

* Selection

The library selection specifies whether any method was used to select for or against, enrich, or screen the material being sequenced.

Selection Description
RANDOM Random selection by shearing or other method
PCR Source material was selected by designed primers
RANDOM PCR Source material was selected by randomly generated primers
RT-PCR Source material was selected by reverse transcription PCR
HMPR Hypo-methylated partial restriction digest
MF Methyl Filtrated
MDA Multiple displacement amplification
MSLL Methylation Spanning Linking Library
cDNA complementary DNA
ChIP Chromatin immunoprecipitation
MNase Micrococcal Nuclease (MNase) digestion
DNase Deoxyribonuclease (MNase) digestion
Hybrid Selection Selection by hybridization in array or solution
Reduced Representation Reproducible genomic subsets, often generated by restriction fragment size selection, containing a manageable number of loci to facilitate re-sampling
Restriction Digest DNA fractionation using restriction enzymes
5-methylcytidine antibody Selection of methylated DNA fragments using an antibody raised against 5-methylcytosine or 5-methylcytidine (m5C)
MBD2 protein methyl-CpG binding domain Enrichment by methyl-CpG binding domain
CAGE Cap-analysis gene expression
RACE Rapid Amplification of cDNA Ends
size fractionation Physical selection of size appropriate targets
Padlock probes capture method Circularized oligonucleotide probes
Oligo-dT enrichment of messenger RNA (mRNA) by hybridization to Oligo-dT.
repeat fractionation Selection for less repetitive (and more gene rich) sequence through Cot filtration (CF) or other fractionation techniques based on DNA kinetics.
other Other library enrichment, screening, or selection process (please include additional info in the "design description")
unspecified Library enrichment, screening, or selection is not specified (please include additional info in the "design description")

* Layout

The library layout specifies whether to expect single, paired, or other configuration of reads. In the case of paired reads, information about the relative distance and orientation is specified.

LayoutDescription
singleSingle read
paired paired

* Nominal size(bp)

The average insert size for paired reads.

Nominal standard deviation(bp)

The standard deviation of the fragment lengths about the mean (insert size).

Spot layout

Some examples for the spot layout.

A[TTACG]F*: Single reads with adapter (A) of sequence [TTACG] followed by biological forward read (F).
A[TTACG]B[ATGC]F*: Single reads with adapter (A) of sequence [TTACG] followed by barcode (B) of sequence [ATGC] and biological forward read (F).
A[TTACG]B[ATGC]P[CGTTT]F*: Single reads with adapter (A) of sequence [TTACG] followed by barcode (B) of sequence [ATGC], primer (P) of sequence [CGTTT] and biological forward read (F).
A[TTACG]P[CGTTT]B[ATGC]F*A[GGTATTC]: Single reads with adapter (A) of sequence [TTACG] followed by primer (P) of sequence [CGTTT], barcode (B) of sequence [ATGC], 100bp biological forward read (F) and adapter (A) of sequence [GGTATTC].
A[TTACG]F*L[linker_sequence]F*: Mate pair reads with adapter (A) of sequence [TTACG] followed by first forward read (F), linker (L) of sequence [linker_sequence] and second forward (F) read.
A[TTACG]F*P[CGTTT]L[linker_sequence]P[AGGCTC]F*A[GGTATTC]: Mate pair reads with adapter (A) of sequence [TTACG] followed by biological forward read (F), primer (P) of sequence [CGTTT], linker (L) of sequence [linker_sequence], primer (P) of sequence [AGGCTC], second biological forward read (F) and adapter (A) of sequence [GGTATTC].A[TTACG]B[ATGC]F*L[linker_sequence]F*: Mate pair reads with adapter (A) of sequence [TTACG] followed by barcode (B) of sequence [ATGC], first forward read (F), linker (L) of sequence [linker_sequence] and second forward (F) read.
The insert size is the size of the DNA fragments after fragmentation (i.e. it is NOT the fragment size minus forward read size, minus reverse read size).

Design description

The goal and setup of the individual library.

Library construction protocol

Describes the protocol by which the sequencing library was constructed.

Pipeline

Index

The index of the programs or algorithms.

Program

The programs or algorithms used.

Version

The version of the programs or algorithms.

Files

Data files

*File name

The name of a sequence data file.

*MD5

MD5 checksum of a sequence data file.

Status

The status of file uploaded to FTP servers.

StatusDescription
UploadedFile uploaded
UploadingFile has been uploaded, but not for the MD5
Not uploadFile not uploaded

*Data publish

When should this sequencing file(s) be released to the public.

Assembly

An assembly contains genome assembly results computed from the primary sequencing reads.

General Info

Submission type

Batch/Multiple Submission

You will be asked to upload a text file that describes your metadata and submit your data files in batches.

Single Experiment/Run(s)

You will be asked to manually complete a web form to describe your Assembly and upload your data.

*Project accession

Select the project this assembly affiliates.

*Sample accession

Select the sample this assembly uses.

*Release date

When should this assembly metadata be released to the public.

Metadata

Assembly metadata

Assembly metadata

The name of the assembly (e.g. GRCh37.p5).

*Molecule type

This field should contain the in vivo molecule type of the sequence to be submitted.

*Coverage

The average coverage of the assembly. Example: 12x

*Sequencing Technology

The platform used for the sequencing.

*Sequencing technology description

Describe the sequencing technology when the Other is selected.

Minimum gap length

The minimum stretches of NNNNNs to be considered as a gap.

*Partial

Field type should be 'true' if genome is Partial and 'false' if genome is Complete.

Assembly methods

* Assembly method

The program used to generate the assembly.

* Version

The version of the program used to generate the assembly.

* Assembly method description

Describe the assembly method when the Other is selected.

Files

Data files

*File type

The assembly data file format.

File TypeDescription
fastaSequence data format indicating sequence base calls.The format is simple: a header line initiated with the > character, data lines following with base calls..

*File name

The name of an assembly data file.

*MD5

MD5 checksum of an assembly data file

Status

The status of file uploaded to FTP servers.

StatustDescription
UploadedFile uploaded
UploadingFile has been uploaded, but not for the MD5
Not uploadFile not uploaded

*Data publish

When should this assembly file(s) be released to the public.

Data format

Sequencing Data format

CNSA receives four types of Sequencing Data format, including fastq, BAM, SFF, PacBio HDF.

Fastq file

We recommend that read data is submitted in Fastq format. Single and paired reads are accepted as Fastq files. A Fastq file can be compressed using gzip or bzip2. Please note that all files must not be compressed into one file for upload. All file names in your account folder must be unique.

  • Quality scores must be in Phred scale.
  • Both ASCII and space delimitered decimal encoding of quality scores are supported. We will automatically detect the Phred quality offset of either 33 or 64.
  • No technical reads (adapters, linkers, barcodes) are allowed.
  • Single reads must be submitted using a single Fastq file and can be submitted with or without read names.
  • Paired reads must split and submitted using two Fastq files. The read names must have a suffix identifying the first and second read from the pair, for example '/1' and '/2' (regular expression for the reads: "^@([a-zA-Z0-9_-]+:[0-9]+:[a-zA-Z0-9]+:[0-9]+:[0-9]+:[0-9-]+:[0-9-]+) ([12]):[YN]:[0-9]*[02468]:[ACGTN]+$").
  • The first line for each read must start with '@'.
  • The base calls and quality scores must be separated by a line starting with '+'.
  • The Fastq files must be compressed using gzip or bzip2.
  • • The regular expression for bases is "^([ACGTNactgn.]*?)$"Example of Fastq file containing single reads:
    @read_name
    GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
    +
    !''*((((***+))%%++)(%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
    ...
    Example of Fastq file containing paired reads:
    @read_name/1
    GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
    +
    !''*((((***+))%%++)(%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
    @read_name/2
    GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
    +
    !''*((((***+))%%++)(%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
    ...
  • The base calls and quality scores must be separated by a line starting with '+'.
  • The Fastq files must be compressed using gzip or bzip2.

BAM file

Submitted BAM files must be readable with Samtools.
BAM file names are required to end up with the .bam suffix (e.g. 'a.bam').
All files must not be compressed into one file for upload.
All file names in your account folder must be unique.

SFF format

The SFF format is supported for the 454 and Ion Torrent platforms.
SFF file names are required to end up with the .sff suffix.
All files must not be compressed into one file for upload.
All file names in your account folder must be unique.

PacBio

PacBio data submissions are supported in the platform specific native format.
One run consists of *.bax.h5, *.bas.h5 and xml. These files should be tarred and compressed. PacBio data must be submitted as a single tar.gz. files.
All files must not be compressed into one file for upload.
All file names in your account folder must be unique

Genome Assembly Data format

Genome assembly submissions include plasmids, organelles, complete virus genomes, viral segments/replicons, bacteriophages, prokaryotic and eukaryotic genomes. Chromosomes include organelles (e.g. mitochondrion and chloroplast), plasmids and viral segments.

Sequences should be submitted as a Fasta file. These sequences can be either contig, scaffold or chromosome sequences.
The submitted fasta file must be gz compressed and should specify the classification (contig, scaffold or chromosome) in the file name.
All file names in your account folder must be unique.
All files must not be compressed into one file for upload.

Fasta file

The sequence name is extracted from the header line starting with >.
For example, the following sequence has name contig1:
>contig1
AAACCCGGG...

Data Upload

You should upload data files into your private FTP file upload area at CNSA.
Before uploading your metadata, please upload your data files to your private directory on the FTP server first. Since verifying the data file takes time, if the data file is not uploaded in advance, the system may display "File not uploaded" when you submit the metadata. Please wait patiently for data verification and check the upload status by clicking the status button of uploading. After the upload status of the files changes to "Uploaded" and the information is verified, you can enter the next submission process.

FTP data storage

CNSA will backup uploaded files to user's directory on the FTP server until all files are successfully submitted and archived. The directory on the FTP server provided to the user is only a temporary directory and is not suitable for long-term storage of data. We hope that the file uploaded to FTP will not be saved for more than 2 months. We will periodically delete files that have been retained for more than 2 months.

FTP data upload

You will upload files to your private FTP directory file upload area using FTP through ftp://ftp.cngb.org/ service. The authentication is done using your FTP user name and password.
General instructions for uploading files using a FTP client.

  • Use your favourite ftp client.
  • Use binary mode for file transfers.
  • Use ftp://ftp.cngb.org/ as the target host.
  • Login with your FTP username and password.
  • Upload files to your private FTP upload area.

File MD5 checksums

Large file transfers do not always complete successfully over the internet.
An MD5 checksum can be computed for a file before and after transfer to verify that the file was transmitted successfully.
MD5 (Message Digest Algorithm 5) is a hash function which calculates a hash value (MD5 number, 32-digit numbers and letters) of a given file.
You must provide an MD5 checksum for each file submitted to the archive. We will re-compute and verify the MD5 checksum to make sure that the file transfer was completed without any changes to the file contents.

Obtain MD5 number (Linux)

Obtain the MD5 numbers of the files by executing,

$ md5sum file1 file2
9F6E6800CFAE7749EB6C486619254B9C file1
B636E0063E29709B6082F324C76D0911 file2

Obtain MD5 number (Mac OS X)

Obtain the MD5 numbers of the files by executing,

$ md5 file1 file2
9F6E6800CFAE7749EB6C486619254B9C file1
B636E0063E29709B6082F324C76D0911 file2

Obtain MD5 number (Windows)

Install and run the Fsum Frontend (sourceforge.net/projects/fsumfe/) .
At first, tick off "md5".

After clicking the [+] button, open the sequence data files that you need. You can select multiple files at the same time.

Click the [Calculate hashes] button. The MD5 numbers of the files are displayed.

By clicking the [Export] button, you can obtain the list of the MD5 numbers as a html, a csv, or a xml file.

Ethics and regulations

For submitting data from human subjects (human data) to the CNSA, it is submitter's responsibility to ensure that the dignity and right of human subject are protected in accordance with all applicable laws, ordinances, guidelines and policies of submitter's institution. In principle, make sure to remove any direct personal identifiers of human subjects from your data to be submitted.

For submitting data to the CNSA, Users must follow the Interim Measures for the Human Genetic Resources Regulations and ethical norms in their countries, submit real organization and contact information, and take responsibility for the legality and compliance of their uploaded data. CNSA will receive raw data and assembly data from animals, plants, microorganisms, etc., and will not receive the relevant data covered by the Human Genetic Resources Regulations.

Submission to the CNSA

Login database

All CNGB databases use a unified user registration and login platform. The registered account applies to all CNGB databases. Users can access the homepage of CNSA through the website (http://db.cngb.org/cnsa) as shown in Figure-1. Please read and agree with the user notice above. Click the tab of "LOGIN/SIGNUP" on the right side of the page to enter the registration and login page(Figure-2). (Note: You need to register before you can login and submit the data.)

Figure-1 homepage

Figure-2 Login page

Fill in submitter information

The CNSA system will obtain partial submitter information from the user information. The submitter information filled in by the user is bound to the submitted Project, Sample, Experiment, Sequence, and Assembly data. When submitting the project, sample, experiment, sequence, and assembly data, the system will jump to the submitter page and request submitter to supplement the information if the submitter information is not completed. If the submitter information is completed, the system will automatically skip the page and go to next step to submit. If users need to change the submitter information, they can go to submitter page to modify it. The modified submitter information will be referred to data which is being submitted or to be submitted in the future.

Create submissions and upload files

Data of Project, Sample, Experiment, Sequence, and Assembly can be submitted through their respective submission portals. Users can click the corresponding Submission Portal in Figure-3 to submit the data. The release date of data submitted in the same batch should to be kept consistent. Users can upload data via FTP, and it is recommended that the data files are uploaded before uploading the metadata of Experiment, Sequence and Assembly.

Figure-3 Submission portal

Project submission

1.1 Submission portal

Users can create a new Project through the Project portal of the homepage(Figure-4) or my submission page.(Figure-5)

Figure-4 Project submission portal in the home page

Figure-5 Project submission portal in my submission page

1.2 General info

General info page of Project (Figure-6) requires the basic information of the project, such as name, related fields, public description, external links, related items, access addresses in other databases, funding information, cooperation information, data provider, and release date. Note that the items marked with the asterisk (*) symbol are required and the rest are optional.

Figure-6 The general info page

1.3 Specific Info

After the General info of the project is completed, please click the Save and Continue button to enter the Specific Info page (Figure-7). This page requires users to fill in the project type, sample range, material, capture scope and method, and target data type. Note that the items marked with the asterisk (*) symbol are required and the rest are optional.

Figure-7 The specific info page

1.4 Overview

After the project type has been filled in, users can click the “Save and Continue” button to go to the overview page (Figure-8). This page summarizes user’s submitted data. If there is something wrong, users can go to any of the previous pages to make corresponding changes. If there is no error information, users can click the “Submit” button to submit. After the project is submitted, the access number such as (CNP0000006) is automatically generated, which can be used to view the details of the public project. Note that the information can also be modified after clicking the Submit button. In the process of completing the entire project information, the system will retain the last completed result if users exit the system midway.

Figure-8 The overview page

Sample submission

2.1 Submission portal

Users can create a new Sample through the Sample portal of the homepage(Figure-9) or my submission page(Figure-10).

Figure-9 Sample submission portal in the home page

Figure-10 Sample submission portal in my submission page

2.2 General info

The General info page of samples (Figure-11) requires the selection of the submission type. Users can choose to submit samples via batch submission or single submission. If the single submission is selected, users need to fill in the release date of the sample on this page.

Figure-11 The general info page

2.2.1 Batch submission

2.2.1.1 Sample type

After selecting batch submission on the sample General info page, users can click the “Save and Continue” button to enter the sample type page (Figure-12). This page requires users to select the sample type.

Figure-12 The sample type page

2.2.1.2 Attribute page

After selecting the sample type, users can click the “Save and Continue” button to enter the Attribute page (see Figure-13). Different sample types require different attribute information. The sample template for the corresponding sample attribute can be download by clicking the “Download Template” button on the page. Users should fill in the sample attribute information according to the template and set the release date of the batch sample on the page. The data can be upload by clicking the “Upload Sample” button. The sample information uploaded by the user will be displayed on the page in the form of excel. Users can click the “+” sign in the first column to see all fields and the error information of all fields, and then edit and save the fields. After modify and overview, users can click the “check” button below the form to confirm the information. If the error information can be ignored, the system will display “Ignore errors and submit” button to submit. If the submitted information is completely correct, the system will display a “submit” button to submit. If there are errors that can't be ignored, users should modify the data then submit it.

Figure-13 The attribute page

2.2.1.3 Overview

After the attribute information is filled in, users can click the “Save and Continue” button to enter the overview page (Figure-14). This page summarizes user’s submitted data. If there is something wrong, users can go to any of the previous pages to make corresponding changes. If there is no error information, users can click the “submit” button to submit. After the sample is submitted, the access number such as CNS0000006 will be generated automatically, which can be used to view the details of the public sample. Note that the information can also be modified after clicking the “Submit” button. In the process of completing the entire sample information, the system will retain the last completed result if users exit the system midway.

Figure-14 The overview page

2.2.2 Single submission

2.2.2.1 Sample type

After selecting single submission on the sample basic information page, users can click the “Save and Continue” button to enter the sample type page (Figure-15). This page requires users to select the sample type.

Figure-15 the sample type page

2.2.2.2 Attribute page

After selecting the sample type, users can click the “Save and Continue” button to enter the Attribute property page (Figure-16). Different sample types require different attribute information, such as the sample name, tissue type, separation description, and age of the human sample, gender and organization of providers. It supports for adding more attribute information. Note that the items marked with the asterisk (*) symbol are required and the rest are optional.

Figure-16 The attribute page

2.2.2.3 Overview

After the attribute information is filled in, users can click the “Save and Continue” button to enter the overview page (Figure-17). This page summarizes user’s submitted data. If there is something wrong, users can go to any of the previous pages to make corresponding changes. If there is no error information. users can click the “submit” button to submit. After the sample is submitted, the access number such as CNS0000006 will be generated automatically, which can be used to view the details of the public sample. Note that the information can also be modified after clicking the “Submit” button. In the process of completing the entire sample information, the system will retain the last completed result if users exit the system midway.

Figure-17 the overview page

Experiment/run submission

3.1 Submission portal

Users can create a new Experiment/run through the Experiment/run portal of the homepage(Figure-18) or my submission page(Figure-19).

Figure-18 Experiment/run submission portal in the home page

Figure-19 Experiment/run submission portal in my submission page

3.2 General info

The General info page of the experiment/run (Figure-20) requires the selection of the submission type. Users can choose to submit the experiment/run via batch submission or single submission. If the single submission is selected, users need to fill in the release date of the metadata of experiment/run on this page.

Figure-20 The general info page

3.2.1 Batch submission

3.2.1.1 Metadata and files

After selecting the batch submission of experiment/run, users can click the “Save and Continue” button to enter the metadata and files submission page ( Figure-21). The template can be download by clicking the “Download template” button. Users need to fill in the experiment/run attribute information according to the template and set the release date of the batch sample on the page. The completed file can be upload by clicking the “Upload experiment/run” button. The experiment/run information uploaded by the user will be displayed on the page in the form of excel. Users can click the “+” sign in the first column to see all fields and the error information of all fields, and then edit and save the fields. After modify and overview, users can click the “Check” button below the form to confirm the information. If the error information can be ignored, the system will display “Ignore errors and submit” button to submit. If the submitted information is completely correct, the system will display a “Submit” button to submit. If there are errors that can't be ignored, users should modify the data and submit it.

Figure-21 The metadata and files submission page

3.2.1.2 Overview

After the metadata and sequencing files are uploaded and verified, users can click the “Save and Continue” button to enter the overview page (Figure-22). This page summarizes user’s submitted data. If there is something wrong, users can go to any of the previous pages to make corresponding changes. If there is no error information, users can click the “submit” button to submit. After the experiment/run is submitted, the access number such as CNX0000006/CNR0000006 will be generated automatically, which can be used to view the details of the public experiment/run. Note that the information can also be modified after clicking the “Submit” button. In the process of completing the entire experiment/run information, the system will retain the last completed result if users exit the system midway.

Figure-22 The overview page

3.2.2 Single submission

3.2.2.1 Metadata

After selecting the single submission, users can click the “Save and Continue” button to enter the Metadata page (Figure-23). This page requires that the data file type should be filled in first. The experiment information can be initialized. If the submitted experiment number is selected, the system will automatically fill in the information of the experiment. After the initialization, users can modify the copied experiment information to help the users quickly fill in the information. If the experiment is not initialized, users need to fill in the relevant experiment information. For example, the basic information includes sequencing platform and experimental title. The library information includes library name, library strategy, sample source, selection and enrichment methods, sequencing design (single end/paired end sequencing), process description. The process information includes programs and versions used. The release date is also need to be set. It should be noted that before submitting experiment information, it is necessary to submit project information and sample information first. The experiment information can only be submitted with the corresponding project accession and sample accession. Note that the items marked with asterisk (*) symbol are required and the rest are optional.

Figure-23 The metadata page

3.2.2.2 Files

After filling in the information, users can click the “Save and Continue” button to enter the file page (Figure-24). The database will feedback FTP account and password. Users can use this account and password to upload data files to the FTP server to complete data submission. Users need fill in the file name, the corresponding MD5 value and the release date. The integrity of data uploaded to FTP will be checked and confirmed through MD5 value. The upload status of the data can be confirmed by clicking the “Check” button. If the data is not uploaded or uploading, users can not enter the next step to submit.

Figure-24 The file page

3.2.2.3 Overview

After completing the information and confirming that files have been uploaded, users can click the “Save and Continue” button to enter the overview page(Figure-25). If there is no error information, users can click the “Submit” button to submit. After the information is submitted, the access number will be automatically generated such as (CNX0000006/CNR000001), which can be used to view the details of the public experiment/run. The information can be modified after clicking the “submit” button. In the process of completing the entire experiment/run information, the system will retain the last completed result if users exit the system midway.

Figure-25 The overview page

Assembly submission

4.1 Submission portal

Users can create a new Assembly through the Assembly portal of the homepage(Figure-26) or my submission page(Figure-27).

Figure-26 Assembly submission portal in the home page

Figure-27 Assembly submission portal in my submission page

4.2 General info

The General info page of assembly (Figure-28) requires the selection of the submission type. Users can choose to submit assembly via batch submission or single submission. If the single submission is selected, users need to fill in the release date of the assembly on this page.

Figure-28 The general info page

4.2.1 Batch submission

4.2.1.1 Metadata and files

After selecting the batch submission of assembly, users can click the “Save and Continue” button to enter the metadata and files submission page (see Figure-29). The template can be download by clicking the “Download template” button. Users need to fill in the assembly attribute information according to the template and set the release date of the batch sample on the page. The data can be upload by clicking the “Upload assembly” button. The assembly information uploaded by the user will be displayed on the page in the form of excel. Users can click the “+” sign in the first column to see all fields and the error information of all fields, and then edit and save the fields. After modify and overview, users can click the “Check” button below the form to confirm the information. The system will check the metadata and files uploaded by users through FTP at the same time. If the files is not uploaded, the next step of submission is not allowed. If the error information can be ignored, the system will display “Ignore errors and submit” button to submit. If the submitted information is completely correct, the system will display a “Submit” button to submit. If there are errors that can't be ignored, users should modify the data and submit it.

Figure-29 The metadata and files submission page

4.2.1.2 Overview

After the metadata and sequencing files are uploaded and verified, users can click the “Save and Continue” button to enter the overview page (Figure-30). This page summarizes user’s submitted data. If there is something wrong, users can go to any of the previous pages to make corresponding changes. If there is no error information, users can click the “submit” button to submit. After the assembly is submitted, the access number such as CNA0000006 will be generated automatically, which can be used to view the details of the public assembly. Note that the information can also be modified after clicking the “Submit” button. In the process of completing the entire assembly information, the system will retain the last completed result if users exit the system midway.

Figure-30 The overview page

4.2.2 Single submission

4.2.2.1 Metadata

After selecting the single submission, users can click the “Save and Continue” button to enter the Metadata page (Figure-31). This page requires users to fill in the relevant assembly information and assembly metadata information, including assembly name, molecular type, coverage, sequencing technology, assembly method, and library construction process. It should be noted that before submitting assembly information, it is necessary to submit project information and sample information first. The assembly information can only be submitted with the corresponding project number and sample number. Note that the items marked with asterisk (*) symbol are required and the rest are optional.

Figure-31 The metadata page

4.2.2.2 File

After filling in the metadata information, users can click the “Save and Continue” button to enter the file page (Figure-32). The database will feedback FTP account and password. Users can use this account and password to upload data files to the FTP server to complete data submission. Users need fill in the file name, the corresponding MD5 value and the release date. The integrity of data uploaded to FTP will be checked and confirmed through MD5 value. The upload status of the data can be confirmed by clicking the “Check” button. If the data is not uploaded or uploading, users can not enter the next step to submit.

Figure-32 The file page

4.2.2.3 Overview

After completing the information and confirming that files have been uploaded, users can click the “Save and Continue” button to enter the overview page(Figure-33). If there is no error information, users can click the “Submit” button to submit. After the information is submitted, the access number will be automatically generated such as (CNA0000006), which can be used to view the details of the public assembly. The information can be modified after clicking the “submit” button. In the process of completing the entire assembly information, the system will retain the last completed result if users exit the system midway.

Figure-33 The overview page

Variation Archive

Variations

CNSA accepts genomic variations from any species, including SNP and SV, and provides long-term stable archive numbers and data. The variation data information includes Analysis, Samplesets, Subject, Call, VCF file and Region.

Submission template

There are two templates for submission of variations.

SNP_submission_template.v1.0.xlsx is the submission template for variations less than or equal to 50bp in length, which includes Analysis, Samplesets, Subject, Call, VCF file. The Analysis and Subject information must be submitted, and at least one of the Call and VCF file is submitted.

SV_submission_template.v1.0.xlsx is the submission template for variations more than 50bp, which includes Analysis, Subject, Samplesets, Call, VCF file and Region. The Analysis and Subject information must be submitted, and at least one of the Call and VCF file is submitted. The Region information can be selectively submitted.

Coding rules

Coding rules for submission

Call

varc+01+numbers (01 means the variation is less than or equal to 50bp in length, and the following numbers are cumulatively presented. For example, varc012341 represents the 2341th variation.)

varc+02+numbers (02 means the variation is more than to 50bp in length, and the following numbers are cumulatively presented. For example, varc022341 represents the 2341th variation.)

Analysis

CVA0000001 (The following numbers are cumulatively presented.)

File

CVF0000001 (The following numbers are cumulatively presented.)

Subject

CVS0000001 (The following numbers are cumulatively presented.)

Region

varr+02+numbers (02 means the variation is more than to 50bp in length, and the following numbers are cumulatively presented. For example, varr022341 represents the 2341th variation.)

Coding rules for archive

var+01+numbers (01 means the variation is less than or equal to 50bp in length, and the following numbers are cumulatively presented. For example, var012341 represents the 2341th variation.)

var+02+numbers (02 means the variation is more than to 50bp in length, and the following numbers are cumulatively presented. For example, var022341 represents the 2341th variation.)

var+09+numbers (09 means it is impossible to distinguish the variation of its length, and the following numbers are cumulatively presented. For example, var092341 represents the 2341th variation.)

Variation data format

CNSA currently only accepts variation data in VCF format. Please note that your variation data needs to be converted to VCF file format. To ensure that the format of VCF file is correct, you are advised to refer to the VCF guide (http:// Samtools.github.io/hts-specs/VCFv4.3.pdf) on the samtools website when converting data to VCF.

VCF file

VCF is a text file format (most likely stored in a compressed manner). It contains meta-information lines (prefixed with ”##”), a header line (prefixed with ”#”), and data lines each containing information about a position in the genome and genotype information on samples for each position (text fields separated by tabs). Zero length fields are not allowed, a dot (”.”) must be used instead. In order to ensure interoperability across platforms, VCF compliant implementations must support both LF (\n) and CR+LF (\r\n) newline conventions.

For example:

##fileformat=VCFv4.3 
##fileDate=20090805 
##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta 
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> 
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 
20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

Variation data submission to CNSA

1. Login database

All CNGB databases use a unified user registration and login platform. The registered account applies to all CNGB databases. Users can access the homepage of CNSA through the website (http://db.cngb.org/cnsa/) as shown in Figure 1. Please read and agree with the user notice above. Click the tab of "LOGIN/SIGNUP" on the right side of the page to enter the registration and login page. (Note: You need to register before you can login and submit the data.)

home.jpg

Figure 1 Home page

login.jpg

Figure 2 login page

2. Projects and samples submission

Before submitting the variation data, you need to create the project and sample in the CNSA (http://db.cngb.org/cnsa/). For the help manual, please refer to https://db.cngb.org/cnsa/handbook/

3. Variation submission

Variation data can be submitted via the submission portal on the CNSA homepage. Please click on the Submission portal in Figure 3 for data submission. Click the Entrance button to go to the Variation Data Submission page (Figure 4). Click the Download Template button to enter the pop-up layer of downloading the template, and download the variation data submission template you need (Figure 5). Fill in the Excel template and the other information on the Figure 4 page, then upload the variation template. After that, click the “View FTP account and password” button, the FTP account and password created by CNSA will be displayed. With the FTP account and password, you can upload the file to FTP. After the upload is successful, click the submit button to submit. After the file is uploaded successfully, the administrator will verify your file and the meta information in the Excel table you submitted, and feedback the results you submitted and the CVAR variation number.

submit.jpg

Figure 3 submission portal

submition.jpg

Figure 4 submission page

download.jpg

Figure 5 variation data submission template