CNSA Handbook - About

CNSA Handbook

About CNSA

CNSA is a convenient and efficient archiving system of multi-omics data in life science, which provides archiving services for raw sequencing reads and further analyzed results. CNSA follows the international data standards for omics data, and supports online and batch submission of multiple data types such as Project, Sample, Experiment/Run, Assembly, Variation, Metabolism, Single cell, Sequence. Its data submission service can be used as a supplement to the literature publishing process to support early data sharing. CNSA is committed to building a comprehensive and high-quality system for storing, managing and sharing of omics data to provide global researchers with comprehensive data and information resources, so that researchers can access and use them more conveniently, and promote the development of life sciences.

Handbook (simple version)

Download handbook (simple version)

CNSA Handbook (simple version in English)

Register/Login

Please use the email or mobile number to register/login on the registration page and fill in the submitter’s information.

Enter the submission portal

Click “Submit” on the CNSA homepage or click “Submission portal” on the homepage navigation bar to enter the Submission portal page.

Submit project

Enter the submission process
Click “Project” on the Submission portal page to enter the submission process.
Submit project information
Select Data access manner -> fill in the basic information -> fill in the details -> overview -> submit

Notes

The first step in project submission requires the choice of Data access manner. If you choose "Public" or "Controlled", the release date can refer to the date the article will be published, and the recommendation is later than the date the article will be published.
The information of the article also can be supplemented after the article is published.
After the project is submitted successfully, you can get the CNSA assigned project accession (prefixed with CNP) in “My submission-Project”.

Submit review materials

After the project is successfully submitted, you have completed the submission of the data submission application; if you need to supplement the relevant materials during the compliance review, the data administrator (datasubs@genomics.cn) will prompt you to submit the relevant review materials, please pay attention to the email of datasubs@genomics.cn. Please note that the material review time is generally 3 working days. After the material review is passed, the data administrator will conduct a project review.

Submit sample

Enter the submission process
Click “Sample” on the Submission portal page to enter the submission process.
Submit sample information
If you submit only one sample at a time, we recommend that you choose a single submission method. If you submit multiple samples at a time, we recommend that you choose the batch submission method.
1. Single submission: Select "Submit a single sample" -> select sample type -> fill in sample attributes -> Fields pass check-> overview -> submit
2. Batch submission: Select "Submit batch samples" -> Select sample type -> Download template -> Upload completed template -> Template pass check -> Submit

Notes

Please select the sample type correctly and you can't modify it by yourself after submitting.
The sample name cannot be duplicated.
When filling out the batch template file, please view the related description and field comments first. If some required fields are missing, you can fill in 'not collected', 'not applicable' or 'missing'. If the taxonomy ID or scientific name of the organism is unclear, you can enter the single submission process to search and ensure that the information is correct.
Collection date supports 4 data formats, YYYY, YYYY-MM, YYYY-MM-DD, YYYY-YYYY.
The number of uploaded file lines cannot exceed 2000. If it exceeds 2000, please submit in multiple processes.
After the sample is submitted successfully, you can get the sample accession assigned by CNSA (prefixed by CNS) in “My submission-Sample”.

Submit experiment/run

Enter the submission process
Click “Experiment/run” on the Submission portal page to enter the submission process.
Submit data files and metadata
If you submit only one experiment/run at a time, we recommend that you choose a single submission method. If you submit multiple experiments/runs at a time, we recommend that you choose the batch submission method.
1. Single submission: Select submission type (Submit a single experiment/run) -> Fill in basic information -> Fill in metadata -> Metadata pass check -> Submit data files -> Data files pass check -> Overview ->Submit
2. Batch submission: Select submission type (Submit batch experiments/runs) ->Upload data files->Download metadata template->Upload completed metadata template->Metadata pass check ->Data files pass check -> submit

Notes

It is recommended to upload the data files first. All users can upload data via FTP or mail the hard drive.
1. You just finished uploading after the data is uploaded to the FTP personal directory.
2. FTP server, username and password can be viewed in the submission process or "My service". Each user has a unique FTP account.
When filling out the batch template file, please view the related instructions and field comments first. One line represents a run. If a sample is associated with multiple data files, please submit them in multiple lines to ensure that the experiment information is consistent and the library name is unique. The file name and MD5 value of each data file are unique.
The number of metadata file lines cannot exceed 2000. If it exceeds 2000, please submit in multiple processes.
After the experiment/run is submitted successfully, you can obtain the accession assigned by CNSA in “My submission- Experiment/run” (Experiment: prefixed with CNX; Run: prefixed with CNR)

Submit assembly

Enter the submission process
Click “Assembly” on the Submission portal page to enter the submission process.
Submit data files and metadata
If you submit only one assembly at a time, we recommend that you choose a single submission method. If you submit multiple assemblies at a time, we recommend that you choose the batch submission method.
1. Single submission: Select submission type (Submit a single assembly) -> Fill in basic information -> Fill in metadata -> Metadata pass check -> Submit data files -> Data files pass check -> Overview ->Submit
2. Batch submission: Select submission type (Submit batch assemblies)->Upload data files->Download metadata template->Upload completed metadata template->Metadata pass check ->Data files pass check -> submit

Notes

It is recommended to upload the data files first (currently only supports the FASTA format). All users can upload data via FTP or mail the hard disk.
1. You just finished uploading after the data is uploaded to the FTP personal directory.
2. FTP server, username and password can be viewed in the submission process or "My service". Each user has a unique FTP account.
When filling out the batch template file, please view the related instructions and field comments first. One line represents an assembly, ensuring that the assembly name of each assembly is unique, and the file name and MD5 value of each data file are unique.
The number of metadata file lines cannot exceed 2000. If it exceeds 2000, please submit in multiple processes.
After the assembly is submitted successfully, you can get the CNSA assigned assembly accession (prefixed with CNA) in “My submission-Assembly”.

Submit variation

Enter the submission process
Click “Variation” on the Submission portal page to enter the submission process.
Submit data files and metadata
1. Submit SNP: Select variation type (SNP) -> Upload data files to ftp-> Download metadata template -> Upload completed metadata template -> Metadata pass check -> Data files pass check -> Submit
2. Submit SV: Select variant type (SV) -> Upload data files to ftp (optional) -> Download template -> Upload completed template -> Pass check -> Data files pass check (if submitted) ->Submit
3. Submit CAHV: Select variation type (CAHV)->Download template->Upload completed template-> Pass check->Submit

Notes

If the selected variation type is SNP, you are advised to first upload the data files (currently only supports VCF format). All users can upload data via FTP or mail the hard disk.
1. You just finished uploading after the data is uploaded to the FTP personal directory.
2. FTP server, username and password can be viewed in the submission process or "My service". Each user has a unique FTP account.
When filling out the batch template file, please view the description and field comments first. If you need to upload VCF files, make sure that the file name and MD5 value of each data file are unique.
After the data has been reviewed, you can get the variation accessions (prefixed by varc) assigned by CNSA in “My submission-Variation”.

Submit metabolism

Enter the submission process
Click "Metabolism" on the Submission portal page to enter the submission process.
Submit data files and metadata
Upload data files-> Download “Descriptions” template -> Upload completed “Descriptions” template -> Add “Assay” ->Download the template of the created “Assay” -> upload the completed “Assay” template ->Metadata pass check -> Data files pass check -> Submit

Notes

You are advised to first upload the data files. All users can upload data via FTP or mail the hard disk.
1. You just finished uploading after the data is uploaded to the FTP personal directory.
2. FTP server, username and password can be viewed in the submission process or "My service". Each user has a unique FTP account.
The "descriptions" information is required, and at least one assay needs to be created. A new assay can be added after the previous assay is uploaded.
The number of lines in each sheet of the metadata file cannot exceed 2000. If it exceeds 2000, please submit in multiple processes.
When filling out the template file, please view the description and field comments first, and make sure that the file name and MD5 value of each data file are unique.
After the data has been reviewed, you can get the metabolism accession (prefixed by METM) assigned by CNSA in “My submission- Metabolism”.

Submit single cell

Enter the submission process
Click “Single cell” on the Submission portal page to enter the submission process.
Select the associated project accession and fill in the number of cells.
Submit data files and metadata
Upload data files-> Download metadata template -> Upload completed metadata template -> Metadata pass check -> Data files pass check -> Submit

Notes

You are advised to first upload the data files. All users can upload data via FTP or mail the hard disk.
1. You just finished uploading after the data is uploaded to the FTP personal directory.
2. FTP server, username and password can be viewed in the submission process or "My service". Each user has a unique FTP account.
The expression matrix file is required, and other file types are optional. If your gene expression files, metadata files, and cluster files need to be grouped, please add the group name to the file name.
The number of lines in each sheet of the metadata file cannot exceed 2000. If it exceeds 2000, please submit in multiple processes.
When filling out the template file, please view the description and field comments first, and make sure that the file name and MD5 value of each data file are unique.
After the data has been reviewed, you can get the metabolism accession (prefixed by CSE) assigned by CNSA in “My submission-Single cell”.

Submit virus sequence

Enter the submission process
Click "Virus sequence" on the Submission portal page to enter the submission process.
Select the release date of this submission.
Submit data files and metadata
Upload data files to ftp-> Download metadata template -> Upload completed metadata template -> Metadata pass check -> Data files pass check -> Submit

Notes

You are advised to first upload the data files (currently only supports FASTA format). All users can upload data via FTP or mail the hard disk.
1. You just finished uploading after the data is uploaded to the FTP personal directory.
2. FTP server, username and password can be viewed in the submission process or "My service". Each user has a unique FTP account.
When filling out the template file, please view the description and field comments first, and make sure that the file name and MD5 value of each data file are unique.
After the data has been reviewed, you can get the virus sequence accessions (prefixed by N_) assigned by CNSA in “My submission-Virus sequence”.

Submit sequence

Enter the submission process
Click “Sequence” on the Submission portal page to enter the submission process.
Select the release date of this submission.
Submit data files and metadata
Upload data files-> Download file list template -> Upload completed file list template -> File list pass check -> Data files pass check -> Submit

Notes

You are advised to first upload the data files (currently only supports GBFF format). All users can upload data via FTP or mail the hard disk.
1. You just finished uploading after the data is uploaded to the FTP personal directory.
2. FTP server, username and password can be viewed in the submission process or "My service". Each user has a unique FTP account.
When filling out the template file, please view the description and field comments first, and make sure that the file name and MD5 value of each data file are unique.
After the data has been reviewed, you can get the virus sequence accessions (prefixed by N_) assigned by CNSA in “My submission-Sequence”.

Memo

Data release
The release date can be set in the “Data Management” in the project submission process. Only the date today or within two years after today can be selected. The data will not be made public until the submitted data is reviewed by the reviewer and reaches the release date set by the user. If the data is about to reach the release date, the system will send a reminder email 15 days in advance.
Modify and delete
On the "My Submission" page, you can click the "pencil icon" in the status column to modify. If the status column does not have a "pencil icon" , please send an email to datasubs@genomics.cn and indicate the submission ID or data accession and the reason for the modification.
1. Modify release date
  If the status of the project is “Unfinished”, go to “My submission”, and click “pencil icon” in the project status column to enter the modification process.
  If the status of the project is "Processing" or "Processed", you can modify the release date by clicking the date in the release date column or the "pencil icon" .
  If the status of the project is “Public” or “Controlled”, the release date cannot be modified by yourself. If you need to make changes, please send an email to datasubs@genomics.cn and indicate the project accession and the reason for the change.
2. Delete submission
  If the status is “unfinished”, click the “trash can icon” to delete the submission. If it is in other status, please send an email to datasubs@genomics.cn and indicate the submission ID and the reason for the deletion.
Data association
The information of the sample will only be triggered after the experiment/run or assembly is submitted; only after the experiment/run or assembly is reviewed and the data is public, all the information associated with the project can be retrieved according to the project accession. Otherwise, the information of the project and sample can only be retrieved separately, and no association will occur.
MD5 check
Please fill in the file name and MD5 value of the uploaded data file, and then click "Check", there are maybe four statuses:
1. Not uploaded: The data file is not uploaded or being uploaded. If you have uploaded the data file, it still shows "Not uploaded". Please click "Check" later.
2. Calculating: The data file has been uploaded, but the MD5 value of the file has not been calculated or is being calculated. Please click "Check" later.
3. MD5 mismatch: The MD5 value calculated by the system is inconsistent with the MD5 value you filled in. If the data file is only uploaded a part and in check, the MD5 value calculated by the system will be inconsistent with the one you filled in. Please click "Check" later. If you click "Check" after a long time (such as half an hour), the status still shows "MD5 mismatch", please recalculate and fill in the MD5 of the data file. If the status still shows "MD5 mismatch", please contact datasubs@genomics.cn and indicate the file name in the email.
4. Check finished: The data file has been uploaded and passed the check.
View accessions and submitted metadata
On “My submission” page, you can directly view the accession of a single submission, download the batch-submitted attribute files with accessions in the status column, or click on the completed submission ID to view the details.
Contact us
If you have any questions, please contact the administrator at datasubs@genomics.cn or 0755-36307296.

Metadata

Metadata is data that describes an information resource or a data object.

Currently, CNSA metadata includes 11 data objects: Submitter, Project, Sample, Experiment, Run, Assembly, Variation, Metabolism, Single cell, Virus sequence, Sequence. Below is an introduction to each data type and the required fields (required fields with *).

Submitter

The submitter submits data on project, sample, experiment, run, assembly and variation to the CNSA. A submitter can submit multiple data types, update and modify data, set Data access manner, etc.

Field	Description
*First name	First (given) name of the submitter.
Middle name	Middle name of the submitter.
*Last name	Last (family) name of the submitter.
*Primary E-mail	Primary Email address of the submitter.
Secondary E-mail	Primary Email address of the submitter.
*Submitting organization	Full name of submitter’s organization.
Submitting organization URL	The URL of submitter’s organization.
*Department	The department of the submitter.
Phone	The phone number of the submitter.
Fax	The Fax number of the submitter.
*Street	The street name of the submitter.
*City	The city name of the submitter.
State/Province	The state/province of the submitter.
*Country	The Country/Region of the submitter.
*Postal code	The Postal code of the submitter.

Project

The definition of a set of related data, a 'project' is very flexible and supports the need to define a project using different parameters. For example, Project records can be established for:

Genome sequencing and assembly
Metagenomes
Transcriptome sequencing and expression
Targeted locus sequencing
Genetic or RH Maps
Epigenetics
Phenotype or Genotype
Variation detection

Project represents a submission, initiative, or group of data that is logically related in some manner, or is of interest to retrieve as a distinct dataset. A project may be identified in terms of distinctions in the type of data produced.

Data access manner

There are three Data access manner of CNSA: Public, Controlled and Private. The data submitter can choose a Data access manner when submitting a project.

Public: Public Data refers to data whose data access manner is "Public". That is, the metadata and data files associated with the project will be public. Public Data will be open to the world, accepting the access of user browser and download. You need to set a release date, and all metadata and data files associated with the project will be public on that date.

Controlled: Controlled Data refers to data whose data access manner is "Controlled". That is, the metadata associated with the project will be public and the data files will be controlled. Users can apply for access to Controlled Data. You need to set the release date of metadata, and all metadata associated with the project will be public on that date.

Private: Private Data refers to data whose data access manner is "Private". That is, the metadata and data files associated with the project are controlled. Private Data is not accessible, and no access or download application is accepted.

General info

*Project title

Short descriptive name of the project such as a phrase or short sentence for public display.

Project name

A short name for the study.

*Public description

A description (a paragraph) of the study goals and relevance. Provide enough information (more than 100 characters) in the description for other users to interpret the data.

*Relevance

The primary general relevance of the project.

Relevance	Description
Agricultural
Environmental
Evolution
Industrial	Could include bio-remediation, bio-fuels and other areas of research where there are areas of mass production.
Medical
Model organism
Other	Unspecified major impact categories to be defined in the "Relevance description".

*Relevance description

Describe the relevance when the Other is selected.

*Functional annotation

You are asked if the project will contain functional annotation. If yes, then a unique locus tag prefix will be created.

*Locus tag prefix

The prefix of a locus tag. Locus_tags are identifiers that are systematically applied to every gene in a genome. All components of a project (such as multiple chromosomes or plasmids, etc) should use the same locus_tag prefix.

Format requirements:

It can contain only alpha-numeric characters, and must be at least 3 characters long.
All letters are capitalized，and it should start with a letter, but numerals can be in the 2nd position or later in the string. (eg. A1C).
There should be no symbols, such as -_* in the prefix.

External links

The web sites that are related to this project.

Field	Description
URL	Display name of web site that is related to this project.
Link description	URL of web site that is related to this project.

Related projects

The projects that are related to this project.

Field	Description
Project acession	Related Project accession ID
Project description	Description of related Project

Grants

The funding sources of this project.

Field	Description
Grant number	Grant number is collected to support researches.
Grant title	Grant title may also support researches.
Institution abbreviation	The abbreviation of institution supported the researches.
Institution	The institution supported the researches.

Consortium

If project is carried out as part of a consortium, please provide the related consortium information.

Field	Description
Consortium name	If project is carried out as part of a consortium, provide the consortium name.
Consortium URL	If the consortium maintains a web site, provide the URL.

Data providers

Indicate the data provider (data submitter) if it is someone other than the submitting organization or consortium.

Field	Description
Data provider	Data provider
Data provider URL	If the data provider maintains a web site, provide the URL.

Detailed information

Project type

*Project data type

A general label indicating the primary study goal. Select appropriate types.

Project data type	Description
Genome sequencing and assembly	Whole, or partial, genome sequencing project (with or without a genome assembly).
Raw sequence reads	Submission of raw sequencing information as it comes out of machine.
Genome sequencing	Genome sequencing
Assembly	Assembly
Clone ends	Clone-end sequencing project
Epigenomics	DNA methylation, histone modification, chromatin accessibility datasets
Exome	Exome resequencing project
Map	Project that results in non-sequence map data such as genetic map, radiation hybrid map, cytogenetic map, optical map, and etc.
Metagenome	Sequence analysis of environmental samples
Metagenomic assembly	Metagenomic assembly
Phenotype or Genotype	Project correlating phenotype and genotype
Proteome	Large scale proteomics experiment including mass spec. analysis
Random survey	Sequence generated from a random sampling of the collected sample; not intended to be comprehensive sampling of the material.
Targeted loci cultured	Targeted loci cultured
Targeted loci environmental	Targeted loci environmental
Targeted Locus (Loci)	Project to sequence specific loci, such as a 16S rRNA sequencing
Transcriptome or Gene expression	Large scale RNA sequencing or expression analysis. Includes cDNA, EST, RNA_seq, and microarray.
Variation	Project with a primary goal of identifying large or small sequence variation across populations.
Other	A free text description is provided to indicate Other data type

* Project data type description

Describe the project data type when the Other is selected.

*Sample scope

The scope and purity of the biological sample used for the study.

Choose “Multiisolate” as the Scope when the goal of the research is to compare multiple individuals or strains of the same species, eg, in a Variation or Genome sequencing and assembly project.
Choose “Multispecies” when different species are being examined.
Choose “Monoisolate” if the goal is to make a single genome or transcriptome assembly, even if more than one individual was the source of the DNA or RNA.

Sample scope	Description
Monoisolate	a single animal, cultured cell-line, inbred population (or possibly a heterogeneous population when a single genome assembly is generated from the pooled sample; not preferred).
Multiisolate	multiple individuals, a population (representative of a species). To be used for variation or other sequence comparison projects, not when multiple genomes will be annotated. Make separate monoisolate projects when more than one genome will be annotated.
Multispecies	sample represents multiple species.
Environment	the species content of the sample is not known.
Synthetic	the sample is synthetically created by a machine.
Other	specify the sample scope that was used.

* Target description

Describe the target description when the Other is selected.

Publications

Field	Description
PubMed ID	The PubMedID will be used to populate the publication information.
DOI	Provide a DOI if a PubMed ID is not available. Provide the additional reference information. If you choose DOI, you need to fill in the following information.
*Reference title	A title of reference.
*Journal title	A title of journal.
*Year	Year of publication.
*Volume	Journal volume.
*Issue	Journal issue.
*Start page number	Start page number of publication.
*End page number	End page number of publication.
*Author	Name of author.
*Institution	Institution of author.

Sample

Description of biological source material; each physically unique specimen should be registered as a single Sample with a unique set of attributes.

General information

Submission type

Submission type	Description
Submit batch samples	Users will be asked to upload a text file that describes each of your samples and their attributes.
Submit a single sample	Users will be asked to manually complete a web form to describe one sample and its attributes.

Sample type

In preparing your submission, please refer to the attributes list below and Sample examples and fill in the relevant fields. Select the package that best describes your samples

Attributes list

Sample type	Description
Clinical or host-associated pathogen
Environmental, food or other pathogen
Combined pathogen	Batch submissions that include both clinical and environmental pathogen.
Microbial sample	Use for bacteria or other unicellular microbes when it is not appropriate or advantageous to use MIxS, Pathogen or Virus packages.
Model organism or animal sample	Use for multicellular samples or cell lines derived from common laboratory model organisms, e.g., mouse, rat, Drosophila, worm, fish, frog, or large mammals including zoo and farm animals.
Metagenome or environmental sample	Use for metagenomic and environmental samples when it is not appropriate or advantageous to use MIxS packages.
Invertebrate sample	Use for any invertebrate sample.
Human sample	Only use for human samples or cell lines that have no privacy concerns. For samples isolated from humans use the Pathogen, Microbe or appropriate MIxS package.
Plant sample	Use for any plant sample or cell line.
Virus sample	Use for all virus samples not directly associated with disease.
GSC MIxS air
GSC MIxS built environment
GSC MIxS host associated
GSC MIxS human associated
GSC MIxS human gut
GSC MIxS human oral
GSC MIxS human skin
GSC MIxS human vaginal
GCS MIxS microbial mat biolfilm
GSC MIxS miscellaneous natural or artificial environment
GSC MIxS plant associated
GSC MIxS sediment
GSC MIxS soil
GSC MIxS waste water sludge
GSC MIxS water
Beta-lactamase	Use for beta-lactamase gene transformants that have antibiotic resistance data.

Sample attributes

A major component of a Sample record is the sample attributes section. Attributes define the material under investigation and can include sample characteristics such as cell type, collection site and phenotypic information like disease state.

Sample attributes are captured as structured name: value pairs, for example, tissue: liver. The first targeted dictionaries implemented in the Sample submission are the MIxS minimum information checklists for standardizing descriptions of genomes, metagenomes and targeted locus sequences as developed by the Genomics Standards Consortium.

Experiment

A description of sample-specific sequencing library, instrument and sequencing methods. An Experiment references 1 Project and 1 Sample.

General information

Submission type

Submission type	Description
Submit batch experiments/runs	Users will be asked to upload a text file that describes each of your experiments and runs.
Submit a single experiment/run(s)	Users will be asked to manually complete a web form to describe your sequencing experiment and upload your raw sequencing reads.

*Project accession

Select the project this experiment affiliates.

*Sample accession

Select the sample this experiment uses.

Metadata

Experiment reuse

Reuse information of experiment that has already been submitted. The existing experiment information will be automatically populated into cells so that users can quickly submit.

*Data files type

The format of sequencing data files.

Data files type	Description
bam	Binary SAM format for use by loaders that combine alignment and sequencing data
cram	SAM recoding using reference genome
sff	454 Standard Flowgram Format file
fastq	fastq files
PacBio_HDF5	PacBio hdf5 Format file
Oxford_Nanopore	Oxford Nanopore native data containing basecalled fast5 files

General information

* Platform

The sequencing platform and instrument model.

Platform	Sequencer
_LS454	454 GS
	454 GS 20
	454 GS FLX
	454 GS FLX+
	454 GS FLX Titanium
	454 GS Junior
ILLUMINA	HiSeq X Five
	HiSeq X Ten
	Illumina Genome Analyzer
	Illumina Genome Analyzer II
	Illumina Genome Analyzer IIx
	Illumina HiScanSQ
	Illumina HiSeq 1000
	Illumina HiSeq 1500
	Illumina HiSeq 2000
	Illumina HiSeq 2500
	Illumina HiSeq 3000
	Illumina HiSeq 4000
	Illumina HiSeq X
	Illumina MiniSeq
	Illumina MiSeq
	Illumina NovaSeq 6000
	Illumina NovaSeq X
	Illumina NovaSeq X Plus
	Illumina iSeq 100
	NextSeq 1000
	NextSeq 2000
	NextSeq 500
	NextSeq 550
HELICOS	Helicos HeliScope
ABI_SOLID	AB 5500 Genetic Analyzer
	AB 5500xl Genetic Analyzer
	AB 5500x-Wl Genetic Analyzer
	AB 5500xl-W Genetic Analysis System
	AB SOLiD 3 Plus System
	AB SOLiD 4 System
	AB SOLiD 4hq System
	AB SOLiD PI System
	AB SOLiD System
	AB SOLiD System 2.0
	AB SOLiD System 3.0
COMPLETE_GENOMICS	Complete Genomics
PACBIO_SMRT	PacBio RS
	PacBio RS II
	Revio
	Sequel
	Sequel II
	Sequel IIe
	Onso
ION_TORRENT	Ion Torrent PGM
	Ion Torrent Proton
	Ion Torrent S5 XL
	Ion Torrent S5
	Ion Torrent Genexus
	Ion GeneStudio S5
	Ion GeneStudio S5 Plus
	Ion GeneStudio S5 Prime
CAPILLARY	AB 310 Genetic Analyzer
	AB 3130 Genetic Analyzer
	AB 3130xL Genetic Analyzer
	AB 3500 Genetic Analyzer
	AB 3500xL Genetic Analyzer
	AB 3730 Genetic Analyzer
	AB 3730xL Genetic Analyzer
OXFORD_NANOPORE	GridION
	MinION
	PromethION
BGISEQ	BGISEQ-500
	BGISEQ-50
	BGISEQ-1000
	BGISEQ-100
DNBSEQ	DNBSEQ-E25
	DNBSEQ-G50(MGISEQ-200)
	DNBSEQ-G400(MGISEQ-2000)
	DNBSEQ-G400 FAST
	DNBSEQ-G99
	DNBSEQ-T1
	DNBSEQ-T5
	DNBSEQ-T7
	DNBSEQ-T10
	DNBSEQ-T10×4
	DNBSEQ-T20
	DNBSEQ-T20×2
	DNBSEQ-G800
GENEMIND	GenoCare 1600
	GenoLab M
	FASTASeq 300
	SURFSeq 5000
	SURFSeq Q
ELEMENT	Element AVITI
GENAPSYS	GS111
TAPESTRI	Tapestri
ULTIMA	UG 100
VELA_DIAGNOSTICS	Sentosa SQ301
CAPITALBIO	BioelectronSeq 4000
CycloneSEQ	CycloneSEQ-WT02

* Title

Short text that can be used to call out experiment records in searches or in displays.

Library

* Library name

Provide a name for your library if you have any.

* Strategy

The library strategy specifies the sequencing technique intended for this library.

Strategy	Description
WGA	Random sequencing of the whole genome following non-pcr amplification
WGS	Random sequencing of the whole genome
WXS	Random sequencing of exonic regions selected from the genome
RNA-Seq	Random sequencing of whole transcriptome
miRNA-Seq	Random sequencing of small miRNAs
WCS	Random sequencing of a whole chromosome or other replicon isolated from a genome
CLONE	Genomic clone based (hierarchical) sequencing
POOLCLONE	Shotgun of pooled clones (usually BACs and Fosmids)
AMPLICON	Sequencing of overlapping or distinct PCR or RT-PCR products
CLONEEND	Clone end (5', 3', or both) sequencing
FINISHING	Sequencing intended to finish (close) gaps in existing coverage
ChIP-Seq	Direct sequencing of chromatin immunoprecipitates
MNase-Seq	Direct sequencing following MNase digestion
DNase-Hypersensitivity	Sequencing of hypersensitive sites, or segments of open chromatin that are more readily cleaved by DNaseI
Bisulfite-Seq	Sequencing following treatment of DNA with bisulfite to convert cytosine residues to uracil depending on methylation status
Tn-Seq	Sequencing from transposon insertion sites
EST	Single pass sequencing of cDNA templates
FL-cDNA	Full-length sequencing of cDNA templates
CTS	Concatenated Tag Sequencing
MRE-Seq	Methylation-Sensitive Restriction Enzyme Sequencing strategy
MeDIP-Seq	Methylated DNA Immunoprecipitation Sequencing strategy
MBD-Seq	Direct sequencing of methylated fractions sequencing strategy
Synthetic-Long-Read	binning and barcoding of large DNA fragments to facilitate assembly of the fragment
ATAC-seq	Assay for Transposase-Accessible Chromatin (ATAC) strategy is used to study genome-wide chromatin accessibility. alternative method to DNase-seq that uses an engineered Tn5 transposase to cleave DNA and to integrate primer DNA sequences into the cleaved genomic DNA
ChIA-PET	Direct sequencing of proximity-ligated chromatin immunoprecipitates
FAIRE-seq	Formaldehyde Assisted Isolation of Regulatory Elements. reveals regions of open chromatin
Hi-C	Chromosome Conformation Capture technique where a biotin-labeled nucleotide is incorporated at the ligation junction, enabling selective purification of chimeric DNA ligation junctions followed by deep sequencing
ncRNA-Seq	Capture of other non-coding RNA types, including post-translation modification types such as snRNA (small nuclear RNA) or snoRNA (small nucleolar RNA), or expression regulation types such as siRNA (small interfering RNA) or piRNA/piwi/RNA (piwi-interacting RNA).
RAD-Seq
RIP-Seq	Direct sequencing of RNA immunoprecipitates (includes CLIP-Seq, HITS-CLIP and PAR-CLIP)
SELEX	Systematic Evolution of Ligands by EXponential enrichment
ssRNA-seq	Strand-specific RNA sequencing
snRNA-seq	Single nucleus RNA sequencing is a method for profiling gene expression in cells which are difficult to isolate.
Targeted-Capture	Enrichment of a targeted subset of loci.
Tethered Chromatin Conformation Capture
DIP-Seq	DNA immunoprecipitation sequencing (DIP-Seq)
GBS	Genotyping by sequencing is a method to discover single nucleotide polymorphisms for genotyping studies.
Inverse rRNA	Depletion of ribosomal RNA by oligo hybridization
NOMe-Seq	Nucleosome Occupancy and Methylome sequencing.
Ribo-Seq	Ribosome profiling (also named ribosome footprinting) that uses specialized messenger RNA (mRNA) sequencing to determine which mRNAs are being actively translated. It produces a "global snapshot" of all the ribosomes active in a cell at a particular moment, known as a translatome.
VALIDATION	CGHub special request: Independent experiment to re-evaluate putative variants
ChM-Seq	ChIPmentation combines chromatin immunoprecipitation with sequencing library preparation by Tn5 transposase
OTHER	Library strategy not listed (please include additional info in the “design description”)

* Source

The library source specifies the type of source material that is being sequenced.

Source	Description
GENOMIC	Genomic DNA (includes PCR products from genomic DNA)
TRANSCRIPTOMIC	Transcription products or non genomic DNA (EST, cDNA, RT-PCR, screened libraries)
METAGENOMIC	Mixed material from metagenome
METATRANSCRIPTOMIC	Transcription products from community targets
SYNTHETIC	Synthetic DNA
VIRAL RNA	Viral RNA
GENOMIC SINGLE CELL
TRANSCRIPTOMIC SINGLE CELL
TRANSCRIPTOMIC SPATIAL
OTHER	Other, unspecified, or unknown library source material (please include additional info in the “design description”)

*Selection

The library selection specifies whether any method was used to select for or against, enrich, or screen the material being sequenced.

Selection	Description
RANDOM	Random selection by shearing or other method
PCR	Source material was selected by designed primers
RANDOM PCR	Source material was selected by randomly generated primers
RT-PCR	Source material was selected by reverse transcription PCR
HMPR	Hypo-methylated partial restriction digest
MF	Methyl Filtrated
MDA	Multiple displacement amplification
MSLL	Methylation Spanning Linking Library
cDNA	Complementary DNA
ChIP	Chromatin immunoprecipitation
MNase	Micrococcal Nuclease (MNase) digestion
DNase	Deoxyribonuclease (MNase) digestion
Hybrid Selection	Selection by hybridization in array or solution
Reduced Representation	Reproducible genomic subsets, often generated by restriction fragment size selection, containing a manageable number of loci to facilitate re-sampling
Restriction Digest	DNA fractionation using restriction enzymes
5-methylcytidine antibody	Selection of methylated DNA fragments using an antibody raised against 5-methylcytosine or 5-methylcytidine (m5C)
MBD2 protein methyl-CpG binding domain	Enrichment by methyl-CpG binding domain
CAGE	Cap-analysis gene expression
RACE	Rapid Amplification of cDNA Ends
size fractionation	Physical selection of size appropriate targets
Padlock probes capture method	Circularized oligonucleotide probes
Oligo-dT	Enrichment of messenger RNA (mRNA) by hybridization to Oligo-dT
repeat fractionation	Selection for less repetitive (and more gene rich) sequence through Cot filtration (CF) or other fractionation techniques based on DNA kinetics
Inverse rRNA	Depletion of ribosomal RNA by oligo hybridization
Inverse rRNA selection	Depletion of ribosomal RNA by inverse oligo hybridization.
PolyA	PolyA selection or enrichment for messenger RNA (mRNA); should replace cDNA enumeration
cDNA_oligo_dT
cDNA_randomPriming
other	Other library enrichment, screening, or selection process (please include additional info in the “design description”)
unspecified	Library enrichment, screening, or selection is not specified (please include additional info in the “design description”)

* Layout

The library layout specifies whether to expect single, paired of reads.

Layout	Description
fragment/single	Single-end read
paired	Paired-end reads

Design description

The goal and setup of the individual library.

Data files

*File name

The name of a sequence data file.

*MD5 value

MD5 checksum of a sequence data file.

Status

The status of file uploaded to FTP servers.

Status	Description
Not uploaded	The data file has not been detected by the system, it might be: The data file detection may take a few minutes to a few hours, please check it later; You have not uploaded the file to our system, or the file name you uploaded does not match the file name in your metadata.
Calculating	The data file has been uploaded, but the MD5 value of the file has not been calculated or is being calculated.
MD5 mismatch	The data file has been uploaded or uploaded a part. The MD5 value calculated by the system is inconsistent with the MD5 value filled in by the user.
Check finished	The data file has been uploaded and check finished.

Assembly

An assembly is a collection of genomic sequences that are used to represent the genome of an organism.

General information

Submission type

Field	Description
Submit batch assemblies	You will be asked to upload a text file that describes your metadata and submit your data files in batches.
Submit a single assembly	You will be asked to manually complete a web form to describe your Assembly and upload your data.

*Project accession

Select the project this assembly affiliates.

*Sample accession

Select the sample this assembly uses.

Metadata

Assembly metadata

Field	Description
*assembly_name	Assembly name (e.g. GRCh37.p5).
*assembly_method	The software used for the genome assembly.
*assembly_method_version	The verison of software used for the genome assembly.
*sequencing_technology	Sequencing platform.
*sequencing_depth	Sequencing depth.
assembly_min_gap_length	The minimum stretches of NNNNNs to be considered as a gap.
*assembly_mol_type	DNA assembly, RNA assembly, or virus assembly.
*genome_type	Such as whole genome assembly, chloroplast genome assembly, metagenome assembly, etc.
*assembly_level	Such as Chromosome level, scaffold level, etc.

Data files

*File type

The assembly data file format.

File type	Description
Fasta	Sequence data format indicating sequence base calls. Format: a header line initiated with the > character, data lines following with base calls.

*File name

The name of an assembly data file.

*MD5 value

MD5 checksum of an assembly data file.

Status

The status of file uploaded to FTP servers.

Status	Description
Not detected	The data file has not been detected by the system, it might be: The data file detection may take a few minutes to a few hours, please check it later; You have not uploaded the file to our system, or the file name you uploaded does not match the file name in your metadata.
Calculating	The data file has been uploaded, but the MD5 value of the file has not been calculated or is being calculated.
MD5 mismatch	The data file has been uploaded or uploaded a part. The MD5 value calculated by the system is inconsistent with the MD5 value filled in by the user.
Check finished	The data file has been uploaded and check finished.

Variation

CNSA accepts genomic variations from any species, including single nucleotide polymorphisms, short insertions/deletions and genomic structural variations, etc., and provides long-term stable archive accessions and data. The variation data includes Analysis, Samplesets, Subject, Call, File and Region.

Submission template

There are three templates for submission of variations.

SNP_submission_template.v1.1.xlsx is for the submission of simple and small-scale genomic variations <= 50 bp, such as single nucleotide polymorphisms (SNP), short insertions and deletions (INDEL), microsatellites, etc., which includes four parts: Analysis, Samplesets, Subject, File, and they are all required.

SV_submission_template.v1.1.xlsx is for the submission of complex and large-scale genomic structural variations (SV) >50bp, such as insertions, deletions, duplications, inversions, translocations, mobile elements, etc., which includes six parts: Analysis, Samplesets, Subject, Call, Region, File. The information of Analysis, Sampleset, Subject are required, at least one of Call and File is required. The Region is optional.

CAHV_Submission_template.v1.0.xlsx is for the submission of Clinically Associated Human Variations (CAHV)，including genomic variations and related phenotypes and clinical significance, etc., which includes four parts: Analysis, Samplesets, Subject, Call, and they are all required.

Metabolism

Metabolomics data, including metadata such as protocols, assays, samples, and raw data files for metabolomics study.

Metabolic metadata includes descriptions and assays:

"Descriptions" is a description of the research designs, factors, factors and samples, etc.

"Assay" is a description of the research metabolites and instrument parameters, etc.

Submission template

The submission of metabolic metadata contains four templates:

Template_Metabolism_Descriptions_MS_NMR.xlsx is used to submit the description information for metabolism.

Template_Metabolism_Assay_GC-MS.xlsx is used to submit assay information using gas chromatography-mass spectrometry (GC_MS) technology.

Template_Metabolism_Assay_LC-MS.xlsx is used to submit assay information using liquid chromatography-mass spectrometry (LC-MS) technology.

Template_Metabolism_Assay_NMR.xlsx is used to submit assay information using nuclear magnetic resonance (NMR) technology.

Single cell

Analysis results of data generated using single cell technology.

Single cell metadata mainly includes description information of gene expression files, metadata files, cluster files and other files.

Submission template

Template_Single_Cell.xlsx

Virus sequence

Viral sequence data, including assembled or non-assembled virus sequences.

Virus sequence metadata mainly includes sequence information and related sample information, experiment information, submission lab, etc.

Submission template

Template_Virus_Sequence.xlsx

Sequence

Sequence data other than the genome assembly sequences of the species, including sequence data of ribosomal RNA (rRNA), rRNA-ITS, metazoan COX1, mRNA, genomic DNA, organelle, ncRNA, plasmids, phages, synthetic constructs, etc.

For specific field information of sequence metadata, please refer to CNSA_Sequence_Submission_Instructions.docx.

Submission template

Templete_Sequence_File_List.xlsx is used to submit a list of data files.

Template_sequence_v1.0.gb is used to submit metadata and sequences.

Data file format

Run

CNSA receives six types of Data file format, including FASTQ, BAM, SFF, PacBio_HDF and CRAM.

FASTQ format

We recommend FASTQ format. Single and paired reads are accepted. Please note that all files must not be compressed into one file for upload. All file names in your account folder must be unique.

Quality scores must be in Phred scale.
Both ASCII and space delimitered decimal encoding of quality scores are supported. We will automatically detect the Phred quality offset of either 33 or 64.
No technical reads (adapters, linkers, barcodes) are allowed.
Single reads must be submitted using a single Fastq file and can be submitted with or without read names.
Paired reads must split and submitted using two Fastq files. The read names must have a suffix identifying the first and second read from the pair, for example '/1' and '/2' (regular expression for the reads: "^@([a-zA-Z0-9_-]+:[0-9]+:[a-zA-Z0-9]+:[0-9]+:[0-9]+:[0-9-]+:[0-9-]+) ([12]):[YN]:[0-9]*[02468]:[ACGTN]+$").
The first line for each read must start with '@'.
The base calls and quality scores must be separated by a line starting with '+'.
The Fastq files must be compressed using gzip or bzip2.
The regular expression for bases is "^([ACGTNactgn.]*?)$"

Example of FASTQ file containing single reads:

@read_name
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%++)(%).1***-+*''))**55CCF>>>>>>CCCCCCC65
...

file containing paired reads:

@read_name/1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%++)(%).1***-+*''))**55CCF>>>>>>CCCCCCC65
@read_name/2
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%++)(%).1***-+*''))**55CCF>>>>>>CCCCCCC65
...

BAM file

Submitted BAM files must be readable with Samtools.

BAM file names are required to end up with the .bam suffix.

All files must not be compressed into one file for upload.

All file names in your account folder must be unique.

SFF format

The SFF format is supported for the 454 and Ion Torrent platforms.

SFF file names are required to end up with the .sff suffix.

All files must not be compressed into one file for upload.

All file names in your account folder must be unique.

PacBio_HFD5 format

PacBio_HFD5 data submissions are supported in the platform specific native format.

One run consists of *.bax.h5, *.bas.h5 and xml. These files should be tarred and compressed.

PacBio_HFD5 data must be submitted as a single tar.gz or tar.bz file.

All files must not be compressed into one file for upload.

All file names in your account folder must be unique.

CRAM format

CRAM is a sequencing read file format that is highly space efficient by using reference-based compression of sequence data and offers both lossless and lossy modes of compression. Please refer to CRAMv3.0 for the specific format.

CRAM file names are required to end up with the .cram suffix.

All files must not be compressed into one file for upload.

All file names in your account folder must be unique.

Assembly

Genome assembly submissions include plasmids, organelles, complete virus genomes, viral segments/replicons, bacteriophages, prokaryotic and eukaryotic genomes. Chromosomes include organelles (e.g. mitochondrion and chloroplast), plasmids and viral segments.

Sequences should be submitted as a Fasta file. These sequences can be either contig, scaffold or chromosome sequences.

The submitted fasta file must be gz compressed and should specify the classification (contig, scaffold or chromosome) in the file name.

All file names in your account folder must be unique. All files must not be compressed into one file for upload.

Fasta format

format：

The sequence name is extracted from the header line starting with >.

For example, the following sequence has name contig1:

>contig1
AAACCCGGG...

Variation

CNSA currently only accepts variation data in VCF format. Please note that your variation data needs to be converted to VCF file format. To ensure that the format of VCF file is correct, you are advised to refer to VCFv4.3.

VCF file

VCF is a text file format (most likely stored in a compressed manner). It contains meta-information lines (prefixed with “##”), a header line (prefixed with “#”), and data lines each containing information about a position in the genome and genotype information on samples for each position (text fields separated by tabs). Zero length fields are not allowed, a dot (“.”) must be used instead. In order to ensure interoperability across platforms, VCF compliant implementations must support both LF (\n) and CR+LF (\r\n) newline conventions.

For example:

##fileformat=VCFv4.3
##fileDate=20090805
##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=< ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=< ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=< ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=< ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=< ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=< ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=< ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=< ID=q10,Description="Quality below 10">
##FILTER=< ID=s50,Description="Less than 50% of samples have data">
##FORMAT=< ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=< ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=< ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=< ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

Metabolism

CNSA accepts various metabolic data file formats, such as RAW, ZIP, JDX, CDF, WIFF, JCAMP, TXT, etc.

Single cell

The single cell data files accepted by CNSA include Gene expression file, Metadata file, Cluster file and Other files.

Gene expression file

Gene expression scores can be represented either as an "Expression matrix" or "MM coordinate matrix".

Expression matrix

gene    cell_1    cell_2    cell_3    ...
Trp53   0         0         0         ...
Apoe    0         5.098     0         ...
Tlr4    0         0         0         ...
Lep     0         0         0.123     ...
Il6     1.234     0         0         ...
...

An Expression matrix file is a dense matrix file that has a header row containing the value "gene", and single cell names. The filename suffix can be .txt, .txt.gz, .tsv, .tsv.gz, .csv, csv.gz.

MM coordinate matrix

%%MatrixMarket    matrix    coordinate    real    general
%
12345             32456     234567
1                 56        1
1                 12        2
...

An MM coordinate matrix file is a Matrix Market file that contains a sparse matrix in coordinate form. The filename suffix can be .mtx, .mtx.gz, .mm, .mm.gz, .txt, .txt.gz. Please note that MM coordinate matrix file must be sorted before upload, and you need to upload the Genes file and Barcodes file separately. The filename suffix can be .csv,.csv.gz, .tsv, .tsv.gz. A Genes file contains all annotated genes. Each gene is represented in each row. The first column is the gene id while the second column is the gene name. A Barcodes file contains the barcodes represented in the MM coordinate matrix file.

The top three lines of MM coordinate matrix file are header lines.
The third line contains the total number of rows in all the three files (Genes file, Barcodes file, MM coordinate matrix file).
The next lines (line number 4 onwards) have three columns:
- The first column refers to the "gene id" index. The "gene id" indices correspond to the entries in the Genes file.
- The second column refers to "cell id" index. The "cell id" indices correspond to the entries in the Barcodes file.
- The third column represents the total Unique Molecular Identifiers (UMI) count per cell and gene combination.
- The index in the MM coordinate matrix file is 1-based.

Metadata file

name      cluster    sub_cluster    average_intensity    sample_name         experiment_accession
type      group      group          numeric              sample_attribute    experiment_attribute
cell_1    clst_A     clst_A_1       6.687                sample1             CNXxxxxxx
cell_2    clst_A     clst_A_1       -12.625              sample1             CNXxxxxxx
...

A Metadata file is a tab-delimited text file containing cell-level annotations. The filename suffix can be .txt or.txt.gz.

A metadata file has at least 2 columns.
A header row containing the value "name" and column names for cluster-level annotations.
A second row with:
- The header of "type" to declare metadata types (see below).
- A value for each metadata column declaring its datatype
  - "group" (set membership) values are treated as literal strings.
  - "numeric" (continuous scores) values are treated as floating-point numbers.
  - "sample_name" (sample name) is a sample name you submitted in CNSA, required.
  - "experiment_accession" (experiment accession) is an experiment accession you submitted in CNSA, optional.

Cluster file

name      X          Y          Z          category    intensity
type      numeric    numeric    numeric    group       numeric
cell_1    34.472     32.211     60.035     C           0.719
cell_2    15.975     10.043     21.424     B           0.904
...

A Cluster file contains any cluster ordinations and optional cluster-specific metadata. The filename suffix can be .txt or.txt.gz.

A cluster file has at least 3 columns.
A header row containing the value "name", "X", "Y", optionally "Z", and columns containing cell-level annotations
A second row with:
- The header of "type" to declare metadata types (see below).
- A value for each metadata column declaring its datatype
  - "group" (set membership) values are treated as literal strings.
  - "numeric" (continuous scores) values are treated as floating-point numbers.
  - The values for the "X", "Y", and "Z" columns must be set to "numeric".

Other files

Any documentation or other support filles you have.

Virus sequence

CNSA currently accepts virus data files in FASTA format.

The file extensions can be: .fa.gz, .fa.bz2, .fasta.gz, .fasta.bz2, .fasta, .fa.

Sequence

CNSA currently accepts sequence data in GenBank Flat File Format (GBFF). Please do not change the sequence file format at will, otherwise the format check will fail. For the specific format, please refer to the sequence template and sequence submission instructions.

If a sample has multiple sequences, please put all the sequences of the sample in a file, and a single file allows up to 10,000 sequences.

Data upload

FTP data upload

Users can upload data files to their personal directories via FTP. For the data security, CNSA uses FTPS (FTP over TLS) for data upload.

General instructions for uploading files using a FTP client

Use your favorite ftp client, such as FileZilla.
Use binary mode for file transfers.
Use ftp.cngb.org as the target host.
Login with your FTP username and password (available in the data submission process or "My service").
Upload files to your private FTP upload area.
Please select Passive for Transmission Mode in Transmission Settings; And select "Use explicit FTP over TLS" or "FTPS" for Encryption in General Settings.

General instructions for uploading files using FTP Command Line Client On Linux/Mac

Go to the folder where the files for submission are.
Use the following command to establish a FTP connection:
lftp FTP_username:password@ftp.cngb.org (e.g ngb_xxx:password@ftp.cngb.org)
FTP username and password are available in the data submission process or "My service".
Use the following command to copy the data files to your private FTP upload directory:
Copy a single file: put file
Copy multiple files: mput files

Note: In the user's personal FTP directory, CNSA will retain the user-uploaded data files until all data files have been successfully submitted and archived. The FTP directory provided to the user for uploading data is a temporary directory and is not suitable for storing data for a long time. The data files will be deleted if they are uploaded to the FTP server for more than 1 months and without submitting the relevant metadata.

Data download

FTP download

Click "Download" in the navigation bar of the CNSA homepage to enter the data download page.
You can directly click "FTP" on the page to enter CNSA FTP, select the required file, and click to download.

MD5 check

Large file transfers do not always complete successfully over the internet.

An MD5 checksum can be computed for a file before and after transfer to verify that the file was transmitted successfully.

MD5 (Message Digest Algorithm 5) is a hash function which calculates a hash value (MD5 number, 32-digit numbers and letters) of a given file.

You must provide an MD5 checksum for each file submitted to the archive. We will re-compute and verify the MD5 checksum to make sure that the file transfer was completed without any changes to the file contents.

Obtain MD5 value (Linux)

Obtain the MD5 values of the files by executing：

$ md5sum file1 file2
9F6E6800CFAE7749EB6C486619254B9C file1
B636E0063E29709B6082F324C76D0911 file2

Obtain MD5 value (Mac OS X)

Obtain the MD5 values of the files by executing：

$ MD5 file1 file2
9F6E6800CFAE7749EB6C486619254B9C file1
B636E0063E29709B6082F324C76D0911 file2

Obtain MD5 value (Windows)

Method 1:

First, press the [win] + [r] on the computer keyboard to open the running command line window, and then enter “cmd” in the running window that pops up.

Click "Enter" to enter the cmd command line interface.

Use the following command to calculate the MD5 value:

CertUtil -hashfile Path\filename MD5

For example:

Method 2 (For Win10):

First, search for Windows PowerShell.

Next, open Windows PowerShell.

Use the following command to calculate the MD5 value:

Get-FileHash Path\filename -Algorithm MD5 | Format-List

For example:

Method 3:

Install and run the Fsum Frontend (sourceforge.net/projects/fsumfe/).

At first, tick off "md5".

After clicking the [+] button, open the sequence data files that you need. You can select multiple files at the same time.

Click the [Calculate hashes] button. The MD5 values of the files are displayed.

By clicking the [Export] button, you can obtain the list of the MD5 values as a html, a csv, or a xml file.

Ethics and regulations on human genetic resources

For submitting data from human subjects (human data) to the CNSA, it is submitter's responsibility to ensure that the dignity and right of human subject are protected in accordance with all applicable laws, ordinances, guidelines and policies of submitter's institution. In principle, make sure to remove any direct personal identifiers of human subjects from your data to be submitted.

For submitting data to the CNSA, Users must follow the Interim Measures for the Human Genetic Resources Regulations and ethical norms in their countries, submit real organization and contact information, and take responsibility for the legality and compliance of their uploaded data. CNSA will receive raw data and assembly data from animals, plants, microorganisms, etc.

Numbering rules

Numbering rules for projects, samples, experiments, runs, and assemblies

Data type	Numbering rules	Example
Project	“CNP”+ 7 numerals	CNP0000063
Sample	“CNS”+ 7 numerals	CNS0001796
Experiment	“CNX”+ 7 numerals	CNX0002218
Run	“CNR”+ 7 numerals	CNR0002529
Assembly	“CNA”+ 7 numerals	CNA0001632
Metabolism	“METM”+ 7 numerals	METM0001234
Single cell	“CSE”+ 7 numerals	CSE0001234

Numbering rules for variations

Variation data type	Numbering rules
Call (SNP)	varc+01+numbers (01 means the variation is less than or equal to 50bp in length, and the following numbers are cumulatively presented. For example, varc012341 represents the 2341th variation.)
Call (SV)	varc+02+numbers (02 means the variation is more than to 50bp in length, and the following are cumulatively presented. For example, varc022341 represents the 2341th variation.)
Call (CAHV)	varc+03+numbers (03 means clinically associated human variations, and the following numbers are cumulatively presented. For example, varc032341 represents the 2341th variation.)
Analysis	CVA0000001 (The following numbers are cumulatively presented)
File	CVF0000001 (The following numbers are cumulatively presented)
Subject	CVS0000001 (The following numbers are cumulatively presented)
Region	varr+02+numbers (02 means the variation is more than to 50bp in length, and the following numbers are cumulatively presented. For example, varr022341 represents the 2341th variation.)

Numbering rules for sequences

Data Type	Data source	Accession format	Accession example
Nucleotide	Direct submissions	“N”+ underbar + 9 numerals	N_000001234
Nucleotide	WGS/TSA	“N”+ underbar + 6 letters + 2 numerals for WGS assembly version + 7 numerals 6 letters scope Genome assembly: AAAAAA-TZZZZZ Transcriptome assembly: UAAAAA-ZZZZZZ	N_AAADBH010000000 N_UAADBH010000000

Data submission

Notes

Please read the "Data Submission Guide" carefully before submitting the data.
CNSA currently accepts online submissions of projects, samples, experiments/runs, assemblies, and variations.
Before submitting data, you need to register/login and fill in the submitter information.
The project and sample must be submitted prior to submitting the experiment/run, assembly, and variation.
The sample can be submitted independently, but the sample is only associated with the project after the relevant data has been submitted.
In the data submission process, fields with * are required, and other are optional.
If you need to submit data files, in order to complete the data submission process more quickly, it is recommended to submit the data files before submitting the metadata.
After the data submission is completed, the system will automatically jump to the corresponding data type under “My submission” after 10 seconds.
1. You can click on the submission ID of completed submission to view the details.
2. In the Status column, you can directly view the accession of a single submission, and you can also download the batch submitted file with attributes and accessions.
3. For public data of a single submission, you can click on the data accession in the status column to access the public details page. For public data of a batch submission, you can search the data accessions on the CNSA home page to access the details page.
4. You can click the "pencil icon" in the status column to modify. If the status column does not have a "pencil icon", please send an email to datasubs@genomics.cn and indicate the submission ID or data accession and the reason for modification.

Signup /Login

Users can enter the homepage of CNSA (Fig. 1) through the website (https://db.cngb.org/cnsa). Click the tab of “Login/Signup” on the right side of the page to enter the Login/Signup page (Fig. 2). (Note: You need to register before you can login and submit the data.)

Submitter information

CNSA will obtain partial submitter information from the user's account information. The submitter's information filled out by the user is bound to the submitted project, sample, experiment/run, assembly, and variation data. When submitting the data, if the submitter information is not filled out, the system will jump to the CNSA homepage and ask the submitter to add the information; if the submitter information is completed, the system will automatically skip the page and enter the submission page. If the user needs to change the submitter information, click "My CNGBdb" in the home navigation to select "Submit" in the drop-down option, and go to the CNSA homepage (Figure 3) to modify. The modified submitter information will be bound to the data being submitted or submitted in the future.

Submission portal

Projects, samples, experiments/runs, assemblies, and variations can be submitted through their respective submission portal. Please click "Submit" on the home page or "Submit portal" in the navigation bar to enter the Submission portal page (Fig. 4), and then click the submission portal for the corresponding data type to submit the data. You can also enter the corresponding submission process by clicking the "New submission" button under each data type in “My submission”.

Submit project

Project submission portal

Click on "Project" (Fig. 5) on the Submission portal page to enter the submission process.

Data management

On the data management page (Fig. 6), you need to choose a Data access manner. If you choose “Public” or “Controlled”, you need to set a release date. Then click "Save and continue" to proceed to the next step.

Note:

When selecting the Data access manner, you need to carefully read the prompt information under the option. After submitting, you cannot change it yourself. If you need to make changes, please send an email to datasubs@genomics.cn with the project accession and the reason for the change. If you need to change the controlled data to public, please send an email to “datasubs@genomics.cn” to apply and prepare the corresponding review materials.
If you need to change the release date, click “My submission”, find the submission under the project, and click the "pencil icon" in the release date column to edit it.

General information

On the general information page (Fig. 7), fill in the information of the Project title, Relevance, Public description, External links, Related projects, etc., and then click "Save and Continue" to proceed to the next step.

Detailed information

On the detailed information page (Fig. 8), select the project type (you can choose more than one) and sample scope. The literature information is optional. Then click "Save and Continue" to proceed to the next step. If your article has not been published yet, you can click on “My submission” in the navigation bar after the article is published, find the corresponding submission ID, click on the "pencil icon" in the status column to enter the modification process, and supplement the relevant information of the article.If the status column does not have a "pencil icon", please contact datasubsgenomics.cn, and indicate the project accession in the email.

Overview

The overview page (Fig.9) summarizes the information filled in the previous steps. If you find any errors, please click “Previous” to go to any of the previous pages to make the corresponding changes. If the check is correct, please click “Submit”. During the process of filling in the entire project information, the system will retain the last filling result.

My submission-Project

After the project is submitted, the system will automatically assign a project accession (CNPXXXXXXX) and jump to "My submissions – Project" after 10 seconds, you can view the project accession on this page (Fig. 10).

Submit sample

Sample submission portal

Click on "Sample" (Fig. 11) on the Submission portal page to enter the submission process.

Submission type

Select a submission type (Fig. 12). If you submit multiple samples at a time, we recommend that you choose the batch submission, which is more convenient and faster than a single submission. You will need to download the batch submission template for the samples first in the submission process, then fill it out and upload it. If you submit only one sample at a time, we recommend that you choose a single submission. You need to fill out the sample information online in the submission process.

Submit batch samples

Select a sample type (Fig. 13). First select a large class from the drop-down list in the left input box and select a small class in the drop-down list in the right input box. Please carefully select the sample type. Once the process is submitted, you cannot modify it yourself. If you need to modify it, please send an email to datasubs@genomics.cn with the submission ID and the reason for the modification.

Download the batch submission template of samples, fill it out and upload it (Fig. 13). Each sample cannot be duplicated with other submitted samples. Please add or modify other field attributes other than the three fields "sample_name, sample_title, description" (optional field attributes can be added) to distinguish the samples you submitted.

If it fails the check, modify it according to the check rule and error line number prompted by the bullet box (Fig. 14), and then upload again.

After the check passed, click “Submit”, the system will automatically assign the sample accessions (CNSXXXXXXX) and jump to “My submission-Sample” after 10 seconds. The metadata file with accessions can be downloaded in the status column of this page (Fig. 15).

Submit a single sample

Sample type

On the sample type page (Fig. 16), please carefully select the sample type, then click “Save and continue” to proceed to the next step. Once the process is submitted, you cannot modify it yourself. If you need to modify it, please send an email to datasubs@genomics.cn with the submission ID and the reason for the modification.

Sample attributes

On the sample attributes page (Fig. 17), different sample types require different attributes, then click "Save and continue". If some fields do not pass the check, please modify it according to the prompt information, then click "Save and continue" to proceed to the next step.

Overview

This page (Fig. 18) summarizes the information filled in the previous steps. If you find any errors, please click “Previous” to go to any of the previous pages to make the corresponding changes. In the entire submission process, the system will retain the last saved result when you quit the system. If the check is correct, please click “Submit”.

My submission-Sample

After the sample is submitted, the system will automatically assign a sample accession such as CNSXXXXXXX and jump to “My submission – Sample”. The sample accession can be viewed in in the status column of that page (Fig. 19).

Submit experiment/run

You need to submit the metadata and data files for the experiment/run. Before submitting the experiment/run data, please create the project and sample first.

Experiment/run submission portal

Click on " Experiment/run " (Figure 20) on the Submission portal page to enter the submission process.

Submission type

Select a submission type (Fig. 21). If you submit multiple experiments/runs at a time, we recommend that you choose the batch submission method, which is more convenient and faster than a single submission. You will need to download the batch submission template for the experiments/runs metadata first in the submission process, then fill it out and upload it. If you submit only one experiment/run(s) at a time, we recommend that you choose a single submission method. You need to fill out the experiment/run(s) metadata online in the submission process.

Submit batch experiments/runs

Upload data files according to the data upload method.

Download the batch submission template of experiments/runs, fill it out and upload it (Fig. 22). If it fails the check, modify it according to the check rule and error line number prompted by the bullet box, and then upload again. After the metadata pass the check, the system will check the MD5 values of data files. If there are files that have not passed the check, please handle it according to the prompt information of the bullet box (Fig. 23).

When the status of the data file is "Check finished", click “Submit”, the system will automatically assign the accessions of experiments/runs (CNXXXXXXXX/CNRXXXXXXX) and jump to “My submission- Experiment/run” after 10 seconds. The metadata file with accessions can be downloaded in the status column of this page (Fig. 24).

Submit a single experiment/run(s)

General information

One the general information page (Fig. 25), select the project accession and sample accession associated with the experiment/run(s) in the drop-down list, then click “Save and continue” to proceed to the next step. If you have not submitted the project and sample, create a new project and sample.

Metadata

On the metadata page (Figure 26), you can select a submitted experiment accession in the experiment reuse section, the system will automatically fill in the experiment information. The copied experiment information can be modified to help users to fill in quickly. If you do not reuse the experiment information, you need to fill in the experiment information, then click "Save and continue". If some fields do not pass the check, please modify it according to the prompt information, then click "Save and continue" to proceed to the next step.

Data files

On the data files page (Fig. 27), please upload the data file according to the data upload method, and fill in the file name and MD5 value of the data file in the input box of the “Data files” section, then click "Check". If the input box turns red, modify it according to the error message in the question mark, and then click "Check". If the status of the data file is “Not uploaded, Calculating or MD5 mismatch”, please handle it according to the prompt information in this part of the page. If the status of the data file is "Check finished ", click "Save and continue" to proceed to the next step.

Overview

This overview page (Fig. 28) summarizes the information filled in the previous steps. If you find any errors, please click “Previous” to go to any of the previous pages to make the corresponding changes. In the entire submission process, the system will retain the last saved result when you quit the system. If the check is correct, please click “Submit”.

My submission- Experiment/run

After the experiment/run is submitted, the system will automatically assign the accession of experiment/run (CNXXXXXXXX/CNRXXXXXXX) and jump to “My submission- Experiment/run” after 10 seconds. The experiment accession can be viewed in in the status column of that page (Fig. 29). The run accession can be viewed by clicking on the submission ID.

Submit assembly

You need to submit the metadata and data files for the assembly. Before submitting the assembly data, please create the project and sample first.

Assembly submission portal

Click on " Assembly " (Figure 30) on the Submission portal page to enter the submission process.

Submission type

Select a submission type (Fig. 31). If you submit multiple assemblies at a time, we recommend that you choose the batch submission method, which is more convenient and faster than a single submission. You will need to download the batch submission template for the assemblies’ metadata first in the submission process, then fill it out and upload it. If you submit only one assembly at a time, we recommend that you choose a single submission method. You need to fill out the assembly metadata online in the submission process.

Submit batch assemblies

Upload data files according to the data upload method (Fig. 32).

Download the batch submission template of assemblies, fill it out and upload it (Fig. 32). If it fails the check, modify it according to the check rule and error line number prompted by the bullet box (Fig. 33), and then upload again. After the metadata pass the check, the system will check the MD5 values of data files. If there are files that have not passed the check, please handle it according to the prompt information of the bullet box.

When the status of the data file is "Check finished", click “Submit”, the system will automatically assign the accessions of assemblies (CNAXXXXXXX) and jump to “My submission-Assembly” after 10 seconds. The metadata file with accessions can be downloaded in the status column of this page (Fig. 34).

Submit a single assembly

General information

Select the project accession and sample accession associated with the assembly in the drop-down list (Fig. 35), then click “Save and continue” to proceed to the next step. If you have not submitted the project and sample, create a new project and sample.

Metadata

On the metadata page (Figure 36), you need to fill in the metadata of assembly, then click "Save and continue". If some fields do not pass the check, please modify it according to the prompt information, then click "Save and continue" to proceed to the next step.

Data files

On the data files page (Fig. 37), please upload the data file according to the data upload method, and fill in the file name and MD5 value of the data file in the input box of the “Data files” section, then click "Check". If there is a field that has not passed the check, please modify it according to the error prompt under the field, and then click "Check". If the status of the data file is “Not uploaded, Calculating or MD5 mismatch”, please handle it according to the prompt information in this part of the page. If the status of the data file is "Check finished ", click "Save and continue" to proceed to the next step.

Overview

The overview page (Fig. 38) summarizes the information filled in the previous steps. If you find any errors, please click “Previous” to go to any of the previous pages to make the corresponding changes. In the entire submission process, the system will retain the last saved result when you quit the system. If the check is correct, please click “Submit”.

My submission-Assembly

After the assembly is submitted, the system will automatically assign the assembly accession (CNAXXXXXXX) and jump to “My submission-Assembly” after 10 seconds. The assembly accession can be viewed in in the status column of that page (Fig. 39).

Submit variation

Before submitting the variation data, please create the project and sample first, and the experiment/run data is optional.

Variation submission portal

Click on " Variation " (Figure 40) on the Submission portal page to enter the submission process.

Submit variation data

Select a variant type (Fig. 41).

Upload the VCF file of variations.

If you choose SNP, you need to upload the VCF file to FTP, then download the SNP submission template, fill it out and upload it (Fig. 42).

If you choose SV, you need to download the SV submission template, fill it out and upload it. VCF files can be optionally submitted (Fig. 43).

If you choose CAHV, you only need to need to download the CAHV submission template and fill it out and upload it (Fig. 44).

After the template file is filled in and uploaded, the system will check each sheet in the template in turn. If the field fails to pass the check, modify it according to the check rules and the error line numbers in the bullet box (Fig. 45), then re-upload and click "check".

If the check is passed, click “Submit”, the system will jump to “My submission-Variation” after 10 seconds. After the data pass the review, the submitted file with accessions can be downloaded in the status column of this page (Fig. 46).

Submit metabolism

Before submitting the metabolic data, please create the project and sample first.

Metabolism submission portal

Click on "Metabolism" (Figure 47) on the Submission portal page to enter the submission process.

Submit metabolic data files and metadata

Check the data file format requirements first, and then upload the data files (Figure 48).

Metabolic metadata includes descriptions and assays. The descriptions must be submitted, and the assay can be added after the descriptions is uploaded. A new assay can be added after the previous assay is uploaded. Please download the descriptions template first, fill it in and upload it (Figure 49).

The system will check each sheet in the submitted file in turn. If the field check fails, please modify it according to the check rules and the error line number prompted by the pop-up box, and then upload it again (Figure 50).

Click "Add assay" to create an assay (Figure 51). The system will provide the corresponding assay template according to your choice. Please download the template, fill in and upload it. If the field check fails, please modify it according to the check rules and error line number prompted by the pop-up box, and then upload again.

After the metadata check is passed, the system will check the MD5 value of the data files. You can check the status of the data files in the “Data files status” module (Figure 52). If there is a file that has not passed the check, click "View" and deal with it according to the prompt information in the pop-up box.

When the “Data files status” is "Check finished", please click "Submit", the system will automatically assign a metabolism accession (METMXXXXXXX), and after 10 seconds, it will jump to "My submission-Metabolism". You can view the metabolism accession and download the metadata file in the “Metadata status” column of this page (Figure 53).

Submit single cell

Before submitting the single cell data, please create the project and sample first.

Single cell submission portal

Click on "Single cell" (Figure 54) on the Submission portal page to enter the submission process.

Associated project

Fill in the project accession and cell number associated with the data (Figure 55).

Submit single cell data files and metadata

First upload the data files according to the data file format requirements and data upload options (Figure 56). Note: Gene expression files are required, other types of files are optional. If your gene expression files, metadata files, and cluster files need to be grouped, please add the group name to the file name.

Download the single cell metadata submission template, fill it in and upload it (Figure 57).

Fig. 58 Check results of single cell metadata

After the metadata check is passed, the system will check the MD5 value of the data files. You can check the status of the data files in the “Data files status” module (Figure 59). If there is a file that has not passed the check, click "View" and deal with it according to the prompt information in the pop-up box.

When the “Data files status” is "Check finished", please click "Submit", the system will automatically assign a single cell accession (CSEXXXXXXX), and after 10 seconds, it will jump to "My submission-Single cell". You can view the single cell accession and download the metadata file in the “Metadata status” column of this page (Figure 60).

Submit virus sequence

Before submitting the virus sequence data, please create the project, the sample is optional.

Virus sequence submission portal

Click on "Virus sequence" on the submission portal page (Figure 61), or click "Submit data" on the Virus Data Integration Platform (VirusDIP) (Figure 62) to enter the submission process.

Fig. 61 Submission portal-Virus sequence portal

Select a release date

Please select the release date of the virus sequence data submitted this time (Figure 63). Note that the release date of this data must be later than the release date of the associated project. Otherwise, the release date is invalid, and the system will release this data on the project’s release date. If you need to release this data in advance, please adjust the release date of the project.

Submit virus data files and metadata

Upload the sequence data file in FASTA format to FTP according to the data file format requirements and data upload method (Figure 64).

Download the virus metadata submission template, fill it in and upload it (Figure 65).

The system will check the submitted file. If the field check fails, please modify it according to the check rules and the error line number prompted by the pop-up box (Figure 66), and then upload it again.

After the metadata check is passed, the system will check the MD5 value of the data files. You can check the status of the data files in the “Data files status” module (Figure 67). If there is a file that has not passed the check, click "View" and deal with it according to the prompt information in the pop-up box.

When the “Data files status” is "Check finished", please click "Submit", the system will automatically assign an accession to each virus sequence (for example, N_AAADBH010000000), and after 10 seconds, it will jump to "My submission-Virus sequence". You can download the metadata file with accessions in the “Metadata status” column of this page (Figure 68).

Submit sequence

Before submitting the sequence data, please create the project, the sample is optional. Note, please submit the genome data of the species through the "Assembly" portal, and the viral sequence data through the "Virus sequence" portal.

Sequence submission portal

Click on "Sequence" on the submission portal page (Figure 69) to enter the submission process.

Submit sequence data files and file list

Upload the sequence data files according to the data file format requirements and data upload method (Figure 70).

Download the submission template of the sequence data file list, fill it in and upload it (Figure 71).

The system will check the submitted file. If the field check fails, please modify it according to the check rules and the error line number prompted by the pop-up box (Figure 72), and then upload it again.

After the file list check is passed, the system will check the MD5 value of the data files. You can check the status of the data files in the “Data files status” module (Figure 72). If there is a file that has not passed the check, click "View" and deal with it according to the prompt information in the pop-up box.

After the MD5 check is passed, the system will check the sequence format. Please make sure that the format of the sequence you submit is GenBank Flat File Format (GBFF). If there is a file that has not passed the format check, please click "View" and modify the sequence file with the wrong format as prompted by the pop-up box (Figure 74), then upload it again, and the system will check it again.

Fig. 74 Data file status-Sequence format check

When the “Data files status” is "Check finished", please click "Submit", the system will automatically assign an accession to each sequence (for example, N_000001234), and after 10 seconds, it will jump to "My submission-Sequence". You can download the file list with accessions in the “Metadata status” column of this page (Figure 75). If the submission has just been completed and the "Download file list with accessions" does not appear in the "Status" column of the submission, it means that the system is assigning sequence accessions. Please refresh the page.