The single-cell database is mainly made up of single cell matrix from the China National GeneBank DataBase(CNGBdb) . The database contains uniformly re-analysed single cell expression data across different species and provides interactive visualizations to explore that data.
1.The workspace
This workspace contains single cell expression data and workflows that will enable you to perform single cell analysis.
该工作空间包含单细胞表达数据和工作流程,您可以通过流程进行单细胞分析。
2.The data
The data in this workspace includes single cell expression data from CNGBdb,you can find more file details of project in Datasets form CDCP.
数据集包含来自 CNGBdb单细胞数据的表达矩阵数据,数据集项目的其他文件你可以在CDCP找到。
3.Workflow
3.1single_cell_scanpy.wdl
This workflow is implemented based on scanpy,You only provide a single cell expression matrix(csv/tsv).We provide sufficient optional adjustment parameters to support user-defined parameters。Each step produces hdf5 file as input for the next step.We also provide each step as a workflow for users to debug.The flow chart is as follows:
工作流程是基于 scanpy实现,您只要提供(csv/tsv)单细胞表达矩阵文件就可以开始分析。将提供足够的可选调整参数,以支持用户定义的参数。每一步都生成hdf5文件作为下一步的输入。我们还将每个步骤拆分作为单独工作流提供给用户进行调试。流程图如下:
The optional paraments in the following splitting steps.
流程可调节参数将显示在以下拆分步骤中。
The following workflows are included in this workspace. Below, the inputs, outputs, example variables and a description of each workflow can be found.
此工作空间中包含以下工作流。在此可以找到输入、输出、示例变量和每个工作流的描述。
3.1.1qc.wdl
this workflow load files and calculate quality control metrics by scanpy.pp.calculate_qc_metrics
。
这个工作流是加载文件并通过scanpy.pp.calculate_qc_metrics
计算质量指标。
Main parameters 主要参数
Input | description | Example |
---|---|---|
infile | input file | "obs://xxx/xxx.csv" |
project_name | project name | test |
filetype | filetype of input files | csv |
optional parameters 可选参数
paraments | description | value |
---|---|---|
min_gene | Minimum number of genes expressed required for a cell to pass filtering. | Int (optional, default = 200) |
min_cell | Minimum number of cells expressed required for a gene to pass filtering | Int (optional, default = 3) |
genes | Maximum number of genes expressed required for a cell to pass filtering | (optional, default = 2500) |
per_mt | Maximum percentage of mitochondria expressed required for a cell to pass filtering | Int (optional, default = 5)" |
Output |
---|
outputfile |
pngfile |
3.1.2norm.wdl
this workflow normalize counts per cell and logarithmize the data matrix .
标准化每个单元的计数并将数据矩阵对数变化。
Main parameters 主要参数
Input | Example |
---|---|
infile | "obs://xxx/xxx.h5ad" |
project_name | test |
optional parameters 次要参数
paraments | description | value |
---|---|---|
target_sum | each observation (cell) has a total count equal to the median of total counts for observations (cells) before normalization | "Float (optional, default = 10000)" |
Output |
---|
outputfile |
3.1.3hvg.wdl
this workflow accepts h5fd file and annotate highly variable genes ,regress out (mostly) unwanted sources of variation then cale data to unit variance and zero mean with sc.pp.highly_variable_genes ,sc.pp.regress_out and sc.pp.scale.
接受h5fd文件作为输入,注释高度可变的基因,回归出(大部分)不需要的变异源,然后将数据缩放到单位方差和零均值
Main parameters 主要参数
Input | Example |
---|---|
infile | "obs://xxx/xxx.h5ad" |
project_name | test |
optional parameters 次要参数
paraments | description | value |
---|---|---|
min_mean | If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored | "Float (optional, default = 0.125)" |
flavor | Choose the flavor for computing normalized dispersion. In their default workflows, Seurat passes the cutoffs whereas Cell Ranger passes n_top_genes. | "String? (optional)", |
n_top_genes | Number of highly-variable genes to keep | "Int? (optional)" |
subset | Inplace subset to highly-variable genes if True otherwise merely indicate highly variable genes | "Boolean? (optional)" |
max_mean | If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored. Default is 3. | "Float (optional, default = 3)" |
min_disp | If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored. Default is 0.5 | "Float (optional, default = 0.5)" |
max_disp | If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored. Default is np.inf | "Float? (optional)" |
max_value | Clip (truncate) to this value after scaling | "Float (optional, default = 10)" |
Output |
---|
outputfile |
pngfile |
3.1.4pca.wdl
Principal component analysis
主成分分析
Main parameters 主要参数
Input | Example |
---|---|
infile | "obs://xxx/xxx.h5ad" |
project_name | test |
optional parameters 次要参数
paraments | description | value |
---|---|---|
n_comps | Principal component analysis | "Int? (optional)" |
use_highly_variable | Whether to use highly variable genes only, stored in .var['highly_variable'] | "Boolean? (optional)" |
zero_center | If True, compute standard PCA from covariance matrix. If False, omit zero-centering variables (uses TruncatedSVD), which allows to handle sparse input efficiently | "Boolean (optional, default = true)" |
svd_solver | SVD solver to use | "String (optional, default = "arpack")" |
Output |
---|
outputfile |
pngfile |
3.1.5neighbors.wdl
Compute a neighborhood graph of observations
计算观测值的邻域图
Main parameters 主要参数
Input | Example |
---|---|
infile | "obs://xxx/xxx.h5ad" |
project_name | test |
optional parameters 次要参数
paraments | description | value |
---|---|---|
knn | If True, use a hard threshold to restrict the number of neighbors to n_neighbors, that is, consider a knn graph. Otherwise, use a Gaussian Kernel to assign low weights to neighbors more distant than the n_neighbors nearest neighbor | "Boolean (optional, default = true)" |
use_rep | Use the indicated representation. 'X' or any key for .obsm is valid. If None, the representation is chosen automatically: For .n_vars < 50, .X is used, otherwise ‘X_pca’ is used. If ‘X_pca’ is not present, it’s computed with default parameters. | "String? (optional)" |
key_added | If not specified, the neighbors data is stored in .uns[‘neighbors’], distances and connectivities are stored in .obsp[‘distances’] and .obsp[‘connectivities’] respectively. If specified, the neighbors data is added to .uns[key_added], distances are stored in .obsp[key_added+’_distances’] and connectivities in .obsp[key_added+’_connectivities’]. | "String? (optional)" |
n_pcs | Use this many PCs. If n_pcs==0 use .X if use_rep is None | "Int? (optional)" |
method | Use ‘umap’ or ‘gauss’ with adaptive width for computing connectivities. Use ‘rapids’ for the RAPIDS implementation of UMAP (experimental, GPU only) | "String (optional, default = "umap")" |
n_neighbors | The size of local neighborhood (in terms of number of neighboring data points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100. If knn is True, number of nearest neighbors to be searched. If knn is False, a Gaussian kernel width is set to the distance of the n_neighbors neighbor | "Int (optional, default = 10) |
Output |
---|
outputfile |
3.1.6umap.wdl/tsne.wdl
this workflow embed the neighborhood graph using umap or tsne
非线性降维
Main parameters 主要参数
Input | Example |
---|---|
infile | "obs://xxx/xxx.h5ad" |
project_name | test |
umap optional parameters umap
次要参数
paraments | description | value |
---|---|---|
negative_sample_rate | The number of negative edge/1-simplex samples to use per positive edge/1-simplex sample in optimizing the low dimensional embedding | "Int (optional, default = 5)" |
alpha | The initial learning rate for the embedding optimization. | "Float (optional, default = 1.0)" |
min_dist | The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. The value should be set relative to the spread value, which determines the scale at which embedded points will be spread out | "Float (optional, default = 0.5)" |
init_pos | Use this many PCs. If n_pcs==0 use .X if use_rep is None | "String (optional, default = "spectral")" |
gamma | Weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples | "Float (optional, default = 1.0)" |
spread | The effective scale of embedded points. In combination with min_dist this determines how clustered/clumped the embedded points are | "Float (optional, default = 1.0)" |
n_components | The number of dimensions of the embedding | "Int (optional, default = 2)" |
tsne optional parameters
tsne 次要参数
paraments | description | value |
---|---|---|
perplexity | The perplexity is related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity. Consider selecting a value between 5 and 50. The choice is not extremely critical since t-SNE is quite insensitive to this parameter | "Float (optional, default = 30)" |
early_exaggeration | Controls how tight natural clusters in the original space are in the embedded space and how much space will be between them. For larger values, the space between natural clusters will be larger in the embedded space. Again, the choice of this parameter is not very critical. If the cost function increases during initial optimization, the early exaggeration factor or the learning rate might be too high | "Float (optional, default = 12)" |
learning_rate | Note that the R-package “Rtsne” uses a default of 200. The learning rate can be a critical parameter. It should be between 100 and 1000. If the cost function increases during initial optimization, the early exaggeration factor or the learning rate might be too high. If the cost function gets stuck in a bad local minimum increasing the learning rate helps sometimes | "Float (optional, default = 1000)" |
Output |
---|
clustfile |
3.1.7leiden.wdl/louvain.wdl
cluster cells into subgroups
细胞聚类
Main parameters 主要参数
Input | Example |
---|---|
infile | "obs://xxx/xxx.h5ad" |
project_name | test |
**leiden optional parameters **
leiden 次要参数
paraments | description | value |
---|---|---|
use_weights | If True, edge weights from the graph are used in the computation (placing more emphasis on stronger edges) | "Boolean (optional, default = true)" |
n_iterations | How many iterations of the Leiden clustering algorithm to perform. Positive values above 2 define the total number of iterations to perform, -1 has the algorithm run until it reaches its optimal clustering | "Float (optional, default = -1)" |
directed | Whether to treat the graph as directed or undirected. | "Boolean (optional, default = true)" |
resolution | A parameter value controlling the coarseness of the clustering. Higher values lead to more clusters. Set to None if overriding partition_type to one that doesn’t accept a resolution_parameter | "Float (optional, default = 1)" |
louvain optional parameters
louvain 次要参数
paraments | description | value |
---|---|---|
use_weights | Use weights from knn graph | "Boolean (optional, default = true)" |
directed | Interpret the adjacency matrix as directed graph? | "Boolean (optional, default = true)" |
flavor | Choose between to packages for computing the clustering. 'vtraag' is much more powerful, and the default | "String (optional, default = "vtraag")" |
resolution | For the default flavor ('vtraag'), you can provide a resolution (higher resolution means finding more and smaller clusters), which defaults to 1.0 | "Float (optional, default = 1)" |
Output |
---|
outputfile |
3.1.8maker.wdl
this workflow rank genes for characterizing groups
对每个 cluster 中高度差异基因的排名
Main parameters 主要参数
Input | description | Example |
---|---|---|
infile | input file | "obs://xxx/xxx.csv" |
project_name | project name | test |
groupby | The key of the observations grouping to consider | leiden |
optional parameters 次要参数
paraments | description | value |
---|---|---|
n_genes | The number of genes that appear in the returned tables | "Int? (optional, default = 100)" |
Output |
---|
outputfile |
pngfile |