Single-cell Database

The single-cell database is mainly made up of single cell matrix from the CNGBdbdb . The database contains uniformly re-analysed single cell expression data across different species and provides interactive visualizations to explore that data.

Data size: 6,305

Last modified: 2024-06-12

Description

Data

Workflows

1.The workspace

This workspace contains single cell expression data and workflows that will enable you to perform single cell analysis.

该工作空间包含单细胞表达数据和工作流程，您可以通过流程进行单细胞分析。

2.The data

The data in this workspace includes single cell expression data from CNGBdbdb，you can find more file details of project in Datasets form CDCP.

数据集包含来自 CNGBdbdb单细胞数据的表达矩阵数据，数据集项目的其他文件你可以在CDCP找到。

3.Workflow

3.1single_cell_scanpy.wdl

This workflow is implemented based on scanpy,You only provide a single cell expression matrix（csv/tsv）.We provide sufficient optional adjustment parameters to support user-defined parameters。Each step produces hdf5 file as input for the next step.We also provide each step as a workflow for users to debug.The flow chart is as follows：

工作流程是基于 scanpy实现，您只要提供（csv/tsv）单细胞表达矩阵文件就可以开始分析。将提供足够的可选调整参数，以支持用户定义的参数。每一步都生成hdf5文件作为下一步的输入。我们还将每个步骤拆分作为单独工作流提供给用户进行调试。流程图如下：

Scanpy流程

The optional paraments in the following splitting steps.

流程可调节参数将显示在以下拆分步骤中。

The following workflows are included in this workspace. Below, the inputs, outputs, example variables and a description of each workflow can be found.

此工作空间中包含以下工作流。在此可以找到输入、输出、示例变量和每个工作流的描述。

3.1.1qc.wdl

this workflow load files and calculate quality control metrics by scanpy.pp.calculate_qc_metrics。

这个工作流是加载文件并通过scanpy.pp.calculate_qc_metrics计算质量指标。

Main parameters 主要参数

Input	description	Example
infile	input file	"obs://xxx/xxx.csv"
project_name	project name	test
filetype	filetype of input files	csv

optional parameters 可选参数

paraments	description	value
min_gene	Minimum number of genes expressed required for a cell to pass filtering.	Int (optional, default = 200)
min_cell	Minimum number of cells expressed required for a gene to pass filtering	Int (optional, default = 3)
genes	Maximum number of genes expressed required for a cell to pass filtering	(optional, default = 2500)
per_mt	Maximum percentage of mitochondria expressed required for a cell to pass filtering	Int (optional, default = 5)"

Output
outputfile
pngfile

3.1.2norm.wdl

this workflow normalize counts per cell and logarithmize the data matrix .

标准化每个单元的计数并将数据矩阵对数变化。

Main parameters 主要参数

Input	Example
infile	"obs://xxx/xxx.h5ad"
project_name	test

optional parameters 次要参数

paraments	description	value
target_sum	each observation (cell) has a total count equal to the median of total counts for observations (cells) before normalization	"Float (optional, default = 10000)"

Output
outputfile

3.1.3hvg.wdl

this workflow accepts h5fd file and annotate highly variable genes ,regress out (mostly) unwanted sources of variation then cale data to unit variance and zero mean with sc.pp.highly_variable_genes ,sc.pp.regress_out and sc.pp.scale.

接受h5fd文件作为输入，注释高度可变的基因，回归出（大部分）不需要的变异源，然后将数据缩放到单位方差和零均值

Main parameters 主要参数

Input	Example
infile	"obs://xxx/xxx.h5ad"
project_name	test

optional parameters 次要参数

paraments	description	value
min_mean	If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored	"Float (optional, default = 0.125)"
flavor	Choose the flavor for computing normalized dispersion. In their default workflows, Seurat passes the cutoffs whereas Cell Ranger passes n_top_genes.	"String? (optional)",
n_top_genes	Number of highly-variable genes to keep	"Int? (optional)"
subset	Inplace subset to highly-variable genes if True otherwise merely indicate highly variable genes	"Boolean? (optional)"
max_mean	If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored. Default is 3.	"Float (optional, default = 3)"
min_disp	If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored. Default is 0.5	"Float (optional, default = 0.5)"
max_disp	If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored. Default is np.inf	"Float? (optional)"
max_value	Clip (truncate) to this value after scaling	"Float (optional, default = 10)"

Output
outputfile
pngfile

3.1.4pca.wdl

Principal component analysis

主成分分析

Main parameters 主要参数

Input	Example
infile	"obs://xxx/xxx.h5ad"
project_name	test

optional parameters 次要参数

paraments	description	value
n_comps	Principal component analysis	"Int? (optional)"
use_highly_variable	Whether to use highly variable genes only, stored in .var['highly_variable']	"Boolean? (optional)"
zero_center	If True, compute standard PCA from covariance matrix. If False, omit zero-centering variables (uses TruncatedSVD), which allows to handle sparse input efficiently	"Boolean (optional, default = true)"
svd_solver	SVD solver to use	"String (optional, default = "arpack")"

Output
outputfile
pngfile

3.1.5neighbors.wdl

Compute a neighborhood graph of observations

计算观测值的邻域图

Main parameters 主要参数

Input	Example
infile	"obs://xxx/xxx.h5ad"
project_name	test

optional parameters 次要参数

paraments	description	value
knn	If True, use a hard threshold to restrict the number of neighbors to n_neighbors, that is, consider a knn graph. Otherwise, use a Gaussian Kernel to assign low weights to neighbors more distant than the n_neighbors nearest neighbor	"Boolean (optional, default = true)"
use_rep	Use the indicated representation. 'X' or any key for .obsm is valid. If None, the representation is chosen automatically: For .n_vars < 50, .X is used, otherwise ‘X_pca’ is used. If ‘X_pca’ is not present, it’s computed with default parameters.	"String? (optional)"
key_added	If not specified, the neighbors data is stored in .uns[‘neighbors’], distances and connectivities are stored in .obsp[‘distances’] and .obsp[‘connectivities’] respectively. If specified, the neighbors data is added to .uns[key_added], distances are stored in .obsp[key_added+’_distances’] and connectivities in .obsp[key_added+’_connectivities’].	"String? (optional)"
n_pcs	Use this many PCs. If n_pcs==0 use .X if use_rep is None	"Int? (optional)"
method	Use ‘umap’ or ‘gauss’ with adaptive width for computing connectivities. Use ‘rapids’ for the RAPIDS implementation of UMAP (experimental, GPU only)	"String (optional, default = "umap")"
n_neighbors	The size of local neighborhood (in terms of number of neighboring data points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100. If knn is True, number of nearest neighbors to be searched. If knn is False, a Gaussian kernel width is set to the distance of the n_neighbors neighbor	"Int (optional, default = 10)

Output
outputfile

3.1.6umap.wdl/tsne.wdl

this workflow embed the neighborhood graph using umap or tsne

非线性降维

Main parameters 主要参数

Input	Example
infile	"obs://xxx/xxx.h5ad"
project_name	test

umap optional parameters umap
次要参数

paraments	description	value
negative_sample_rate	The number of negative edge/1-simplex samples to use per positive edge/1-simplex sample in optimizing the low dimensional embedding	"Int (optional, default = 5)"
alpha	The initial learning rate for the embedding optimization.	"Float (optional, default = 1.0)"
min_dist	The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. The value should be set relative to the spread value, which determines the scale at which embedded points will be spread out	"Float (optional, default = 0.5)"
init_pos	Use this many PCs. If n_pcs==0 use .X if use_rep is None	"String (optional, default = "spectral")"
gamma	Weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples	"Float (optional, default = 1.0)"
spread	The effective scale of embedded points. In combination with min_dist this determines how clustered/clumped the embedded points are	"Float (optional, default = 1.0)"
n_components	The number of dimensions of the embedding	"Int (optional, default = 2)"

tsne optional parameters

tsne 次要参数

paraments	description	value
perplexity	The perplexity is related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity. Consider selecting a value between 5 and 50. The choice is not extremely critical since t-SNE is quite insensitive to this parameter	"Float (optional, default = 30)"
early_exaggeration	Controls how tight natural clusters in the original space are in the embedded space and how much space will be between them. For larger values, the space between natural clusters will be larger in the embedded space. Again, the choice of this parameter is not very critical. If the cost function increases during initial optimization, the early exaggeration factor or the learning rate might be too high	"Float (optional, default = 12)"
learning_rate	Note that the R-package “Rtsne” uses a default of 200. The learning rate can be a critical parameter. It should be between 100 and 1000. If the cost function increases during initial optimization, the early exaggeration factor or the learning rate might be too high. If the cost function gets stuck in a bad local minimum increasing the learning rate helps sometimes	"Float (optional, default = 1000)"

Output
clustfile

3.1.7leiden.wdl/louvain.wdl

cluster cells into subgroups

细胞聚类

Main parameters 主要参数

Input	Example
infile	"obs://xxx/xxx.h5ad"
project_name	test

**leiden optional parameters **
leiden 次要参数

paraments	description	value
use_weights	If True, edge weights from the graph are used in the computation (placing more emphasis on stronger edges)	"Boolean (optional, default = true)"
n_iterations	How many iterations of the Leiden clustering algorithm to perform. Positive values above 2 define the total number of iterations to perform, -1 has the algorithm run until it reaches its optimal clustering	"Float (optional, default = -1)"
directed	Whether to treat the graph as directed or undirected.	"Boolean (optional, default = true)"
resolution	A parameter value controlling the coarseness of the clustering. Higher values lead to more clusters. Set to None if overriding partition_type to one that doesn’t accept a resolution_parameter	"Float (optional, default = 1)"

louvain optional parameters
louvain 次要参数

paraments	description	value
use_weights	Use weights from knn graph	"Boolean (optional, default = true)"
directed	Interpret the adjacency matrix as directed graph?	"Boolean (optional, default = true)"
flavor	Choose between to packages for computing the clustering. 'vtraag' is much more powerful, and the default	"String (optional, default = "vtraag")"
resolution	For the default flavor ('vtraag'), you can provide a resolution (higher resolution means finding more and smaller clusters), which defaults to 1.0	"Float (optional, default = 1)"

Output
outputfile

3.1.8maker.wdl

this workflow rank genes for characterizing groups

对每个 cluster 中高度差异基因的排名

Main parameters 主要参数

Input	description	Example
infile	input file	"obs://xxx/xxx.csv"
project_name	project name	test
groupby	The key of the observations grouping to consider	leiden

optional parameters 次要参数

paraments	description	value
n_genes	The number of genes that appear in the returned tables	"Int? (optional, default = 100)"

Output
outputfile
pngfile