Introduction
简介
Scanpy 是单细胞转录组常用且主流的分析Python工具包,用于分析与anndata联合构建的单细胞基因表达数据。它包括预处理、可视化、聚类、轨迹推断和差异表达测试。能够有效地处理超过一百万个单元的数据集。
本工作流基于Scanpy 3k PBMC官方教程归纳扩展的WDL工作流,并提取常调整参数以供针对不同个性化数据进行调整。同时按下图步骤拆分10小流程供用户单独调试单个步骤结果。可在工具集搜索对应工作流,使用对应输出H5AD文件进行调试。你也可以根据我们提供的Scanpy Notebook工具进行代码调试。
详细参数说明见下文input介绍,
3.1.1qc.wdl
this workflow load files and calculate quality control metrics by scanpy.pp.calculate_qc_metrics
。
这个工作流是加载文件并通过scanpy.pp.calculate_qc_metrics
计算质量指标同时可选择的过滤原始矩阵中基因表达较少的细胞和在细胞中检测较少的基因。
3.1.2norm.wdl
this workflow normalize counts per cell and logarithmize the data matrix .
标准化每个单元的计数并将数据矩阵对数变化。
3.1.3hvg.wdl
this workflow accepts h5fd file and annotate highly variable genes ,regress out (mostly) unwanted sources of variation then cale data to unit variance and zero mean with sc.pp.highly_variable_genes ,sc.pp.regress_out and sc.pp.scale.
接受h5fd文件作为输入,注释高度可变的基因,回归出(大部分)不需要的变异源,然后将数据缩放到单位方差和零均值
3.1.4pca.wdl
Principal component analysis
主成分分析
3.1.5neighbors.wdl
Compute a neighborhood graph of observations
计算观测值的邻域图
umap.wdl/tsne.wdl
this workflow embed the neighborhood graph using umap or tsne
非线性降维
leiden.wdl/louvain.wdl
cluster cells into subgroups
细胞聚类
maker.wdl
this workflow rank genes for characterizing groups
对每个 cluster 中高度差异基因的排名
联系我们
该工具由国家基因库团队提供。如有任何问题或疑虑,请联系 CNGBdb@cngb.org
Task name | Attribute name | Type | Description |
---|---|---|---|
* scanpy | project_name | String | project name |
* scanpy | infile | File | input matrix file |
* scanpy | filetype | String | filetype of input files:csv or tsv |
scanpy.umap | spread | Float | The effective scale of embedded points. In combination with min_dist this determines how clustered/clumped the embedded points are |
scanpy.umap | negative_sample_rate | Int | The number of negative edge/1-simplex samples to use per positive edge/1-simplex sample in optimizing the low dimensional embedding |
scanpy.umap | n_components | Int | The number of dimensions of the embedding |
scanpy.umap | min_dist | Float | The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. The value should be set relative to the spread value, which determines the scale at which embedded points will be spread out. |
scanpy.umap | memory | String | Number of memory running tasksnotice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8 |
scanpy.umap | init_pos | String | How to initialize the low dimensional embedding,['paga', 'spectral', 'random], ndarray, None] (default: 'spectral') |
scanpy.umap | gamma | Float | Weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples. |
scanpy.umap | cpu | String | Number of CPU running tasks.notice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8 |
scanpy.umap | alpha | Float | The initial learning rate for the embedding optimization |
scanpy.tsne | perplexity | Float | The perplexity is related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity. Consider selecting a value between 5 and 50. The choice is not extremely critical since t-SNE is quite insensitive to this parameter. |
scanpy.tsne | memory | String | Number of memory running tasksnotice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8 |
scanpy.tsne | learning_rate | Float | Note that the R-package “Rtsne' uses a default of 200. The learning rate can be a critical parameter. It should be between 100 and 1000. If the cost function increases during initial optimization, the early exaggeration factor or the learning rate might be too high. If the cost function gets stuck in a bad local minimum increasing the learning rate helps sometimes. |
scanpy.tsne | early_exaggeration | Float | Controls how tight natural clusters in the original space are in the embedded space and how much space will be between them. For larger values, the space between natural clusters will be larger in the embedded space. Again, the choice of this parameter is not very critical. If the cost function increases during initial optimization, the early exaggeration factor or the learning rate might be too high. |
scanpy.tsne | cpu | String | Number of CPU running tasks.notice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8 |
scanpy.qc | per_mt | Int | Maximum percentage of mitochondria expressed required for a cell to pass filtering |
scanpy.qc | min_gene | Int | Minimum number of genes expressed required for a cell to pass filtering. |
scanpy.qc | min_cell | Int | Minimum number of cells expressed required for a gene to pass filtering. |
scanpy.qc | memory | String | Number of memory running tasksnotice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8 |
scanpy.qc | genes | Int | Maximum number of genes expressed required for a cell to pass filtering. |
scanpy.qc | cpu | String | Number of CPU running tasks.notice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8 |
scanpy.pca | zero_center | Boolean | If True, compute standard PCA from covariance matrix. If False, omit zero-centering variables (uses TruncatedSVD), which allows to handle sparse input efficiently. Passing None decides automatically based on sparseness of the data. |
scanpy.pca | use_highly_variable | Boolean | Whether to use highly variable genes only, stored in .var['highly_variable']. By default uses them if they have been determined beforehand. |
scanpy.pca | svd_solver | String | method of SVD solver to use |
scanpy.pca | n_comps | Int | Number of principal components to compute. Defaults to 50, or 1 - minimum dimension size of selected representation. |
scanpy.pca | memory | String | Number of memory running tasksnotice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8 |
scanpy.pca | cpu | String | Number of CPU running tasks.notice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8 |
scanpy.normalize | target_sum | Float | If None, after normalization, each observation (cell) has a total count equal to the median of total counts for observations (cells) before normalization. |
scanpy.normalize | memory | String | Number of memory running tasksnotice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8 |
scanpy.normalize | cpu | String | Number of CPU running tasks.notice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8 |
scanpy.neighbors | use_rep | String | Use the indicated representation. 'X' or any key for .obsm is valid. If None, the representation is chosen automatically: For .n_vars < 50, .X is used, otherwise ‘X_pca’ is used. If ‘X_pca’ is not present, it’s computed with default parameters. |
scanpy.neighbors | n_pcs | Int | Use this many PCs. If n_pcs==0 use .X if use_rep is None |
scanpy.neighbors | n_neighbors | Int | The size of local neighborhood (in terms of number of neighboring data points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100. If knn is True, number of nearest neighbors to be searched. If knn is False, a Gaussian kernel width is set to the distance of the n_neighbors neighbor. |
scanpy.neighbors | method | String | Use 'umap' or 'gauss' for computing connectivities. Use 'rapids' for the RAPIDS implementation of UMAP (experimental, GPU only). |
scanpy.neighbors | memory | String | Number of memory running tasksnotice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8 |
scanpy.neighbors | knn | Boolean | If True, use a hard threshold to restrict the number of neighbors to n_neighbors, that is, consider a knn graph. Otherwise, use a Gaussian Kernel to assign low weights to neighbors more distant than the n_neighbors nearest neighbor. |
scanpy.neighbors | key_added | String | If not specified, the neighbors data is stored in .uns[‘neighbors’], distances and connectivities are stored in .obsp[‘distances’] and .obsp[‘connectivities’] respectively. If specified, the neighbors data is added to .uns[key_added], distances are stored in .obsp[key_added+’_distances’] and connectivities in .obsp[key_added+’_connectivities’]. |
scanpy.neighbors | cpu | String | Number of CPU running tasks.notice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8 |
scanpy.marker | n_genes | Int | The number of genes that appear in the returned tables. Defaults to all genes. |
scanpy.marker | method | String | The default method is 't-test', 't-test_overestim_var' overestimates variance of each group, 'wilcoxon' uses Wilcoxon rank-sum, 'logreg' uses logistic regression. See [Ntranos18], here and here, for why this is meaningful. |
scanpy.marker | memory | String | Number of memory running tasksnotice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8 |
scanpy.marker | cpu | String | Number of CPU running tasks.notice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8 |
scanpy.louvain | use_weights | Boolean | Use weights from knn graph. |
scanpy.louvain | resolution | Float | For the default flavor ('vtraag'), you can provide a resolution (higher resolution means finding more and smaller clusters), which defaults to 1.0. |
scanpy.louvain | memory | String | Number of memory running tasksnotice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8 |
scanpy.louvain | flavor | String | 可选值[‘vtraag’, ‘igraph’, ‘rapids’] ;Choose between to packages for computing the clustering. 'vtraag' is much more powerful, and the default. |
scanpy.louvain | directed | Boolean | Interpret the adjacency matrix as directed graph? |
scanpy.louvain | cpu | String | Number of CPU running tasks.notice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8 |
scanpy.leiden | use_weights | Boolean | If True, edge weights from the graph are used in the computation (placing more emphasis on stronger edges). |
scanpy.leiden | resolution | Float | A parameter value controlling the coarseness of the clustering. Higher values lead to more clusters. Set to None if overriding partition_type to one that doesn’t accept a resolution_parameter. |
scanpy.leiden | n_iterations | Float | How many iterations of the Leiden clustering algorithm to perform. Positive values above 2 define the total number of iterations to perform, -1 has the algorithm run until it reaches its optimal clustering. |
scanpy.leiden | memory | String | Number of memory running tasksnotice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8 |
scanpy.leiden | directed | Boolean | Whether to treat the graph as directed or undirected. |
scanpy.leiden | cpu | String | Number of CPU running tasks.notice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8 |
scanpy.hvg | subset | Boolean | Inplace subset to highly-variable genes if True otherwise merely indicate highly variable genes. |
scanpy.hvg | n_top_genes | Int | Number of highly-variable genes to keep. Mandatory if flavor='seurat_v3'. |
scanpy.hvg | min_mean | Float | If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored if flavor='seurat_v3'. |
scanpy.hvg | min_disp | Float | If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored if flavor='seurat_v3'. |
scanpy.hvg | memory | String | Number of memory running tasksnotice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8 |
scanpy.hvg | max_value | Float | Clip (truncate) to this value after scaling. If None, do not clip. |
scanpy.hvg | max_mean | Float | If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored if flavor='seurat_v3'. |
scanpy.hvg | max_disp | Float | If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored if flavor='seurat_v3'. |
scanpy.hvg | flavor | String | ['seurat', 'cell_ranger', 'seurat_v3'],Choose the flavor for identifying highly variable genes. For the dispersion based methods in their default workflows, Seurat passes the cutoffs whereas Cell Ranger passes n_top_genes. |
scanpy.hvg | cpu | String | Number of CPU running tasks.notice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8 |
scanpy | embed_method | String | choose ['umap','tsne'] method of embedded |
scanpy | cluster_method | String | method of cluster,choose 'leiden' or 'louvain' |
Task name | Attribute name | Type | Description |
---|---|---|---|
scanpy | cluster | File | -- |
scanpy | fina_h5ad | File | -- |
scanpy | metadata | File | -- |