Codeplot
single_cell_scanpy
Introduction

简介

Scanpy 是单细胞转录组常用且主流的分析Python工具包,用于分析与anndata联合构建的单细胞基因表达数据。它包括预处理、可视化、聚类、轨迹推断和差异表达测试。能够有效地处理超过一百万个单元的数据集。

本工作流基于Scanpy 3k PBMC官方教程归纳扩展的WDL工作流,并提取常调整参数以供针对不同个性化数据进行调整。同时按下图步骤拆分10小流程供用户单独调试单个步骤结果。可在工具集搜索对应工作流,使用对应输出H5AD文件进行调试。你也可以根据我们提供的Scanpy Notebook工具进行代码调试。

详细参数说明见下文input介绍,
image

3.1.1qc.wdl

this workflow load files and calculate quality control metrics by scanpy.pp.calculate_qc_metrics

这个工作流是加载文件并通过scanpy.pp.calculate_qc_metrics计算质量指标同时可选择的过滤原始矩阵中基因表达较少的细胞和在细胞中检测较少的基因。

3.1.2norm.wdl

this workflow normalize counts per cell and logarithmize the data matrix .

标准化每个单元的计数并将数据矩阵对数变化。

3.1.3hvg.wdl

this workflow accepts h5fd file and annotate highly variable genes ,regress out (mostly) unwanted sources of variation then cale data to unit variance and zero mean with sc.pp.highly_variable_genes ,sc.pp.regress_out and sc.pp.scale.

接受h5fd文件作为输入,注释高度可变的基因,回归出(大部分)不需要的变异源,然后将数据缩放到单位方差和零均值

3.1.4pca.wdl

Principal component analysis

主成分分析

3.1.5neighbors.wdl

Compute a neighborhood graph of observations

计算观测值的邻域图

umap.wdl/tsne.wdl

this workflow embed the neighborhood graph using umap or tsne

非线性降维

leiden.wdl/louvain.wdl

cluster cells into subgroups

细胞聚类

maker.wdl

this workflow rank genes for characterizing groups

对每个 cluster 中高度差异基因的排名

联系我们

该工具由国家基因库团队提供。如有任何问题或疑虑,请联系 CNGBdb@cngb.org

Script
Input
Task nameAttribute nameTypeDescription
* scanpy project_nameString project name
* scanpy infileFile input matrix file
* scanpy filetypeString filetype of input files:csv or tsv
* scanpy.umap spreadFloat The effective scale of embedded points. In combination with min_dist this determines how clustered/clumped the embedded points are
* scanpy.umap negative_sample_rateInt The number of negative edge/1-simplex samples to use per positive edge/1-simplex sample in optimizing the low dimensional embedding
* scanpy.umap n_componentsInt The number of dimensions of the embedding
* scanpy.umap min_distFloat The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. The value should be set relative to the spread value, which determines the scale at which embedded points will be spread out.
* scanpy.umap memoryString Number of memory running tasksnotice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8
* scanpy.umap init_posString How to initialize the low dimensional embedding,['paga', 'spectral', 'random], ndarray, None] (default: 'spectral')
* scanpy.umap gammaFloat Weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples.
* scanpy.umap cpuString Number of CPU running tasks.notice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8
* scanpy.umap alphaFloat The initial learning rate for the embedding optimization
* scanpy.tsne perplexityFloat The perplexity is related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity. Consider selecting a value between 5 and 50. The choice is not extremely critical since t-SNE is quite insensitive to this parameter.
* scanpy.tsne memoryString Number of memory running tasksnotice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8
* scanpy.tsne learning_rateFloat Note that the R-package “Rtsne' uses a default of 200. The learning rate can be a critical parameter. It should be between 100 and 1000. If the cost function increases during initial optimization, the early exaggeration factor or the learning rate might be too high. If the cost function gets stuck in a bad local minimum increasing the learning rate helps sometimes.
* scanpy.tsne early_exaggerationFloat Controls how tight natural clusters in the original space are in the embedded space and how much space will be between them. For larger values, the space between natural clusters will be larger in the embedded space. Again, the choice of this parameter is not very critical. If the cost function increases during initial optimization, the early exaggeration factor or the learning rate might be too high.
* scanpy.tsne cpuString Number of CPU running tasks.notice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8
* scanpy.qc per_mtInt Maximum percentage of mitochondria expressed required for a cell to pass filtering
* scanpy.qc min_geneInt Minimum number of genes expressed required for a cell to pass filtering.
* scanpy.qc min_cellInt Minimum number of cells expressed required for a gene to pass filtering.
* scanpy.qc memoryString Number of memory running tasksnotice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8
* scanpy.qc genesInt Maximum number of genes expressed required for a cell to pass filtering.
* scanpy.qc cpuString Number of CPU running tasks.notice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8
* scanpy.pca zero_centerBoolean If True, compute standard PCA from covariance matrix. If False, omit zero-centering variables (uses TruncatedSVD), which allows to handle sparse input efficiently. Passing None decides automatically based on sparseness of the data.
* scanpy.pca use_highly_variableBoolean Whether to use highly variable genes only, stored in .var['highly_variable']. By default uses them if they have been determined beforehand.
* scanpy.pca svd_solverString method of SVD solver to use
* scanpy.pca n_compsInt Number of principal components to compute. Defaults to 50, or 1 - minimum dimension size of selected representation.
* scanpy.pca memoryString Number of memory running tasksnotice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8
* scanpy.pca cpuString Number of CPU running tasks.notice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8
* scanpy.normalize target_sumFloat If None, after normalization, each observation (cell) has a total count equal to the median of total counts for observations (cells) before normalization.
* scanpy.normalize memoryString Number of memory running tasksnotice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8
* scanpy.normalize cpuString Number of CPU running tasks.notice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8
* scanpy.neighbors use_repString Use the indicated representation. 'X' or any key for .obsm is valid. If None, the representation is chosen automatically: For .n_vars < 50, .X is used, otherwise ‘X_pca’ is used. If ‘X_pca’ is not present, it’s computed with default parameters.
* scanpy.neighbors n_pcsInt Use this many PCs. If n_pcs==0 use .X if use_rep is None
* scanpy.neighbors n_neighborsInt The size of local neighborhood (in terms of number of neighboring data points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100. If knn is True, number of nearest neighbors to be searched. If knn is False, a Gaussian kernel width is set to the distance of the n_neighbors neighbor.
* scanpy.neighbors methodString Use 'umap' or 'gauss' for computing connectivities. Use 'rapids' for the RAPIDS implementation of UMAP (experimental, GPU only).
* scanpy.neighbors memoryString Number of memory running tasksnotice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8
* scanpy.neighbors knnBoolean If True, use a hard threshold to restrict the number of neighbors to n_neighbors, that is, consider a knn graph. Otherwise, use a Gaussian Kernel to assign low weights to neighbors more distant than the n_neighbors nearest neighbor.
* scanpy.neighbors key_addedString If not specified, the neighbors data is stored in .uns[‘neighbors’], distances and connectivities are stored in .obsp[‘distances’] and .obsp[‘connectivities’] respectively. If specified, the neighbors data is added to .uns[key_added], distances are stored in .obsp[key_added+’_distances’] and connectivities in .obsp[key_added+’_connectivities’].
* scanpy.neighbors cpuString Number of CPU running tasks.notice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8
* scanpy.marker n_genesInt The number of genes that appear in the returned tables. Defaults to all genes.
* scanpy.marker methodString The default method is 't-test', 't-test_overestim_var' overestimates variance of each group, 'wilcoxon' uses Wilcoxon rank-sum, 'logreg' uses logistic regression. See [Ntranos18], here and here, for why this is meaningful.
* scanpy.marker memoryString Number of memory running tasksnotice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8
* scanpy.marker cpuString Number of CPU running tasks.notice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8
* scanpy.louvain use_weightsBoolean Use weights from knn graph.
* scanpy.louvain resolutionFloat For the default flavor ('vtraag'), you can provide a resolution (higher resolution means finding more and smaller clusters), which defaults to 1.0.
* scanpy.louvain memoryString Number of memory running tasksnotice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8
* scanpy.louvain flavorString 可选值[‘vtraag’, ‘igraph’, ‘rapids’] ;Choose between to packages for computing the clustering. 'vtraag' is much more powerful, and the default.
* scanpy.louvain directedBoolean Interpret the adjacency matrix as directed graph?
* scanpy.louvain cpuString Number of CPU running tasks.notice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8
* scanpy.leiden use_weightsBoolean If True, edge weights from the graph are used in the computation (placing more emphasis on stronger edges).
* scanpy.leiden resolutionFloat A parameter value controlling the coarseness of the clustering. Higher values lead to more clusters. Set to None if overriding partition_type to one that doesn’t accept a resolution_parameter.
* scanpy.leiden n_iterationsFloat How many iterations of the Leiden clustering algorithm to perform. Positive values above 2 define the total number of iterations to perform, -1 has the algorithm run until it reaches its optimal clustering.
* scanpy.leiden memoryString Number of memory running tasksnotice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8
* scanpy.leiden directedBoolean Whether to treat the graph as directed or undirected.
* scanpy.leiden cpuString Number of CPU running tasks.notice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8
* scanpy.hvg subsetBoolean Inplace subset to highly-variable genes if True otherwise merely indicate highly variable genes.
* scanpy.hvg n_top_genesInt Number of highly-variable genes to keep. Mandatory if flavor='seurat_v3'.
* scanpy.hvg min_meanFloat If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored if flavor='seurat_v3'.
* scanpy.hvg min_dispFloat If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored if flavor='seurat_v3'.
* scanpy.hvg memoryString Number of memory running tasksnotice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8
* scanpy.hvg max_valueFloat Clip (truncate) to this value after scaling. If None, do not clip.
* scanpy.hvg max_meanFloat If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored if flavor='seurat_v3'.
* scanpy.hvg max_dispFloat If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored if flavor='seurat_v3'.
* scanpy.hvg flavorString ['seurat', 'cell_ranger', 'seurat_v3'],Choose the flavor for identifying highly variable genes. For the dispersion based methods in their default workflows, Seurat passes the cutoffs whereas Cell Ranger passes n_top_genes.
* scanpy.hvg cpuString Number of CPU running tasks.notice:1. The value range is 0.25-32 cores, in addition, 48 cores and 64 cores can be selected, and the CPU must be an integer multiple of 0.25 cores; 2. The memory value range is 1GB-512GB, and the memory must be an integer multiple of 1GB. 3. The CPU / memory ratio must be between 1:2 and 1:8
* scanpy embed_methodString choose ['umap','tsne'] method of embedded
* scanpy cluster_methodString method of cluster,choose 'leiden' or 'louvain'
Output
Task nameAttribute nameTypeDescription
* scanpy clusterFile --
* scanpy fina_h5adFile --
* scanpy metadataFile --