Codeplot
国家基因库单细胞训练营 -- 单细胞ATAC数据下游分析

scATAC-seq Advanced data analysis

1. 获取Peak Set

1.1 获取pseudo bulk重复样本

因为 scATAC-seq 数据本质上是二元的——这意味着任何单个基因座都可以开放或者是不开放的,基于在单个细胞上进行的peakcalling几乎上是不可能。此外,许多分析都需要重复才具有统计显着性。在单细胞ATAC数据中,我们通过创建 pseudo bulk 来解决这些问题。pseudo-bulk是指将来自同一组数据组合成一个类似于bulk ATAC-seq 实验的模拟样本。ArchR 为每个所需的细胞分组制作多个这样的pseudo bulk样本,这个过程的基本假设是,被组合在一起的单个细胞非常相似,因此不考虑它们之间的差异。
为了创建pseudo bulk 重复样本,ArchR 采用了决策树方法。需要指定 (i) 所需的最小和最大重复样本数,(ii) 每个模拟bulk样本的最小和最大细胞数,以及 (iii) 如果特定分组缺乏足够的细胞时,使用的不放回的随机抽样的方法获取pseudo bulk 样本时的采样率。例如,0.8 的采样率意味着对分组中总细胞进行不放回的随机抽样,每次抽取的细胞数最多可达总细胞数的80%。在这种情况下,将会导致多个重复样本中可能包含一些相同的细胞,但如果想从缺乏足够细胞的细胞组中生成pseudo bulk重复样本,这是必要的牺牲。

首先,需要确定要使用的group——可以是Cluster 也可以是celltype。然后,对group中每一个Cluster/cell type,ArchR尝试创建所需的bulk ATAC-seq。理想的模拟bulk ATAC-seq样本将由单个样品中的足够数量的细胞组成。这保持了样本多样性和重复之间的生物学变化。实际上,在此过程中有5个可能的结果:

  1. 数据拥有足够多的样本(至少要等于maxRep定义的重复),且每个样本的细胞数都大于 minCell, 于是每个样本都可以作为拟混池重复,每个重复的细胞都来自于同一个样本。
  2. 一些样本的细胞数超过minCell,因此能够单独形成一个重复。剩下的重复则是通过将余下的细胞进行混合,然后通过无放回抽样得到。
  3. 数据里没有一个样本的细胞数超过minCell, 但是总细胞数超过minCells * minReps。因此将所有的细胞进行混合,然后进行无放回抽样,抽样时不考虑细胞来源。
  4. 一个细胞分组中的总细胞数低于 minCells * minReps,但是大于minCells / Sample Ratio。此时单个样本的构建采取无放回抽样,重复间则需要有放回抽样,降低多个拟混池重复间的相同细胞数。
  5. 一个细胞分组中的总细胞数低于 minCells / Sample Ratio 。这意味着我们必须在单个重复和跨重复中都采取有放回抽样策略。这是最糟糕的情况,后续在使用这些模拟的bulk样本做下游分析分析要特别小心。后续可以通过设置ArchR的minCells参数进行淘汰。

为了说明此过程,我们将使用以下示例数据集:

Sample Cluster1 Cluster2 Cluster3 Cluster4 Cluster5
A 800 600 900 100 75
B 1000 50 400 150 25
C 600 900 100 200 50
D 1200 500 50 50 25
E 900 100 50 150 50
F 700 200 100 100 25

设置minRep = 3,maxRep = 5, minCells = 300,maxCells = 1000,sampleRatio = 0.8
Cluster1

对于Cluster 1,我们有6个样品(超过maxRep),所有样本都比minCells细胞(300个单元)更多。这说明了上面的选项#1,我们将以示例意识的方式进行5个模拟bulk ATAC-seq:

  • Rep1 = 800 cells from SampleA
  • Rep2 = 1000 cells from SampleB
  • Rep3 = 1000 cells from SampleD
  • Rep4 = 900 cells from SampleE
  • Rep5 = 700 cells from SampleF

关于这些重复的有两件事要注意:(i)被排除在外,因为我们有足够多的样品来使maxRep样品吸引的伪bulk-seq,而SampleC的细胞数量最少。(ii)从采样中仅使用了1000个单元,因为这是maxCells值。

Cluster2

对于cluster2,我们有3个样品,所有样本都比minCells细胞多,还有一些其他样品。这说明了上面的选项#2,我们将进行以下伪bulk重复:

  • Rep1 = 600 cells from SampleA
  • Rep2 = 900 cells from SampleC
  • Rep3 = 500 cells from SampleD
  • Rep4 = 350 cells [50 cells from SampleB + 100 from SampleE + 200 from SampleF]

在此示例中,Rep4通过不放回抽样的方式创建。
Cluster3

对于cluster3,我们只有2个样品,这些样品的minCells细胞大于所需的细胞minReps。但是,如果我们将剩余样品中的单元组合在一起,则可以使一个额外的复制量超过minCells。这为我们提供了总共3个伪蓝色复制,并表示上面选项3所示的情况。我们将进行以下重复:

  • Rep1 = 900 cells from SampleA
  • Rep2 = 400 cells from SampleB
  • Rep3 = 250 cells [100 cells from SampleC + 50 from SampleD + 50 from SampleE + 50 from SampleF]

与上面的cluster2相似,cluster3 rep3是通过样本不可知的方式创建的,而无需替换多个样本。
Cluster4

对于Cluster4,细胞总数为750,小于minCells * minReps(900个单元)。minReps在这种情况下,至少minCells没有某种形式的采样,我们没有足够的细胞来制作。但是,总细胞仍然大于minCells / sampleRatio(375个细胞),这意味着我们只需要在不同的伪膨化重复中进行替换,而不是在单个复制中进行采样。这表示上面选项4中所示的情况,因此我们将进行以下重复:

  • Rep1 = 300 cells [250 unique cells + 25 cells overlapping Rep2 + 25 cells overlapping Rep3]
  • Rep2 = 300 cells [250 unique cells + 25 cells overlapping Rep1 + 25 cells overlapping Rep3]
  • Rep3 = 300 cells [250 unique cells + 25 cells overlapping Rep1 + 25 cells overlapping Rep2]

在这种情况下,ARCHR将最大程度地减少任何两个伪卵形复制之间重叠的单元数量。
Cluster 5

对于Cluster5,细胞总数为250,小于minCells * minReps(900个细胞),小于minCells / sampleRatio(375个细胞)。这意味着我们必须在每个样本中和跨不同的重复次数中进行替换,以进行伪块复制。这代表了上面选项5中所示的最不可取的情况,因此,我们应该在下游分析中使用这些伪膨胀重复。因此,我们将进行以下重复:

  • Rep1 = 300 cells [250 unique cells + 25 cells overlapping Rep2 + 25 cells overlapping Rep3]
  • Rep2 = 300 cells [250 unique cells + 25 cells overlapping Rep1 + 25 cells overlapping Rep3]
  • Rep3 = 300 cells [250 unique cells + 25 cells overlapping Rep1 + 25 cells overlapping Rep2]
    In [19]:
library(ArchR)
library(parallel)
library(Seurat)

library(pryr) 
addArchRThreads(threads = 10)
set.seed(1)
addArchRGenome("hg38")
library(BSgenome.Hsapiens.UCSC.hg38)

Out [19]:
Setting default number of Parallel threads to 10.

Setting default genome to Hg38.

Loading required package: BSgenome

Loading required package: rtracklayer

In [23]:

# 读入做完细胞类型注释的ArchR object


proj <- readRDS("/opt/example/00.scATAC_data/proj.rds")
setwd('/home/jovyan')

In [24]:

proj

Out [24]:

       ___      .______        ______  __    __  .______      
      /   \     |   _  \      /      ||  |  |  | |   _  \     
     /  ^  \    |  |_)  |    |  ,----'|  |__|  | |  |_)  |    
    /  /_\  \   |      /     |  |     |   __   | |      /     
   /  _____  \  |  |\  \\___ |  `----.|  |  |  | |  |\  \\___.
  /__/     \__\ | _| `._____| \______||__|  |__| | _| `._____|

In [25]:

proj$celltype <- proj$predictedsubGroup
table(proj$celltype)

In [30]:

valid_celltype <- names(table(proj$celltype)[table(proj$celltype) > 100])
valid_celltype
proj <- proj[which(proj$celltype  %in% valid_celltype),]
proj
, ,
  1. 'B'
  2. 'CD14+ Mono'
  3. 'CD8 T'
  4. 'Memory CD4 T'
  5. 'Naive CD4 T'
  6. 'NK'
,

Out [30]:
Dropping ImputeWeights Since You Are Subsetting Cells! ImputeWeights is a cell-x-cell Matrix!

       ___      .______        ______  __    __  .______      
      /   \     |   _  \      /      ||  |  |  | |   _  \     
     /  ^  \    |  |_)  |    |  ,----'|  |__|  | |  |_)  |    
    /  /_\  \   |      /     |  |     |   __   | |      /     
   /  _____  \  |  |\  \\___ |  `----.|  |  |  | |  |\  \\___.
  /__/     \__\ | _| `._____| \______||__|  |__| | _| `._____|

In [31]:

proj <- addGroupCoverages(
  ArchRProj = proj ,
  groupBy = "celltype",
  minRep=3,
  useLabels = TRUE,
  force = TRUE,
  threads = 1
)

Out [31]:
ArchR logging to : ArchRLogs/ArchR-addGroupCoverages-45f1fd2182f-Date-2022-08-24_Time-02-54-48.log
If there is an issue, please report to github with logFile!

B (1 of 6) : CellGroups N = 2

CD8 T (2 of 6) : CellGroups N = 2

CD14+ Mono (3 of 6) : CellGroups N = 2

Memory CD4 T (4 of 6) : CellGroups N = 2

Naive CD4 T (5 of 6) : CellGroups N = 2

NK (6 of 6) : CellGroups N = 2

2022-08-24 02:54:48 : Creating Coverage Files!, 0.008 mins elapsed.

2022-08-24 02:54:48 : Batch Execution w/ safelapply!, 0.008 mins elapsed.

2022-08-24 02:54:48 : Group B..Rep1 (1 of 12) : Creating Group Coverage File : B..Rep1.insertions.coverage.h5, 0.008 mins elapsed.

Number of Cells = 310

Coverage File Exists!

Added Coverage Group

Added Metadata Group

Added ArrowCoverage Class

Added Coverage/Info

Added Coverage/Info/CellNames

2022-08-24 02:55:11 : Group B..Rep2 (2 of 12) : Creating Group Coverage File : B..Rep2.insertions.coverage.h5, 0.393 mins elapsed.

Number of Cells = 80

Coverage File Exists!

Added Coverage Group

Added Metadata Group

Added ArrowCoverage Class

Added Coverage/Info

Added Coverage/Info/CellNames

2022-08-24 02:55:33 : Group CD8 T..Rep1 (3 of 12) : Creating Group Coverage File : CD8.T..Rep1.insertions.coverage.h5, 0.756 mins elapsed.

Number of Cells = 500

Coverage File Exists!

Added Coverage Group

Added Metadata Group

Added ArrowCoverage Class

Added Coverage/Info

Added Coverage/Info/CellNames

2022-08-24 02:55:57 : Group CD8 T..Rep2 (4 of 12) : Creating Group Coverage File : CD8.T..Rep2.insertions.coverage.h5, 1.149 mins elapsed.

Number of Cells = 80

Coverage File Exists!

Added Coverage Group

Added Metadata Group

Added ArrowCoverage Class

Added Coverage/Info

Added Coverage/Info/CellNames

2022-08-24 02:56:18 : Group CD14+ Mono..Rep1 (5 of 12) : Creating Group Coverage File : CD14..Mono..Rep1.insertions.coverage.h5, 1.511 mins elapsed.

Number of Cells = 254

Coverage File Exists!

Added Coverage Group

Added Metadata Group

Added ArrowCoverage Class

Added Coverage/Info

Added Coverage/Info/CellNames

2022-08-24 02:56:41 : Group CD14+ Mono..Rep2 (6 of 12) : Creating Group Coverage File : CD14..Mono..Rep2.insertions.coverage.h5, 1.893 mins elapsed.

Number of Cells = 80

Coverage File Exists!

Added Coverage Group

Added Metadata Group

Added ArrowCoverage Class

Added Coverage/Info

Added Coverage/Info/CellNames

2022-08-24 02:57:03 : Group Memory CD4 T..Rep1 (7 of 12) : Creating Group Coverage File : Memory.CD4.T..Rep1.insertions.coverage.h5, 2.257 mins elapsed.

Number of Cells = 500

Coverage File Exists!

Added Coverage Group

Added Metadata Group

Added ArrowCoverage Class

Added Coverage/Info

Added Coverage/Info/CellNames

2022-08-24 02:57:27 : Group Memory CD4 T..Rep2 (8 of 12) : Creating Group Coverage File : Memory.CD4.T..Rep2.insertions.coverage.h5, 2.659 mins elapsed.

Number of Cells = 80

Coverage File Exists!

Added Coverage Group

Added Metadata Group

Added ArrowCoverage Class

Added Coverage/Info

Added Coverage/Info/CellNames

2022-08-24 02:57:49 : Group Naive CD4 T..Rep1 (9 of 12) : Creating Group Coverage File : Naive.CD4.T..Rep1.insertions.coverage.h5, 3.02 mins elapsed.

Number of Cells = 500

Coverage File Exists!

Added Coverage Group

Added Metadata Group

Added ArrowCoverage Class

Added Coverage/Info

Added Coverage/Info/CellNames

2022-08-24 02:58:13 : Group Naive CD4 T..Rep2 (10 of 12) : Creating Group Coverage File : Naive.CD4.T..Rep2.insertions.coverage.h5, 3.417 mins elapsed.

Number of Cells = 80

Coverage File Exists!

Added Coverage Group

Added Metadata Group

Added ArrowCoverage Class

Added Coverage/Info

Added Coverage/Info/CellNames

2022-08-24 02:58:34 : Group NK..Rep1 (11 of 12) : Creating Group Coverage File : NK..Rep1.insertions.coverage.h5, 3.779 mins elapsed.

Number of Cells = 500

Coverage File Exists!

Added Coverage Group

Added Metadata Group

Added ArrowCoverage Class

Added Coverage/Info

Added Coverage/Info/CellNames

2022-08-24 02:58:58 : Group NK..Rep2 (12 of 12) : Creating Group Coverage File : NK..Rep2.insertions.coverage.h5, 4.171 mins elapsed.

Number of Cells = 80

Coverage File Exists!

Added Coverage Group

Added Metadata Group

Added ArrowCoverage Class

Added Coverage/Info

Added Coverage/Info/CellNames

2022-08-24 02:59:20 : Adding Kmer Bias to Coverage Files!, 4.531 mins elapsed.

Kmer Bias chr1 (1 of 24)

chr1
Coverage File chr1 (1 of 12)

Coverage File chr1 (2 of 12)

Coverage File chr1 (3 of 12)

Coverage File chr1 (4 of 12)

Coverage File chr1 (5 of 12)

Coverage File chr1 (6 of 12)

Coverage File chr1 (7 of 12)

Coverage File chr1 (8 of 12)

Coverage File chr1 (9 of 12)

Coverage File chr1 (10 of 12)

Coverage File chr1 (11 of 12)

Coverage File chr1 (12 of 12)

Kmer Bias chr10 (2 of 24)

chr10
Coverage File chr10 (1 of 12)

Coverage File chr10 (2 of 12)

Coverage File chr10 (3 of 12)

Coverage File chr10 (4 of 12)

Coverage File chr10 (5 of 12)

Coverage File chr10 (6 of 12)

Coverage File chr10 (7 of 12)

Coverage File chr10 (8 of 12)

Coverage File chr10 (9 of 12)

Coverage File chr10 (10 of 12)

Coverage File chr10 (11 of 12)

Coverage File chr10 (12 of 12)

Kmer Bias chr11 (3 of 24)

chr11
Coverage File chr11 (1 of 12)

Coverage File chr11 (2 of 12)

Coverage File chr11 (3 of 12)

Coverage File chr11 (4 of 12)

Coverage File chr11 (5 of 12)

Coverage File chr11 (6 of 12)

Coverage File chr11 (7 of 12)

Coverage File chr11 (8 of 12)

Coverage File chr11 (9 of 12)

Coverage File chr11 (10 of 12)

Coverage File chr11 (11 of 12)

Coverage File chr11 (12 of 12)

Kmer Bias chr12 (4 of 24)

chr12
Coverage File chr12 (1 of 12)

Coverage File chr12 (2 of 12)

Coverage File chr12 (3 of 12)

Coverage File chr12 (4 of 12)

Coverage File chr12 (5 of 12)

Coverage File chr12 (6 of 12)

Coverage File chr12 (7 of 12)

Coverage File chr12 (8 of 12)

Coverage File chr12 (9 of 12)

Coverage File chr12 (10 of 12)

Coverage File chr12 (11 of 12)

Coverage File chr12 (12 of 12)

Kmer Bias chr13 (5 of 24)

chr13
Coverage File chr13 (1 of 12)

Coverage File chr13 (2 of 12)

Coverage File chr13 (3 of 12)

Coverage File chr13 (4 of 12)

Coverage File chr13 (5 of 12)

Coverage File chr13 (6 of 12)

Coverage File chr13 (7 of 12)

Coverage File chr13 (8 of 12)

Coverage File chr13 (9 of 12)

Coverage File chr13 (10 of 12)

Coverage File chr13 (11 of 12)

Coverage File chr13 (12 of 12)

Kmer Bias chr14 (6 of 24)

chr14
Coverage File chr14 (1 of 12)

Coverage File chr14 (2 of 12)

Coverage File chr14 (3 of 12)

Coverage File chr14 (4 of 12)

Coverage File chr14 (5 of 12)

Coverage File chr14 (6 of 12)

Coverage File chr14 (7 of 12)

Coverage File chr14 (8 of 12)

Coverage File chr14 (9 of 12)

Coverage File chr14 (10 of 12)

Coverage File chr14 (11 of 12)

Coverage File chr14 (12 of 12)

Kmer Bias chr15 (7 of 24)

chr15
Coverage File chr15 (1 of 12)

Coverage File chr15 (2 of 12)

Coverage File chr15 (3 of 12)

Coverage File chr15 (4 of 12)

Coverage File chr15 (5 of 12)

Coverage File chr15 (6 of 12)

Coverage File chr15 (7 of 12)

Coverage File chr15 (8 of 12)

Coverage File chr15 (9 of 12)

Coverage File chr15 (10 of 12)

Coverage File chr15 (11 of 12)

Coverage File chr15 (12 of 12)

Kmer Bias chr16 (8 of 24)

chr16
Coverage File chr16 (1 of 12)

Coverage File chr16 (2 of 12)

Coverage File chr16 (3 of 12)

Coverage File chr16 (4 of 12)

Coverage File chr16 (5 of 12)

Coverage File chr16 (6 of 12)

Coverage File chr16 (7 of 12)

Coverage File chr16 (8 of 12)

Coverage File chr16 (9 of 12)

Coverage File chr16 (10 of 12)

Coverage File chr16 (11 of 12)

Coverage File chr16 (12 of 12)

Kmer Bias chr17 (9 of 24)

chr17
Coverage File chr17 (1 of 12)

Coverage File chr17 (2 of 12)

Coverage File chr17 (3 of 12)

Coverage File chr17 (4 of 12)

Coverage File chr17 (5 of 12)

Coverage File chr17 (6 of 12)

Coverage File chr17 (7 of 12)

Coverage File chr17 (8 of 12)

Coverage File chr17 (9 of 12)

Coverage File chr17 (10 of 12)

Coverage File chr17 (11 of 12)

Coverage File chr17 (12 of 12)

Kmer Bias chr18 (10 of 24)

chr18
Coverage File chr18 (1 of 12)

Coverage File chr18 (2 of 12)

Coverage File chr18 (3 of 12)

Coverage File chr18 (4 of 12)

Coverage File chr18 (5 of 12)

Coverage File chr18 (6 of 12)

Coverage File chr18 (7 of 12)

Coverage File chr18 (8 of 12)

Coverage File chr18 (9 of 12)

Coverage File chr18 (10 of 12)

Coverage File chr18 (11 of 12)

Coverage File chr18 (12 of 12)

Kmer Bias chr19 (11 of 24)

chr19
Coverage File chr19 (1 of 12)

Coverage File chr19 (2 of 12)

Coverage File chr19 (3 of 12)

Coverage File chr19 (4 of 12)

Coverage File chr19 (5 of 12)

Coverage File chr19 (6 of 12)

Coverage File chr19 (7 of 12)

Coverage File chr19 (8 of 12)

Coverage File chr19 (9 of 12)

Coverage File chr19 (10 of 12)

Coverage File chr19 (11 of 12)

Coverage File chr19 (12 of 12)

Kmer Bias chr2 (12 of 24)

chr2
Coverage File chr2 (1 of 12)

Coverage File chr2 (2 of 12)

Coverage File chr2 (3 of 12)

Coverage File chr2 (4 of 12)

Coverage File chr2 (5 of 12)

Coverage File chr2 (6 of 12)

Coverage File chr2 (7 of 12)

Coverage File chr2 (8 of 12)

Coverage File chr2 (9 of 12)

Coverage File chr2 (10 of 12)

Coverage File chr2 (11 of 12)

Coverage File chr2 (12 of 12)

Kmer Bias chr20 (13 of 24)

chr20
Coverage File chr20 (1 of 12)

Coverage File chr20 (2 of 12)

Coverage File chr20 (3 of 12)

Coverage File chr20 (4 of 12)

Coverage File chr20 (5 of 12)

Coverage File chr20 (6 of 12)

Coverage File chr20 (7 of 12)

Coverage File chr20 (8 of 12)

Coverage File chr20 (9 of 12)

Coverage File chr20 (10 of 12)

Coverage File chr20 (11 of 12)

Coverage File chr20 (12 of 12)

Kmer Bias chr21 (14 of 24)

chr21
Coverage File chr21 (1 of 12)

Coverage File chr21 (2 of 12)

Coverage File chr21 (3 of 12)

Coverage File chr21 (4 of 12)

Coverage File chr21 (5 of 12)

Coverage File chr21 (6 of 12)

Coverage File chr21 (7 of 12)

Coverage File chr21 (8 of 12)

Coverage File chr21 (9 of 12)

Coverage File chr21 (10 of 12)

Coverage File chr21 (11 of 12)

Coverage File chr21 (12 of 12)

Kmer Bias chr22 (15 of 24)

chr22
Coverage File chr22 (1 of 12)

Coverage File chr22 (2 of 12)

Coverage File chr22 (3 of 12)

Coverage File chr22 (4 of 12)

Coverage File chr22 (5 of 12)

Coverage File chr22 (6 of 12)

Coverage File chr22 (7 of 12)

Coverage File chr22 (8 of 12)

Coverage File chr22 (9 of 12)

Coverage File chr22 (10 of 12)

Coverage File chr22 (11 of 12)

Coverage File chr22 (12 of 12)

Kmer Bias chr3 (16 of 24)

chr3
Coverage File chr3 (1 of 12)

Coverage File chr3 (2 of 12)

Coverage File chr3 (3 of 12)

Coverage File chr3 (4 of 12)

Coverage File chr3 (5 of 12)

Coverage File chr3 (6 of 12)

Coverage File chr3 (7 of 12)

Coverage File chr3 (8 of 12)

Coverage File chr3 (9 of 12)

Coverage File chr3 (10 of 12)

Coverage File chr3 (11 of 12)

Coverage File chr3 (12 of 12)

Kmer Bias chr4 (17 of 24)

chr4
Coverage File chr4 (1 of 12)

Coverage File chr4 (2 of 12)

Coverage File chr4 (3 of 12)

Coverage File chr4 (4 of 12)

Coverage File chr4 (5 of 12)

Coverage File chr4 (6 of 12)

Coverage File chr4 (7 of 12)

Coverage File chr4 (8 of 12)

Coverage File chr4 (9 of 12)

Coverage File chr4 (10 of 12)

Coverage File chr4 (11 of 12)

Coverage File chr4 (12 of 12)

Kmer Bias chr5 (18 of 24)

chr5
Coverage File chr5 (1 of 12)

Coverage File chr5 (2 of 12)

Coverage File chr5 (3 of 12)

Coverage File chr5 (4 of 12)

Coverage File chr5 (5 of 12)

Coverage File chr5 (6 of 12)

Coverage File chr5 (7 of 12)

Coverage File chr5 (8 of 12)

Coverage File chr5 (9 of 12)

Coverage File chr5 (10 of 12)

Coverage File chr5 (11 of 12)

Coverage File chr5 (12 of 12)

Kmer Bias chr6 (19 of 24)

chr6
Coverage File chr6 (1 of 12)

Coverage File chr6 (2 of 12)

Coverage File chr6 (3 of 12)

Coverage File chr6 (4 of 12)

Coverage File chr6 (5 of 12)

Coverage File chr6 (6 of 12)

Coverage File chr6 (7 of 12)

Coverage File chr6 (8 of 12)

Coverage File chr6 (9 of 12)

Coverage File chr6 (10 of 12)

Coverage File chr6 (11 of 12)

Coverage File chr6 (12 of 12)

Kmer Bias chr7 (20 of 24)

chr7
Coverage File chr7 (1 of 12)

Coverage File chr7 (2 of 12)

Coverage File chr7 (3 of 12)

Coverage File chr7 (4 of 12)

Coverage File chr7 (5 of 12)

Coverage File chr7 (6 of 12)

Coverage File chr7 (7 of 12)

Coverage File chr7 (8 of 12)

Coverage File chr7 (9 of 12)

Coverage File chr7 (10 of 12)

Coverage File chr7 (11 of 12)

Coverage File chr7 (12 of 12)

Kmer Bias chr8 (21 of 24)

chr8
Coverage File chr8 (1 of 12)

Coverage File chr8 (2 of 12)

Coverage File chr8 (3 of 12)

Coverage File chr8 (4 of 12)

Coverage File chr8 (5 of 12)

Coverage File chr8 (6 of 12)

Coverage File chr8 (7 of 12)

Coverage File chr8 (8 of 12)

Coverage File chr8 (9 of 12)

Coverage File chr8 (10 of 12)

Coverage File chr8 (11 of 12)

Coverage File chr8 (12 of 12)

Kmer Bias chr9 (22 of 24)

chr9
Coverage File chr9 (1 of 12)

Coverage File chr9 (2 of 12)

Coverage File chr9 (3 of 12)

Coverage File chr9 (4 of 12)

Coverage File chr9 (5 of 12)

Coverage File chr9 (6 of 12)

Coverage File chr9 (7 of 12)

Coverage File chr9 (8 of 12)

Coverage File chr9 (9 of 12)

Coverage File chr9 (10 of 12)

Coverage File chr9 (11 of 12)

Coverage File chr9 (12 of 12)

Kmer Bias chrX (23 of 24)

chrX
Coverage File chrX (1 of 12)

Coverage File chrX (2 of 12)

Coverage File chrX (3 of 12)

Coverage File chrX (4 of 12)

Coverage File chrX (5 of 12)

Coverage File chrX (6 of 12)

Coverage File chrX (7 of 12)

Coverage File chrX (8 of 12)

Coverage File chrX (9 of 12)

Coverage File chrX (10 of 12)

Coverage File chrX (11 of 12)

Coverage File chrX (12 of 12)

Kmer Bias chrY (24 of 24)

chrY
Coverage File chrY (1 of 12)

Coverage File chrY (2 of 12)

Coverage File chrY (3 of 12)

Coverage File chrY (4 of 12)

Coverage File chrY (5 of 12)

Coverage File chrY (6 of 12)

Coverage File chrY (7 of 12)

Coverage File chrY (8 of 12)

Coverage File chrY (9 of 12)

Coverage File chrY (10 of 12)

Coverage File chrY (11 of 12)

Coverage File chrY (12 of 12)

Completed Kmer Bias Calculation

Adding Kmer Bias (1 of 12)

Adding Kmer Bias (2 of 12)

Adding Kmer Bias (3 of 12)

Adding Kmer Bias (4 of 12)

Adding Kmer Bias (5 of 12)

Adding Kmer Bias (6 of 12)

Adding Kmer Bias (7 of 12)

Adding Kmer Bias (8 of 12)

Adding Kmer Bias (9 of 12)

Adding Kmer Bias (10 of 12)

Adding Kmer Bias (11 of 12)

Adding Kmer Bias (12 of 12)

2022-08-24 03:03:48 : Finished Creation of Coverage Files!, 9.001 mins elapsed.

ArchR logging successful to : ArchRLogs/ArchR-addGroupCoverages-45f1fd2182f-Date-2022-08-24_Time-02-54-48.log

1.2 peak set 确定

ArchR 选用501-bp的等宽peak,原因:

  1. 下游处理不再需要根据peak宽度进行标准化从而简化了计算
  2. ATAC-seq的大部分peak宽度都低于501-bp。如果使用不定宽peak,后续不同样本的peak合并就会变得特别复杂

不同peak合并方法:

  • 有1bp以上overlap就合并(bedtools merge)
  • 重叠区域选取显著性最高作为最终peak(bedtools cluster)
  • 首先按peak显著性排名。第一步,删除显著性最高的peak有overlap的peak;第二步,比较余下peak显著性排名第二的peak,删除与其有overlap的部分,以此类推,保留没有overlap的peak

三种策略的比较

In [32]:

pathToMacs2 <- findMacs2()
proj <- addReproduciblePeakSet(
    ArchRProj = proj,
    groupBy = "celltype",
    cutOff = 0.1,
    pathToMacs2 = pathToMacs2,
    #genomeAnnotation = genomeAnnotation,
    #geneAnnotation = geneAnnotation,
    #genomeSize = 2.5e9,
    maxPeaks = 150000,
    force = TRUE
)
# 如果是自己生成的genomeAnnotation与geneAnnotation,需要genomeAnnotation与geneAnnotation加入两个参数;
# maxPeaks = 1500000,设置之后如果peak数超过该阈值,将会保留top maxPeaks数目,默认为 150000

Out [32]:
Searching For MACS2..

Found with $path!

ArchR logging to : ArchRLogs/ArchR-addReproduciblePeakSet-45f11a94b1c-Date-2022-08-24_Time-03-03-48.log
If there is an issue, please report to github with logFile!

Calling Peaks with Macs2

2022-08-24 03:03:48 : Peak Calling Parameters!, 0.002 mins elapsed.

Out [32]:
Group nCells nCellsUsed nReplicates nMin nMax maxPeaks
B B 390 390 2 80 310 150000
CD8 T CD8 T 1313 580 2 80 500 150000
CD14+ Mono CD14+ Mono 334 334 2 80 254 150000
Memory CD4 T Memory CD4 T 617 580 2 80 500 150000
Naive CD4 T Naive CD4 T 1895 580 2 80 500 150000
NK NK 1554 580 2 80 500 150000

Out [32]:
2022-08-24 03:03:48 : Batching Peak Calls!, 0.002 mins elapsed.

2022-08-24 03:03:48 : Batch Execution w/ safelapply!, 0 mins elapsed.

2022-08-24 03:05:07 : Identifying Reproducible Peaks!, 1.307 mins elapsed.

2022-08-24 03:05:11 : Creating Union Peak Set!, 1.386 mins elapsed.

Converged after 4 iterations!

Plotting Ggplot!

2022-08-24 03:05:16 : Finished Creating Union Peak Set (95289)!, 1.465 mins elapsed.

1.3 添加PeakMatrix到ArrowFiles中

In [33]:

proj <- addPeakMatrix(proj)

Out [33]:
ArchR logging to : ArchRLogs/ArchR-addPeakMatrix-45f6e2ed86b-Date-2022-08-24_Time-03-05-16.log
If there is an issue, please report to github with logFile!

2022-08-24 03:05:16 : Batch Execution w/ safelapply!, 0 mins elapsed.

.createArrowGroup : Arrow Group already exists! Dropping Group from ArrowFile! This will take ~10-30 seconds!

.dropGroupsFromArrow : Initializing Temp ArrowFile

.dropGroupsFromArrow : Adding Metadata to Temp ArrowFile

.dropGroupsFromArrow : Adding SubGroups to Temp ArrowFile

.dropGroupsFromArrow : Move Temp ArrowFile to ArrowFile

2022-08-24 03:05:29 : Adding 0607-ATAC-PBMC-ZYH-0302-2 to PeakMatrix for Chr (1 of 23)!, 0.005 mins elapsed.

2022-08-24 03:05:33 : Adding 0607-ATAC-PBMC-ZYH-0302-2 to PeakMatrix for Chr (2 of 23)!, 0.074 mins elapsed.

2022-08-24 03:05:36 : Adding 0607-ATAC-PBMC-ZYH-0302-2 to PeakMatrix for Chr (3 of 23)!, 0.13 mins elapsed.

2022-08-24 03:05:39 : Adding 0607-ATAC-PBMC-ZYH-0302-2 to PeakMatrix for Chr (4 of 23)!, 0.18 mins elapsed.

2022-08-24 03:05:42 : Adding 0607-ATAC-PBMC-ZYH-0302-2 to PeakMatrix for Chr (5 of 23)!, 0.223 mins elapsed.

2022-08-24 03:05:45 : Adding 0607-ATAC-PBMC-ZYH-0302-2 to PeakMatrix for Chr (6 of 23)!, 0.269 mins elapsed.

2022-08-24 03:05:48 : Adding 0607-ATAC-PBMC-ZYH-0302-2 to PeakMatrix for Chr (7 of 23)!, 0.32 mins elapsed.

2022-08-24 03:05:51 : Adding 0607-ATAC-PBMC-ZYH-0302-2 to PeakMatrix for Chr (8 of 23)!, 0.366 mins elapsed.

2022-08-24 03:05:53 : Adding 0607-ATAC-PBMC-ZYH-0302-2 to PeakMatrix for Chr (9 of 23)!, 0.408 mins elapsed.

2022-08-24 03:05:56 : Adding 0607-ATAC-PBMC-ZYH-0302-2 to PeakMatrix for Chr (10 of 23)!, 0.452 mins elapsed.

2022-08-24 03:05:58 : Adding 0607-ATAC-PBMC-ZYH-0302-2 to PeakMatrix for Chr (11 of 23)!, 0.497 mins elapsed.

2022-08-24 03:06:01 : Adding 0607-ATAC-PBMC-ZYH-0302-2 to PeakMatrix for Chr (12 of 23)!, 0.546 mins elapsed.

2022-08-24 03:06:04 : Adding 0607-ATAC-PBMC-ZYH-0302-2 to PeakMatrix for Chr (13 of 23)!, 0.595 mins elapsed.

2022-08-24 03:06:06 : Adding 0607-ATAC-PBMC-ZYH-0302-2 to PeakMatrix for Chr (14 of 23)!, 0.632 mins elapsed.

2022-08-24 03:06:09 : Adding 0607-ATAC-PBMC-ZYH-0302-2 to PeakMatrix for Chr (15 of 23)!, 0.675 mins elapsed.

2022-08-24 03:06:11 : Adding 0607-ATAC-PBMC-ZYH-0302-2 to PeakMatrix for Chr (16 of 23)!, 0.714 mins elapsed.

2022-08-24 03:06:14 : Adding 0607-ATAC-PBMC-ZYH-0302-2 to PeakMatrix for Chr (17 of 23)!, 0.759 mins elapsed.

2022-08-24 03:06:17 : Adding 0607-ATAC-PBMC-ZYH-0302-2 to PeakMatrix for Chr (18 of 23)!, 0.811 mins elapsed.

2022-08-24 03:06:19 : Adding 0607-ATAC-PBMC-ZYH-0302-2 to PeakMatrix for Chr (19 of 23)!, 0.846 mins elapsed.

2022-08-24 03:06:23 : Adding 0607-ATAC-PBMC-ZYH-0302-2 to PeakMatrix for Chr (20 of 23)!, 0.906 mins elapsed.

2022-08-24 03:06:25 : Adding 0607-ATAC-PBMC-ZYH-0302-2 to PeakMatrix for Chr (21 of 23)!, 0.943 mins elapsed.

2022-08-24 03:06:27 : Adding 0607-ATAC-PBMC-ZYH-0302-2 to PeakMatrix for Chr (22 of 23)!, 0.976 mins elapsed.

2022-08-24 03:06:29 : Adding 0607-ATAC-PBMC-ZYH-0302-2 to PeakMatrix for Chr (23 of 23)!, 1.013 mins elapsed.

ArchR logging successful to : ArchRLogs/ArchR-addPeakMatrix-45f6e2ed86b-Date-2022-08-24_Time-03-05-16.log

1.4 自定义peak set

In []:

#Proj <- addPeakSet() 

1.5 获取每种cluster/celltype特异的marker peaks

获取每个cluster/celltype 特有的peak
In [22]:

library(parallel)

In [34]:

markersPeaks <- getMarkerFeatures(
   ArchRProj = proj,
   useMatrix = "PeakMatrix",
   groupBy = "celltype",
   bias = c("TSSEnrichment", "log10(nFrags)"),
   testMethod = "wilcoxon"
)

markerList <- getMarkers(markersPeaks, cutOff = "FDR <= 0.01 & Log2FC >= 1")
#saveRDS(markerList,"marker_peaks_list.rds")

Out [34]:
ArchR logging to : ArchRLogs/ArchR-getMarkerFeatures-45f89d940e-Date-2022-08-24_Time-03-10-40.log
If there is an issue, please report to github with logFile!

MatrixClass = Sparse.Integer.Matrix

2022-08-24 03:10:41 : Matching Known Biases, 0.002 mins elapsed.

###########
2022-08-24 03:11:07 : Completed Pairwise Tests, 0.439 mins elapsed.
###########

ArchR logging successful to : ArchRLogs/ArchR-getMarkerFeatures-45f89d940e-Date-2022-08-24_Time-03-10-40.log

1.6 获取两两cluster,或者两两celltype之间的差异peak

In [35]:

markerTest <- getMarkerFeatures(
  ArchRProj = proj, 
  useMatrix = "PeakMatrix",
  groupBy = "celltype",
  testMethod = "wilcoxon",
  bias = c("TSSEnrichment", "log10(nFrags)"),
  useGroups = "CD14+ Mono",
  bgdGroups = "B"
)

Out [35]:
ArchR logging to : ArchRLogs/ArchR-getMarkerFeatures-45f47869e0b-Date-2022-08-24_Time-03-11-56.log
If there is an issue, please report to github with logFile!

MatrixClass = Sparse.Integer.Matrix

2022-08-24 03:11:56 : Matching Known Biases, 0.002 mins elapsed.

2022-08-24 03:11:57 : Computing Pairwise Tests (1 of 1), 0.018 mins elapsed.

Pairwise Test CD14+ Mono : Seqnames chr1

Pairwise Test CD14+ Mono : Seqnames chr10

Pairwise Test CD14+ Mono : Seqnames chr11

Pairwise Test CD14+ Mono : Seqnames chr12

Pairwise Test CD14+ Mono : Seqnames chr13

Pairwise Test CD14+ Mono : Seqnames chr14

Pairwise Test CD14+ Mono : Seqnames chr15

Pairwise Test CD14+ Mono : Seqnames chr16

Pairwise Test CD14+ Mono : Seqnames chr17

Pairwise Test CD14+ Mono : Seqnames chr18

Pairwise Test CD14+ Mono : Seqnames chr19

Pairwise Test CD14+ Mono : Seqnames chr2

Pairwise Test CD14+ Mono : Seqnames chr20

Pairwise Test CD14+ Mono : Seqnames chr21

Pairwise Test CD14+ Mono : Seqnames chr22

Pairwise Test CD14+ Mono : Seqnames chr3

Pairwise Test CD14+ Mono : Seqnames chr4

Pairwise Test CD14+ Mono : Seqnames chr5

Pairwise Test CD14+ Mono : Seqnames chr6

Pairwise Test CD14+ Mono : Seqnames chr7

Pairwise Test CD14+ Mono : Seqnames chr8

Pairwise Test CD14+ Mono : Seqnames chr9

Pairwise Test CD14+ Mono : Seqnames chrX

###########
2022-08-24 03:12:18 : Completed Pairwise Tests, 0.374 mins elapsed.
###########

ArchR logging successful to : ArchRLogs/ArchR-getMarkerFeatures-45f47869e0b-Date-2022-08-24_Time-03-11-56.log

In [37]:

#绘制MA图
pma <- markerPlot(seMarker = markerTest, name = "CD14+ Mono", cutOff = "FDR <= 0.1 & abs(Log2FC) >= 1", plotAs = "MA")
pma
#绘制火山图
pv <- markerPlot(seMarker = markerTest, name = "CD14+ Mono", cutOff = "FDR <= 0.1 & abs(Log2FC) >= 1", plotAs = "Volcano")
pv

Out [37]:
Warning message:
“'markerPlot' is deprecated.
Use 'plotMarkers' instead.
See help("Deprecated")”
Warning message:
“Removed 30 rows containing missing values (geom_point_rast).”
Warning message:
“'markerPlot' is deprecated.
Use 'plotMarkers' instead.
See help("Deprecated")”
Warning message:
“Removed 30 rows containing missing values (geom_point_rast).”

img
img

2. 获取motif Annotation

在确定了一个PeakSet后,我们经常想要预测哪些转录因子可能会介导产生那些可访问染色质位点的结合事件。这有助于评估标记峰或差异峰,以了解这些峰组是否富含特定转录因子的结合位点。例如,我们经常发现在细胞类型特异性可接近的染色质区域中,关键谱系定义 TF 的富集。以类似的方式,我们可能想要测试各种峰组以丰富其他已知特征。例如,我们可能想知道细胞类型 A 的细胞类型特异性peak是否富含另一组基因组区域,例如 ChIP-seq 峰。本章详细介绍了如何在 ArchR 中执行这些扩充
In [39]:

proj <- addMotifAnnotations(ArchRProj = proj, motifSet = "cisbp", name = "Motif")

Out [39]:
ArchR logging to : ArchRLogs/ArchR-addMotifAnnotations-45f59daec28-Date-2022-08-24_Time-03-30-00.log
If there is an issue, please report to github with logFile!

2022-08-24 03:30:00 : Gettting Motif Set, Species : Homo sapiens, 0.001 mins elapsed.

Using version 2 motifs!

2022-08-24 03:30:02 : Finding Motif Positions with motifmatchr!, 0.03 mins elapsed.

2022-08-24 03:32:26 : Filtering Motif Annotations with 0 overlaps :

ENSG00000250542_156

, 2.433 mins elapsed.

2022-08-24 03:32:26 : Creating Motif Overlap Matrix, 2.434 mins elapsed.

2022-08-24 03:32:28 : Finished Getting Motif Info!, 2.474 mins elapsed.

ArchR logging successful to : ArchRLogs/ArchR-addMotifAnnotations-45f59daec28-Date-2022-08-24_Time-03-30-00.log

2.1 在差异peaks中motif的富集

In [40]:

motifsUp <- peakAnnoEnrichment(
    seMarker = markerTest,
    ArchRProj = proj,
    peakAnnotation = "Motif",
    cutOff = "FDR <= 0.1 & Log2FC >= 0.5"
  )

Out [40]:
ArchR logging to : ArchRLogs/ArchR-peakAnnoEnrichment-45f437a914b-Date-2022-08-24_Time-04-18-16.log
If there is an issue, please report to github with logFile!

2022-08-24 04:18:19 : Computing Enrichments 1 of 1, 0.054 mins elapsed.

ArchR logging successful to : ArchRLogs/ArchR-peakAnnoEnrichment-45f437a914b-Date-2022-08-24_Time-04-18-16.log

peakAnnoEnrichment() 的输出是一个 SummarizedExperiment 包含多个assays,这些assays中存储着超几何富集的结果
In [41]:

motifsUp

然后,我们创建一个data.frame对象用于ggplot作图,包括motif名,矫正的p值和显著性排序
In [42]:

df <- data.frame(TF = rownames(motifsUp), mlog10Padj = assay(motifsUp)[,1])
df <- df[order(df$mlog10Padj, decreasing = TRUE),]
df$rank <- seq_len(nrow(df))

head(df)
, , , , , , , , , , , , , , ,
A data.frame: 6 × 3
TFmlog10Padjrank
<chr><dbl><int>
335SPIB_336 630.84761
155CEBPA_155559.23232
321SPI1_322 538.58513
140CEBPB_140488.08414
152CEBPD_152434.60415
139JUNB_139 355.37156
,

使用ggplot展示结果,以ggrepel来标识每个TF motif名。
In [43]:

ggUp <- ggplot(df, aes(rank, mlog10Padj, color = mlog10Padj)) + 
  geom_point(size = 1) +
  ggrepel::geom_label_repel(
        data = df[rev(seq_len(30)), ], aes(x = rank, y = mlog10Padj, label = TF), 
        size = 1.5,
        nudge_x = 2,
        color = "black"
  ) + theme_ArchR() + 
  ylab("-log10(P-adj) Motif Enrichment") + 
  xlab("Rank Sorted TFs Enriched") +
  scale_color_gradientn(colors = paletteContinuous(set = "comet"))

ggUp

Out [43]:
Warning message:
“ggrepel: 8 unlabeled data points (too many overlaps). Consider increasing max.overlaps”

img
通过设置Log2FC <= -0.5我们可以挑选出在"Progenitor"里更加开放的peak,然后分析其中富集的motif。
In [45]:

motifsDo <- peakAnnoEnrichment(
    seMarker = markerTest,
    ArchRProj = proj,
    peakAnnotation = "Motif",
    cutOff = "FDR <= 0.1 & Log2FC <= -0.5"
  )
motifsDo

df <- data.frame(TF = rownames(motifsDo), mlog10Padj = assay(motifsDo)[,1])
df <- df[order(df$mlog10Padj, decreasing = TRUE),]
df$rank <- seq_len(nrow(df))



ggDo <- ggplot(df, aes(rank, mlog10Padj, color = mlog10Padj)) + 
  geom_point(size = 1) +
  ggrepel::geom_label_repel(
        data = df[rev(seq_len(30)), ], aes(x = rank, y = mlog10Padj, label = TF), 
        size = 1.5,
        nudge_x = 2,
        color = "black"
  ) + theme_ArchR() + 
  ylab("-log10(FDR) Motif Enrichment") +
  xlab("Rank Sorted TFs Enriched") +
  scale_color_gradientn(colors = paletteContinuous(set = "comet"))

ggDo

Out [45]:
ArchR logging to : ArchRLogs/ArchR-peakAnnoEnrichment-45f6ccbbbcb-Date-2022-08-24_Time-04-18-58.log
If there is an issue, please report to github with logFile!

2022-08-24 04:19:01 : Computing Enrichments 1 of 1, 0.052 mins elapsed.

ArchR logging successful to : ArchRLogs/ArchR-peakAnnoEnrichment-45f6ccbbbcb-Date-2022-08-24_Time-04-18-58.log

Out [45]:
Warning message:
“ggrepel: 20 unlabeled data points (too many overlaps). Consider increasing max.overlaps”

img

2.2 在细胞类型特异的peak中motif的富集

In [48]:

enrichMotifs <- peakAnnoEnrichment(
    seMarker = markersPeaks,
    ArchRProj = proj,
    peakAnnotation = "Motif",
    cutOff = "FDR <= 0.1 & Log2FC >= 0.5"
  )

enrichMotifs

heatmapEM <- plotEnrichHeatmap(enrichMotifs, n = 7, transpose = TRUE)
ComplexHeatmap::draw(heatmapEM, heatmap_legend_side = "bot", annotation_legend_side = "bot")

Out [48]:
ArchR logging to : ArchRLogs/ArchR-peakAnnoEnrichment-45fb8de5b4-Date-2022-08-24_Time-04-25-53.log
If there is an issue, please report to github with logFile!

2022-08-24 04:25:56 : Computing Enrichments 1 of 6, 0.056 mins elapsed.

2022-08-24 04:25:56 : Computing Enrichments 2 of 6, 0.06 mins elapsed.

2022-08-24 04:25:57 : Computing Enrichments 3 of 6, 0.063 mins elapsed.

2022-08-24 04:25:57 : Computing Enrichments 4 of 6, 0.068 mins elapsed.

2022-08-24 04:25:57 : Computing Enrichments 5 of 6, 0.071 mins elapsed.

2022-08-24 04:25:57 : Computing Enrichments 6 of 6, 0.075 mins elapsed.

ArchR logging successful to : ArchRLogs/ArchR-peakAnnoEnrichment-45fb8de5b4-Date-2022-08-24_Time-04-25-53.log

Out [48]:
ArchR logging to : ArchRLogs/ArchR-plotEnrichHeatmap-45f5f98c668-Date-2022-08-24_Time-04-25-58.log
If there is an issue, please report to github with logFile!

Adding Annotations..

Preparing Main Heatmap..

'magick' package is suggested to install to give better rasterization.

Set ht_opt$message = FALSE to turn off this message.

img

3. 用ChromVAR进行motif偏离富集分析

TF 基序富集可以帮助我们预测哪些调节因子在我们感兴趣的细胞类型中最活跃。然而,这些富集不是基于每个细胞计算的,也没有考虑到 Tn5 转座酶的插入序列偏差。为解决此问题,Greenlead Lab开发R包chromVAR根据稀疏的染色质可及性数据预测每个细胞的 TF 活性富集。chromVAR 的两个主要输出是:

"deciations" - 偏差是对给定特征(即motif)的每个细胞开放性性与基于细胞或者样本的平均值的预期开放性偏离的偏差校正值。
"Z-score" - 也称为“偏差分数”,是所有单元格中每个偏差校正偏差的 z 分数。偏差分数的绝对值与每个细胞的reads的深度相关。这是因为,reads数越多,就越有信心相信motif的每个单元格可访问性与预期的差异大于偶然发生的差异。
chromVAR 的主要限制之一是它是在 scATAC-seq 数据生成的早期设计的,当时实验由几百个细胞组成。在这个实验规模下,chromVAR 可以轻松地将整个cell-peak矩阵读取到内存中,以快速计算 TF 偏差。然而,当细胞数目达到数万数百万级别时,很难将peak-cell矩阵读入。对于 50,000 个细胞大小的数据集都会导致运行时间和内存使用量的显著增加。

chromVAR流程:

为了规避这些限制,ArchR 通过独立分析样本子矩阵来实现相同的 chromVAR 分析工作流程。

首先,确定已经在proj中添加motif annotation
In [49]:

if("Motif" %ni% names(proj@peakAnnotation)){
    proj <- addMotifAnnotations(ArchRProj = proj, motifSet = "cisbp", name = "Motif")
}
# motifset

此外,我们还需要添加一组background peak用于计算偏离。background peak通过chromVAR::getBackgroundPeaks()函数进行选择,该函数根据根据GC含量相似性和样本中的fragment数计算马氏距离然后对peak进行抽样。

In [50]:

proj <- addBgdPeaks(proj)

Out [50]:
Identifying Background Peaks!

接下来,就可以使用addDeviatonMatrix()函数根据所有的motif注释计算每个细胞的偏离值。该函数有一个可选参数matrixName,用于定义该偏离值矩阵在Arrow文件里的名字。在下面的例子,函数会以"peakAnnotation"里设置的参数为基础,额外在后面添加字符串"Matrix",因此下面函数运行结束后会为每个Arrow文件都创建了一个"MotifMatrix"的偏离值矩阵

In [51]:

proj <- addDeviationsMatrix(
  ArchRProj = proj, 
  peakAnnotation = "Motif",
  force = TRUE
)

Out [51]:
Using Previous Background Peaks!

ArchR logging to : ArchRLogs/ArchR-addDeviationsMatrix-45f4d75f47f-Date-2022-08-24_Time-04-26-57.log
If there is an issue, please report to github with logFile!

Out [51]:
NULL

Out [51]:
2022-08-24 04:26:59 : Batch Execution w/ safelapply!, 0 mins elapsed.

2022-08-24 04:27:00 : chromVAR deviations 0607-ATAC-PBMC-ZYH-0302-2 (1 of 1) Schep (2017), 0.019 mins elapsed.

2022-08-24 04:27:58 : 0607-ATAC-PBMC-ZYH-0302-2 (1 of 1) : Deviations for Annotation 43 of 869, 0.801 mins elapsed.

2022-08-24 04:28:41 : 0607-ATAC-PBMC-ZYH-0302-2 (1 of 1) : Deviations for Annotation 86 of 869, 1.519 mins elapsed.

2022-08-24 04:29:22 : 0607-ATAC-PBMC-ZYH-0302-2 (1 of 1) : Deviations for Annotation 129 of 869, 2.209 mins elapsed.

2022-08-24 04:30:17 : 0607-ATAC-PBMC-ZYH-0302-2 (1 of 1) : Deviations for Annotation 172 of 869, 3.121 mins elapsed.

2022-08-24 04:31:21 : 0607-ATAC-PBMC-ZYH-0302-2 (1 of 1) : Deviations for Annotation 215 of 869, 4.181 mins elapsed.

2022-08-24 04:32:20 : 0607-ATAC-PBMC-ZYH-0302-2 (1 of 1) : Deviations for Annotation 258 of 869, 5.167 mins elapsed.

2022-08-24 04:33:10 : 0607-ATAC-PBMC-ZYH-0302-2 (1 of 1) : Deviations for Annotation 301 of 869, 6.007 mins elapsed.

2022-08-24 04:34:01 : 0607-ATAC-PBMC-ZYH-0302-2 (1 of 1) : Deviations for Annotation 344 of 869, 6.855 mins elapsed.

2022-08-24 04:34:48 : 0607-ATAC-PBMC-ZYH-0302-2 (1 of 1) : Deviations for Annotation 387 of 869, 7.642 mins elapsed.

2022-08-24 04:35:23 : 0607-ATAC-PBMC-ZYH-0302-2 (1 of 1) : Deviations for Annotation 430 of 869, 8.223 mins elapsed.

2022-08-24 04:35:59 : 0607-ATAC-PBMC-ZYH-0302-2 (1 of 1) : Deviations for Annotation 473 of 869, 8.823 mins elapsed.

2022-08-24 04:36:35 : 0607-ATAC-PBMC-ZYH-0302-2 (1 of 1) : Deviations for Annotation 516 of 869, 9.418 mins elapsed.

2022-08-24 04:37:12 : 0607-ATAC-PBMC-ZYH-0302-2 (1 of 1) : Deviations for Annotation 559 of 869, 10.04 mins elapsed.

2022-08-24 04:37:51 : 0607-ATAC-PBMC-ZYH-0302-2 (1 of 1) : Deviations for Annotation 602 of 869, 10.687 mins elapsed.

2022-08-24 04:38:35 : 0607-ATAC-PBMC-ZYH-0302-2 (1 of 1) : Deviations for Annotation 645 of 869, 11.426 mins elapsed.

2022-08-24 04:39:21 : 0607-ATAC-PBMC-ZYH-0302-2 (1 of 1) : Deviations for Annotation 688 of 869, 12.18 mins elapsed.

2022-08-24 04:40:07 : 0607-ATAC-PBMC-ZYH-0302-2 (1 of 1) : Deviations for Annotation 731 of 869, 12.951 mins elapsed.

2022-08-24 04:40:53 : 0607-ATAC-PBMC-ZYH-0302-2 (1 of 1) : Deviations for Annotation 774 of 869, 13.725 mins elapsed.

2022-08-24 04:41:44 : 0607-ATAC-PBMC-ZYH-0302-2 (1 of 1) : Deviations for Annotation 817 of 869, 14.577 mins elapsed.

2022-08-24 04:42:33 : 0607-ATAC-PBMC-ZYH-0302-2 (1 of 1) : Deviations for Annotation 860 of 869, 15.385 mins elapsed.

2022-08-24 04:42:49 : Finished Computing Deviations!, 15.837 mins elapsed.

###########
2022-08-24 04:42:49 : Completed Computing Deviations!, 15.876 mins elapsed.
###########

ArchR logging successful to : ArchRLogs/ArchR-addDeviationsMatrix-45f4d75f47f-Date-2022-08-24_Time-04-26-57.log

使用getVarDeviations()函数提取这些偏离值矩阵。如果需要它返回一个ggplot对象,那么只需要设置plot=TRUE即可,函数会返回一个DataFrame对象。函数运行后,会默认展示该DataFrame对象的前几行。
In [52]:

plotVarDev <- getVarDeviations(proj, name = "MotifMatrix", plot = TRUE)

Out [52]:
DataFrame with 6 rows and 6 columns
seqnames idx name combinedVars combinedMeans rank

f335 z 335 SPIB_336 27.0488 -0.324648 1
f321 z 321 SPI1_322 16.8260 -0.241240 2
f155 z 155 CEBPA_155 16.0397 -0.271544 3
f124 z 124 JUND_124 15.9445 -0.134996 4
f139 z 139 JUNB_139 15.4794 -0.130890 5
f142 z 142 FOSL1_142 15.3420 -0.113450 6

从上面DataFrame的输出信息中,可以发现MotifMatrix的seqnames并不是chromosome(染色体名)。通常而言,TileMatrix, PeakMatrix, GeneScoreMatrix,seqnames中记录均是染色体信息。MotifMatrix并没有任何对应的位置信息,而是会在相同的矩阵里记录chromVAR输出的"devations"和"z-scores"信息,即deviations和z。如果后续想在getMarkerFeatures()这类函数中使用MotifMatrix(Sparse.Assays.Matrix类)的话,那么这些信息就是非常重要的。在这类操作中,ArchR会希望你从MotifMatrix提取其中一个seqnames(例如,选择z-scores或deviations执行计算)
In [53]:

plotVarDev

Out [53]:
Warning message:
“ggrepel: 16 unlabeled data points (too many overlaps). Consider increasing max.overlaps”

img
假如想提取部分motif用于下游分析,就需要用到getFeatures()函数。并使用paste(motifs, collapse="|")语句会以"逻辑或"连接所有motifs里的值,用于选择给定的motif。
In [55]:

motifs <- c("GATA1", "CEBPA")
markerMotifs <- getFeatures(proj, select = paste(motifs, collapse="|"), useMatrix = "MotifMatrix")
markerMotifs
, ,
  1. 'z:GATA1_383'
  2. 'z:CEBPA_155'
  3. 'deviations:GATA1_383'
  4. 'deviations:CEBPA_155'
,

正如之前所提到的,MotifMatrix的seqnames包含z-scores(z:)和deviations(deviations:)。为了只提取对应特征的z-scores, 我们需要用到grep。此外,在之前的选择中由于"EBF1"会匹配到"SREBF1",而后者并不是我们所需要的,因此我们还需要一步过滤。ArchR提供了%ni表达式,它是R提供的%ni%的反义词,表示反向选择。
In [56]:

markerMotifs <- grep("z:", markerMotifs, value = TRUE)

markerMotifs
, ,
  1. 'z:GATA1_383'
  2. 'z:CEBPA_155'
,

既然,我们已经有了我们感兴趣的特征,我们可以为每个cluster绘制chromVAR偏离得分。注,我们提供的是之前基因得分分析里计算的推断权重。考虑到scATAC-seq数据的稀疏性,推断权重利用邻近细胞对信号进行平滑处理。
In [59]:

p <- plotGroups(ArchRProj = proj, 
  groupBy = "celltype", 
  colorBy = "MotifMatrix", 
  name = markerMotifs,
  imputeWeights = getImputeWeights(proj)
)
#使用cowplot将不同的moitfs的分布组合在一张图中
p2 <- lapply(seq_along(p), function(x){
  if(x != 1){
    p[[x]] + guides(color = FALSE, fill = FALSE) + 
    theme_ArchR(baseSize = 6) +
    theme(plot.margin = unit(c(0.1, 0.1, 0.1, 0.1), "cm")) +
    theme(
        axis.text.y=element_blank(), 
        axis.ticks.y=element_blank(),
        axis.title.y=element_blank()
    ) + ylab("")
  }else{
    p[[x]] + guides(color = FALSE, fill = FALSE) + 
    theme_ArchR(baseSize = 6) +
    theme(plot.margin = unit(c(0.1, 0.1, 0.1, 0.1), "cm")) +
    theme(
        axis.ticks.y=element_blank(),
        axis.title.y=element_blank()
    ) + ylab("")
  }
})
do.call(cowplot::plot_grid, c(list(nrow = 1, rel_widths = c(2, rep(1, length(p2) - 1))),p2))

Out [59]:
Getting ImputeWeights

No imputeWeights found, returning NULL

Getting Matrix Values...

2022-08-24 04:52:40 :

1

1
2

Warning message:
guides(<scale> = FALSE) is deprecated. Please use guides(<scale> = "none") instead.”
Warning message:
guides(<scale> = FALSE) is deprecated. Please use guides(<scale> = "none") instead.”
Picking joint bandwidth of 0.248

Picking joint bandwidth of 0.492

img
除了检查z-scores的分布,我们也可以和之前展示基因得分一样将z-scores在UMAP嵌入图中进行展示
In [61]:

p <- plotEmbedding(
    ArchRProj = proj, 
    colorBy = "MotifMatrix", 
    name = sort(markerMotifs), 
    embedding = "UMAP",
    imputeWeights = getImputeWeights(proj)
)
#可以使用cowplot将motif UMAP放在一张图上展示
p2 <- lapply(p, function(x){
    x + guides(color = FALSE, fill = FALSE) + 
    theme_ArchR(baseSize = 6.5) +
    theme(plot.margin = unit(c(0, 0, 0, 0), "cm")) +
    theme(
        axis.text.x=element_blank(), 
        axis.ticks.x=element_blank(), 
        axis.text.y=element_blank(), 
        axis.ticks.y=element_blank()
    )
})
do.call(cowplot::plot_grid, c(list(ncol = 3),p2))

Out [61]:
Getting ImputeWeights

No imputeWeights found, returning NULL

ArchR logging to : ArchRLogs/ArchR-plotEmbedding-45f6f613b7b-Date-2022-08-24_Time-04-53-10.log
If there is an issue, please report to github with logFile!

Getting UMAP Embedding

ColorBy = MotifMatrix

Getting Matrix Values...

2022-08-24 04:53:11 :

1

Plotting Embedding

1
2

ArchR logging successful to : ArchRLogs/ArchR-plotEmbedding-45f6f613b7b-Date-2022-08-24_Time-04-53-10.log

Warning message:
guides(<scale> = FALSE) is deprecated. Please use guides(<scale> = "none") instead.”
Warning message:
guides(<scale> = FALSE) is deprecated. Please use guides(<scale> = "none") instead.”

img
为了比较TF deviation z-scores和根据对应TF基因的基因得分推断的基因表达量,我们可以把这两者画在同一个UMAP中。
In [63]:

markerRNA <- getFeatures(proj, select = paste(motifs, collapse="|"), useMatrix = "GeneScoreMatrix")

# 获取GeneScoreMatrix
p <- plotEmbedding(
    ArchRProj = proj, 
    colorBy = "GeneScoreMatrix", 
    name = sort(markerRNA), 
    embedding = "UMAP",
    imputeWeights = getImputeWeights(proj)
)
# 同时处理的多个图
p2 <- lapply(p, function(x){
    x + guides(color = FALSE, fill = FALSE) + 
    theme_ArchR(baseSize = 6.5) +
    theme(plot.margin = unit(c(0, 0, 0, 0), "cm")) +
    theme(
        axis.text.x=element_blank(), 
        axis.ticks.x=element_blank(), 
        axis.text.y=element_blank(), 
        axis.ticks.y=element_blank()
    )
})
#可以使用cowplot将gene activity score的图放在一张图上展示拼图
do.call(cowplot::plot_grid, c(list(ncol = 3),p2))

Out [63]:
Getting ImputeWeights

No imputeWeights found, returning NULL

ArchR logging to : ArchRLogs/ArchR-plotEmbedding-45f68166c0f-Date-2022-08-24_Time-04-54-00.log
If there is an issue, please report to github with logFile!

Getting UMAP Embedding

ColorBy = GeneScoreMatrix

Getting Matrix Values...

2022-08-24 04:54:00 :

1

Plotting Embedding

1
2
3

ArchR logging successful to : ArchRLogs/ArchR-plotEmbedding-45f68166c0f-Date-2022-08-24_Time-04-54-00.log

Warning message:
guides(<scale> = FALSE) is deprecated. Please use guides(<scale> = "none") instead.”
Warning message:
guides(<scale> = FALSE) is deprecated. Please use guides(<scale> = "none") instead.”
Warning message:
guides(<scale> = FALSE) is deprecated. Please use guides(<scale> = "none") instead.”

img
之前将对应的scRNA-seq数据和scATAC-seq数据进行了关联,我们也可以在UMAP图上绘制每个TF对应的基因表达量。
In [64]:

markerRNA <- getFeatures(proj, select = paste(motifs, collapse="|"), useMatrix = "GeneIntegrationMatrix")
markerRNA <- markerRNA[markerRNA %ni% c("SREBF1","CEBPA-DT")]
p <- plotEmbedding(
    ArchRProj = proj, 
    colorBy = "GeneIntegrationMatrix", 
    name = sort(markerRNA), 
    embedding = "UMAP",
    continuousSet = "blueYellow",
    imputeWeights = getImputeWeights(proj)
)
p2 <- lapply(p, function(x){
    x + guides(color = FALSE, fill = FALSE) + 
    theme_ArchR(baseSize = 6.5) +
    theme(plot.margin = unit(c(0, 0, 0, 0), "cm")) +
    theme(
        axis.text.x=element_blank(), 
        axis.ticks.x=element_blank(), 
        axis.text.y=element_blank(), 
        axis.ticks.y=element_blank()
    )
})
do.call(cowplot::plot_grid, c(list(ncol = 3),p2))

4. ArchR的足迹分析

转录因子(Transcripts factor, TF)足迹分析使得我们能够预测特定位点中TF的精确结合位置。这是因为该位置被TF结合避免了转座酶的切割,而TF结合位点的邻近位置处于开放状态。

理想情况下,TF足迹分析需要在单个位置上分析从而确定TF的准确结合位置。但实际上,这需要非常高的测序深度,甚至超过bulk ATAC-seq或者scATAC-seq的所有数据。为了解决这个问题,我们可以把和待预测的TF结合相关的Tn5插入位置进行合并。例如,我们可以提取所有包含CTCF motif的peak,制作一个全基因组的CTCF的聚合TF足迹。

为了保证足迹的可靠性,我们需要确保能够可靠的预测出目标TF所对应的结合位点。ArchR使用自带的addMotifAnnotations()函数对peak区域进行搜索,寻找能够匹配的DNA序列。考虑到motif的简并性,无法保证每个motif都有足够的peak。添加到ArchRProject的motif注释以二值矩阵表示(0=无motif, 1=有motif)。一旦你有了这些motif注释,ArchR使用getFootPrints()函数分析足迹,它以一个ArchRProject对象和一个GenomicRanges对象(记录motif的位置)作为输入。可以使用getPositions()函数从ArchRProject中提取这些位置。之后足迹可以使用plotFootprints()函数可视化。

更重要的是,ArchR的足迹分析能够抵消已知的Tn5插入序列偏好性。ArchR使用一个hexmer位置频率矩阵和一个目标Tn5插入位置上的k-mer频率矩阵来实现该功能。

### 4.1 motif 足迹分析

在足迹分析过程中,我们需要做的第一件事是获取相关motif的位置,可使用getPositions()函数进行调取。
In [65]:

motifPositions <- getPositions(proj)
motifPositions
# motifPositions 为一个GRangesList对象,每个TF motif以不同的GRanges对象进行区分

In [66]:

# 提取部分感兴趣的TF motifs用于展示
motifs <- c("GATA1", "CEBPA")
markerMotifs <- unlist(lapply(motifs, function(x) grep(x, names(motifPositions), value = TRUE)))

markerMotifs
, ,
  1. 'GATA1_383'
  2. 'CEBPA_155'
,

为了准确找到TF足迹,我们需要大量的reads。因此,细胞需要进行分组生成拟bulk ATAC-seq才能用于TF足迹分析。这些拟bulk ATAC-s之前在peak calling时已经使用addGroupCoverages()函数生成。 如果没有在ArchRProject添加,则运行如下命令
In [68]:

#proj <- addGroupCoverages(ArchRProj = proj, groupBy = "celltype")

在计算分组覆盖度后,我们可以为之前getFootprints()挑选的一组标记motif计算足迹。即便ArchR已经优化了足迹分析流程,我们也建议先对一部分motif分析足迹,而不是直接分析所有motif。 我们通过positions参数来选择motif。
In [69]:

seFoot <- getFootprints(
  ArchRProj = proj, 
  positions = motifPositions[markerMotifs], 
  groupBy = "celltype"
)

Out [69]:
ArchR logging to : ArchRLogs/ArchR-getFootprints-45fdeb88ee-Date-2022-08-24_Time-05-01-35.log
If there is an issue, please report to github with logFile!

2022-08-24 05:01:35 : Computing Kmer Bias Table, 0.002 mins elapsed.

2022-08-24 05:01:38 : Finished Computing Kmer Tables, 0.041 mins elapsed.

2022-08-24 05:01:38 : Computing Footprints, 0.043 mins elapsed.

2022-08-24 05:01:43 : Computing Footprints Bias, 0.122 mins elapsed.

2022-08-24 05:01:47 : Summarizing Footprints, 0.188 mins elapsed.

4.2 Tn5插入偏好性标准化

使用ATAC-seq数据分析TF足迹的一大挑战就是Tn5转座酶的插入序列偏好性,这会导致TF足迹的错误分类。为了降低Tn5插入偏好性的影响,ArchR识别每个Tn5插入位置附近的k-mer序列(k由用户提供,默认是6).

对于该项分析,ArchR为每个拟bulk-seq识别单碱基分辨率的Tn5插入位点,将这些1-bp位点调整为k-bp窗口(-k/2和+(k/2-1)bp),然后使用Biostrings包中的oligonucleotidefrequency(w=k, simplify.as="collapse")函数创建k-mer频率表。然后,ArchR使用与BSgenome相关的基因组文件,以相同的函数计算出全基因组范围预期的k-mers。

为了计算拟混池足迹的插入偏差,ArchR创建了一个k-mer频率矩阵,该矩阵表示为从motif中心到窗口+/-N bp(用户定义,默认为250 bp)的所有可能k-mer。然后,遍历每个motif位点,ArchR将定位的k-mer填充到k-mer频率矩阵中。然后在全基因组范围内计算每个motif位置。利用样本的k-mer频率表,ArchR可以通过将k-mer位置频率表乘以观察/期望 Tn5 k-mer频率来计算预期的Tn5插入。

4.2.1 去除Tn5偏好性

一个标准化方式就是从足迹信号中减去Tn5偏好。该标准化方法通过设置plotFootprints()的normMethod = "Subtract"实现
In [70]:

plotFootprints(
  seFoot = seFoot,
  ArchRProj = proj, 
  normMethod = "Subtract",
  plotName = "Footprints-Subtract-Bias",
  addDOC = FALSE,
  smoothWindow = 5
)

Out [70]:
ArchR logging to : ArchRLogs/ArchR-plotFootprints-45f59d47958-Date-2022-08-24_Time-05-02-32.log
If there is an issue, please report to github with logFile!

2022-08-24 05:02:32 : Plotting Footprint : GATA1_383 (1 of 2), 0.001 mins elapsed.

Applying smoothing window to footprint

Normalizing by flanking regions

NormMethod = Subtract

2022-08-24 05:02:33 : Plotting Footprint : CEBPA_155 (2 of 2), 0.02 mins elapsed.

Applying smoothing window to footprint

Normalizing by flanking regions

NormMethod = Subtract

ArchR logging successful to : ArchRLogs/ArchR-plotFootprints-45f59d47958-Date-2022-08-24_Time-05-02-32.log

这些图会默认保存在ArchRProjectoutputDirectory。如果你需要绘制所有motif, 可以将其返回为ggplot2对象,需要注意这个ggplot对象会非常大。下面是一个从motif足迹中减去Tn5偏好信号的结果

4.2.2 除以Tn5偏好

第二种标准化方法就是将足迹除以Tn5偏好信号。该标准化方法通过设置plotFootprints()的normMethod = "Divide"实现
In [72]:

plotFootprints(
  seFoot = seFoot,
  ArchRProj = proj, 
  normMethod = "Divide",
  plotName = "Footprints-Divide-Bias",
  addDOC = FALSE,
  smoothWindow = 5
)

Out [72]:
ArchR logging to : ArchRLogs/ArchR-plotFootprints-45f3d3f4065-Date-2022-08-24_Time-05-05-47.log
If there is an issue, please report to github with logFile!

2022-08-24 05:05:47 : Plotting Footprint : GATA1_383 (1 of 2), 0.001 mins elapsed.

Applying smoothing window to footprint

Normalizing by flanking regions

NormMethod = Divide

2022-08-24 05:05:48 : Plotting Footprint : CEBPA_155 (2 of 2), 0.019 mins elapsed.

Applying smoothing window to footprint

Normalizing by flanking regions

NormMethod = Divide

ArchR logging successful to : ArchRLogs/ArchR-plotFootprints-45f3d3f4065-Date-2022-08-24_Time-05-05-47.log

4.2.3 无Tn5偏好标准化的足迹

可以通过设置plotFootprints()的normMethod = "None"来省去标准化。

4.3 特征足迹

除了motif足迹分析,ArchR还允许用户分析任意定义的特征数据集。为了对功能进行说明,我们将会使用plotFootprints()函数创建TSS插入谱(在之前数据质量控制一节中引入)。一个TSS插入谱本质上就是特殊的足迹。

我们在之前小节讨论过,足迹会用到来源于拟混池重复的分组覆盖文件。我们在之前peak鉴定时创建过这些文件。如果你没有在ArchRProject加入分组覆盖信息, 那么需要运行如下代码
In []:

#proj <- addGroupCoverages(ArchRProj = proj, groupBy = "celltype")

我们接着创建一个没有经过Tn5偏好性校正的TSS插入谱。和之前分析一个主要不同是,我们设置了flank=2000, 将TSS向前向后分别延伸2000 bp.
In [73]:

seTSS <- getFootprints(
  ArchRProj = proj, 
  positions = GRangesList(TSS = getTSS(proj)), 
  groupBy = "celltype",
  flank = 2000
)
#plotFootprints()对每个细胞分组绘制TSS插入谱。
plotFootprints(
  seFoot = seTSS,
  ArchRProj = proj, 
  normMethod = "None",
  plotName = "TSS-No-Normalization",
  addDOC = FALSE,
  flank = 2000,
  flankNorm = 100
)   

Out [73]:
ArchR logging to : ArchRLogs/ArchR-getFootprints-45f79ff1524-Date-2022-08-24_Time-05-06-43.log
If there is an issue, please report to github with logFile!

2022-08-24 05:06:43 : Computing Kmer Bias Table, 0.002 mins elapsed.

2022-08-24 05:07:50 : Finished Computing Kmer Tables, 1.117 mins elapsed.

2022-08-24 05:07:50 : Computing Footprints, 1.118 mins elapsed.

2022-08-24 05:07:59 : Computing Footprints Bias, 1.273 mins elapsed.

2022-08-24 05:08:04 : Summarizing Footprints, 1.355 mins elapsed.

ArchR logging to : ArchRLogs/ArchR-plotFootprints-45f5b5c4a00-Date-2022-08-24_Time-05-08-06.log
If there is an issue, please report to github with logFile!

2022-08-24 05:08:06 : Plotting Footprint : TSS (1 of 1), 0.001 mins elapsed.

Normalizing by flanking regions

NormMethod = None

ArchR logging successful to : ArchRLogs/ArchR-plotFootprints-45f5b5c4a00-Date-2022-08-24_Time-05-08-06.log

5. 使用ArchR进行整合分析

ArchR的一个优势能够整合多种水平的信息从而提供新的洞见。我们可以只用ATAC-seq数据进行分析,如识别peak之间的共开放性来预测调控相互作用,或整合scRNA-seq数据,如通过peak-基因的连锁分析预测增性子活性。无论是哪种情况,ArchR都可以很容易地从scATAC seq数据中获得更深入的见解。

5.1 创建细胞低重叠聚集

ArchR能方便许多特征间相关性的分析。在这些相关分析中,使用稀疏的单细胞数据进行这些计算会导致大量的噪声。为规避这一挑战,我们采用了一种由Cicero引入的方法,在这些分析之前创建单细胞的低重叠聚集。我们过滤与任何其他聚集重叠超过80%的聚集以减少偏差。同时为提升该方法的速度,我们开发了一个优化的迭代重叠检查流程,借由"Rcpp"包通过C++实现了快速特征相关性运算。在ArchR中,这些优化方法被用于计算peak共开放性、peak-基因连锁和其他连锁分析。这些低重叠聚集运算都是在内部完成,为清楚起见,我们先在这里对其介绍。

5.2 ArchR共开放分析

共开放是多个单细胞中两个peak之间开放性的相关性。换句话说,当peak A在某个单细胞中是开放状态时,peak B通常也是开放状态。下图可以直观地说明了这一概念,说明增强子E3通常与启动子P是共开放的。

有一点需要注意,共开放分析找到的peak通常都是细胞类型特异的peak。这是因为这些peak在一种细胞类型中都是开放的,而另一种细胞类型通常都是关闭的。虽然这些peak之间有很强的相关性,但是不意味着这些peak之间存在调控关系。

在ArchR中,我们使用addCoAccessibility()函数计算共开放性,计算结束得到的开放性信息保存在ArchRProject中

In [74]:

proj <- addCoAccessibility(
    ArchRProj = proj,
    reducedDims = "IterativeLSI"
)   

Out [74]:
ArchR logging to : ArchRLogs/ArchR-addCoAccessibility-45f54d1d6c8-Date-2022-08-24_Time-05-08-29.log
If there is an issue, please report to github with logFile!

2022-08-24 05:08:29 : Computing KNN, 0.001 mins elapsed.

2022-08-24 05:08:29 : Identifying Non-Overlapping KNN pairs, 0.003 mins elapsed.

2022-08-24 05:08:31 : Identified 484 Groupings!, 0.041 mins elapsed.

2022-08-24 05:08:32 : Computing Co-Accessibility chr1 (1 of 23), 0.052 mins elapsed.

2022-08-24 05:08:37 : Computing Co-Accessibility chr2 (2 of 23), 0.128 mins elapsed.

2022-08-24 05:08:40 : Computing Co-Accessibility chr3 (3 of 23), 0.194 mins elapsed.

2022-08-24 05:08:44 : Computing Co-Accessibility chr4 (4 of 23), 0.252 mins elapsed.

2022-08-24 05:08:47 : Computing Co-Accessibility chr5 (5 of 23), 0.303 mins elapsed.

2022-08-24 05:08:50 : Computing Co-Accessibility chr6 (6 of 23), 0.358 mins elapsed.

2022-08-24 05:08:54 : Computing Co-Accessibility chr7 (7 of 23), 0.418 mins elapsed.

2022-08-24 05:08:57 : Computing Co-Accessibility chr8 (8 of 23), 0.474 mins elapsed.

2022-08-24 05:09:00 : Computing Co-Accessibility chr9 (9 of 23), 0.526 mins elapsed.

2022-08-24 05:09:04 : Computing Co-Accessibility chr10 (10 of 23), 0.579 mins elapsed.

2022-08-24 05:09:07 : Computing Co-Accessibility chr11 (11 of 23), 0.633 mins elapsed.

2022-08-24 05:09:10 : Computing Co-Accessibility chr12 (12 of 23), 0.691 mins elapsed.

2022-08-24 05:09:14 : Computing Co-Accessibility chr13 (13 of 23), 0.748 mins elapsed.

2022-08-24 05:09:17 : Computing Co-Accessibility chr14 (14 of 23), 0.795 mins elapsed.

2022-08-24 05:09:20 : Computing Co-Accessibility chr15 (15 of 23), 0.848 mins elapsed.

2022-08-24 05:09:23 : Computing Co-Accessibility chr16 (16 of 23), 0.898 mins elapsed.

2022-08-24 05:09:26 : Computing Co-Accessibility chr17 (17 of 23), 0.952 mins elapsed.

2022-08-24 05:09:30 : Computing Co-Accessibility chr18 (18 of 23), 1.013 mins elapsed.

2022-08-24 05:09:32 : Computing Co-Accessibility chr19 (19 of 23), 1.058 mins elapsed.

2022-08-24 05:09:36 : Computing Co-Accessibility chr20 (20 of 23), 1.124 mins elapsed.

2022-08-24 05:09:39 : Computing Co-Accessibility chr21 (21 of 23), 1.172 mins elapsed.

2022-08-24 05:09:42 : Computing Co-Accessibility chr22 (22 of 23), 1.216 mins elapsed.

2022-08-24 05:09:45 : Computing Co-Accessibility chrX (23 of 23), 1.265 mins elapsed.

ArchR logging successful to : ArchRLogs/ArchR-addCoAccessibility-45f54d1d6c8-Date-2022-08-24_Time-05-08-29.log

使用getCoAccessibility()函数可以从ArchRProject对象中提取共开放性信息,当设置returnLoops=FALSE时会返回一个DataFrame对象。
In [76]:

cA <- getCoAccessibility(
    ArchRProj = proj,
    corCutOff = 0.5,
    resolution = 1,
    returnLoops = FALSE
)
cA
# 该DataFrame对象包含几个重要的信息。queryHits和subjectHits列记录的是两个存在相关性的peak的索引。correlation则是两个peak之间的开放状态相关的数值。

这个共开放DataFrame还有一个元数据成员,包含一个相关peak的GRanges对象。上面提到的queryHits和subjectHits的索引适用于这个GRanges对象。
In [77]:

metadata(cA)[[1]]

如果我们设置retureLoops=TRUE, 那么getCoAccessibility()会以loop track的形式返回共开放数据。在这个GRanges对象中,IRanges的起始和结束对应的是每个存在互作关系中两个共开放peak的位置。resolution参数设置这些loop的碱基对分辨率。当resolution=1时,它会输出连接每个peak中心的loop.

In []:

cA <- getCoAccessibility(
    ArchRProj = projHeme5,
    corCutOff = 0.5,
    resolution = 1,
    returnLoops = TRUE
)
cA[[1]]

如果我们将loops的分辨率调整为resolution = 1000, 这能够避免绘制过多的共开放互作事件。从下面的输出的GRanges对象中,我们可以看到条目少了很多
In [79]:

cA <- getCoAccessibility(
    ArchRProj = proj,
    corCutOff = 0.5,
    resolution = 1000,
    returnLoops = TRUE
)

cA[[1]]

5.2.1 在browser track中绘制共开放

当我们在ArchRProject里增加共开放信息后,我们可以用这个信息在基因组浏览器里绘制loop track。具体做法就是在plotBrowserTrack()函数中设置loops参数。我们这里以getCoAccessibility()默认参数输出结果为例,即corCutOff = 0.5, resolution = 1000, returnLoops = TRUE
In [82]:

markerGenes  <- c(
    "CD34", #Early Progenitor
    "GATA1", #Erythroid
    "PAX5", "MS4A1", #B-Cell Trajectory
    "CD14", #Monocytes
    "CD3D", "CD8A", "TBX21", "IL7R" #TCells
  )

p <- plotBrowserTrack(
    ArchRProj = proj, 
    groupBy = "celltype", 
    geneSymbol = markerGenes, 
    upstream = 50000,
    downstream = 50000,
    loops = getCoAccessibility(proj)
)

#最后用grid.draw函数绘制结果,通过$选择特定的标记基因。
grid::grid.newpage()
grid::grid.draw(p$CD14)

Out [82]:
ArchR logging to : ArchRLogs/ArchR-plotBrowserTrack-45f378513ba-Date-2022-08-24_Time-05-12-43.log
If there is an issue, please report to github with logFile!

2022-08-24 05:12:43 : Validating Region, 0.001 mins elapsed.

Out [82]:
GRanges object with 9 ranges and 2 metadata columns:
seqnames ranges strand | gene_id symbol
|
[1] chr1 207880972-207911402 - | 947 CD34
[2] chrX 48786554-48794311 + | 2623 GATA1
[3] chr9 36833275-37034185 - | 5079 PAX5
[4] chr11 60455752-60470760 + | 931 MS4A1
[5] chr5 140631728-140633701 - | 929 CD14
[6] chr11 118338954-118342744 - | 915 CD3D
[7] chr2 86784610-86808396 - | 925 CD8A
[8] chr17 47733244-47746119 + | 30009 TBX21
[9] chr5 35852695-35879603 + | 3575 IL7R

seqinfo: 24 sequences from hg38 genome

Out [82]:
2022-08-24 05:12:43 : Adding Bulk Tracks (1 of 9), 0.002 mins elapsed.

Getting Region From Arrow Files 1 of 1

2022-08-24 05:12:44 : Adding Feature Tracks (1 of 9), 0.011 mins elapsed.

2022-08-24 05:12:44 : Adding Loop Tracks (1 of 9), 0.012 mins elapsed.

2022-08-24 05:12:44 : Adding Gene Tracks (1 of 9), 0.013 mins elapsed.

2022-08-24 05:12:44 : Plotting, 0.016 mins elapsed.

2022-08-24 05:12:45 : Adding Bulk Tracks (2 of 9), 0.028 mins elapsed.

Getting Region From Arrow Files 1 of 1

2022-08-24 05:12:45 : Adding Feature Tracks (2 of 9), 0.034 mins elapsed.

2022-08-24 05:12:45 : Adding Loop Tracks (2 of 9), 0.035 mins elapsed.

2022-08-24 05:12:45 : Adding Gene Tracks (2 of 9), 0.036 mins elapsed.

2022-08-24 05:12:45 : Plotting, 0.039 mins elapsed.

2022-08-24 05:12:46 : Adding Bulk Tracks (3 of 9), 0.049 mins elapsed.

Getting Region From Arrow Files 1 of 1

2022-08-24 05:12:46 : Adding Feature Tracks (3 of 9), 0.055 mins elapsed.

2022-08-24 05:12:47 : Adding Loop Tracks (3 of 9), 0.056 mins elapsed.

2022-08-24 05:12:47 : Adding Gene Tracks (3 of 9), 0.069 mins elapsed.

2022-08-24 05:12:47 : Plotting, 0.072 mins elapsed.

2022-08-24 05:12:48 : Adding Bulk Tracks (4 of 9), 0.086 mins elapsed.

Getting Region From Arrow Files 1 of 1

2022-08-24 05:12:49 : Adding Feature Tracks (4 of 9), 0.093 mins elapsed.

2022-08-24 05:12:49 : Adding Loop Tracks (4 of 9), 0.094 mins elapsed.

2022-08-24 05:12:49 : Adding Gene Tracks (4 of 9), 0.096 mins elapsed.

2022-08-24 05:12:49 : Plotting, 0.1 mins elapsed.

2022-08-24 05:12:50 : Adding Bulk Tracks (5 of 9), 0.111 mins elapsed.

Getting Region From Arrow Files 1 of 1

2022-08-24 05:12:50 : Adding Feature Tracks (5 of 9), 0.117 mins elapsed.

2022-08-24 05:12:50 : Adding Loop Tracks (5 of 9), 0.118 mins elapsed.

2022-08-24 05:12:51 : Adding Gene Tracks (5 of 9), 0.136 mins elapsed.

2022-08-24 05:12:51 : Plotting, 0.139 mins elapsed.

2022-08-24 05:12:52 : Adding Bulk Tracks (6 of 9), 0.155 mins elapsed.

Getting Region From Arrow Files 1 of 1

2022-08-24 05:12:53 : Adding Feature Tracks (6 of 9), 0.162 mins elapsed.

2022-08-24 05:12:53 : Adding Loop Tracks (6 of 9), 0.162 mins elapsed.

2022-08-24 05:12:53 : Adding Gene Tracks (6 of 9), 0.165 mins elapsed.

2022-08-24 05:12:53 : Plotting, 0.168 mins elapsed.

2022-08-24 05:12:54 : Adding Bulk Tracks (7 of 9), 0.183 mins elapsed.

Getting Region From Arrow Files 1 of 1

2022-08-24 05:12:55 : Adding Feature Tracks (7 of 9), 0.19 mins elapsed.

2022-08-24 05:12:55 : Adding Loop Tracks (7 of 9), 0.191 mins elapsed.

2022-08-24 05:12:56 : Adding Gene Tracks (7 of 9), 0.206 mins elapsed.

2022-08-24 05:12:56 : Plotting, 0.21 mins elapsed.

2022-08-24 05:12:57 : Adding Bulk Tracks (8 of 9), 0.226 mins elapsed.

Getting Region From Arrow Files 1 of 1

2022-08-24 05:12:57 : Adding Feature Tracks (8 of 9), 0.233 mins elapsed.

2022-08-24 05:12:57 : Adding Loop Tracks (8 of 9), 0.233 mins elapsed.

2022-08-24 05:12:58 : Adding Gene Tracks (8 of 9), 0.251 mins elapsed.

2022-08-24 05:12:58 : Plotting, 0.254 mins elapsed.

2022-08-24 05:12:59 : Adding Bulk Tracks (9 of 9), 0.271 mins elapsed.

Getting Region From Arrow Files 1 of 1

2022-08-24 05:13:00 : Adding Feature Tracks (9 of 9), 0.279 mins elapsed.

2022-08-24 05:13:00 : Adding Loop Tracks (9 of 9), 0.279 mins elapsed.

2022-08-24 05:13:00 : Adding Gene Tracks (9 of 9), 0.289 mins elapsed.

2022-08-24 05:13:01 : Plotting, 0.292 mins elapsed.

ArchR logging successful to : ArchRLogs/ArchR-plotBrowserTrack-45f378513ba-Date-2022-08-24_Time-05-12-43.log