科普

RNA-seq数据不仅仅是表达量

2019-12-02 48动手实验室

RNA-seq数据毫无疑问是目前NGS领域被使用最频繁的了,但是大部分科研人员对它的理解,还停留在表达量层面,尤其是基于基因的表达量,无非就是分组,然后走差异分析这样的统计学检验,绘制火山图和差异基因热图,上下调的通路。先不说大家对RNA-seq数据的标准分析是否一定是对的,这样的简陋的分析其实是对数据的暴殄天物!

首先可以分析差异转录本,可变剪切

看到一篇2019年5月发表在Molecular Neurodegeneration杂志的文章:TREM2 brain transcript-specific studies in AD and TREM2 mutation carriers 把普通的RNA-seq数据根据自己的生物学背景挖掘了一下。背景知识需要去搜索了解Triggering Receptor Expressed in Myeloid cells 2 (TREM2)这个基因,以及它的3个转录本。

都是European-Americans,测序数据是:
AD cases with TREM2 variants (n = 33)
AD cases (n = 195)
healthy controls (n = 118)

来源于3个不同的机构:
Washington University in St. Louis Knight-ADRC Brain Bank (51 participants)
MSBB-BM36, (132 participants)
MCBB (162 participants)
每个样本平均测序数据量是  134.9 million ,是2 × 101bp的测序策略。

其中2个机构的数据是已有的,数据下载方式:
Mayo Clinic Brain Bank RNA-seq data was downloaded from the AMP-AD portal (synapse ID = 5,550,404; accessed January 2017).
Mount Sinai Brain Bank RNA-seq data was downloaded from the AMP-AD portal (synapse ID = https://www.synapse.org/#!Synapse:syn3157743; accessed January 2017).

转录组数据分析流程,主要是软件选择,参考基因组版本:

FastQC [49] was used to assess sequencing quality.
The RNA-seq was aligned to the human GRCh37 primary assembly using STAR (ver 2.5.2b). Read alignments were further evaluated by using PICARD CollectRnaSeqMetrics (ver 2.8.2) to examine read distribution across the genome.
We employed Kallisto (v0.42.5) and tximport to determine the read count for each transcript and quantified transcript abundance as transcripts per kilobase per million reads mapped (TPM), using gene annotation of Homo sapiens reference genome (GENCODE GRCh37) for each participant from Knight-ADRC, MCB and MSBB-BM36 independently, with the following parameters: -t 10 -b 100.
Then we summed the read counts and TPM of all alternative splicing transcripts of a gene to obtain gene expression levels. Due to the positive skewness of TPM values, we calculate their logarithm10 (log10TPM) for further analysis.

关于转录本的差异分析,我们分享过salmon+DRIMseq流程,在前些天的推文里面,见:每月一生信流程之rnaseqDTU(差异转录本)

在文章导论大量介绍了TREM2)这个基因,以及它的3个转录本。同时看了3个队列的这个基因的3个转录本的表达量情况。

We were able to detect and quantify the levels of three TREM2 transcripts ENST00000373113, ENST00000373122 and ENST00000338469 using RNA-seq data from AD and control brains from three different, independent studies.

RNA-seq_1

不过这样的分析仍然是片面的,因为作者仅仅是关心自己生物学背景的基因,下面的全局比较的总结表格其实是不可或缺的。

RNA-seq_2

然后可以分析融合基因

看到 Transcriptome analysis offers a comprehensive illustration of the genetic background of pediatric acute myeloid leukemia. Blood Adv 文章就是日本研究团队的 [RNA-seq]  in 139 of the 369 patients with de novo pediatric AML ,这样文章落脚点就是基因融合事件,54 in-frame gene fusions and 1 RUNX1 out-of-frame fusion in 53 of 139 patients.

在大的病人队列里面,提供实验验证了  258 gene fusions in 369 patients (70%) 。

RNA-seq_3

因为有RNA-seq数据的只有139个病人,所以 突变全景图如下:

RNA-seq_4

甚至找到的基因融合事件,可以当做是病人的一种表型信息进行分析:

RNA-seq_5

RNA-seq_6

▍本文版权属于“生信技能树”(微信公众号:biotrainee),禁止二次转载

上一篇下一篇