/
/
Assembly and gene annotation of the 1000 plant transcriptomes
Assembly and gene annotation of the 1000 plant transcriptomes

The 1000 Plant transcriptomes initiative (1KP) sequenced and analysed transcribed RNA from 1,342 samples representing 1,173 green plant and chloroplast bearing species, including examples of all major taxa within the Viridiplantae: streptophyte and chlorophyte green algae, bryophytes, ferns, angiosperms, and gymnosperms.

Data size: 1,448
Last modified: 2020-10-21

1. Background 背景描述

The 1000 Plant transcriptomes initiative (1KP) sequenced and analysed transcribed RNA from 1,342 samples representing 1,173 green plant and chloroplast bearing species, including examples of all major taxa within the Viridiplantae: streptophyte and chlorophyte green algae, bryophytes, ferns, angiosperms, and gymnosperms.

千种植物转录组项目(1KP), 对取自1,342个样本的转录组RNA进行了测序,组装和编码产物预测。这些样本代表了1,173个不同物种的绿色植物或者叶绿体携带物种,包括:链生植物和绿藻、苔藓植物、蕨类植物、被子植物和裸子植物等。

2. data description 数据说明

2.1 data processing 数据处理

Transcripts were assembled from clean reads of each sample using the SOAPdenovo-Trans transcript assembler with the k-mer length of 25. Scaffolds were generated after fill gaps using the internal FillGap module and the external GapCloser post-processor (supplied with SOAPdenovo-Trans).

Open reading fames and protein translations were achieved by comparison transcripts to protein sequences from 22 sequenced and well annotated plant genomes in Phytozome (RRID:SCR006507) using TransPipe.

所有样品测序生成的原始数据经过数据过滤和指控后,统一用转录本组装软件SOAPdenovo-Trans 进行组装,其中组装过程中的关键参数k-mer设为25个碱基长度。随后这些生成的contigs 通过 软件FillGap 和 GapCloser 进行补缺,从而生成Scaffords。 开发阅读框和编码产物是,基于存储于Phytozome数据库里面22个已经测序并拥有高质量的基因注释数据,对组装得到的转录本使用分析软件Transpipe 进行分析预测得到的。

Reference 参考文献

Carpenter et al. (2019). Access to RNA-sequencing data from 1,173 plant species: The 1000 Plant transcriptomes initiative (1KP). GigaScience. DOI:10.1093/gigascience/giz126.

Leebens-Mack et al. (2019). One thousand plant transcriptomes and the phylogenomics of green plants. Nature. DOI: 10.1038/s41586-019-1693-2

2.2 Meta data 元信息表

Field 字段说明
Sample_Code 样品代码
Clade 分支名称
Family 家族名称
Species 物种名称
Protein 预测蛋白序列文件
Assembly 转录本序列文件
nucleotides 编码区域核酸序列文件
Prefix 样品唯一标识代码

3. Workflows 工作流程说明

Gene homology discovery using Hidden Markov Models

HMMER is widely used to search homologous protein or nucleotide sequences agianst relevant database using multiple sequence alignment profiles as queries through profile HMM methods. Its major utilizations include searching either a single protein sequence, multiple protein sequence alignment or profile HMM against a target sequence database.

Here, HMMER was implemented to discover all members of a given gene family in the gene coding product datasets generated from the 1000 Plant transcriptomes initiative. Later, we plan to provide more comprehensive datasets for characterizing the diversity of all functional gene families.

基于隐马尔可夫模型的鉴定同源基因

HMMER广泛用于在相关数据库中搜索同源蛋白质或核苷酸序列,它基于多个序列比对生成的比对矩阵文件,采用隐马尔可夫模型的算法进行同源基因的鉴定。它的主要用途包括搜索单个蛋白质序列、多个蛋白质序列比对或针对目标序列数据库的使用隐马尔可夫模型进行搜索。 在这里,HMMER的部署是为了搜索由千种植物转录组项目生成的基因编码产品数据集中给定基因家族的所有成员。稍后,我们计划提供更全面的数据集来研究所有功能基因家族的多样性特征。

reference 参考文献

Zhang, Z., Wood, WI. (2003). A profile hidden Markov model for signal peptides generated by HMMER. Bioinformatics. Doi: 10.1093/bioinformatics/19.2.307