/
/
The One Index For All: Kraken2 index for metagenomes
The One Index For All: Kraken2 index for metagenomes

The One Index For All (toifa) is a comprehensive kmer hash index complied with Kraken2. Toifa contains all genomes curated in GTDB (bacteria and archaea), most of fungal genomes in NCBI (some were removed by manually curation), and all protistan genomes in NCBI. Human genome GRCh38 is include for host signal removal. Community profiles of multiple domains (prokaryote, fungi, protist) can be inferred from unassembled reads with only one pass through our index.

数据量: 9
更新时间: 2020-09-21

1.Background 背景描述

The One Index For All (aka toifa) is a Kraken2 index for solving community profile of multiple kingdoms at one shot. Toifa includes genomes of all Prokaryota (bacteria and archaea), fungi, protist that are public available so far. Human genome (GRCh38) is also included for removing the host signal if necessary.

Kraken2 is a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds. The standard Kraken2 index contains only genomes from NCBI RefSeq, which includes bacteria, archaea, virus, but no fungi and other bugs that could appear in our environmental samples.

The One Index For All (简称toifa)是基于Kraken2建立的数据库,可以通过一次计算,得到多个类群微生物的群落矩阵。Toifa包括了目前公开的所有原核生物(细菌和古菌),真菌,和原生动物的基因组。数据库还包括了人类基因组(GRCh38)用于去除宿主信号。

Kraken2是一个利用哈希算法比对k-mer的测序数据分类方法。标准的Kraken2数据库之包括了NCBI RefSeq中的细菌,古菌,和病毒,缺少真菌和其它可能出现在环境样品中的微生物。

2.Data description 数据说明

Genomes of bacteria and archaea are from GTDB r95, which is a expert curated database for prokaryota genomes.

Fungal genomes were manually curated based from all available genomes at NCBI. Pairwise MinHash distance were compared between RefSeq genomes and all other genomes. Genomes with similar taxonomy but MinHash distance > 0.8 were manually checked.

Protistan genomes were included unchanged.

Human genome is GRCh38.

Below is a summary of the number of taxon at each taxonomic level includes in toifa:

细菌和古菌的基因组来自GTDB r95,GTDB是经过数据清理的原核生物基因组数据库。

NCBI的真菌基因组也经过了手工的清理。计算了RefSeq的真菌基因组和其它基因组之间的MinHash距离。分类信息接近但是MinHash距离较远(>0.8)的基因组都经过过人工的筛选。

原生动物的基因组没有进行修改。

人类基因组使用的版本是GRCh38。

下面的表格总结了toifa所有基因组的分类信息:

Domain Kingdom Phylum Class Order Family Genus Species
Prokaryote Bacteria 111 327 917 2,282 8,778 30,238
Prokaryote Archaea 18 42 103 276 650 1,672
Eukaryote Fungi 9 48 117 314 710 2114
Eukaryote Protozoa 18 45 81 115 174 377
Eukaryote Human 1 1 1 1 1 1

3. Workflows 工作流程说明3

After download the toifa index, simply use kraken2 to assign taxonomy to all reads, memory > 500GB is recommended:

kraken2 --paired --threads 4 --output kk2.output --db path/to/toifa/theOneIndexForAll r1.fq r2.fq