Taxonomic Profiling and Visualization of Metagenomic Data

Which species (or genera, families, …) are present in my sample? What are the different approaches and tools to get the community profile of my sample? How can we visualize and compare community profiles?

The investigation of microorganisms present at a specific site and their relative abundance is also called microbial community profiling”. The main objective is to identify the microorganisms that are present within the given sample. This can be achieved for all known microbes, where the DNA sequence specific for a certain species is known.

For that we try to identify the taxon to which each individual read belongs.

The taxonomic hierarchy includes eight levels: Domain, Kingdom, Phylum, Class, Order, Family, Genus and Species.

Let’s explore taxonomy in the Tree of Life, using Lifemap

For metagenomic data analysis we start with sequences derived from DNA fragments that are isolated from the sample of interest. Ideally, the sequences from all microbes in the sample are present.

The underlying idea of taxonomic assignment is to compare the DNA sequences found in the sample (reads) to DNA sequences of a database. When a read matches a database DNA sequence of a known microbe, we can derive a list with microbes present in the sample.

When talking about taxonomic assignment or taxonomic classification, most of the time we actually talk about two methods, that in practice are often used interchangeably:

taxonomic binning: the clustering of individual sequence reads based on similarities criteria and assignation of clusters to reference taxa
taxonomic profiling: classification of individual reads to reference taxa to extract the relative abundances of the different taxa

The reads can be obtained from amplicon sequencing (e.g. 16S, 18S, ITS), where only specific gene or gene fragments are targeted (using specific primers) or shotgun metagenomic sequencing, where all the accessible DNA of a mixed community is amplified (using random primers).

This tutorial focuses on reads obtained from shotgun metagenomic sequencing.

Taxonomic profiling

Tools for taxonomic profiling can be divided into three groups

DNA-to-DNA: comparison of sequencing reads with genomic databases of DNA sequences with tools like Kraken
DNA-to-Protein: comparison of sequencing reads with protein databases (more computationally intensive because all six frames of potential DNA-to amino acid translations need to be analyzed) with tools like DIAMOND
Marker based: searching for marker genes (e.g. 16S rRNA sequence) in reads, which is quick, but introduces bias, with tools like MetaPhlAn

The comparison of reads to database sequences can be done in different ways, leading to three different types of taxonomic assignment:

Genome based approach

Reads are aligned to reference genomes. Considering the coverage and breadth, genomes are used to measure genome abundance. Furthermore, genes can be analyzed in genomic context. Advantages of this method are the high detection accuracy, that the unclassified percentage is known, that all SNVs can be detected and that high-resolution genomic comparisons are possible. This method takes medium compute cost.
Gene based approach

Reads are aligned to reference genes. Next, marker genes are used to estimate species abundance. Furthermore, genes can be analyzed in isolation for presence or absence in a specific condition. The major advantage is the detection of the pangenome (entire set of genes within a species). Major disadvantages are the high compute cost, low detection accuracy and that the unclassified percentage is unknown. At least intragenic SNVs can be detected and low-resolution genomic comparison is possible.
k-mer based approach

Databases as well as the samples DNA are broken into strings of length k for comparison. From all the genomes in the database, where a specific k-mer is found, a lowest common ancestor (LCA) tree is derived and the abundance of k-mers within the tree is counted. This is the basis for a root-to-leaf path calculation, where the path with the highest score is used for classification of the sample. By counting the abundance of k-mers, also an estimation of relative abundance of taxa is possible. The major advantage of k-mer based analysis is the low compute cost. Major disadvantages are the low detection accuracy, that the unclassified percentage is unknown and that there is no gene detection, no SNVs detection and no genomic comparison possible. An example for a k-mer based analysis tool is Kraken, which will be used in this tutorial

A typical workflow consists of:

Processing with quality control/trimming (FastQC and Trim Galore! or Cutadapt) and (optionally) dereplication (VSearch)
Taxonomic analyses with assignation (Kraken2 or MetaPhlAn2) and visualization (KRONA, GraPhlAn)

Prepare Galaxy and data

The dataset we will use for this tutorial comes from an oasis in the Mexican desert called Cuatro Ciénegas.

The researchers were interested in genomic traits that affect the rates and costs of biochemical information processing within cells. They performed a whole-ecosystem experiment, thus fertilizing the pond to achieve nutrient enriched conditions.

Here we will use 2 datasets:

JP4D: a microbiome sample collected from the Lagunita Fertilized Pond
JC1A: a control samples from a control mesocosm.

The reads have been trimmed using cutadapt as explained in the Quality control tutorial.

NOTE! Please, use:

Create and rename new history

Upload datasets

https://zenodo.org/record/7871630/files/JC1A_R1.fastqsanger.gz
https://zenodo.org/record/7871630/files/JC1A_R2.fastqsanger.gz
https://zenodo.org/record/7871630/files/JP4D_R1.fastqsanger.gz
https://zenodo.org/record/7871630/files/JP4D_R2.fastqsanger.gz

Create a paired collection

Kraken2: k-mer based taxonomic assignment

To find out which microorganisms are present, we will compare the reads of the sample to a reference database, i.e. sequences of known microorganisms stored in a database, using Kraken2

In the k-mer approach for taxonomy classification, we use a database containing DNA sequences of genomes whose taxonomy we already know. On a computer, the genome sequences are broken into short pieces of length k (called k-mers), usually 30bp.

Kraken2:

examines the k-mers within the query sequence,
searches for them in the database,
looks for where these are placed within the taxonomy tree inside the database,
makes the classification with the most probable position,
then maps k-mers to the lowest common ancestor (LCA) of all genomes known to contain the given k-mer.

For this tutorial, we will use the PlusPF database which contains the Standard (archaea, bacteria, viral, plasmid, human, UniVec_Core), protozoa and fungi data.

Assign taxonomic labels with Kraken2

Kraken2 Tool with the following parameters:
- Single or paired reads: Paired Collection
  - Collection of paired reads: Input paired collection
- Confidence: 0.1
- In Create Report:
  - Print a report with aggregrate counts/clade to file: Yes
- Select a Kraken2 database: Prebuilt Refseq indexes: PlusPF

A confidence score of 0.1 means that at least 10% of the k-mers should match entries in the database. This value can be reduced if a less restrictive taxonomic assignation is desired.

Kraken2 will create two outputs for each dataset: Classification and Report

Classification

Classification: tabular files with one line for each sequence classified by Kraken and 5 columns:
1. C/U: a one letter indicating if the sequence classified or unclassified
2. Sequence ID as in the input file
3. NCBI taxonomy ID assigned to the sequence, or 0 if unclassified
4. Length of sequence in bp (read1|read2 for paired reads)
5. A space-delimited list indicating the lowest common ancestor (LCA) mapping of each k-mer in the sequence
For example, 562:13 561:4 A:31 0:1 562:3 would indicate that:
1. The first 13 k-mers mapped to taxonomy ID #562
2. The next 4 k-mers mapped to taxonomy ID #561
3. The next 31 k-mers contained an ambiguous nucleotide
4. The next k-mer was not in the database
5. The last 3 k-mers mapped to taxonomy ID #562

|:| indicates end of first read, start of second read for paired reads

For JC1A:

C1  C2                                                      C3      C4      C5
U   MISEQ-LAB244-W7:91:000000000-A5C7L:1:1101:13417:1998    0       151|190 A:18 0:14 2055:5 0:1 2220095:5 0:74 |:| 0:3 A:54 2:1 0:32 204455:1 2823043:5 0:60
U   MISEQ-LAB244-W7:91:000000000-A5C7L:1:1101:15782:2187    0       169|173 0:101 37329:1 0:33 |:| 0:10 2751189:5 0:30 1883:2 0:39 2609255:5 0:48
U   MISEQ-LAB244-W7:91:000000000-A5C7L:1:1101:11745:2196    0       235|214 0:173 2282523:5 2746321:2 0:21 |:| 0:65 2746321:2 2282523:5 0:108
U   MISEQ-LAB244-W7:91:000000000-A5C7L:1:1101:18358:2213    0       251|251 0:35 281093:5 0:3 651822:5 0:145 106591:3 0:21 |:| 0:64 106591:3 0:145 651822:5
U   MISEQ-LAB244-W7:91:000000000-A5C7L:1:1101:14892:2226    0       68|59   0:34 |:| 0:25
U   MISEQ-LAB244-W7:91:000000000-A5C7L:1:1101:18764:2247    0       146|146 0:112 |:| 0:112
C   MISEQ-LAB244-W7:91:000000000-A5C7L:1:1101:12147:2252    9606    220|220 9606:148 0:19 9606:19 |:| 9606:19 0:19 9606:148
U   MISEQ-LAB244-W7:91:000000000-A5C7L:1:1101:14052:2291    0       118|85  0:84 |:| 0:7 5679:5 0:39
C   MISEQ-LAB244-W7:91:000000000-A5C7L:1:1101:18387:2294    9606    241|241 9606:207 |:| 9606:207
U   MISEQ-LAB244-W7:91:000000000-A5C7L:1:1101:13378:2301    0       164|98  0:130 |:| 0:64 

Question For JC1A sample:

Is the first sequence in the file classified or unclassified?
What is the taxonomy ID assigned to the first classified sequence?
What is the corresponding taxon

Report

Report: tabular files with one line per taxon and 6 columns or fields:
1. Percentage of fragments covered by the clade rooted at this taxon
2. Number of fragments covered by the clade rooted at this taxon
3. Number of fragments assigned directly to this taxon
4. A rank code, indicating:
  - (U)nclassified
  - (R)oot
  - (D)omain
  - (K)ingdom
  - (P)hylum
  - (C)lass
  - (O)rder
  - (F)amily
  - (G)enus
  - (S)pecies
5. NCBI taxonomic ID number
6. Indented scientific name

Column 1    Column 2    Column 3    Column 4    Column 5    Column 6
86 	    105399      105399      U 	        0           unclassified
14 	    31740       1197 	    R 	        1           root
20 	    30448       312 	    R1 	        131567      cellular organisms
58 	    17254       3767 	    D 	        2           Bacteria
77 	    12027       2867 	    P 	        1224        Proteobacteria
94 	    6779        3494 	    C 	        28211       Alphaproteobacteria
30 	    1782        1085 	    O 	        204455      Rhodobacterales
43 	    593         461 	    F 	        31989       Rhodobacteraceae
05 	    74          53          G 	        265         Paracoccus
01 	    10          10          S 	        82367       Paracoccus pantotrophus
00 	    6           6           S 	        1545044     Paracoccus sanguinis
00 	    5           0           G1 	        2688777     unclassified Paracoccus
00 	    4           4           S 	        2589076     Paracoccus sp. AK26
00 	    1           1           S 	        2895796     Paracoccus sp. MA
01 	    17          16          G 	        1060        Rhodobacter
00 	    1           0           G1 	        196779      unclassified Rhodobacter
00 	    1           1           S 	        2033869     Rhodobacter sp. CZR27
01 	    17          15          G 	        34008       Rhodovulum
00 	    2           0           G1 	        2631432     unclassified Rhodovulum
00 	    2           2           S 	        1564506     Rhodovulum sp. P5 

Taxa that are not at any of these 10 ranks have a rank code that is formed by using the rank code of the closest ancestor rank with a number indicating the distance from that rank.

E.g., G2 is a rank code indicating a taxon is between genus and species and the grandparent taxon is at the genus rank.

Question

What are the percentage on unclassified for JC1A and JP4D?
What are the kindgoms found for JC1A and JP4D?
Where might the eukaryotic DNA come from?
How is the diversity of Proteobacteria in JC1A and JP4D?

Getting an overview of the assignation is not straightforward with the Kraken2 outputs directly. We can use visualisation tools for that.

Estimate species abundance with Bracken

A simple and worthwile addition to Kraken for better abundance estimates is called Bracken (Bayesian Reestimation of Abundance after Classification with Kraken).

Instead of only using proportions of classified reads, it takes a probabilistic approach to generate final abundance profiles. It works by re-distributing reads in the taxonomic tree: Reads assigned to nodes above the species level are distributed down to the species nodes, while reads assigned at the strain level are re-distributed upward to their parent species

Estimate Abundance at Taxonomic Level Tool with the following parameters:
- Kraken report file: Report output of Kraken
- Select a kmer distribution: PlusPF (same as for Kraken)
- Level: Species
- Produce Kraken-Style Bracken report: yes

NOTE! It is important to choose the same database that you also chose for Kraken2

Bracken output file format (tabular):

Taxon name
Taxonomy ID
Level ID (S=Species, G=Genus, O=Order, F=Family, P=Phylum, K=Kingdom)
Kraken assigned reads
Added reads with abundance re-estimation
Total reads after abundance re-estimation
Fraction of total reads

Kraken style report

Visualization of taxonomic assignment

Getting an overview of the assignation is not straightforward with the Kraken2 outputs directly. Once we have assigned the corresponding taxa to each sequence, the next step is to properly visualize the data. There are several tools for that:

Krona
Phinch
Pavian

Visualisation using Krona

Krona creates an interactive HTML file allowing hierarchical data to be explored with zooming, multi-layered pie charts.

With this tool, we can easily visualize the composition of the bacterial communities and compare how the populations of microorganisms are modified according to the conditions of the environment.

Convert Kraken report file

Kraken outputs can not be given directly to Krona, they first need to be converted.

Krakentools is a suite of tools to work on Kraken outputs. It include a tool designed to translate results of the Kraken metagenomic classifier to the full representation of NCBI taxonomy. The output of this tool can be directly visualized by the Krona tool.

Krakentools: Convert kraken report file Tool with the following parameters:
- Kraken report file: Report collection of Kraken
Inspect the generated output for JC1A

105399 Unclassified 						
k__Bacteria 						
k__Bacteria p__Proteobacteria 					
k__Bacteria p__Proteobacteria c__Alphaproteobacteria 				
k__Bacteria p__Proteobacteria c__Alphaproteobacteria o__Rhodobacterales 			
k__Bacteria p__Proteobacteria c__Alphaproteobacteria o__Rhodobacterales f__Rhodobacteraceae 		
k__Bacteria p__Proteobacteria c__Alphaproteobacteria o__Rhodobacterales f__Rhodobacteraceae g__Paracoccus 	
k__Bacteria p__Proteobacteria c__Alphaproteobacteria o__Rhodobacterales f__Rhodobacteraceae g__Paracoccus s__Paracoccus_pantotrophus
k__Bacteria p__Proteobacteria c__Alphaproteobacteria o__Rhodobacterales f__Rhodobacteraceae g__Paracoccus s__Paracoccus_sanguinis
k__Bacteria p__Proteobacteria c__Alphaproteobacteria o__Rhodobacterales f__Rhodobacteraceae g__Paracoccus s__Paracoccus_sp._AK26 

Question

What are the different columns?
What are the lines?

Generate Krona visualisation

Let’s now run Krona

Krona pie chart Tool with the following parameters:
- Type of input data: Tabular
- Input file: output of Krakentools
Inspect the generated file

Question

What are the percentage on unclassified for JC1A and JP4D?
What are the kindgoms found for JC1A and JP4D?
Where might the eukaryotic DNA come from?
How is the diversity of Proteobacteria in JC1A and JP4D?

Visualisation using Phinch

Phinch is another tools to visualize large biological datasets like our taxonomic classification

As a first step, we need to convert the Kraken output file into a kraken-biom file to make it accessible for Phinch. For this, we need to add a metadata file. When generating a metadata file for your own data, you can take this as an example and apply the general guidelines:

The first column header must be #SampleID, and the data in this column must contain unique (short and meaningful) sample identifiers containing only alphanumeric and period (.) characters.
The second column header must be BarcodeSequence, where each value in that column corresponds to the barcode used for each sample
The third column header must be “LinkerPrimerSequence”, where each value in that column corresponds to the primer used to amplify that sample.
All subsequent column headers (except the last one) are metadata headers. For example, a Smoker column would include either Yes or No. Note that the data in each column is assumed to be categorical unless specified otherwise. Categorical data columns must include at least 2 unique values per column. All metadata must be composed of:
- only alphanumeric,
- underscore (_),
- period (.),
- minus sign (-),
- plus sign (+),
- percentage (%),
- space ( ),
- semicolon (;),
- colon (:),
- comma (,),
- and/or forward slash (/) characters.
- For missing data, write NA; do not leave blanks.
The last column of the mapping file must be named Description. Information in this column includes information that is unique to each sample, such as the medications taken by the patient, or any other descriptive information. The same character restrictions that apply to the metadata columns in guideline four apply to sample descriptions. Sample/Run Description should be kept brief, if possible.
Information that applies to all samples in a mapping file should go in the run description section, which is defined as lines starting with a # character, immediately following the header line (See example format below.)
There should be no empty lines or comment lines (starting with #) throughout the metadata, with the exception of any additional run description lines that immediately follow the initial header line.

#SampleID BarcodeSequence LinkerPrimerSequence Treatment DOB Description
#Example mapping file for the QIIME analysis package. 
#These 9 samples are from a study of the effects of
#exercise and diet on mouse cardiac physiology (Crawford, et al, PNAS, 2009).
PC.354 AGCACGAGCCTA YATGCTGCCTCCCGTAGGAGT Control 20061218 Control_mouse__I.D._354
PC.355 AACTCGTCGATG YATGCTGCCTCCCGTAGGAGT Control 20061218 Control_mouse__I.D._355
PC.356 ACAGACCACTCA YATGCTGCCTCCCGTAGGAGT Control 20061126 Control_mouse__I.D._356
PC.481 ACCAGCGACTAG YATGCTGCCTCCCGTAGGAGT Control 20070314 Control_mouse__I.D._481
PC.593 AGCAGCACTTGT YATGCTGCCTCCCGTAGGAGT Control 20071210 Control_mouse__I.D._593
PC.607 AACTGTGCGTAC YATGCTGCCTCCCGTAGGAGT Fast 20071112 Fasting_mouse__I.D._607
PC.634 ACAGAGTCGGCT YATGCTGCCTCCCGTAGGAGT Fast 20080116 Fasting_mouse__I.D._634
PC.635 ACCGCAGAGTCA YATGCTGCCTCCCGTAGGAGT Fast 20080116 Fasting_mouse__I.D._635
PC.636 ACGGTGAGTGTC YATGCTGCCTCCCGTAGGAGT Fast 20080116 Fasting_mouse__I.D._636
    

Import the metadata tabular from Zenodo:

https://zenodo.org/record/7871630/files/metadata.tabular

Kraken-biom Tool (to convert Kraken2 report into the correct format for Phinch) with the following parameters:
- Input: Report output of Kraken2
- Sample Metadata file: Metadata tabular
- Output format: JSON
Download BIOM output file and upload it in https://usegalaxy.eu/phinch/index.html
The first pages shows an overview of your samples. Here, you have the possibility to further filter your data, for example by date or location, depending on which information you provided in your metadata file.
- Question
- How many sequence reads do the samples contain?
Click on Proceed to gallery to see an overview of all visualization options

Taxonomy Bar Charts: the abundance of different taxa depicted in different colors in your samples. On top of the chart you can select which rank is supposed to be shown in the chart. You can also change the display options to for example switch between value und percentage.
- Question
- What information can you get from hovering over a sample?
- How many percent of the sample reads are bacteria and how many are eukaryota?
Bubble Charts: the distribution of taxa across the whole dataset at the rank that you can choose above the chart. When hovering over the bubbles, you get additional information concerning the taxon.
- Question
- Which is the most abundant Class?
- How many reads are found in both samples?
Sankey Diagrams: that is depicting the abundance of taxonomies as a flow chart. Again, you can choose the taxonomy level that you want to show in your diagram. When clicking on one bar of the diagram, this part is enlarged for better view.
Donut Partitions: summarizes the microbial community according to non-numerical attributes. In the drop-down menu at the top right corner, you can switch between the different attributes provided in the metadata file.
- In our case, you can for example choose the ‘environmental medium’ to see the difference between sediment and water (It doesn’t really make a lot of sense in our very simple dataset, as this will show the same result as sorting them by sample 0 and 1, but if attributes change across different samples this might be an interesting visualization option).
- When clicking on one part of the donut you will also find the distribution of the taxon across the samples.
- On the right hand side you can additionally choose if you’d like to have dynamic y axis or prefer standard y axis to compare different donuts with each other.
Attributes Column Chart: summarizes the microbial community according to numerical attributes. In the drop-down menu at the top right corner, you can switch between the different attributes provided in the metadata file.
- In our case, you can for example choose the ‘geographic location’ to (again, it doesn’t really make a lot of sense in our very simple dataset, as this will show the same result as sorting them by sample 0 and 1, but if attributes change across different samples this might be an interesting visualization option).

Visualization using Pavian

Pavian (pathogen visualization and more) is an interactive visualization tool for metagenomic data. It was developed for the clinical metagenomic problem to find a disease-causing pathogen in a patient sample, but it is useful to analyze and visualize any kind of metagenomics data.

Download Report files from Report collection of Kraken
Upload Report files on Pavian server: https://fbreitwieser.shinyapps.io/pavian/
Click on Save table
Click on Result Overview. This page shows the summary of the classifications in the selected sample set:

Question

Does both sample have same size?
What are the percentage of classified reads for JC1A and JP4D?
Are the percentage of bacterial reads similar?

Inspect samples with Pavian

Let’s now inspect assignements to reads per sample.

Click on Sample in the left panel
Select JC1A in the Select sample drop-down on the top

The first view gives a Sankey diagram for one sample:

A sankey diagram is a visualization used to depict a flow from one set of values to another
The different set of values represented as the horizontal axis represents the taxonomy hierarchy from domain on the left to species on the right

Click on Proteobacteria in the Sankey plot
Inspect the created graph on the right

Question

Are the number of reads assigned to Proteobacteria similar for both samples?
Why?

Compare samples with Pavian

Click on Comparison in the left panel
Select % and unclick Reads in the blue area drop-down on the top
Click on Domain green button
- Is there similar proportion of Bacteria in both samples?
Select Homo sapiens in the Filter taxa box below the green buttons
- Question
- Is there similar proportion of Bacteria in both samples?
Click on Class green button
- Question
- How are the diversities of classes in both samples?

Marker gene based approach using MetaPhlAn

MetaPhlAn is a computational tool for profiling the composition of microbial communities (Bacteria, Archaea and Eukaryotes) from metagenomic shotgun sequencing data (i.e. not 16S) at species-level.

It allows:

unambiguous taxonomic assignments;
accurate estimation of organismal relative abundance;
species-level resolution for bacteria, archaea, eukaryotes and viruses;
strain identification and tracking
orders of magnitude speedups compared to existing methods.
microbiota strain-level population genomics
MetaPhlAn clade-abundance estimation

The basic usage of MetaPhlAn consists in the identification of the clades (from phyla to species and strains in particular cases) present in the microbiota obtained from a microbiome sample and their relative abundance.

Assign taxonomic labels with MetaPhlAn

MetaPhlAn in Galaxy can not directly take as input a paired collection but expect 2 collections: 1 with forward data and 1 with reverse. Before launching MetaPhlAn, we need to split our input paired collection.

Unzip collection Tool with the following parameters:
- Paired input to unzip: Input paired collection
MetaPhlAn Tool with the following parameters:
- In Inputs
  - Input(s): Fasta/FastQ file(s) with microbiota reads
    - Fasta/FastQ file(s) with microbiota reads: Paired-end files
      - Forward paired-end Fasta/FastQ file with microbiota reads: output of Unzip collection with forward in the name
      - *Reverse paired-end Fasta/FastQ file with microbiota reads”: output of Unzip collection with reveerse in the name
- In Outputs:
  - Output for Krona?: Yes

5 files and a collection are generated by MetaPhlAn:

Predicted taxon relative abundances: tabular files with the community profile

Each line contains 4 columns:
1. the lineage with different taxonomic levels, from high level taxa (kingdom: k__) to more precise taxa
2. the NCBI taxon ids of the lineage taxonomic level
3. the relative abundance found for our sample for the lineage
4. any additional species
Question
1. Which kindgoms have been identified for JC1A?
2. For JP4D?
3. How is the diversity for JP4D compared to Kraken results?
  - !! Diversity is really reduced for JP4D using MetaPhlAn, compared to the one identified with Kraken !!
Predicted taxon relative abundances for Krona: same information as the previous files but formatted for > visualization using Krona

Each line represent an identified taxons with 9 columns:
- Column 1: The percentage of reads assigned the taxon
- Column 2-9: The taxon description at the different taxonomic levels from Kindgom to more precise taxa
A collection with the same information as in the tabular file but splitted into different files, one per taxonomic level
BIOM file with community profile in BIOM format
- Download it and then dupload on https://usegalaxy.eu/phinch/
Bowtie2 output and SAM file with the results of the sequence mapping on the reference database

Generate Krona visualisation

Let’s now run Krona to visualize the communities

Krona pie chart Tool with the following parameters:
- Type of input data: tabular
- Input file: Krona output of Metaphlan

The community looks a lot less diverse than with Kraken.

This is probably due to the reference database, which is potentially not complete enough yet to identify all taxons.
Or there are too few reads in the input data to cover enough marker genes. Indeed, no taxon has been identified for JC1A, w hich contains much less reads than JP4D.

However, Kraken is also known to have high number of false positive. A benchmark dataset or mock community where the community DNA content is known would be required to correctly judge which tool provides the better taxonomic classification.

Generate GraPhlAn visualisation

GraPhlAn is another software tool for producing high-quality circular representations of taxonomic and phylogenetic trees.

We first need to convert the MetaPhlAn output using Export to GraPhlAn. This conversion software tool produces both annotation and tree file for GraPhlAn.

Export to GraPhlAn need the MetaPhlAn table with predicted taxon abundance with only 2 columns: the lineage and the abundance.

Cut Tool with the following parameters:
- Cut columns: c1,c3
- From: Predicted taxon relative abundances (output of MetaPhlAn)
Rename as Cut predicted taxon relative abundances table
Export to GraPhlAn Tool with the following parameters:
- Input file: Cut predicted taxon relative abundances table
- List which levels should be annotated in the tree: 1,2
- List which levels should use the external legend for the annotation: 3,4,5
- List which levels should be highlight with a shaded background: 1
Generation, personalization and annotation of tree Tool with the following parameters:
- Input tree: Tree (output of Export to GraPhlAn)
- Annotation file: Annotation (output of Export to GraPhlAn)
GraPhlAn Tool with the following parameters:
- Input tree: Tree in PhyloXML
- Output format: PNG
Inspect GraPhlAn output

Conclusion

In this tutorial, we look how to get the community profile from microbiome data.

There are 3 different types of taxonomic assignment (genome, gene or k-mer based)
Kraken or MetaPhlAn can be used in Galaxy for taxonomic assignment
- Kraken:
  - k-mer approach;
  - Fastest
  - Most memory efficient;
  - Good performance metrics;
  - Very fast on large numbers of samples;
  - Allow custom databases when high amounts of memory (>100 Gb) are available
- MetaPhlAn:
  - Marker Genes approach;
  - Recommended for low computational requirements (< 2 Gb of memory)
Visualization of community profiles can be done using Krona, Pavian or GraPhlAn
The taxonomic classification tool to use depends on the data

Hands-On

Now, it’s your turn!