Analyses of metagenomics data

How to analyze metagenomics data? What information can be extracted of metagenomics data? What is the difference between amplicon and shotgun data? What are the difference in the analyses of amplicon and shotgun data?

In metagenomics, information about micro-organisms in an environment can be extracted with two main techniques:

Amplicon sequencing, which sequences only the rRNA or ribosomal DNA of organisms
Shotgun sequencing, which sequences full genomes of the micro-organisms in the environment

In this lecture, we will introduce the two main types of analyses with their general principles and differences.

We will use two datasets (one amplicon and one shotgun) from the same project on the Argentinean agricultural pampean soils.

In this project, three different geographic regions that are under different types of land uses and two soil types (bulk and rhizospheric) were analyzed using shotgun and amplicon sequencing.

We will focus on data from the Argentina Anguil and Pampas Bulk Soil: the original study included one more geographical regions (Rascovan et al. 2013.)

Amplicon data

Amplicon sequencing is a highly targeted approach for analyzing genetic variation in specific genomic regions. In the metagenomics fields, amplicon sequencing refers to capture and sequence of rRNA data in a sample. It can be 16S for bacteria or archea or 18S for eukaryotes.

The 16S rRNA gene has several properties that make it ideally suited to our purposes

Present in all living organisms
Highly conserved + highly variable regions
Huge reference databases

With amplicon data, we can determine the micro-organisms from which the sequences in our sample are coming from. This is called taxonomic assignation. We try to assign sequences to taxons and then classify or extract the taxonomy in our sample.

Importing the data

16S rDNA V4 region

Create a new history

Upload datasets

https://zenodo.org/record/815875/files/SRR531818_pampa.fasta
https://zenodo.org/record/815875/files/SRR651839_anguil.fasta

Rename them

Preparing data

We will perform a multisample analysis with mothur. To do this, we will merge all reads into a single file, and create a group file, indicating which reads belong to which samples.

Prepare multisample analysis

Merge.files tool with the following parameters:
- Merge: fasta files
- Inputs: the two sample fasta files
Make.group tool with the following parameters:
- Method to create group file: Manually specify fasta files and group names
- Additional: Add two elements to this repeat
  - First element
    - fasta - Fasta to group: SRR531818_pampa file
    - group - Group name: pampa
  - Second element (click on Insert Additional)
    - fasta - Fasta to group: SRR651839_anguil file
    - group - Group name: anguil
View the resulting datasets.

It contains two columns: the first contains the read names, the second contains the group (sample) name, in our case pampa or anguil.

Optimization of files for computation

Because we are sequencing many of the same organisms, we anticipate that many of our sequences are duplicates of each other.

Remove duplicate sequences

Unique.seqs tool with the following parameters:
- fasta: the merged fasta file
- output format: Name File
View the resulting datasets.

Unique.seqs tool outputs two files:
1. a fasta file containing only the unique sequences
2. a names file consisting of two columns:
  - the first contains the sequence names for each of the unique sequences,
  - the second column contains all other sequence names that are identical to the representative sequence in the first column.

Question

How many sequences were unique?
How many duplicates were removed?

Count sequences

Count.seqs tool with the following parameters:
- name: the name file from Unique.seqs
- Use a group file: yes
- group: the group file from Make.group
View the resulting datasets.

The Count.seqs file keeps track of the number of sequences represented by each unique representative across multiple samples. We will pass this file to many of the following tools to be used or updated as needed.

Quality Control

Summarize data

Summary.seqs tool with the following parameters:
- fasta: the fasta from Unique.seqs
- count: count table from Count.seqs
- output logfile?: yes
View the resulting datasets.
- The summary output files give information per read.
- The logfile outputs also contain some summary statistics:

We have a total of 19,502 unique sequences, representing 20,000 total sequences that vary in length between 80 and 275 bases.
At least some of our sequences had some ambiguous base calls.
At least one read had a homopolymer stretch of 31 bases, this is likely an error so we would like to filter such reads out as well.

Filter reads based on quality, length and maximum homopolymer length

We will remove any sequences with ambiguous bases (maxambig parameter), homopolymer stretches of 9 or more bases (maxhomop parameter) and any reads longer than 275 bp or shorter than 225 bp.

Screen.seqs tool with the following parameters:
- fasta: the fasta file from Unique.seqs
- minlength: 225
- maxlength 275
- maxambig: 0
- maxhomop: 8
- count: the count file from Count.seqs
View the resulting datasets.

Question

How many reads were removed in this screening step?

Sequence Alignment

Aligning our sequences to a reference helps improve OTU assignment Schloss et. al., so we will now align our sequences to an alignment of the V4 variable region of the 16S rRNA.

This alignment has been created as described in mothur’s MiSeq SOP from the Silva reference database.

Import the reference file in your history

     https://zenodo.org/record/815875/files/silva.v4.fasta

Align.seqs tool with the following parameters:
- fasta: the good.fasta output from Screen.seqs
- Select Reference Template from: Your history
- reference: the silva.v4.fasta reference file
- flip: Yes This step may take a few minutes, please be patient.
Summary.seqs tool with the following parameters:
- fasta: the aligned output from Align.seqs
- count: count_table output from Screen.seqs
View the resulting datasets.

Question

How many sequences have been aligned?
Between which positions most of the reads are aligned to this references?

Remove poorly aligned sequences

To make sure that everything overlaps the same region we’ll re-run Screen.seqs to get sequences that start at or before position 3,080 and end at or after position 13,424.

Screen.seqs tool with the following parameters:
- fasta: the aligned fasta file
- start: 3080
- end: 13424
- count: the group file created by the previous run of Screen.seqs

Question

How many reads were removed in this screening step?

Filter sequences

Now we know our sequences overlap the same alignment coordinates, we want to make sure they only overlap that region. So we’ll filter the sequences to remove the overhangs at both ends.

In addition, there are many columns in the alignment that only contain external gap characters (i.e. “.”), while columns containing only internal gap characters (i.e., “-“) are not considered. These can be pulled out without losing any information.

Filter.seqs tool with the following parameters:
- fasta: good.fasta output from Screen.seqs
- trump: .

Extraction of taxonomic information

The main questions when analyzing amplicon data are:

Which micro-organisms are present in an environmental samples?
And in what proportion?
What is the structure of the community of the micro-organisms?

The idea is to take the sequences and assign them to a taxon. To do that, we group (or cluster) sequences based on their similarity to defined Operational Taxonomic Units (OTUs): groups of similar sequences that can be treated as a single “genus” or “species” (depending on the clustering threshold).

In 16S metagenomics approaches, OTUs are clusters of similar sequence variants of the 16S rDNA marker gene sequence.

Each of these clusters is intended to represent a taxonomic unit of a bacterial species or genus depending on the sequence similarity threshold.

OTU clusters are defined by a 97% identity threshold of the 16S gene sequence variants at genus level.
98% or 99% identity is suggested for species separation.

Perform preliminary clustering of sequences and remove undesired sequences

The first thing we want to do is to further de-noise our sequences from potential sequencing errors, by pre-clustering the sequences using the Pre.cluster tool, allowing for up to 2 differences between sequences.

Pre.cluster will split the sequences by group and then sort them by abundance, then go from most abundant to least and identify sequences that differ by no more than 2 nucleotides from on another.

If this is the case, then they get merged.

We generally recommend allowing 1 difference for every 100 basepairs of sequence

Pre.cluster tool with the following parameters:
- fasta: the fasta output from the last Filter.seqs run
- name file or count table: the count table from the last Screen.seqs step
- diffs: 2

Question

How many unique sequences are we left with after this clustering of highly similar sequences?

Classify the sequences into phylotypes

We would like to classify the sequences using a training set, which is again is provided on mothur’s MiSeq SOP.

Import the trainset16_022016.pds.fasta and trainset16_022016.pds.tax in your history
Classify.seqs tool with the following parameters:
- fasta: the fasta output from Pre.cluster
- Select Reference Template from: History
- reference: trainset16_022016.pds.fasta from your history
- Select Taxonomy from to History
- taxonomy: trainset16_022016.pds.tax from your history
- count: the count table from Pre.cluster
This step may take a couple of minutes
View the resulting datasets.

Every read now has a classification.

The next step is to use this information to determine the abundances of the different found taxa:

Assign sequences to OTUs

Cluster.split tool with the following parameters:
- Split by: Classification using fasta
- fasta: the fasta output from Pre.cluster
- taxonomy: the taxonomy output from Classify.seqs
- count: the count table output from Pre.cluster
- Clustering method: Average Neighbour
- cutoff: 0.15
View the resulting datasets.

We obtain a table with the columns being the different identified OTUs, the rows the different distances and the cells the ids of the sequences identified for these OTUs for the different distances.

Estimate OTU abundance

Now, we want to know how many sequences are in each OTU from each group with a distance of 0.03 (97% similarity).

Make.shared tool with the following parameters:
- Select input type: OTU list
- list: list output from Cluster.split
- supply group or count table: the count table from Pre.cluster
- label: 0.03

Classify the OTUs

We want to know the taxonomy for each of our OTUs.

Classify.otu tool with the following parameters:
- list: list output from Cluster.split
- count: he count table from Pre.cluster
- Select Taxonomy from: History
- taxonomy the taxonomy output from Classify.seqs
- label to 0.03

Question

How many OTUs with taxonomic assignation are found for the Anguil sample? And for the Pampa sample? (tax.summary)
What is the annotation of first OTU and its size? (taxonomy)

Visualization

We have now determined our OTUs and classified them, but looking at a long text file is not very informative. Let’s visualize our data using Krona:

Krona

First we convert our mothur taxonomy file to a format compatible with Krona

Taxonomy-to-Krona tool with the following parameters:
- Taxonomy file: the taxonomy output from Classify.otu (note: this is a collection input)
Krona pie chart tool with the following parameters:
- Type of input: Tabular
- Input file: taxonomy output from Taxonomy-to-Krona (collection)

This produced a single plot for both your samples, but what if you want to compare the two samples?

Per-sample Krona plots

Classify.otu tool
- Hit the rerun button on the Classify.otu job in your history and see if you can find settings that will give you per-sample taxonomy data
Krona tool
- Now use this new output collection to create per-sample Krona plots
Taxonomy-to-Krona tool with the following parameters:
- Taxonomy file: the taxonomy output from Classify.otu (note: this is a collection input)
Krona pie chart tool with the following parameters:
- Type of input: Tabular
- Input file: taxonomy output from Taxonomy-to-Krona (collection)

In this new Krona output you can switch between the combined plot and the per-sample plots via the selector in the top-left corner.

Question

Which soil sample had a higher percentage of Acidobacteria, anguil or pampa?
what were the respective percentages?

Visualization of the community structure with Phinch

To further explore the community structure, we can visualize it with dedicated tools such as Phinch.

Make.biom tool with the following parameters:
- shared: Make.shared output
- constaxonomy: taxonomy output from the first run of Classify.otu (collection)
Download and rename it to name-file.biom
Upoload on https://usegalaxy.eu/phinch/

Downstream Analysis

Once we have information about the community structure (OTUs with taxonomic structure), we can do more analysis on it:

estimation of the diversity of micro-organism,
comparison of diversity between samples,
analysis of populations

Shotgun metagenomics data

In shotgun data, full genomes of the micro-organisms in the environment are sequenced (not only the 16S or 18S).

We can then have access to the rRNA (only a small part of the genomes), but also to the other genes of the micro-organisms.

Using this information, we can try to answer questions such as “What are the micro-organisms doing?” in addition to the question “What micro-organisms are present?”.

https://usegalaxy.eu/

Importing the data

Create a new history

Upload datasets

https://zenodo.org/record/815875/files/SRR606451_pampa.fasta

Rename them

Extraction of taxonomic information

As for amplicon data, we can extract taxonomic and community structure information from shotgun data.

Different approaches can be used:

Same approach as for amplicon data with identification and classification of OTUs
- Such an approach requires a first step of sequence sorting to extract only the 16S and 18S sequences, then using the same tools as for amplicon data. However, rRNA sequences represent a low proportion (< 1%) of the shotgun sequences so such an approach is not the most statistically supported
Assignation of taxonomy on the whole sequences using databases with marker genes

In this tutorial, we use the second approach with MetaPhlAn2. See MetaPhlAn2

This tools is using a database of ~1M unique clade-specific marker genes (not only the rRNA genes) identified from ~17,000 reference (bacterial, archeal, viral and eukaryotic) genomes.

Taxonomic assignation with MetaPhlAn2

MetaPhlAN2 tool with:
- *Input file: the imported file
- Database with clade-specific marker genes: locally cached
- Cached database with clade-specific marker genes: MetaPhlAn2 clade-specific marker genes

This step may take a couple of minutes.

3 files are generated:

A tabular file with the community structure Each line contains a taxa and its relative abundance found for our sample. The file starts with high level taxa (k__) and go to more precise taxa.
A BIOM file with the same information as the previous file but in BIOM format It can be used then by mothur and other tools requiring community structure information in BIOM format
A SAM file with the results of the mapping of the sequences on the reference database

Question

What is the most precise level we have access to with MetaPhlAn2?
What are the two orders found in our sample?
What is the most abundant family in our sample?

Interactive visualization with KRONA

Format MetaPhlAn2 output for Krona tool with:
- Input file to Community profile output of MetaPhlAn2
KRONA pie chart tool with:
- What is the type of your input data: MetaPhlan
- Input file: the output of Format MetaPhlAn2

Extraction of functional information

What are the micro-organisms doing?
Which functions are performed by the micro-organisms in the environment?

In the shotgun data, we have access to the gene sequences from the full genome. We use that to identify the genes, associate them with a function, build pathways, etc., to investigate the functional part of the community.

Metabolism function identification

HUMAnN2 tool with:
- Input sequence file: the imported sequence file
- Use of a custom taxonomic profile: Yes
- Taxonomic profile file: Community profile output of MetaPhlAn2
- Nucleotide database: Locally cached
- Nucleotide database: Full
- Protein database: Locally cached
- Protein database: Full UniRef50
- Search for uniref50 or uniref90 gene families?: uniref50
- Database to use for pathway computations: MetaCyc
- Advanced Options
  - Remove stratification from output: Yes
This step is long so we generated the output for you!

Import the 3 files whose the name is starting with “humann2”

https://zenodo.org/record/815875/files/humann2_gene_families_abundance.tsv
https://zenodo.org/record/815875/files/humann2_pathways_abundance.tsv
https://zenodo.org/record/815875/files/humann2_pathways_coverage.tsv

HUMAnN2 generates 3 files

A file with the abundance of gene families

Gene family abundance is reported in RPK (reads per kilobase) units to normalize for gene length. It reflects the relative gene (or transcript) copy number in the community.

The “UNMAPPED” value is the total number of reads which remain unmapped after both alignment steps (nucleotide and translated search). Since other gene features in the table are quantified in RPK units, “UNMAPPED” can be interpreted as a single unknown gene of length 1 kilobase recruiting all reads that failed to map to known sequences.
A file with the coverage of pathways

Pathway coverage provides an alternative description of the presence (1) and absence (0) of pathways in a community, independent of their quantitative abundance.
A file with the abundance of pathways

Question

How many gene families and pathways have been identified?

Normalize the gene family abundances

The RPK for the gene families are quite difficult to interpret in term of relative abundance. We decide then to normalize the values

Renormalize a HUMAnN2 generated table tool with:
- Gene/pathway table: the gene family table generated with HUMAnN2
- Normalization scheme: Relative abundance
- Normalization level: Normalization of all levels by community total

Question

What percentage of sequences has not be assigned to a gene family?
What is the most abundant gene family?

Conclusion

With amplicon data, we can extract information about the studied community structure
With shotgun data, we can extract information about the studied community structure and also the functions realised by the community
The tools used to analyze amplicon and shotgun data are different, except for the visualisation
Metagenomics data analyses are complex and time-consuming