Introduction to Metagenomics

Why study the microbiome? Amplicon and shotgun sequencing. Analysis pipelines and Visualisation options

Why study the microbiome?

Health care research

Humans are full of microorganisms
Skin, gut, oral cavity, nasal cavity, eyes, ..
Affects health, drug efficacy, etc
Sometimes referred to as your second genome
- ~10 times more cells than you
- ~100 times more genes than you
- ~1000s different species

Go Up

Environmental studies

Microbes in the soil affect plants and animals
Improve agriculture

Go Up

2 different approaches

Puzzle analogy: think of each organism as a puzzle

Shotgun

We consider all pieces (DNA) from all puzzles (organisms); we throw the pieces in a big pile, and need to reconstruct each individual puzzle from this. It is more complex, but you get full details (functional information) about the picture on each puzzle.

Sequence all DNA
More information
Higher complexity
Higher cost

Go Up

Amplicon

We only look at the corner pieces, these are easy to spot, and all puzzles have them. It will not tell us all the details (e.g. functional information) of the picture on the puzzle, but might be enough to tell us whether it is a landscape or a portrait or abstract art for instance (taxonomy).

Sequence only specific gene
No functional information
Less complex to analyse
Cheaper

Go Up

Amplicon Metagenomics

Targeted approach, e.g. 16S (bacteria), 18S (eukaryotes), ITS (fungi)
Amplify gene of specific taxonomic group, i.e. not host or environment

Go Up

Highly conserved gene: easy to target across all bacteria
With variable regions: distinguish between genus

V1, V2 etc are the variable regions

Go Up

Pros
- Well-established
- Inexpensive ($50-$100/sample)
Cons
- V-region choice can bias results
- Is based on a very well conserved gene, making it hard to resolve species and strains

Go Up

Shotgun metagenomics

Aims to sequence the “whole” metagenome

Pros:
- Not biased by amplicon primer set
- Not limited by conservation of the amplicon
- Can also provide functional information
Cons:
- Environmental contamination, including host
- More expensive ($1000+/sample)
- Complex data analysis
- Requires high performance computing, high memory, high compute capacity

Go Up

Analysis pipelines

OTU: operational taxonomic units
ASV: amplicon sequence variants

Go Up

Pre-processing

There are a lot of ways to filter and trim your data
Trade-off between quality and amount of information retained

Go Up

OTU Clustering

Operational Taxonomic Unit (OTU) clustering is a method used to group similar sequences of DNA together into operational taxonomic units.

When clustering the dataset, OTUs will be generated by combining sequences that are similar to a set percentage level (traditionally 97%), with the most abundant sequence identified as the true sequence.

Cluster on 97% sequence similarity for genus-level differentiation
OTUs depend on the data set: no comparability of data sets

Go Up

ASV Denoising

Amplicon sequence variant (ASV) denoising is a method used to remove errors from amplicon sequencing data. The goal of ASV denoising is to identify true biological sequences and remove errors introduced during PCR amplification and sequencing.

ASV denoising is a newer method than OTU clustering and is based on a different algorithm based on a statistical model to estimate the error rate of each nucleotide position in the sequence, which is then used to correct errors in the data.

Discriminate sequence variations from sequencing errors
Based on an error model and the assumption that sequences containing errors are less likely observed repeatedly

Go Up

Chimera Removal

Chimera removal is a process used to identify and remove chimeric sequences from a dataset. Chimeras are sequences that are formed by two or more biological sequences incorrectly joined together. This often occurs during PCR reactions.

The goal of chimera removal is to identify and remove these sequences so that they do not interfere with downstream analysis.

During PCR multiple sequences can combine to form a hybrid
Must be removed from your data for better results

Go Up

Search marker database and taxonomy assignment

A marker database is a database of genes or proteins that are used to identify organisms in a sample. Taxonomy assignment is the process of assigning taxonomic labels to metagenomic sequences based on their similarity to sequences in reference databases.

This process is essential for understanding the composition of microbial communities in environmental samples.

Homology with reference databases
- Amplicon
- Shotgun: MetaPhlAn2 database
  - ~1M unique clade-specific marker genes
  - ~17,000 reference genomes (bacterial and archaeal, viral and eukaryotic)
Accuracy depends on quality and completeness of database used
Databases are inevitable incomplete

Go Up

Search functional database

Functional databases are databases that contain information about the functions of genes or proteins. These databases can be used to identify the functions of genes or proteins in a sample. Functional analysis is a main component of metagenomic analysis and is used to figure out what the organisms in a sample are actually doing.

One common approach is to query a specific set of known sequences that have an associated function, or can be related to a series of function.

Databases to identify gene families
- UniRef50
- UniRef90
Grouping in other functional categories
- MetaCyc Reactions
- KEGG Orthogroups (KOs)
- Pfam domains
- Level-4 enzyme commission (EC) categories
- EggNOG (including COGs)
- Gene Ontology (GO)
- Informative GO
Pathway reconstruction