Introduction to Metagenomics

Why study the microbiome? Amplicon and shotgun sequencing. Analysis pipelines and Visualisation options


Why study the microbiome?

Health care research

  • Humans are full of microorganisms
  • Skin, gut, oral cavity, nasal cavity, eyes, ..
  • Affects health, drug efficacy, etc
  • Sometimes referred to as your second genome
    • ~10 times more cells than you
    • ~100 times more genes than you
    • ~1000s different species

Go Up


Environmental studies

  • Microbes in the soil affect plants and animals
  • Improve agriculture

Go Up


2 different approaches

Puzzle analogy: think of each organism as a puzzle

Shotgun

We consider all pieces (DNA) from all puzzles (organisms); we throw the pieces in a big pile, and need to reconstruct each individual puzzle from this. It is more complex, but you get full details (functional information) about the picture on each puzzle.

  • Sequence all DNA
  • More information
  • Higher complexity
  • Higher cost

Go Up


Amplicon

We only look at the corner pieces, these are easy to spot, and all puzzles have them. It will not tell us all the details (e.g. functional information) of the picture on the puzzle, but might be enough to tell us whether it is a landscape or a portrait or abstract art for instance (taxonomy).

  • Sequence only specific gene
  • No functional information
  • Less complex to analyse
  • Cheaper

Go Up


Amplicon Metagenomics

  • Targeted approach, e.g. 16S (bacteria), 18S (eukaryotes), ITS (fungi)
  • Amplify gene of specific taxonomic group, i.e. not host or environment

Go Up


  • Highly conserved gene: easy to target across all bacteria
  • With variable regions: distinguish between genus

V1, V2 etc are the variable regions

Go Up


  • Pros
    • Well-established
    • Inexpensive ($50-$100/sample)
  • Cons
    • V-region choice can bias results
    • Is based on a very well conserved gene, making it hard to resolve species and strains

Go Up


Shotgun metagenomics

Aims to sequence the “whole” metagenome

  • Pros:
    • Not biased by amplicon primer set
    • Not limited by conservation of the amplicon
    • Can also provide functional information
  • Cons:
    • Environmental contamination, including host
    • More expensive ($1000+/sample)
    • Complex data analysis
    • Requires high performance computing, high memory, high compute capacity

Go Up


Analysis pipelines

  • OTU: operational taxonomic units
  • ASV: amplicon sequence variants

Go Up


Pre-processing

  • There are a lot of ways to filter and trim your data
  • Trade-off between quality and amount of information retained

Go Up


OTU Clustering

Operational Taxonomic Unit (OTU) clustering is a method used to group similar sequences of DNA together into operational taxonomic units.

When clustering the dataset, OTUs will be generated by combining sequences that are similar to a set percentage level (traditionally 97%), with the most abundant sequence identified as the true sequence.

  • Cluster on 97% sequence similarity for genus-level differentiation
  • OTUs depend on the data set: no comparability of data sets

Go Up


ASV Denoising

Amplicon sequence variant (ASV) denoising is a method used to remove errors from amplicon sequencing data. The goal of ASV denoising is to identify true biological sequences and remove errors introduced during PCR amplification and sequencing.

ASV denoising is a newer method than OTU clustering and is based on a different algorithm based on a statistical model to estimate the error rate of each nucleotide position in the sequence, which is then used to correct errors in the data.

  • Discriminate sequence variations from sequencing errors
  • Based on an error model and the assumption that sequences containing errors are less likely observed repeatedly

Go Up


Chimera Removal

Chimera removal is a process used to identify and remove chimeric sequences from a dataset. Chimeras are sequences that are formed by two or more biological sequences incorrectly joined together. This often occurs during PCR reactions.

The goal of chimera removal is to identify and remove these sequences so that they do not interfere with downstream analysis.

  • During PCR multiple sequences can combine to form a hybrid
  • Must be removed from your data for better results

Go Up


Search marker database and taxonomy assignment

A marker database is a database of genes or proteins that are used to identify organisms in a sample. Taxonomy assignment is the process of assigning taxonomic labels to metagenomic sequences based on their similarity to sequences in reference databases.

This process is essential for understanding the composition of microbial communities in environmental samples.

  • Homology with reference databases
    • Amplicon
    • Shotgun: MetaPhlAn2 database
      • ~1M unique clade-specific marker genes
      • ~17,000 reference genomes (bacterial and archaeal, viral and eukaryotic)
  • Accuracy depends on quality and completeness of database used
  • Databases are inevitable incomplete

Go Up


Search functional database

Functional databases are databases that contain information about the functions of genes or proteins. These databases can be used to identify the functions of genes or proteins in a sample. Functional analysis is a main component of metagenomic analysis and is used to figure out what the organisms in a sample are actually doing.

One common approach is to query a specific set of known sequences that have an associated function, or can be related to a series of function.

  • Databases to identify gene families
    • UniRef50
    • UniRef90
  • Grouping in other functional categories
    • MetaCyc Reactions
    • KEGG Orthogroups (KOs)
    • Pfam domains
    • Level-4 enzyme commission (EC) categories
    • EggNOG (including COGs)
    • Gene Ontology (GO)
    • Informative GO
  • Pathway reconstruction

Go Up


Results: OTU table

Go Up


Results: Visualizations

Krona

  • Interactive exploration of sample taxonomy

Go Up


Phinch

  • BIOM file input
  • various visualizations
  • multi-sample data

Go Up