Microbial Variant Calling

How do we detect differences between a set of reads from a microorganism and a reference genome?

Variant calling is the process of identifying differences between two genome samples. Usually differences are limited to single nucleotide polymorphisms (SNPs) and small insertions and deletions (indels).

Imagine that you have been asked to find the differences between a sample that has been sequenced and a known genome.

For example:

You have a new sample from a patient and you want to see if it has any differences from a well known reference genome  of the same species. 

Typically, you would have a couple of fastq read files sent back to you from the sequencing provider and either an annotated or non annotated reference genome.

In this tutorial, we will use the tool Snippy to find high confidence differences (indels or SNPs) between our known genome and our reads, ie. finds SNPs between a haploid reference genome and your NGS sequence reads.

Snippy pipeline consists of 3 steps:

Read mapping: to align the reads to the reference genome
- BWA MEM with a custom set of settings which are very suitable to aligning reads for microbial type data
Variant Calling: to decide (“call”) if any of the resulting discrepancies are real variants or technical artifacts that can be ignored
- Freebayes with a custom set of settings
Functional Effect Prediction: to describe what the predicted changes do in terms of the genes themselves.
- SnpEff with a custom set of settings

The Galaxy wrapper for Snippy has the ability to change some of the underlying tool settings in the advanced section but it is not recommended.

Go Up

Get the data

The data is a subset of a real dataset from a Staphylococcus aureus bacteria. We have a closed genome sequence and an annotation for our wildtype strain.

A whole-genome shotgun approach was used to produce a series of short sequence reads on an Illumina DNA sequencing instrument for the mutant strain.

The reads are paired-end
Each read is on average 150 bases
The reads would cover the original wildtype genome to a depth of 19x

The files we will be using are:

mutant_R1.fastq & mutant_R2.fastq - the read files in fastq format.
wildtype.fna - The sequence of the reference strain in fasta format.
wildtype.gbk - The reference strain with gene and other annotations in genbank format.
wildtype.gff - The reference strain with gene and other annotations in gff3 format.

NOTE! This data is available at Zenodo

NOTE! Please, use

Create a new history

Upload datasets

https://zenodo.org/record/582600/files/mutant_R1.fastq
https://zenodo.org/record/582600/files/mutant_R2.fastq
https://zenodo.org/record/582600/files/wildtype.fna
https://zenodo.org/record/582600/files/wildtype.gbk
https://zenodo.org/record/582600/files/wildtype.gff

Go Up

Find variants with Snippy

We will now run the Snippy tool on our reads, comparing it to the reference.

Snippy is a tool for rapid bacterial SNP calling and core genome alignments.
Snippy finds SNPs between a haploid reference genome and your NGS sequence reads. It will find both substitutions (snps) and insertions/deletions (indels).
If we give Snippy an annotated reference, it will silently run SnpEff which will figure out the effect of any changes on the genes and other features.
If we just give Snippy the reference sequence alone without the annotations, it will not run SnpEff.

We have an annotated reference and so will use it in this case.

Snippy tool with the following parameters:
- Reference File: the wildtype.gbk file (if the genbank file is not selectable, make sure to change its datatype to ‘genbank’)
- Single or Paired-end reads: Paired
- Select first set of reads: mutant_R1.fastq
- Select second set of reads: mutant_R2.fastq
- Select all outputs

Go Up

Examine Snippy output

Snippy has taken the reads and:

mapped them against the reference using BWA MEM,
looked through the resulting BAM file and found differences using some fancy Bayesian statistics (Freebayes),
filtered the differences for sensibility and finally checked what effect these differences will have on the predicted genes and other features in the genome.

It produces quite a bit of output, there can be up to 10 output files.

snps vcf file: The final annotated variants in VCF format
snps gff file: The variants in GFF3 format
snps table: A simple tab-separated summary of all the variants
snps summary: A summary of the SNPs called
log file: A log file with the commands run and their outputs
aligned fasta: A version of the reference but with - at position with depth=0 and N for 0 < depth < –mincov (does not have variants)
consensus fasta: A version of the reference genome with all variants instantiated
mapping depth: A table of the mapping depth
mapped reads bam: A BAM file containing all of the mapped reads
outdir: A tarball of the Snippy output directory for input into Snippy-core if required

Question

Which types of variants have been found?
What is the third variant called?
What is the product of the mutation?
What might be the result of such a mutation?

Go Up

View Snippy output in JBrowse

We could go through all of the variants like this and read them out of a text table, but this is onerous and doesn’t really give the context of the changes very well.

In order to have a visualisation of the SNPs and the other relevant data, in Galaxy we can use JBrowse

JBrowse Tool with the following parameters
- Reference genome to display: Use a genome from history
- Select the reference genome: wildtype.fna
- Genetic Code: 11: The Bacterial, Archaeal and Plant Plastid Code
- In Track Group: sequence reads
  - Insert Track Group
  - Track Category: Gene Annotations
    - Track Type: BAM Pileups
    - BAM Track Data: bam file from Snippy
    - Autogenerate SNP Track: Yes
    - Track Visibility: On for new users
- In Track Group: variants
  - Insert Track Group
  - Track Category: Variants
    - Track Type: GFF/GFF3/BED Features
    - GFF/GFF3/BED Track Data snps gff file from Snippy
    - Track Visibility: On for new users
- In Track Group:
  - Insert Track Group
  - Track Category:
    - Track Type: GFF/GFF3/BED Features
    - GFF/GFF3/BED Track Data: wildtype.gff
    - JBrowse Track Type [Advanced]: Canvas Features
    - Click on JBrowse Styling Options [Advanced]
      - JBrowse style.label: product
      - JBrowse style.description product
    - Track Visibility: On for new users

Go Up

Inspecting the SNPs using JBrowse

Display all the tracks and practice maneuvering around
- Click on the tick boxes on the left to display the tracks
- Zoom out by clicking on the minus button to see sequence reads and their coverage (the grey graph)
- Zoom in by clicking on the plus button to see
  - probable real variants (a whole column of SNPs)
  - probable errors (single one here and there)
Look at the stop SNP
- Type 47299 in the coordinates box
- Click on Go to see the position of the SNP discussed above

Question

What is the correct codon at this position?
What is the mutation found here?

Go Up

Conclusion

By running a tool such as Snippy on your read files and reference genome, we can find where the biologically important changes between genomes of different strains occur and perhaps what they mean to the phenotype

We used a tool called Snippy to call variants between our reads and our reference genome.
As our reference genome had annotations, we could see what effect the changes have on the features as annotated in the reference and therefore make inferences on the possible changes to the phenotype.
We used the JBrowse genome browser to visualise what these changes look like

Go Up

Hands-On

Now, it’s your turn!

Go Up

Training Course in Galaxy for Metagenomic Analysis on Targeted and Whole Genome Sequencing Data

Microbial Variant Calling

Get the data

Find variants with Snippy

Examine Snippy output

View Snippy output in JBrowse

Inspecting the SNPs using JBrowse

Conclusion

Hands-On