Microbial Variant Calling

How do we detect differences between a set of reads from a microorganism and a reference genome?


Variant calling is the process of identifying differences between two genome samples. Usually differences are limited to single nucleotide polymorphisms (SNPs) and small insertions and deletions (indels).

Imagine that you have been asked to find the differences between a sample that has been sequenced and a known genome.

For example:

You have a new sample from a patient and you want to see if it has any differences from a well known reference genome  of the same species. 

Typically, you would have a couple of fastq read files sent back to you from the sequencing provider and either an annotated or non annotated reference genome.

In this tutorial, we will use the tool Snippy to find high confidence differences (indels or SNPs) between our known genome and our reads, ie. finds SNPs between a haploid reference genome and your NGS sequence reads.

Snippy pipeline consists of 3 steps:

  1. Read mapping: to align the reads to the reference genome
    • BWA MEM with a custom set of settings which are very suitable to aligning reads for microbial type data
  2. Variant Calling: to decide (“call”) if any of the resulting discrepancies are real variants or technical artifacts that can be ignored
    • Freebayes with a custom set of settings
  3. Functional Effect Prediction: to describe what the predicted changes do in terms of the genes themselves.
    • SnpEff with a custom set of settings

The Galaxy wrapper for Snippy has the ability to change some of the underlying tool settings in the advanced section but it is not recommended.

Go Up


Get the data

The data is a subset of a real dataset from a Staphylococcus aureus bacteria. We have a closed genome sequence and an annotation for our wildtype strain.

A whole-genome shotgun approach was used to produce a series of short sequence reads on an Illumina DNA sequencing instrument for the mutant strain.

  • The reads are paired-end
  • Each read is on average 150 bases
  • The reads would cover the original wildtype genome to a depth of 19x

The files we will be using are:

  • mutant_R1.fastq & mutant_R2.fastq - the read files in fastq format.
  • wildtype.fna - The sequence of the reference strain in fasta format.
  • wildtype.gbk - The reference strain with gene and other annotations in genbank format.
  • wildtype.gff - The reference strain with gene and other annotations in gff3 format.

NOTE! This data is available at Zenodo

NOTE! Please, use

  1. Create a new history
  2. Upload datasets
    https://zenodo.org/record/582600/files/mutant_R1.fastq
    https://zenodo.org/record/582600/files/mutant_R2.fastq
    https://zenodo.org/record/582600/files/wildtype.fna
    https://zenodo.org/record/582600/files/wildtype.gbk
    https://zenodo.org/record/582600/files/wildtype.gff
    

Go Up


Find variants with Snippy

We will now run the Snippy tool on our reads, comparing it to the reference.

  • Snippy is a tool for rapid bacterial SNP calling and core genome alignments.
  • Snippy finds SNPs between a haploid reference genome and your NGS sequence reads. It will find both substitutions (snps) and insertions/deletions (indels).
  • If we give Snippy an annotated reference, it will silently run SnpEff which will figure out the effect of any changes on the genes and other features.
  • If we just give Snippy the reference sequence alone without the annotations, it will not run SnpEff.

We have an annotated reference and so will use it in this case.

  1. Snippy tool with the following parameters:
    • Reference File: the wildtype.gbk file (if the genbank file is not selectable, make sure to change its datatype to ‘genbank’)
    • Single or Paired-end reads: Paired
    • Select first set of reads: mutant_R1.fastq
    • Select second set of reads: mutant_R2.fastq
    • Select all outputs

Go Up


Examine Snippy output

Snippy has taken the reads and:

  1. mapped them against the reference using BWA MEM,
  2. looked through the resulting BAM file and found differences using some fancy Bayesian statistics (Freebayes),
  3. filtered the differences for sensibility and finally checked what effect these differences will have on the predicted genes and other features in the genome.

It produces quite a bit of output, there can be up to 10 output files.

  • snps vcf file: The final annotated variants in VCF format
  • snps gff file: The variants in GFF3 format
  • snps table: A simple tab-separated summary of all the variants
  • snps summary: A summary of the SNPs called
  • log file: A log file with the commands run and their outputs
  • aligned fasta: A version of the reference but with - at position with depth=0 and N for 0 < depth < –mincov (does not have variants)
  • consensus fasta: A version of the reference genome with all variants instantiated
  • mapping depth: A table of the mapping depth
  • mapped reads bam: A BAM file containing all of the mapped reads
  • outdir: A tarball of the Snippy output directory for input into Snippy-core if required

Question

  1. Which types of variants have been found?
  2. What is the third variant called?
  3. What is the product of the mutation?
  4. What might be the result of such a mutation?

Go Up


View Snippy output in JBrowse

We could go through all of the variants like this and read them out of a text table, but this is onerous and doesn’t really give the context of the changes very well.

In order to have a visualisation of the SNPs and the other relevant data, in Galaxy we can use JBrowse

  1. JBrowse Tool with the following parameters
    • Reference genome to display: Use a genome from history
    • Select the reference genome: wildtype.fna
    • Genetic Code: 11: The Bacterial, Archaeal and Plant Plastid Code
    • In Track Group: sequence reads
      • Insert Track Group
      • Track Category: Gene Annotations
        • Track Type: BAM Pileups
        • BAM Track Data: bam file from Snippy
        • Autogenerate SNP Track: Yes
        • Track Visibility: On for new users
    • In Track Group: variants
      • Insert Track Group
      • Track Category: Variants
        • Track Type: GFF/GFF3/BED Features
        • GFF/GFF3/BED Track Data snps gff file from Snippy
        • Track Visibility: On for new users
    • In Track Group:
      • Insert Track Group
      • Track Category:
        • Track Type: GFF/GFF3/BED Features
        • GFF/GFF3/BED Track Data: wildtype.gff
        • JBrowse Track Type [Advanced]: Canvas Features
        • Click on JBrowse Styling Options [Advanced]
          • JBrowse style.label: product
          • JBrowse style.description product
        • Track Visibility: On for new users

Go Up


Inspecting the SNPs using JBrowse

  1. Display all the tracks and practice maneuvering around
    • Click on the tick boxes on the left to display the tracks
    • Zoom out by clicking on the minus button to see sequence reads and their coverage (the grey graph)
    • Zoom in by clicking on the plus button to see
      • probable real variants (a whole column of SNPs)
      • probable errors (single one here and there)

  2. Look at the stop SNP
    • Type 47299 in the coordinates box
    • Click on Go to see the position of the SNP discussed above

Question

  1. What is the correct codon at this position?
  2. What is the mutation found here?

Go Up


Conclusion

By running a tool such as Snippy on your read files and reference genome, we can find where the biologically important changes between genomes of different strains occur and perhaps what they mean to the phenotype

  • We used a tool called Snippy to call variants between our reads and our reference genome.
  • As our reference genome had annotations, we could see what effect the changes have on the features as annotated in the reference and therefore make inferences on the possible changes to the phenotype.
  • We used the JBrowse genome browser to visualise what these changes look like

Go Up


Hands-On

Now, it’s your turn!

Go Up