Identification of AMR genes in an bacterial genome

Identification and visualization of antimicrobial resistance genes in a bacterial genome, including their genomic location and association with plasmids.

Galaxy and data preparation

In this tutorial, we will demonstrate how to identify antimicrobial resistance (AMR) genes in a bacterial genome using an assembled genome as input.
We will start from a bacterial genome assembly generated in the previous tutorial.

Before starting the analysis, prepare your Galaxy workspace as follows:

Create a new Galaxy history and give it a meaningful name.
Import theShovill contigs dataset into the new history by dragging and dropping it from a previous history (see here for instructions on managing and copying datasets between histories).
Import the raw read filtered by fastp

Go Up

Identification of AMR genes

To identify antimicrobial resistance (AMR) genes in assembled contigs, tools such as ABRicate and StarAMR can be used.
In this tutorial, we focus on StarAMR, while ABRicate will be covered in a dedicated follow-up tutorial.

Staramr scans bacterial genome contigs against the ResFinder, PointFinder, and PlasmidFinder databases and generates a comprehensive summary report of the detected antimicrobial resistance genes.

StarAMR Tool with the following parameters:
- genomes: Contig file

There are 8 different output files produced by StarAMR

summary.tsv

A summary of all detected AMR genes/mutations in each genome, one genome per line:

Isolate ID: The id of the isolate/genome file(s) passed to StarAMR
Quality Module: The isolate/genome file(s) pass/fail result(s) for the quality metrics
Genotype: The AMR genotype of the isolate.
Predicted Phenotype: The predicted AMR phenotype (drug resistances) for the isolate.
CGE Predicted Phenotype: The CGE-predicted AMR phenotype (drug resistances) for the isolate (CGE = Center for Genomic Epidemiology)
Plasmid: Plasmid types that were found for the isolate.
Scheme: The MLST scheme used
MLST stands for MultiLocus Sequence Typing. It is a technique for the typing of multiple loci, using DNA sequences of internal fragments of multiple housekeeping genes to characterize isolates of microbial species. Here, StarAMR uses mlst to scan the contig files against traditional PubMLST typing schemes. The correspondance between the scheme and the bacteria genus and species is accessible in the map
Sequence Type: The sequence type that’s assigned when combining all allele types
Genome Length: The isolate/genome file(s) genome length(s)
N50 value: The isolate/genome file(s) N50 value(s)
Number of Contigs Greater Than Or Equal To 300 bp: The number of contigs greater or equal to 300 base pair in the isolate/genome file(s)
Quality Module Feedback: The isolate/genome file(s) detailed feedback for the quality metrics

Go Up

detailed_summary.tsv

A detailed summary of all detected AMR genes/mutations/plasmids/sequence type in each genome, one gene per line:

Isolate ID: The id of the isolate/genome file(s) passed to staramr.
Data: The particular gene detected from ResFinder, PlasmidFinder, PointFinder, or the sequence type.
Data Type: The type of gene (Resistance or Plasmid), or MLST.
Predicted Phenotype: The predicted AMR phenotype (drug resistances) found in ResFinder/PointFinder. Plasmids will be left blank by default.
CGE Predicted Phenotype: The CGE-predicted AMR phenotype (drug resistances) found in ResFinder/PointFinder. Plasmids will be left blank by default.
%Identity: The % identity of the top BLAST HSP to the gene.
%Overlap: THe % overlap of the top BLAST HSP to the gene (calculated as hsp length/total length * 100).
HSP Length/Total Length: The top BLAST HSP length over the gene total length (nucleotides).
Contig: The contig id containing this gene.
Start: The start of the gene (will be greater than End if on minus strand).
End: The end of the gene.
Accession: The accession of the gene from either ResFinder or PlasmidFinder database.

Go Up

resfinder.tsv

A tabular file of each AMR gene and additional BLAST information from the ResFinder database, one gene per line:

Isolate ID: The id of the isolate/genome file(s) passed to staramr.
Gene: The particular AMR gene detected.
Predicted Phenotype: The predicted AMR phenotype (drug resistances) for this gene.
CGE Predicted Phenotype: The CGE-predicted AMR phenotype (drug resistances) for this gene.
%Identity: The % identity of the top BLAST HSP to the AMR gene.
%Overlap: THe % overlap of the top BLAST HSP to the AMR gene (calculated as hsp length/total length * 100).
HSP Length/Total Length: The top BLAST HSP length over the AMR gene total length (nucleotides).
Contig: The contig id containing this AMR gene.
Start: The start of the AMR gene (will be greater than End if on minus strand).
End: The end of the AMR gene.
Accession: The accession of the AMR gene in the ResFinder database.
Sequence: The AMR Gene sequence
CGE Notes: Any CGE notes associated with the prediction

Go Up

plasmidfinder.tsv

A tabular file of each AMR plasmid type and additional BLAST information from the PlasmidFinder database, one plasmid type per line:

Isolate ID: The id of the isolate/genome file(s) passed to staramr.
Plasmid: The particular plasmid type detected.
%Identity: The % identity of the top BLAST HSP to the plasmid type.
%Overlap: The % overlap of the top BLAST HSP to the plasmid type (calculated as hsp length/total length * 100).
HSP Length/Total Length: The top BLAST HSP length over the plasmid type total length (nucleotides).
Contig: The contig id containing this plasmid type.
Start: The start of the plasmid type (will be greater than End if on minus strand).
End: The end of the plasmid type.
Accession: The accession of the plasmid type in the PlasmidFinder database.

Go Up

settings.txt

The command-line, database versions, and other settings used to run staramr.

results.xlsx

An Excel spreadsheet containing the previous files as separate worksheets.

Go Up

Getting more information about ARG via the CARD database

To get more information about the antibiotic resistant genes (ARG), we can check the CARD database (Comprehensive Antibiotic Resistance Database).

CARD can be very helpful to check all the resistance genes and check if it is logical to find the resistance gene in a specific bacteria.

Question

To what family does mecA belong?
Do you expect to find this gene in this MRSA strain and why?

Go Up

Visualization of the ARGs and plasmid genes in their genomic context

We would like to look at the ARGs and plasmid genes in their genomic context. To do that, we will use JBrowse with several information:

Assembly as the reference
ARGs location
Contigs annotation (genes, etc)
Coverage of the contigs from the raw reads

Go Up

Extraction of the ARG and plasmid genes location

The first step is to extract the location of the ARGs and plasmid genes on the contigs. This information is available on the detailed_summary.tsv output of StarAMR. The genes and their location are on the lines with a decimal value on column 6 or 7. So to only get this information, we need to select lines with a decimal value (###.##) followed by a tab character, the column separator in Galaxy. As a result, any lines without an identity or overlap value will be filtered out.

Select lines that match an expression with the following parameters:
- Select lines from: detailed_summary.tsv staramr output
- that: Matching
- *the pattern: [0-9]+.[0-9]+\t
Click on Run Tool

This table can not be used directly in JBrowse. It first needs to be transformed in a standard format: GFF3, a file format used for describing genes and other features of DNA, RNA and protein sequences (see the previous tutorial for details).

Create a GFF file

Table to GFF3
- Table: output of the above Select lines tool step
- Record ID column or value: 9
- Start column or value*: 10
- End column or value: 11
- Type column or value: 3
- Score column or value: 6
- Source column or value: 3
- Insert Qualifiers
  - Name: name
  - Qualifier value column or raw text: 2
- Insert Qualifiers
  - Name: phenotype
  - Qualifier value column or raw text: 4
- Insert Qualifiers
  - *Name: accession
  - Qualifier value column or raw text: 12
Click on Run Tool

Question

How many genes have been found per contigs?

Go Up

Annotation of the contigs

In addition to antimicrobial resistance genes (ARGs) and plasmid-associated genes, it is useful to obtain functional information for all other genes present on the contigs.

For this purpose, we will use the genome annotation results generated in the previous tutorial.
Specifically, drag and drop the GFF file produced by Bakta from the previous history into the current history, as it contains the structural and functional annotation of the contigs.

Go Up

Mapping the raw reads on the contigs

To estimate the coverage of contigs and annotated genes, the original sequencing reads are mapped back to the assembled contigs using Bowtie2.

Use the paired-end reads filtered with fastp, imported by dragging and dropping them from the assembly tutorial history into the current history.

Bowtie2 with the following parameters:
- Is this single or paired library: Paired-end
  - FASTQ Paired Dataset: Paired Reads
- Will you select a reference genome from your history or use a built-in index?: Use a genome from the history and build index
  - Select reference genome: Contig file
- Save the bowtie2 mapping statistics to the history: Yes
MultiQC with the following parameters:
- Which tool was used generate logs?: Bowtie2
- Output of Bowtie 2: Mapping Stats
- Report Title: Mapping Report

Question

What is the alignment rate?
What does that mean for the assembly?

Go Up

Visualisation of the ARGs

We can now visualize the contigs, the mapping coverage, and the genes, using JBrowse and different information track.

JBrowse with the following parameters:
- Reference genome to display: Use a genome from history
  - Select the reference genome: genome.fa
  - Genetic Code: 11. The Bacterial, Archaeal and Plant Plastid Code
  - In Track Group:
    - Insert Track Group
      - Track Category: Bakta
      - In Annotation Track:
        
        Insert Annotation Track
        
        Track Type: GFF/GFF3/BED Features
        
        GFF/GFF3/BED Track Data: annotation_and_sequences (output of Bakta)
        
        JBrowse Track Type [Advanced]: Neat Canvas Features
        
        Track Visibility: On for new users
    - Insert Track Group
      - Track Category: ARGs and plasmid genes
      - In Annotation Track:
        
        Insert Annotation Track
        
        Track Type: GFF/GFF3/BED Features
        
        GFF/GFF3/BED Track Data: Output of Table to GFF3
        
        JBrowse Track Type [Advanced]: Neat Canvas Features
        
        Track Visibility: On for new users
    - Insert Track Group
      - Track Category: Coverage
      - In Annotation Track:
        
        Insert Annotation Track
        
        Track Type: BAM Pileups
        
        BAM Track Data: Bowtie2’s output
        
        Autogenerate SNP Track: Yes
View the output of JBrowse

In the output of the JBrowse you can view the mapped reads and the found genes against the reference genome. With the search tools you can easily find genes of interest.

Using the Bowtie2 mapping output, low coverage regions can be detected. This SNP detection can also give a clear view of where the data was less reliable or where variations were located.

Question

What is the name for the gene found by Bakta corresponding to the rep16 gene found by StarAMR?
What are the NCBI Protein id and UniRef id for aac(6’)-aph(2’’)?

Go Up

Conclusion

In this tutorial, contigs were scaned for AMR and plasmid genes. The genes were then visualized in their genomic context after contig annotation.

Go Up

Galaxy Training for Pathogen Genomics, AMR Detection & Virus ID

Identification of AMR genes in an bacterial genome

Galaxy and data preparation

Identification of AMR genes

summary.tsv

detailed_summary.tsv

resfinder.tsv

plasmidfinder.tsv

settings.txt

results.xlsx

Getting more information about ARG via the CARD database

Visualization of the ARGs and plasmid genes in their genomic context

Extraction of the ARG and plasmid genes location

Annotation of the contigs

Mapping the raw reads on the contigs

Visualisation of the ARGs

Conclusion