Pathogen detection with ABRicate

Identification of pathogens from sequencing data and tracking their presence across multiple samples.

Pathogen detection is a core activity in clinical microbiology and public health surveillance, aimed at identifying and characterizing disease-causing bacteria from isolates or clinical samples to support diagnosis, surveillance, and outbreak investigations.

Despite its importance, pathogen detection presents several challenges:

Genetic variability across species, strains, and lineages can complicate accurate identification.
Closely related non-pathogenic strains may be difficult to distinguish from true pathogens.
Result interpretation requires clinical and epidemiological context, as detection alone does not always imply pathogenicity.

After detecting a potential pathogen, results are interpreted by:

comparing genomic features against reference databases,
assessing virulence factors and antimicrobial resistance genes,
integrating genomic findings with clinical and epidemiological data.

This integrated approach helps distinguish true pathogens from background organisms and supports informed public health and clinical decision-making.

Go Up

Preparing data

In this tutorial, we use datasets from a comprehensive study that analyzed the genomic composition of Vibrio cholerae isolates, focusing on virulence genes, pathogenicity islands, and antimicrobial resistance genes.

The dataset includes 10 whole-genome sequences of V. cholerae isolates.
Isolates were collected during three cholera outbreaks that occurred in Uganda between 2014 and 2016.
Sequencing data are publicly available in the NCBI Sequence Read Archive under accession SRP136117.

For the purposes of this tutorial and due to time constraints, the analysis starts directly from assembled contigs, skipping the initial quality control and genome assembly steps.

Create a new history and rename it

Upload datasets

 https://github.com/vappiah/bioinfodata/raw/main/genome-assemblies/vibrio-cholerae/raw/batch1/SRR6871245.fasta
 https://github.com/vappiah/bioinfodata/raw/main/genome-assemblies/vibrio-cholerae/raw/batch1/SRR6871246.fasta
 https://github.com/vappiah/bioinfodata/raw/main/genome-assemblies/vibrio-cholerae/raw/batch1/SRR6871247.fasta
 https://github.com/vappiah/bioinfodata/raw/main/genome-assemblies/vibrio-cholerae/raw/batch1/SRR6871248.fasta
 https://github.com/vappiah/bioinfodata/raw/main/genome-assemblies/vibrio-cholerae/raw/batch1/SRR6871249.fasta
 https://github.com/vappiah/bioinfodata/raw/main/genome-assemblies/vibrio-cholerae/raw/batch1/SRR6871250.fasta
 https://github.com/vappiah/bioinfodata/raw/main/genome-assemblies/vibrio-cholerae/raw/batch1/SRR6871251.fasta
 https://github.com/vappiah/bioinfodata/raw/main/genome-assemblies/vibrio-cholerae/raw/batch1/SRR6871252.fasta
 https://github.com/vappiah/bioinfodata/raw/main/genome-assemblies/vibrio-cholerae/raw/batch1/SRR6871253.fasta
 https://github.com/vappiah/bioinfodata/raw/main/genome-assemblies/vibrio-cholerae/raw/batch1/SRR6871254.fasta

Create a collection and remane it as contigs

Go Up

AMR Genes detection

In order to search AMR genes among our samples’ contigs, we run ABRicate and choose the NCBI Bacterial Antimicrobial Resistance Gene Database (AMRFinderPlus) from the advanced options of the tool.

ABRicate checks if there is an AMR found or not, if found then in which contig it is, its location on the contig, what the name of the exact product is, what substance it provides resistance against and a lot of other information regarding the found AMR.

ABRicate Tool with the following parameters:
- Input file (Fasta, Genbank or EMBL file): contigs collection
- Advanced options:
- Database to use - default is ‘resfinder’: NCBI Bacterial Antimicrobial Resistance Reference Gene Database

The outputs of ABRicate is a tabular file with different columns:

FILE: The filename this hit came from
SEQUENCE: The sequence in the filename
START: Start coordinate in the sequence
END: End coordinate
STRAND: Strand
GENE: AMR gene
COVERAGE: What proportion of the gene is in our sequence
COVERAGE_MA: A visual representation
GAPS: Was there any gaps in the alignment - possible pseudogene?
%COVERAGE: Proportion of gene covered
%IDENTITY: Proportion of exact nucleotide matches
DATABASE: The database this sequence comes from
ACCESSION: The genomic source of the sequence
PRODUCT: Gene product (if available)
RESISTANCE: Putative antibiotic resistance phenotype, ;-separated

TRY!

Search ACCESSION value on NCBI webserver

Go Up

Combine ABRicate results into a simple matrix of gene presence/absence

ABRicate can combine results into a simple matrix of gene presence/absence.

An absent gene is denoted by .
a present gene is represented by its %COVERAGE. This can be individual abricate reports, or a combined one.

ABRicate Summary Tool with the following parameters:
- Abricate output reports to summarize: Output Report from ABRicate

Go Up

Virulence Factor identification

In this step we return back to the main goal of the tutorial where we want to identify the pathogens: identify if the bacteria found in our samples are pathogenic bacteria or not.

DEFINITIONS

Bacterial Pathogen: A bacterial pathogen is usually defined as any bacterium that has the capacity to cause disease. Its ability to cause disease is called pathogenicity.
Virulence: Virulence provides a quantitative measure of the pathogenicity or the likelihood of causing a disease.
Virulence Factors: Virulence factors refer to the properties (i.e., gene products) that enable a microorganism to establish itself on or within a host of a particular species and enhance its potential to cause disease. Virulence factors include bacterial toxins, cell surface proteins that mediate bacterial attachment, cell surface carbohydrates and proteins that protect a bacterium, and hydrolytic enzymes that may contribute to the pathogenicity of the bacterium.

To identify VFs, we use again ABRicate but this time with the VFDB from the advanced options of the tool.

ABRicate Tool with the following parameters:
- Input file (Fasta, Genbank or EMBL file): contigs collection
- Advanced options:
- Database to use - default is ‘resfinder’: VFBD
Rename output collection as VFs
ABRicate Summary Tool with the following parameters:
- Abricate output reports to summarize: Output Report from ABRicate

Go Up

Pathogen Tracking among all samples

In this last section, we would like to show how to aggregate results and use the results to help tracking pathogenes among samples by:

Drawing a presence-absence heatmap of the identified VF genes within all samples to visualize in which samples these genes can be found.

In this way we can have an overview of all samples and the genes, but also how samples are related to each other i.e. which common pathogenic genes they share.

Go Up

Formatting ABRicate Output

To execute the analysis in this section, it is essential to handle and format the ABRicate output. This processing is aimed at extracting a curated list comprising Accession IDs of identified genes associated with Virulence Factors, along with their corresponding sample IDs.

Cut Tool with the following parameters:
- Cut columns: c13
- From: VFs collection output of ABRicate
Rename the outputs VFs accessions
Cut Tool with the following parameters:
- Cut columns: c1
- From: VFs collection output of ABRicate
Unique occurrences of each record Tool with the following parameters:
- File to scan for unique values: Output collection from previous Cut
Rename the outputs SampleIDs
Concatenate multiple datasets (tail-to-head by specifying how) Tool with the following parameters:
- What type of data do you wish to concatenate?: Collection
- Input first collection: SampleIDs collection
- Input second collection: VFs accessions collection
Rename the outputs VFs accessions with SampleID

Go Up

Heatmap

A heatmap is one of the visualization techniques that can give you a complete overview of all the samples together and whether or not a certain value exists. In this tutorial, we use the heatmap to visualize all samples aside and check which common bacteria pathogen genes are found in samples and which is only found in one of them.

We use Heatmap w ggplot tool along with other tabular manipulating tools to prepare the tabular files.

Combine VFs accessions for samples into a table and get 0 or 1 for absence / presence

Collapse Collection Tool with the following parameters:
- Collection of files to collapse into single dataset: VFs accessions collection
- Prepend File name: No
Add line to file Tool with the following parameters:
- input file: output from Collapse Collection
- text to add: All_VFs
Multi-Join Tool with the following parameters:
- File to join: output from Add line to file
- add additional file: VFs accessions with SampleID
- Common key column: 1
- Column with values to preserve: 1
- Add header line to the output file: Yes
- Input files contain a header line (as first line): Yes
- Ignore duplicated keys: Yes
Replace Tool with the following parameters:
- File to process: output of Multi-Join
- In Find and Replace:
  - Insert Find and Replace
    - Find pattern: dataset_(.*?)_
    - Replace with: Sample_
    - Find-Pattern is a regular expression: Yes
    - Replace all occurences of the pattern: Yes
    - Ignore first line: No
    - Find and Replace text in: entire line
  - Insert Find and Replace
    - Find pattern: (\S+)
    - Replace with: Acc_$1
    - Find-Pattern is a regular expression: Yes
    - Case-Insensitive search: Yes
    - Replace all occurences of the pattern: Yes
    - Ignore first line: Yes
    - Find and Replace text in: entire line
Replace Tool with the following parameters:
- File to process: output of Replace tool
- In Find and Replace:
  - Insert Find and Replace
    - Find pattern: Acc_0
    - Replace with: 0
    - Find-Pattern is a regular expression: No
    - Case-Insensitive search: Yes
    - Replace all occurences of the pattern: Yes
    - Ignore first line: Yes
    - Find and Replace text in: entire line
Replace Tool with the following parameters:
- File to process: output of Replace
- In Find and Replace:
  - Insert Find and Replace
  - Find pattern: Acc_\S*
  - Replace with: 1
  - Find-Pattern is a regular expression: Yes
  - Replace all occurences of the pattern: Yes
  - Ignore first line: Yes
  - Case-Insensitive search: No
  - Find and Replace text in: entire line
Advanced Cut Tool with the following parameters:
- File to cut: output of Multi-Join tool
- Operation: Keep
- Delimited by: Tab
- Cut by: fields
- List of Fields: 1
Advanced Cut Tool with the following parameters:
- File to cut: output of the last Replace tool
- Operation: Discard
- Delimited by: Tab
- Cut by: fields
- List of Fields: 1,2
Paste Tool with the following parameters:
- Paste: output of Advanced Cut (Tool N. 7)
- and: output of Advanced Cut (Tool N. 8)
- Delimit by: Tab

Draw heatmap

Transpose Tool with the following parameters:
- Input tabular dataset: output of Paste
Replace Tool with the following parameters:
- File to process: output of Transpose tool
- In Find and Replace:
  - Insert Find and Replace
    - Find pattern: Acc_Sample_
    - Replace with: ``
    - Find-Pattern is a regular expression: No
    - Case-Insensitive search: Yes
    - Replace all occurences of the pattern: Yes
    - Ignore first line: No
    - Find and Replace text in: entire line
Heatmap w ggplot Tool with the following parameters:
- Select table: output of Replace tool
- Select input dataset options: Dataset with header and row names
  - Select column, for row names: Column: 1
  - Sample names orientation: horizontal
- Plot title: All Samples Common VFs Heatmap
- In Output Options:
  - width of output: 15.0
  - height of output: 17.0

Go Up

Multilocus Sequence Typing

MLST tool is used to scan the pubMLST database against PubMLST typing schemes. It’s one of the analyses that you can perform on your dataset to determine the allele IDs, you can also detect novel alleles.

This step is not essential to identify pathogens and track them in the remainder of this tutorial, however we wanted to show some of the analysis that one can use Galaxy in to understand more about the dataset, as well as identifying the strain that might be a pathogen or not.

MLST Tool with the following parameters:
- input_files: contigs collection
- Specify advanced parameters: Yes, see full parameter list
  - Output novel alleles: Yes
  - Automatically set MLST scheme: Automatic MLST scheme detection

The output file of the MLST tool is a tab-separated output file which contains:

the filename
the closest PubMLST scheme name
the ST (sequence type)
the allele IDs

Go Up

Hands-on

Re-Run with MRSA assembled genome

Galaxy Training for Pathogen Genomics, AMR Detection & Virus ID

Pathogen detection with ABRicate

Preparing data

AMR Genes detection

Combine ABRicate results into a simple matrix of gene presence/absence

Virulence Factor identification

Pathogen Tracking among all samples

Formatting ABRicate Output

Heatmap

Combine VFs accessions for samples into a table and get 0 or 1 for absence / presence

Draw heatmap

Multilocus Sequence Typing

Hands-on