Pathogen detection

How to identify pathogens using sequencing data? How to track the found pathogens through all your samples datasets?

Metagenomic analysis plays a pivotal role in unraveling the complex microbial communities present in various biological samples. One crucial aspect of metagenomics is the detection of pathogens within these communities, which holds significant implications for public health, clinical diagnostics, epidemiology, and environmental monitoring.

Detecting pathogens within metagenomic datasets is of paramount importance for several reasons:

Early Disease Prevention: Timely identification of disease-causing microorganisms can help prevent outbreaks and mitigate their impact on public health.
Public Health Surveillance: Understanding the presence and prevalence of pathogens in a given population aids in public health monitoring and response.
Clinical Diagnostics: In clinical settings, identifying pathogens in patient samples is essential for accurate diagnosis and treatment.

Despite its significance, pathogen detection in metagenomics presents several challenges:

Complex Sample Mixtures: Biological samples often contain a multitude of microorganisms, making it challenging to isolate pathogen DNA.
Genetic Variability: Pathogens can exhibit significant genetic diversity, including strain variations and sequence variations.
Low Abundance: In some samples, pathogenic organisms may be present in low abundance, making their detection more difficult.

After detecting potential pathogens, it’s crucial to interpret the results accurately. This involves distinguishing between commensal and pathogenic strains, often achieved through comparative genomics and reference databases.

The applications of pathogen detection in metagenomic analysis are diverse and include:

Clinical Diagnosis: Identifying disease-causing pathogens in patient samples.
Epidemiology: Tracking outbreaks and understanding disease transmission.
Environmental Monitoring: Detecting pathogens in environmental samples.
Food Safety: Ensuring the safety of food products by identifying contaminating pathogens.

Preparing data

In this tutorial, we will be using datasets sourced from a comprehensive study that employed bioinformatics approaches to analyze whole-genome sequence data obtained from Vibrio cholerae isolates.

This study aimed to provide a complete array of virulence genes, pathogenicity islands, and antimicrobial resistance genes within these isolates.

The research holds particular significance as it sheds light on the genomic makeup of V. cholerae strains associated with cholera outbreaks in Uganda.

The study’s dataset comprises a total of 10 whole-genome sequences of V. cholerae isolates.
These isolates were collected during three separate cholera outbreaks that occurred in Uganda between 2014 and 2016.
The dataset can be readily accessed through the NCBI’s Sequence Read Archive under the accession number SRP136117.
For the purpose of our tutorial and due to time constraints, we will begin our analysis directly from contigs, bypassing the initial steps of quality control and assembly.

:NOTE! Please, use https://usegalaxy.eu/

Create a new history

Upload datasets

https://github.com/vappiah/bioinfodata/raw/main/genome-assemblies/vibrio-cholerae/raw/batch1/SRR6871245.fasta
https://github.com/vappiah/bioinfodata/raw/main/genome-assemblies/vibrio-cholerae/raw/batch1/SRR6871246.fasta
https://github.com/vappiah/bioinfodata/raw/main/genome-assemblies/vibrio-cholerae/raw/batch1/SRR6871247.fasta
https://github.com/vappiah/bioinfodata/raw/main/genome-assemblies/vibrio-cholerae/raw/batch1/SRR6871248.fasta
https://github.com/vappiah/bioinfodata/raw/main/genome-assemblies/vibrio-cholerae/raw/batch1/SRR6871249.fasta
https://github.com/vappiah/bioinfodata/raw/main/genome-assemblies/vibrio-cholerae/raw/batch1/SRR6871250.fasta
https://github.com/vappiah/bioinfodata/raw/main/genome-assemblies/vibrio-cholerae/raw/batch1/SRR6871251.fasta
https://github.com/vappiah/bioinfodata/raw/main/genome-assemblies/vibrio-cholerae/raw/batch1/SRR6871252.fasta
https://github.com/vappiah/bioinfodata/raw/main/genome-assemblies/vibrio-cholerae/raw/batch1/SRR6871253.fasta
https://github.com/vappiah/bioinfodata/raw/main/genome-assemblies/vibrio-cholerae/raw/batch1/SRR6871254.fasta

Create a collection and remane it as contigs

AMR Genes detection

We will explore two distinct approaches for AMR gene detection, each based on specialized tools:

ABRicate: Given a FASTA contig file or a genbank file, ABRicate will perform a mass screening of contigs and identify the presence of antibiotic resistance genes and virulence genes. The user can choose which database to search from a list of available AMR databases: NCBI, CARD, ARG-ANNOT, Resfinder, MEGARES, EcOH, PlasmidFinder, Ecoli_VF and VFDB
StarAMR: scans bacterial genome contigs against both the ResFinder, PlasmidFinder and PointFinder databases (used by their respective webservices) and compiles a summary report of detected antimicrobial resistance genes.

AMR Genes Detection with ABRicate

In order to search AMR genes among our samples’ contigs, we run ABRicate and choose the NCBI Bacterial Antimicrobial Resistance Gene Database (AMRFinderPlus) from the advanced options of the tool.

ABRicate checks if there is an AMR found or not, if found then in which contig it is, its location on the contig, what the name of the exact product is, what substance it provides resistance against and a lot of other information regarding the found AMR.

ABRicate Tool with the following parameters:
- Input file (Fasta, Genbank or EMBL file): contigs collection
- Advanced options:
- Database to use - default is ‘resfinder’: NCBI Bacterial Antimicrobial Resistance Reference Gene Database

The outputs of ABRicate is a tabular file with different columns:

FILE: The filename this hit came from
SEQUENCE: The sequence in the filename
START: Start coordinate in the sequence
END: End coordinate
STRAND: AMR gene
GENE: AMR gene
COVERAGE: What proportion of the gene is in our sequence
COVERAGE_MA: A visual represenation
GAPS: Was there any gaps in the alignment - possible pseudogene?
%COVERAGE: Proportion of gene covered
%IDENTITY: Proportion of exact nucleotide matches
DATABASE: The database this sequence comes from
ACCESSION: The genomic source of the sequence
PRODUCT: Gene product (if available)
RESISTANCE: Putative antibiotic resistance phenotype, ;-separated

TRY! Search ACCESSION value on NCBI webserver

Combine ABRicate results into a simple matrix of gene presence/absence

ABRicate can combine results into a simple matrix of gene presence/absence.

An absent gene is denoted by .
a present gene is represented by its %COVERAGE. This can be individual abricate reports, or a combined one.

ABRicate Summary Tool with the following parameters:
- Abricate output reports to summarize: Output Report from ABRicate

AMR Genes Detection with StarAMR

StarAMR scans bacterial genome contigs against both the ResFinder, PointFinder (Zankari et al. 2017), and PlasmidFinder (used by the ResFinder webservice) and compiles a summary report of detected antimicrobial resistance genes.

StarAMR Tool with the following parameters:
- genomes: contigs collection

There are 5 different output files produced by StarAMR

summary.tsv: A summary of all detected AMR genes/mutations in each genome, one genome per line.
resfinder.tsv: A tabular file of each AMR gene and additional BLAST information from the ResFinder database, one gene per line.
pointfinder.tsv: A tabular file of each AMR point mutation and additional BLAST information from the PointFinder database, one gene per line.
settings.txt: The command-line, database versions, and other settings used to run staramr.
results.xlsx: An Excel spreadsheet containing the previous 4 files as separate worksheets.

The summary file is most important and provides all the resistance genes found.

CARD database

To get more information about these antibiotic resistant genes, you can check the CARD database (Comprehensive Antibiotic Resistance Database)

Virulence Factor identification

In this step we return back to the main goal of the tutorial where we want to identify the pathogens: identify if the bacteria found in our samples are pathogenic bacteria or not.

DEFINITIONS

Bacterial Pathogen: A bacterial pathogen is usually defined as any bacterium that has the capacity to cause disease. Its ability to cause disease is called pathogenicity.
Virulence: Virulence provides a quantitative measure of the pathogenicity or the likelihood of causing a disease.
Virulence Factors: Virulence factors refer to the properties (i.e., gene products) that enable a microorganism to establish itself on or within a host of a particular species and enhance its potential to cause disease. Virulence factors include bacterial toxins, cell surface proteins that mediate bacterial attachment, cell surface carbohydrates and proteins that protect a bacterium, and hydrolytic enzymes that may contribute to the pathogenicity of the bacterium.

To identify VFs, we use again ABRicate but this time with the VFDB from the advanced options of the tool.

ABRicate Tool with the following parameters:
- Input file (Fasta, Genbank or EMBL file): contigs collection
- Advanced options:
- Database to use - default is ‘resfinder’: VFBD
Rename output collection as VFs
ABRicate Summary Tool with the following parameters:
- Abricate output reports to summarize: Output Report from ABRicate

Pathogen Tracking among all samples

In this last section, we would like to show how to aggregate results and use the results to help tracking pathogenes among samples by:

Drawing a presence-absence heatmap of the identified VF genes within all samples to visualize in which samples these genes can be found.
Drawing a phylogenetic tree for each pathogenic gene detected, where we will relate the contigs of the samples together where this gene is found.

With these two types of visualizations we can have an overview of all samples and the genes, but also how samples are related to each other i.e. which common pathogenic genes they share.

Given the time of the sampling and the location one can easily identify using these graphs, where and when the contamination has occurred among the different samples.

Formatting ABRicate Output

To execute the analysis in this section, it is essential to handle and format the ABRicate output. This processing is aimed at extracting a curated list comprising Accession IDs of identified genes associated with Virulence Factors, along with their corresponding sample IDs.

Cut Tool with the following parameters:
- Cut columns: c13
- From: *VFs collection output of ABRicate
Rename the outputs VFs accessions
Cut Tool with the following parameters:
- Cut columns: c1
- From: *VFs collection output of ABRicate
Unique occurrences of each record Tool with the following parameters:
- File to scan for unique values: Output collection from previous Cut
Rename the outputs SampleIDs
Concatenate multiple datasets (tail-to-head by specifying how Tool with the following parameters:
- What type of data do you wish to concatenate?: Collection
- Input first collection: SampleIDs collection
- Input second collection: VFs accessions collection
Rename the outputs VFs accessions with SampleID

Heatmap

A heatmap is one of the visualization techniques that can give you a complete overview of all the samples together and whether or not a certain value exists. In this tutorial, we use the heatmap to visualize all samples aside and check which common bacteria pathogen genes are found in samples and which is only found in one of them.

We use Heatmap w ggplot tool along with other tabular manipulating tools to prepare the tabular files.

1 Combine VFs accessions for samples into a table and get 0 or 1 for absence / presence

Collapse Collection Tool with the following parameters:
- Collection of files to collapse into single dataset: VFs accessions collection
- Prepend File name: No
Add line to file Tool with the following parameters:
- input file: output from Collapse Collection
- text to add: All_VFs
Multi-Join Tool with the following parameters:
- File to join: output from Add line to file
- add additional file: VFs accessions with SampleID
- Common key column: 1
- *Column with values to preserve: 1
- Add header line to the output file: Yes
- Input files contain a header line (as first line): Yes
- Ignore duplicated keys: Yes
Replace Tool with the following parameters:
- File to process: output of Multi-Join
- In Find and Replace:
  - Insert Find and Replace
    - Find pattern: dataset_(.*?)_
    - *Replace with”: Sample_
    - Find-Pattern is a regular expression: Yes
    - Replace all occurences of the pattern: Yes
    - Ignore first line: No
    - Find and Replace text in: entire line
  - Insert Find and Replace
    - Find pattern: (\S+)
    - Replace with: Acc_$1
    - Find-Pattern is a regular expression: Yes
    - Case-Insensitive search: Yes
    - Replace all occurences of the pattern: Yes
    - Ignore first line: Yes
    - Find and Replace text in: entire line
Replace Tool with the following parameters:
- File to process: output of Replace tool
- In Find and Replace:
  - Insert Find and Replace
    - Find pattern: Acc_0
    - Replace with: 0
    - Find-Pattern is a regular expression: No
    - Case-Insensitive search*: Yes
    - Replace all occurences of the pattern: Yes
    - Ignore first line: Yes
    - Find and Replace text in: entire line
Replace Tool with the following parameters:
- File to process: output of Replace
- In Find and Replace:
  - Insert Find and Replace
  - Find pattern: Acc_\S*
  - Replace with: 1
  - Find-Pattern is a regular expression: Yes
  - Replace all occurences of the pattern: Yes
  - Ignore first line: Yes
  - Case-Insensitive search: No
  - Find and Replace text in: entire line
Advanced Cut Tool with the following parameters:
- File to cut: output of Multi-Join tool
- Operation*:Keep
- Delimited by:Tab
- Cut by:fields
- List of Fields:1
Advanced Cut Tool with the following parameters:
- File to cut: output of the last Replace tool
- Operation:Discard
- Delimited by:Tab
- Cut by:fields
- List of Fields:1,2
Paste Tool with the following parameters:
- Paste: output of Advanced Cut (Tool N. 7)
- and: output of Advanced Cut (Tool N. 8)
- Delimit by:Tab

2 Draw heatmap

Transpose Tool with the following parameters:
- “Input tabular dataset”: output of Paste
Heatmap w ggplot Tool with the following parameters:
- Select table: output of Transpose tool
- Select input dataset options: Dataset with header and row names
  - Select column, for row names: Column: 1
  - Sample names orientation: horizontal
- Plot title: All Samples Common VFs Heatmap
- In Output Options:
  - width of output: 15.0
  - height of output: 17.0

Phylogenetic Tree building

Phylogenetic trees can be used to track the evolution of the pathogen between the samples. Therefore, the VFs are used as a marker gene for the pathogen, similar to 16S marker genes for species profiling.

We use the VFs since we know they are associated to the pathogenicity of the sample.

By observing the created trees one can identify groups of related pathogens. If additional meta data of the samples would be available one could further identify groups that are associated to specific traits such as increase pathogenicity or faster transmission.

Consequently, the tree could be used for phylogenetic placement of unknwon samples.

For the phylogenetic trees, for each bacteria pathogen gene found in the samples we use ClustalW for Multiple Sequence Alignment (MSA) needed before constructing a phylogenetic tree, for the tree itself we use FASTTREE and Newick Display to visualize it.

1 Extract the sequences of the VFs

To get the sequence to align, we need to extract the sequences of the VFs in the contigs:

Collapse Collection Tool with the following parameters:
- Collection of files to collapse into single dataset: Contigs
- *Prepend File name: Yes
Collapse Collection Tool with the following parameters:
- Collection of files to collapse into single dataset: VFs
- Keep one header line: Yes
- *Prepend File name: Yes
Split by group Tool with the following parameters:
- File to split: output of 2nd Collapse Collection tool
- on column: Column: 13
- Include header in splits?: Yes
Remove beginning Tool with the following parameters:
- from: output of Split by group
Filter sequences by ID Tool with the following parameters:
- Sequence file to be filtered: output of 1st Collapse Collection tool
- Filter using the ID list from: tabular file
  - Tabular file containing sequence identifiers: output of Split by group tool
  - Column(s) containing sequence identifiers: Column: 2
- Output positive matches, negative matches, or both?: Just positive matches (ID on list), as a single file

2 Phylogenetic Tree building

We can now run multiple sequence alignment, build the trees for each VF and display them.

ClustalW Tool with the following parameters:
- FASTA file: output of Filter sequences by ID tool
- Data type: DNA nucleotide sequences
- Output alignment format: FASTA format
- Output complete alignment (or specify part to output): Complete alignment
FASTTREE Tool with the following parameters:
- Aligned sequences file (FASTA or Phylip format): fasta
  - FASTA file: output of ClustalW tool
  - Set starting tree(s): No starting trees
- Protein or nucleotide alignment: Nucleotide
Newick Display Tool with the following parameters:
- *Newick file: output of FASTTREE
- Branch support*: Hide branch support
- Branch length*: Hide branch length
- Image width: 2000

NB It takes a lot of time!!!

Multilocus Sequence Typing

MLST tool is used to scan the pubMLST database against PubMLST typing schemes. It’s one of the analyses that you can perform on your dataset to determine the allele IDs, you can also detect novel alleles.

This step is not essential to identify pathogens and track them in the remainder of this tutorial, however we wanted to show some of the analysis that one can use Galaxy in to understand more about the dataset, as well as identifying the strain that might be a pathogen or not.

MLST Tool with the following parameters:
- input_files: contigs collection
- Specify advanced parameters: Yes, see full parameter list
  - Output novel alleles: Yes
  - Automatically set MLST scheme: Automatic MLST scheme detection

The output file of the MLST tool is a tab-separated output file which contains:

the filename
the closest PubMLST scheme name
the ST (sequence type)
the allele IDs

Host read filtering

In this tutorial, we worked with isolated genomes.

To identify and track foodborne pathogens using metagenomic sequencing, different samples of potentially contaminated food (at different time points or different locations) are prepared, DNA is extracted and sequenced. The generated sequencing data then need to be processed using bioinformatics tools.

Generally, we are not interested in the food (host) sequences, rather only those originating from the pathogen itself.

It is an important to get rid of all host sequences and to only retain sequences that might include a pathogen, both in order to speed up further steps and to avoid host sequences compromising the analysis.

In this part of tutorial we will use samples from chicken meat spiked with Salmonella

Assign reads to taxa using Kraken2 and Kalamari, a database of completed assemblies for metagenomics-related tasks used widely in contamination and host filtering
Filter host assigned reads based on Kraken2 assignments
- Manipulate Kraken2 classification to extract the sequence ids of all hosts sequences identified with Kraken2
- Filter the FASTQ files to get 1 ouput with the host-assigned sequences and 1 output without the host-assigned reads

Preparing Data

NOTE! Please, use

Create a new history

Import the data

https://zenodo.org/record/7593928/files/Barcode10_Spike2.fastq.gz
https://zenodo.org/record/7593928/files/Barcode11_Spike2b.fastq.gz

Rename datasets to Barcode10 and Barcode11 respectively
Create a collection named Reads from the two imported datasets

Read taxonomic classification for host filtering

Kraken2 Tool with the following parameters:
- Single or paired reads: Single
  - Input sequences: Reads
- Print scientific names instead of just taxids: Yes
- In Create Report:
  - Print a report with aggregrate counts/clade to file: Yes
  - Format report output like Kraken 1’s kraken-mpa-report: No
  - Report counts for ALL taxa, even if counts are zero: Yes
  - Report minimizer data: Yes
- Select a Kraken2 database: kalamari
Download and Inspect the report of Kraken2 collection

Question

What is the species of the host?
How many sequences of this host was found?

Visualisation using Krona

Krakentools: Convert kraken report file Tool with the following parameters:
- Kraken report file: Report collection of Kraken2
Krona pie chart Tool with the following parameters:
- Type of input data: Tabular
- Input file: output of Krakentools

Filtering

Krakentools: Extract Kraken Reads By ID Tool with the following parameters:
- Single or paired reads?: Single
- Results: Kraken2 with Kalamari database Results outputs of Kraken2 tool
- Report: Kraken2 with Kalamri database Report outputs of Kraken2 tool
- Taxonomix ID(s) to match:9031 9606 9913
  
  We specify here the taxonomic ID of the hosts so we can filter reads assigned to these hosts. Kraken2 uses taxonomic IDs from NCBI, the IDs for a specific taxa can be found at ncbi. To be generic, we remove here (if the contaminated food comes from and may include other animals, you can change the values):
  - Human (9606)
  - Chicken (9031)
  - Beef (9913)
  - Pig (9823)
- Invert output: Yes
- Output as FASTQ: Yes
- Include parents: Yes
- Include children: Yes

Hands-On

Now, it’s your turn!