Hands On

From raw reads to insights: A hands-on guide to bacterial genomics

You have mastered the theory and practiced with curated demo datasets.
Today, we step out of the tutorial zone.

This session is about real data.

Real-world sequencing is messy, unpredictable, and rarely perfect.
You are no longer here to just follow commands —
you are here to practice the decision-making process required to transform raw, noisy reads into a scientifically robust genome assembly.


Biological questions

In this hands-on exercise we will analyze a small collection of Serratia marcescens isolates.

The main questions we aim to answer are:

  1. Are all sequencing datasets of sufficient quality?
  2. Do all samples belong to the same bacterial species?
  3. Are some datasets contaminated?
  4. How similar are the isolates at the genomic level?
  5. Do they share the same antimicrobial resistance genes?
  6. What are the evolutionary relationships between the isolates?

Go Up


Analysis Workflow

Raw reads
│
▼
Quality Control (Falco → MultiQC → fastp)
│
▼
Contamination detection (Kraken2 → Bracken → Krona)
│
▼
Genome assembly (Shovill)
│
▼
Assembly assessment (QUAST → MultiQC)
│
▼
Genome annotation (Bakta)
│
▼
Functional profiling (StarAMR, ABRicate)
│
▼
Comparative genomics (Roary pangenome)
│
├── Accessory genome tree
├── Core genome phylogeny
└── SNP phylogeny (Snippy)
│
▼
Species confirmation (FastANI)
|
▼
Population structure (PopPUNK)

Go Up


Galaxy and Data Preparation

To eliminate local installation bottlenecks, we will use the public Galaxy instances:

We will work with a dataset pre-loaded into a shared Galaxy history.

  1. Log in to your preferred Galaxy instance.
  2. In the right-hand panel, click the Histories icon.
  3. Select “Histories Shared with me”.
  4. Locate the shared history named “RAW READS”.
  5. Click Import.

  6. Create a new history (click the + icon).
  7. Rename it (e.g., Serratia Analysis).
  8. Copy or move the Serratia marcescens paired-end collection into your new history.
  9. Download Reference Genome from https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_030291735.1/ and rename ir as Reference

At this stage, you are ready to begin the analysis pipeline.

Go Up


Quality Check (QC)

Before trimming or assembling anything, always inspect your data.

See the dedicated tutorial section: Quality Control

  1. Falco Tool with the following parameters
    • Raw read data from your current history: Raw Reads
  2. MultiQC on Falco RawData

  3. fastp with the following parameters:
    • Single-end or paired reads: Paired Collection
    • Select paired collection(s): Raw Reads
    • In Filter Options:
      • In Quality filtering Options:
        • Qualified quality phred: 20
      • In Length filtering Options:
        • Length required: 30
    • In Read Modification Options
      • In Per read cuitting by quality options:
        • Cut by quality in front (5’): Yes
        • Cut by quality in tail (3’): Yes
        • Cutting window size: 4
        • Cutting mean quality: 20
    • In Output Options:
      • Output JSON report: Yes
  4. Rename fastp output as Filtered Reads

  5. MultiQC on the JSON output generated by fastp

Go Up


QC Checklist

Before moving to next step, confirm:

  • ☐ Per-base quality ≥ Q30 after trimming
  • ☐ Adapters removed
  • ☐ Read loss acceptable (<40%)
  • ☐ Median read length >100 bp
  • ☐ GC profile coherent and unimodal
  • ☐ No abnormal duplication patterns

If all criteria are met → proceed to next step.
Otherwise → adjust trimming parameters and repeat QC.

Go Up


Contamination Check

When sequencing a genome, contamination from foreign organisms may be inadvertently mixed with the genomic material of the species of interest.

For this reason, when working with bacterial isolates, it is crucial to:

  • Verify that the expected species or strain is present
  • Detect potential contaminant organisms
  • Quantify the proportion of non-target reads

Confirming species identity and detecting contamination are essential to ensure the accuracy and reliability of the analysis, and to prevent misleading results that could compromise downstream interpretation (assembly quality, annotation, AMR detection, phylogenetic analysis).

For detailed instructions, interpretation guidelines, and tool configuration, please refer to the dedicated tutorial section: Checking expected species and contamination in bacterial isolate

  1. Kraken2 Tool with the following parameters:
    • Single or paired reads: Paired Collection
      • Collection of paired reads: Input paired collection
    • Confidence: 0.1
    • Minimum Base Quality: 10
    • In Create Report:
      • Print a report with aggregrate counts/clade to file: Yes
    • Select a Kraken2 database: Standard-Full (2022)
  2. Bracken Tool with the following parameters:
    • Kraken report file: Report output of Kraken
    • Select a kmer distribution: Standard-Full (same as for Kraken)
    • Level: Species
    • Produce Kraken-Style Bracken report: yes

  3. Krakentools: Convert kraken report file Tool with the following parameters:
    • Kraken report file: Kraken style report collection of Bracken
  4. Krona pie chart Tool with the following parameters:
    • Type of input data: Tabular
    • Input file: output of Krakentools

Go Up


Contamination Checklist

Before proceeding to assembly, confirm:

  • ☐ The dominant species matches the expected organism
  • ☐ Relative abundance of the target species is high (>90–95%)
  • ☐ No unexpected high-abundance taxa are present
  • ☐ Unclassified reads are within reasonable limits

If criteria are met → proceed to assembly.
Otherwise → investigate contamination before continuing or remove dirty dataset

Go Up


Interpretation and Action

From the taxonomic profiling results, samples 044 and 046 show clear evidence of contamination and do not meet the criteria defined in the contamination checklist.

These two datasets must therefore be excluded from downstream analysis.

Remove 044 and 046 from the collection.

We will proceed with de novo assembly using the remaining four clean datasets.

Go Up


Optional Exercise

For educational purposes, we will intentionally create three scenarios:

  • Group A → removes contaminated datasets (044, 046)
  • Group B → keeps the contaminated datasets
  • Group C → Remove only 044

The objective is to evaluate how contamination propagates through the pipeline and impacts assembly quality and downstream interpretation.

Go Up


Assembly

After quality control and contamination filtering, we are ready to reconstruct the genome.

De novo assembly aims to rebuild the bacterial genome from short sequencing reads without relying on a reference sequence.

For a detailed explanation please refer to the dedicated tutorial section: Genome Assembly of a bacterial genome

  1. Shovill with the following parameters:
    • Input reads type, collection or single library: Paired Collection
    • Paired collection: Group A/B/C Paired-end output
    • In Advanced options:
      • Estimated genome size: 5160776 (from NCBI)
      • Extra SPAdes options: --isolate
  2. Rename output as CONTIGS

  3. Click on the title of your file to see the row of small icons for saving, linking etc:

  4. Click on the visualise icon and then select the Multiple Sequence Alignment tool. You should see something like this:

  5. Quast Tool with parameters
    • Assembly mode?: Individual Assembly
      • Use customized names for the input files?: No, use dataset names
        • Contigs/scaffolds file: contigs output Shovill
      • Reads options: Illumina paired-end reads in paired collection
        • FASTQ/FASTA files: Group A/B/C Paired-end output
    • Output files: HTML report, PDF report, Tabular reports
  6. MultiQC tool with following parameters
    • In “Results”:
      • Insert Results
        • Which tool was used generate logs?: QUAST
        • In “QUAST Output:
          • QUAST Output: QUAST Report on Shovill
    • Report title: Assembly Report

Go Up


Reference-based Assembly Evaluation with QUAST

When a suitable reference genome is available, QUAST can also be run in reference-based mode.

In this case, the assembled contigs are aligned to the reference genome, allowing additional metrics to be calculated.

This analysis provides information such as:

  • genome coverage
  • misassemblies
  • structural differences
  • alignment statistics

Go Up


Circos Plot

QUAST can also generate a Circos plot showing how the assembled contigs align to the reference genome.

In this plot:

  • the outer circle represents the reference genome
  • the colored segments represent assembled contigs
  • the connections show how contigs map along the reference

This visualization helps identify:

  • gaps in the assembly
  • large rearrangements
  • potential misassemblies

Go Up


Assembly Checklist

After running Shovill, evaluate your assembly using the following key metrics (e.g. from QUAST OT MultiQC output).

  1. Continuity
    • ☐ N50 > 100 kb (excellent if >200 kb)
    • ☐ No extreme fragmentation (N50 < 20 kb is a red flag)
  2. Fragmentation
    • ☐ Number of contigs < 150
    • ☐ No thousands of short contigs
  3. Completeness (Genome Size)
    • Expected genome size for Serratia marcescens based on NCBI: ~5.0–5.5 Mb
    • ☐ Total assembly length within ±10% of expected size
    • ☐ No abnormal inflation of genome size
  4. GC Content
    • Expected GC content for Serratia marcescens based on NCBI: ~59–60%
    • ☐ GC% consistent with species expectation
    • ☐ No strong GC deviation

If all criteria are satisfied → assembly is technically sound.

If one or more major criteria fail →
reconsider contamination, read quality, or sequencing depth.

Go Up


Functional Annotation

After verifying that the assembly meets quality standards, we proceed to genome annotation.

Genome annotation aims to:

  • Identify coding sequences (CDS)
  • Predict rRNA and tRNA genes
  • Detect functional elements
  • Assign putative biological functions

For detailed explanation, please refer to the dedicated tutorial: Bacterial Genome Annotation

  1. Bakta with the following parameters:
    • In Input/Output options:
      • The bakta database: latest one
      • The amrfinderplus database: latest one
      • Select genome in fasta format: CONTIGS collection
    • In Optional annotation:
      • Keep original contig header (–keep-contig-headers): Yes
    • In Selection of the output files:
      • Output files selection:
        • Annotation file in TSV
        • Annotation and sequence in GFF3
        • Feature nucleotide sequences as FASTA
        • Summary as TXT
        • Plot of the annotation result as SVG
  2. MultiQC tool with following parameters
    • In “Results”:
      • Insert Results
        • Which tool was used generate logs?: Bakta
        • In Bakta Output:
          • Bakta Output: Bakta Summary
    • Report title: Annotation Report

Go Up


Annotation Checklist

After completion, verify:

  • ☐ Total number of CDS is biologically plausible (~4,500–5,500 for Serratia marcescens)
  • ☐ rRNA operons detected
  • ☐ tRNA genes present
  • ☐ No abnormally low gene count
  • ☐ No excessive hypothetical proteins (>50–60% may indicate fragmented assembly)

Large discrepancies in gene count or missing core features (e.g., rRNA genes) may indicate:

  • fragmented assembly
  • contamination
  • incomplete genome reconstruction

If annotation metrics are coherent → proceed to downstream analysis.

Go Up


Further Structural Annotation

Beyond core genome annotation, additional mobile genetic elements can be explored. These analyses are interpretative and must be evaluated critically.

  1. Plasmid detection is exploratory.

    • The presence of plasmids is possible but not mandatory.
    • Absence of plasmid hits does not invalidate the assembly.
    • Strong plasmid signals must be interpreted in the context of contamination and assembly quality.
    • Fragmented plasmid hits may reflect assembly breaks rather than true absence.
  2. Integron detection is exploratory.

    • Integrons are common in clinical Gram-negative isolates such as Serratia marcescens.
    • Their presence may correlate with AMR gene cassettes.
    • Absence does not invalidate the genome.
    • Fragmented detection may reflect assembly limitations or repetitive regions.
  3. Insertion Sequences (IS) detection is exploratory.

    • IS elements are common in bacterial genomes.
    • Multiple transposase annotations are expected.
    • Extremely high counts may indicate assembly fragmentation or contamination.
    • Apparent absence may reflect annotation sensitivity or database limitations.

In our dataset:

  • No plasmids were detected.
  • No integrons were identified.
  • Multiple Insertion Sequences (IS) were annotated.

Interpretation:

  • The absence of plasmids and integrons is biologically plausible and does not compromise the assembly.
  • The presence of IS elements is expected in bacterial genomes and consistent with normal genomic architecture.
  • No abnormal inflation of mobile genetic elements was observed.

Go Up


Visualisation of the annotation

We can now launch JBrowse with different information track.

  1. JBrowse with the following parameters:
    • Reference genome to display: Use a genome from history
      • Select the reference genome: contigs
      • Genetic Code: 11. The Bacterial, Archaeal and Plant Plastid Code
      • In Track Group:
        • Insert Track Group
          • Track Category: Bakta
          • In Annotation Track:
            • Insert Annotation Track
              • Track Type: GFF/GFF3/BED Features
              • GFF/GFF3/BED Track Data: annotation_and_sequences (output of Bakta)
              • JBrowse Track Type [Advanced]: Neat Canvas Features
              • Track Visibility: On for new users
        • Insert Track Group
          • Track Category: Plasmid sequences
          • In Annotation Track:
            • Insert Annotation Track
            • Track Type: GFF/GFF3/BED Features
            • GFF/GFF3/BED Track Data: PlasmidFinder GFF
            • JBrowse Track Type [Advanced]: Neat Canvas Features
            • Track Visibility: On for new users
        • Insert Track Group
          • Track Category: IS elements
          • In Annotation Track:
            • Insert Annotation Track
              • Track Type: GFF/GFF3/BED Features
              • GFF/GFF3/BED Track Data: GFF output of ISEScan
              • JBrowse Track Type [Advanced]: Neat Canvas Features
              • Track Visibility: On for new users

        If integrons are found as IntegronFinder

        • Insert Track Group
          • Track Category: Integrons
          • In Annotation Track:
            • Insert Annotation Track
              • Track Type: GFF/GFF3/BED Features
              • GFF/GFF3/BED Track Data: IntegronFinder GFF
              • JBrowse Track Type [Advanced]: Neat Canvas Features
              • Track Visibility: On for new users

Go Up


AMR Detection & Virulence Factor Identification

After genome annotation, additional analyses can be performed to explore clinically relevant genomic features.

In this section we will investigate:

  • Antimicrobial Resistance (AMR) genes
  • Virulence factors

AMR detection allows us to identify genes that may confer resistance to antibiotics, while virulence factor analysis helps reveal genetic determinants involved in pathogenicity, such as toxins, adhesion systems, secretion systems, and iron acquisition mechanisms.

These analyses provide insight into the potential clinical relevance and pathogenic profile of the organism.

For detailed explanations of the methodologies, databases, and interpretation of results, please refer to the dedicated tutorials:

Go Up


AMR Detection with StarAMR

  1. StarAMR Tool with the following parameters:

    • Genomes: CONTIGS collection
    • StarAMR database: Latest release
    • Enable scanning for point mutations using the PointFinder database:
      • Disable PointFinder (enable it only if the target bacterial species is supported by the database)
    • Selection of the output files:
      • Add Excel spreadsheet
  2. Run the tool and inspect the resulting reports.

The Excel output provides a convenient summary of the StarAMR results.
Download the file and inspect it.

Note: The MLST analysis did not return any results. This is expected because Serratia marcescens is not present in the MLST database used by StarAMR. This limitation will be discussed later in the tutorial.

Go Up


AMR Detection with ABRicate

  1. ABRicate Tool with the following parameters:
    • Input file (Fasta, Genbank or EMBL file): contigs collection
    • Advanced options:
    • Database to use - default is ‘resfinder’: NCBI Bacterial Antimicrobial Resistance Reference Gene Database
  2. Rename ABRicate output as AMR

  3. ABRicate Summary Tool with the following parameters:
    • Abricate output reports to summarize: AMR Output Report from ABRicate
  4. Rename ABRicate Summary output as AMR summary

Go Up


Preparing the Heatmap Input Matrix

The abricate summary output reports the percentage identity of each detected gene for every genome analyzed.

#FILE NUM_FOUND aac(6’)-Ic aac(6’)_Serra blaSRT blaSRT-2 oqxB30 oqxB9 tet(41)
041_SM 3 . 100.00 100.00 . . 98.89 .
042_SM 3 100.00 . . 100.00 . 98.89 .
043_SM 3 100.00 . . 100.00 . 98.89 .
045_SM 4 . 100.00 100.00 . 98.92 . 99.49

In this table:

  • numerical values represent percent identity of the match
  • . indicates absence of the gene

For heatmap visualization we do not need identity values, but only whether a gene is present or absent in each genome.

Therefore, the table must be converted into a binary matrix:

  • 1 → gene present
  • 0 → gene absent

Example of the required format (from Pathogen Identidification Tutorial :

key NP_231094 NP_231099 NP_231100 NP_231101 NP_231102 NP_232418 NP_232419
SRR6871245 1 1 1 1 1 1 1
SRR6871246 1 1 1 1 1 1 1
SRR6871247 1 1 1 1 1 1 1
SRR6871248 1 1 1 0 1 1 1
SRR6871249 1 0 0 0 0 1 1
SRR6871250 1 1 1 1 1 1 1

To generate this matrix:

  1. Keep the sample ID column as the first column (key).
  2. Remove the NUM_FOUND column.
  3. Convert each gene column using the following rule:
    • value .0
    • numeric value (e.g. 98.89, 100.00) → 1

The resulting binary matrix can then be used to generate a gene presence/absence heatmap.

The resulting binary matrix can then be used to generate a gene presence/absence heatmap.

There are two possible ways to generate this matrix:

  • Using a spreadsheet editor
    Download the abricate summary table, open it in a spreadsheet application (e.g. Excel), perform the conversion, save the modified table, and upload it back to Galaxy.

  • Using Galaxy tools
    The same transformation can be performed directly in Galaxy using tools for text and tabular file manipulation.

  1. Advanced Cut Tool with the following parameters:
    • File to cut: AMR Summary
    • Operation: Discard
    • Delimited by: Tab
    • Cut by: fields
    • List of Fields: 2
  2. Replace Tool with the following parameters:
    • File to process: output of Advanced Cut
    • In Find and Replace:
      • Insert Find and Replace
      • Find pattern: #FILE
      • Replace with: key
      • Find-Pattern is a regular expression: No
      • Replace all occurences of the pattern: No
      • Ignore first line: No
      • Case-Insensitive search: No
      • Find whole-words: Yes
      • Find and Replace text in: entire line
    • In Find and Replace:
      • Insert Find and Replace
      • Find pattern: \d+\.\d{2}
      • Replace with: 1
      • Find-Pattern is a regular expression: Yes
      • Replace all occurences of the pattern: Yes
      • Ignore first line: No
      • Case-Insensitive search: No
      • Find whole-words: Yes
      • Find and Replace text in: entire line
  3. Replace Tool with the following parameters:
    • File to process: output of Replace
    • In Find and Replace:
      • Insert Find and Replace
      • Find pattern: .
      • Replace with: 0
      • Find-Pattern is a regular expression: No
      • Replace all occurences of the pattern: Yes
      • Ignore first line: No
      • Case-Insensitive search: No
      • Find whole-words: No
      • Find and Replace text in: entire line

And, finally

  1. heatmap2 Tool wit the followinf parameters:
    • Input: output of last Replace
    • Data transformation: Plot data as ot is
    • Compute z-scores prior to clustering: Do not compute
    • Scale data on the plot (after clustering): Do not scale
    • Enable data clustering: No
    • Labeling columns and rows: Label my columns ad row
    • Type of colormap to use: Gradient with 2 colors
    • Output format: PNG

Go Up


Virulence Factor Identification with ABRicate

  1. ABRicate Tool with the following parameters:
    • Input file (Fasta, Genbank or EMBL file): contigs collection
    • Advanced options:
    • Database to use - default is ‘resfinder’: VFDB
  2. Rename ABRicate output as VFs

  3. ABRicate Summary Tool with the following parameters:
    • Abricate output reports to summarize: VFs Output Report from ABRicate
  4. Rename ABRicate Summary output as VFs summary

  5. Prepare the Heatmap Input Matrix

  6. Draw the Heatmap

Go Up


Multilocus Sequence Typing (MLST)

  1. MLST Tool with the following parameters:
    • input_files: CONTIGS collection
    • Specify advanced parameters: Yes, see full parameter list
      • Output novel alleles: Yes
      • Automatically set MLST scheme: Automatic MLST scheme detection

In our dataset, the MLST tool does not return any results.
The analysis only works for species that have an available MLST scheme in the PubMLST database.

In this case, Serratia marcescens is not included among the typing schemes supported by the MLST database bundled with the tool.
Therefore, no allele matches or sequence types (ST) can be assigned.

This can be verified by running the MLST List tool, which displays the list of species for which MLST schemes are available in the database.

Note
MLST schemes for Serratia marcescens do exist in PubMLST, but they are not included in the version of the database distributed with the MLST tool used here.
In a command-line environment it is possible to manually add missing schemes by downloading the allele sequences and profile tables from the PubMLST REST API and updating the local MLST database.
However, this type of database customization is typically outside the scope of standard Galaxy workflows and requires direct access to the underlying system..

The MLST results obtained using the updated database have been uploaded to the shared Galaxy history so that you can inspect and compare them with the other analyses performed in this tutorial.

Remember that the output file of the MLST tool is a tab-separated file containing:

  • the filename
  • the closest PubMLST scheme name
  • the ST (Sequence Type)
  • the allele IDs
Column 1 Column 2 Column 3 Column 4 Column 5 Column 6 Column 7 Column 8 Column 9 Column 10
041_SM serratia 595 adk(142) fumC(173) gyrB(133) icd(145) mdh(131) recA(128)  
042_SM serratia 1406 adk(174) fumC(315) gyrB(225) icd(256) mdh(398) recA(220)  
043_SM serratia 1406 adk(174) fumC(315) gyrB(225) icd(256) mdh(398) recA(220)  
044_SM ecloacae 116 dnaA(9) fusA(4) gyrB(14) leuS(6) pyrG(11) rplB(4) rpoB(6)
045_SM serratia - adk(1) fumC(265) gyrB(3) icd(3) mdh(2) recA(3)  
046_SM serratia - adk(95) fumC(117) gyrB(108) icd(269) mdh(205) recA(101)  

Go Up


Pangenomic Analysis

Pangenome analysis allows us to compare multiple genomes of the same species in order to identify:

  • core genes shared by all isolates
  • accessory genes present only in a subset of genomes
  • strain-specific genes

For a detailed explanation of pangenome concepts and Roary parameters, refer to the dedicated tutorial: Pangenome Analysis Tutorial

Roary Tool with the following parameters: - Individual gff files or a dataset collection: Collection - Dataset collection to submit to Roary: GFF collection from Bakta - In Additional Outputs - Select Accessory binary genes in newick format

This option produces a phylogenetic tree based on the presence/absence patterns of accessory genes.

Roary generates a summary table describing the gene distribution across the analyzed genomes:

Category Definition Genes
Core genes (99% ≤ strains ≤ 100%) 3648
Soft core genes (95% ≤ strains < 99%) 0
Shell genes (15% ≤ strains < 95%) 2793
Cloud genes (0% ≤ strains < 15%) 0
Total genes (0% ≤ strains ≤ 100%) 6441
  • Core genome (3648 genes)
    Represents the conserved genetic backbone shared by all analyzed isolates.

  • Shell genes (2793 genes)
    These genes are present in some but not all strains and represent the accessory genome, often associated with adaptation, mobile elements, or strain-specific traits.

  • Soft core and cloud genes
    In this dataset, no genes fall into these intermediate categories.

The total number of genes across the pangenome is 6441

The core genome defines the species, while the accessory genome explains the diversity between strains.

Go Up


Accessory Genome Tree

If the option Accessory binary genes in newick format was selected, Roary generates a tree based on the presence/absence patterns of accessory genes.

This tree can be visualized using the Newick Display tool.

  1. Newick Display
    • Input tree file: Accessory binary genes (newick) produced by Roary

This tree represents similarity in gene content between genomes.

Genomes that cluster together in this tree share a similar set of accessory genes, which may include:

  • antimicrobial resistance genes
  • virulence factors
  • insertion sequences
  • plasmid-associated genes
  • other mobile genetic elements

However, this tree does not represent the evolutionary history of the strains, because accessory genes can be exchanged through horizontal gene transfer.

Go Up


Interpreting the Newick Tree

The accessory genome tree generated by Roary is stored in Newick format.

(042_SM:0.001928085,043_SM:0.001044593, (041_SM:0.000000005,045_SM:0.791384807)1.000:1.018569918);

  • Names represent the genomes (e.g. 041_SM, 045_SM)
  • Numbers after : represent branch length
  • Numbers after ) represent branch support

Interpretation:

  • 041_SM and 045_SM form a cluster (sister taxa)
  • 1.000 indicates 100% support for this grouping
  • branch lengths represent distance between genomes

In this case, the tree indicates that:

  • 041_SM and 045_SM share a very similar accessory genome
  • 043_SM and 042_SM are more distant isolates

Remember:

  • Branch support → confidence in the grouping
  • Branch length → degree of genomic difference

Go Up


Interpreting Branch Length Values

Consider this part of the tree:

(041_SM:0.000000005,045_SM:0.791384807)1.000:1.018569918

Here we have three branch length values:

Genome / Node Branch length
041_SM 0.000000005
045_SM 0.791384807
internal node 1.018569918

Branch lengths are distance values, not percentages.

  • 0.000000005 → extremely small distance
  • 0.791 → moderate distance
  • 1.018 → the largest distance in the tree

So in relative terms:

  • 041_SM is almost identical to its ancestor node
  • 045_SM is more divergent
  • the branch 1.018 indicates the largest separation between clusters

In an accessory genome tree, branch length reflects differences in gene content, not evolutionary time.

Therefore:

  • larger values → genomes differ more in accessory genes
  • smaller values → genomes share more similar gene sets

Branch length values are meaningful only when compared within the same tree, not across different analyses.

Go Up


Visualizing the Tree with iTOL

The tree generated by Roary is stored in Newick/NHX format.
To visualize and explore it interactively, we can use the online tool iTOL (Interactive Tree Of Life).

  1. Download the file Accessory binary genes (newick) produced by Roary from your Galaxy history.

  2. Open the iTOL website: https://itol.embl.de

  3. Click Upload Tree.

  4. Upload the downloaded .nwk / .nhx file.

  5. Once uploaded, iTOL automatically renders the tree.

You can then:

  • zoom and navigate the tree
  • change the layout (rectangular, circular, radial)
  • adjust branch lengths and labels
  • collapse or expand clades
  • export the figure as SVG, PDF, or PNG.

iTOL is particularly useful for exploring larger trees and preparing publication-quality figures.

Go Up


Core Genome Phylogeny

To infer the evolutionary relationships between the isolates, we must instead use the core genome alignment produced by Roary.

Roary generates the file:

core_gene_alignment.aln

This multiple sequence alignment can be used to build a phylogenetic tree using tools such as:

  • RAxML
  • IQ-TREE
  • FastTree

These tools reconstruct a tree based on sequence evolution of the conserved core genes, providing a more accurate representation of the phylogenetic relationships between strains.


RAxML Output Files

Running RAxML on the core genome alignment produces several output files:

Output Description
Info Summary of the analysis parameters and run statistics (model used, number of sites, likelihood scores).
Log Detailed log of the optimization process performed during the ML search.
Parsimony Tree Starting tree generated using the maximum parsimony method. This tree is used as an initial guess for the ML search.
Result One of the trees obtained during the ML optimization process.
Best-scoring ML Tree The final phylogenetic tree with the highest likelihood score. This is the tree typically used for interpretation and visualization.

In practice:

  • the Best-scoring ML Tree is the main result of the analysis.
  • this file can be visualized with tools such as Newick Display

This tree represents the phylogenetic relationships between the isolates based on the core genome alignment.

Go Up


Interpreting the Core Genome Tree

The Best-scoring ML Tree generated by RAxML represents the phylogenetic relationships between the isolates based on the alignment of core genes.

Because core genes are conserved and typically inherited vertically, this tree approximates the evolutionary history of the strains.

When interpreting a phylogenetic tree, focus on three main elements.

  1. Tree topology
    • The topology describes how genomes cluster together.
    • Genomes that share a recent common ancestor appear in the same clade.
  2. Branch length
    • Branch lengths represent sequence divergence between genomes.
    • Short branches indicate closely related strains, while longer branches suggest greater evolutionary divergence.
  3. Branch support
    • Some trees include support values (e.g. bootstrap values) that indicate how strongly the data support a given split in the tree.
    • Higher values indicate more reliable groupings.

  • 042_SM and 043_SM form a cluster, indicating that these two isolates are the most closely related in terms of core genome sequence.
  • 041_SM is related to this cluster but shows a slightly larger divergence.
  • 045_SM has the longest branch, suggesting it is the most divergent isolate in this dataset.

This tree therefore describes the evolutionary relationships between the isolates based on their conserved genome backbone.

Go Up


IQ-TREE Output Files

Running IQ-TREE on the core genome alignment generates several output files.

Output Description
BIONJ Tree Initial tree built using the BIONJ distance method. This tree serves as a starting point for the maximum likelihood optimization.
MaxLikelihood Tree The maximum likelihood phylogenetic tree inferred from the alignment. This is the tree used for interpretation and visualization.
MaxLikelihood Distance Matrix Pairwise genetic distance matrix between genomes derived from the alignment.
Occurrence Frequencies in Bootstrap Trees Table summarizing how frequently each split appears across bootstrap replicates, used to calculate branch support.
Report and Final Tree Detailed report describing the analysis, parameters used, likelihood scores, and summary of the final tree inference.

In practice, the most important file for visualization is: MaxLikelihood Tree. This file can be opened withNewick Display and represents the phylogenetic relationships between the isolates based on the core genome alignment.

Go Up


FastTree Output

When running FastTree on the core genome alignment, with Nucleotide option, the tool produces a single output:

Output Description
Tree The inferred phylogenetic tree in Newick format.

Unlike tools such as IQ-TREE or RAxML, which generate multiple intermediate files and reports, FastTree outputs only the final tree.

FastTree is designed to infer trees very quickly, even for large datasets.

Advantages:

  • very fast
  • low memory usage
  • simple output

However, it is generally considered less rigorous than full maximum likelihood methods such as IQ-TREE or RAxML.

For exploratory analyses or large datasets, FastTree is often sufficient, while IQ-TREE or RAxML are preferred for more robust phylogenetic inference.

Go Up


Key Concept

Two complementary views of genomic diversity are obtained:

Tree type Based on Biological meaning
Core genome tree Sequence alignment of conserved genes Evolutionary relationships
Accessory genome tree Presence/absence of accessory genes Functional genomic similarity

Go Up


SNP-Based Phylogeny

Another common approach to infer phylogenetic relationships between bacterial isolates is SNP-based phylogeny.

Instead of comparing full gene sequences, this method focuses only on single nucleotide polymorphisms (SNPs) across the genomes.

Because SNPs capture very small genetic differences, this approach provides very high resolution and is widely used in:

  • outbreak investigations
  • microbial epidemiology
  • fine-scale strain comparison

The general workflow for SNP-based phylogeny includes several steps:

  1. Reference genome selection
    A suitable reference genome is chosen for the species under study.

  2. Read mapping
    Sequencing reads are aligned to the reference genome.

  3. Variant calling
    SNPs are identified by comparing each isolate to the reference genome.

  4. SNP filtering
    Low-quality or unreliable variants are removed.

  5. Core SNP alignment
    SNP positions shared across all isolates are extracted to create an alignment.

  6. Phylogenetic inference
    A phylogenetic tree is reconstructed using tools such as:

    • IQ-TREE
    • RAxML
    • FastTree

Go Up


Output

The resulting tree represents the phylogenetic relationships between isolates based on nucleotide differences across the genome.

Because SNPs capture very small genetic differences, SNP-based phylogenies provide high-resolution comparisons between strains.

Go Up


SNP-Based Phylogeny with Snippy

Running each step separately can be complex.
For bacterial genomes, the pipeline can be simplified using Snippy.

Snippy is a tool specifically designed for rapid SNP detection in bacterial genomes.
It automatically performs several steps of the workflow:

  • read mapping to a reference genome
  • SNP calling
  • variant filtering
  • consensus sequence generation

This greatly simplifies the analysis and reduces the number of tools required.

The output generated by Snippy can then be used to build a core SNP alignment, which serves as input for phylogenetic inference with tools such as IQ-TREE or RAxML.

Because of its simplicity and reliability, Snippy is widely used in bacterial genomics and outbreak analysis.

  1. Download Reference Genome from https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_030291735.1/
  2. Upload to History
  3. Snippy tool with the following parameters:
    • Use a Genome Reference from history and build index
      • Use the following dataset as the reference sequence:reference file
    • Input Type: Paired Reads in collection
      • Select a paired collection: DATA
    • Select all outputs
  4. snippy-core tool with the following parameters:
    • Snippy input zipped dirs: zip dir from Snippy
    • Use the following dataset as the reference sequence: reference file
  5. Raxml or IQ-Tree with Core Alignment Fasta from snippy-core

Go Up


Snippy Output

Snippy has taken the reads and:

  1. mapped them against the reference using BWA MEM,

  2. looked through the resulting BAM file and found differences using some fancy Bayesian statistics (Freebayes),

  3. filtered the differences for sensibility and finally checked what effect these differences will have on the predicted genes and other features in the genome.

It produces quite a bit of output, there can be up to 10 output files.

  • snps vcf file: The final annotated variants in VCF format

  • snps gff file: The variants in GFF3 format

  • snps table: A simple tab-separated summary of all the variants

  • snps summary: A summary of the SNPs called

  • log file: A log file with the commands run and their outputs

  • aligned fasta: A version of the reference but with - at position with depth=0 and N for 0 < depth < –mincov (does not have variants)

  • consensus fasta: A version of the reference genome with all variants instantiated

  • mapping depth: A table of the mapping depth

  • mapped reads bam: A BAM file containing all mapped reads

  • outdir: A tarball of the Snippy output directory for input into snippy-core

Go Up


snippy-core Output

ID LENGTH ALIGNED UNALIGNED VARIANT HET MASKED LOWCOV
041_SM 516077 64372 72476 7318 183 171105 9019675
042_SM 516077 64416 61972 5011 176 117710 018436
043_SM 516077 64417 25172 5218 176 023790 017517
045_SM 516077 64515 21962 8970 175 989862 015725
Reference 516077 65160 77600 0000 0 0 0

Go Up


Tree

From the topology we observe:

  • 042_SM and 043_SM cluster together, indicating that these two isolates are the most closely related.
  • 041_SM is related to this cluster but shows a slightly larger divergence.
  • 045_SM and the reference genome form a separate branch.

Branch lengths represent the number of nucleotide differences detected across the genome.

Very small values (e.g. 0.00028) indicate nearly identical genomes, while larger values indicate greater genetic divergence.

In this tree, the branch lengths associated with 045_SM and the reference are much larger, suggesting that this isolate is much more divergent from the other samples.

Extremely long branches in SNP phylogenies may indicate a distant lineage, an unsuitable reference genome, or potential issues in the assembly or mapping.

Go Up


Average Nucleotide Identity (ANI)

Average Nucleotide Identity (ANI) is a metric used to measure the genomic similarity between two microbial genomes.

It calculates the average nucleotide identity of homologous genomic fragments shared between the genomes.

ANI is widely used for species identification in prokaryotes and has largely replaced traditional DNA–DNA hybridization methods.

A commonly accepted rule is:

ANI value Interpretation
≥ 95–96 % genomes belong to the same species
90–95 % closely related species
< 90 % different species

Go Up


Computing ANI with FastANI

ANI can be computed using FastANI, a fast alignment-based method designed for large-scale microbial genome comparisons.

FastANI works by:

  1. fragmenting the query genome into short sequences
  2. mapping these fragments to the reference genome
  3. calculating the average nucleotide identity across the matched regions.

The tool reports:

  • ANI value
  • number of fragments successfully mapped
  • total number of query fragments

Go Up


ANI Interpretation

ANI values between the isolates and the reference genome range between 95.1% and 95.4%.

Since the commonly accepted species threshold is 95–96% ANI, all isolates are consistent with belonging to Serratia marcescens.

Although isolate 045_SM appears more divergent in the phylogenetic trees, the ANI analysis confirms that it still falls within the same species boundary.


Key Concept

Phylogenetic distance does not necessarily imply species divergence.

An isolate may appear phylogenetically distant, yet still belong to the same species, as confirmed by ANI analysis.

Go Up


Population Structure Analysis

While phylogenetic trees describe evolutionary relationships between genomes, population genomics approaches can identify genomic clusters or lineages within a species.

One tool commonly used for this purpose is PopPUNK (POPulation Partitioning Using Nucleotide Kmers).

PopPUNK is typically applied to large genomic datasets (tens to thousands of genomes).

Taxon Cluster
CONTIGS_042_SM 1
CONTIGS_043_SM 1
CONTIGS_041_SM 2
CONTIGS_045_SM 3

Go Up


Final interpretation of the dataset

From the analyses performed in this tutorial we can draw several conclusions.

  1. Two samples (044 and 046) showed clear evidence of contamination and were excluded from downstream analyses.

  2. The remaining four genomes produced high-quality assemblies consistent with the expected genome size and GC content of Serratia marcescens.

  3. AMR analysis revealed a small set of resistance genes shared across the isolates, with minor variation between strains.

  4. Pangenome analysis identified a core genome of 3648 genes and a substantial accessory genome contributing to strain diversity.

  5. Core genome phylogeny shows that:
    • isolates 042_SM and 043_SM are the closest pair
    • isolate 045_SM is the most divergent
  6. ANI values between 95.1–95.4% confirm that all isolates belong to the same species, despite their phylogenetic divergence.