Hands On

From raw reads to insights: A hands-on guide to bacterial genomics

You have mastered the theory and practiced with curated demo datasets.
Today, we step out of the tutorial zone.

This session is about real data.

Real-world sequencing is messy, unpredictable, and rarely perfect.
You are no longer here to just follow commands —
you are here to practice the decision-making process required to transform raw, noisy reads into a scientifically robust genome assembly.

Biological questions

In this hands-on exercise we will analyze a small collection of Serratia marcescens isolates.

The main questions we aim to answer are:

Are all sequencing datasets of sufficient quality?
Do all samples belong to the same bacterial species?
Are some datasets contaminated?
How similar are the isolates at the genomic level?
Do they share the same antimicrobial resistance genes?
What are the evolutionary relationships between the isolates?

#FILE	NUM_FOUND	aac(6’)-Ic	aac(6’)_Serra	blaSRT	blaSRT-2	oqxB30	oqxB9	tet(41)
041_SM	3	.	100.00	100.00	.	.	98.89	.
042_SM	3	100.00	.	.	100.00	.	98.89	.
043_SM	3	100.00	.	.	100.00	.	98.89	.
045_SM	4	.	100.00	100.00	.	98.92	.	99.49

key	NP_231094	NP_231099	NP_231100	NP_231101	NP_231102	NP_232418	NP_232419
SRR6871245	1	1	1	1	1	1	1
SRR6871246	1	1	1	1	1	1	1
SRR6871247	1	1	1	1	1	1	1
SRR6871248	1	1	1	0	1	1	1
SRR6871249	1	0	0	0	0	1	1
SRR6871250	1	1	1	1	1	1	1

Column 1	Column 2	Column 3	Column 4	Column 5	Column 6	Column 7	Column 8	Column 9	Column 10
041_SM	serratia	595	adk(142)	fumC(173)	gyrB(133)	icd(145)	mdh(131)	recA(128)
042_SM	serratia	1406	adk(174)	fumC(315)	gyrB(225)	icd(256)	mdh(398)	recA(220)
043_SM	serratia	1406	adk(174)	fumC(315)	gyrB(225)	icd(256)	mdh(398)	recA(220)
044_SM	ecloacae	116	dnaA(9)	fusA(4)	gyrB(14)	leuS(6)	pyrG(11)	rplB(4)	rpoB(6)
045_SM	serratia	-	adk(1)	fumC(265)	gyrB(3)	icd(3)	mdh(2)	recA(3)
046_SM	serratia	-	adk(95)	fumC(117)	gyrB(108)	icd(269)	mdh(205)	recA(101)

Category	Definition	Genes
Core genes	(99% ≤ strains ≤ 100%)	3648
Soft core genes	(95% ≤ strains < 99%)	0
Shell genes	(15% ≤ strains < 95%)	2793
Cloud genes	(0% ≤ strains < 15%)	0
Total genes	(0% ≤ strains ≤ 100%)	6441

Output	Description
Info	Summary of the analysis parameters and run statistics (model used, number of sites, likelihood scores).
Log	Detailed log of the optimization process performed during the ML search.
Parsimony Tree	Starting tree generated using the maximum parsimony method. This tree is used as an initial guess for the ML search.
Result	One of the trees obtained during the ML optimization process.
Best-scoring ML Tree	The final phylogenetic tree with the highest likelihood score. This is the tree typically used for interpretation and visualization.

Output	Description
BIONJ Tree	Initial tree built using the BIONJ distance method. This tree serves as a starting point for the maximum likelihood optimization.
MaxLikelihood Tree	The maximum likelihood phylogenetic tree inferred from the alignment. This is the tree used for interpretation and visualization.
MaxLikelihood Distance Matrix	Pairwise genetic distance matrix between genomes derived from the alignment.
Occurrence Frequencies in Bootstrap Trees	Table summarizing how frequently each split appears across bootstrap replicates, used to calculate branch support.
Report and Final Tree	Detailed report describing the analysis, parameters used, likelihood scores, and summary of the final tree inference.

Tree type	Based on	Biological meaning
Core genome tree	Sequence alignment of conserved genes	Evolutionary relationships
Accessory genome tree	Presence/absence of accessory genes	Functional genomic similarity

ID	LENGTH	ALIGNED	UNALIGNED	VARIANT	HET	MASKED	LOWCOV
041_SM	516077	64372	72476	7318	183	171105	9019675
042_SM	516077	64416	61972	5011	176	117710	018436
043_SM	516077	64417	25172	5218	176	023790	017517
045_SM	516077	64515	21962	8970	175	989862	015725
Reference	516077	65160	77600	0000	0	0	0

ANI value	Interpretation
≥ 95–96 %	genomes belong to the same species
90–95 %	closely related species
< 90 %	different species

Taxon	Cluster
CONTIGS_042_SM	1
CONTIGS_043_SM	1
CONTIGS_041_SM	2
CONTIGS_045_SM	3

Genome / Node	Branch length
041_SM	0.000000005
045_SM	0.791384807
internal node	1.018569918

Galaxy Training for Pathogen Genomics, AMR Detection & Virus ID

Hands On

Biological questions

Analysis Workflow

Galaxy and Data Preparation

Quality Check (QC)

QC Checklist

Contamination Check

Contamination Checklist

Interpretation and Action

Optional Exercise

Assembly

Reference-based Assembly Evaluation with QUAST

Circos Plot

Assembly Checklist

Functional Annotation

Annotation Checklist

Further Structural Annotation

Visualisation of the annotation

AMR Detection & Virulence Factor Identification

AMR Detection with StarAMR

AMR Detection with ABRicate

Preparing the Heatmap Input Matrix

Virulence Factor Identification with ABRicate

Multilocus Sequence Typing (MLST)

Pangenomic Analysis

Accessory Genome Tree

Interpreting the Newick Tree

Interpreting Branch Length Values

Visualizing the Tree with iTOL

Core Genome Phylogeny

RAxML Output Files

Interpreting the Core Genome Tree

IQ-TREE Output Files

FastTree Output

Key Concept

SNP-Based Phylogeny

Output

SNP-Based Phylogeny with Snippy

Snippy Output

snippy-core Output

Tree

Average Nucleotide Identity (ANI)

Computing ANI with FastANI

ANI Interpretation

Key Concept

Population Structure Analysis

Final interpretation of the dataset

AMR Detection with `StarAMR`

AMR Detection with `ABRicate`

Virulence Factor Identification with `ABRicate`