Pangenome Analysis

If we analyse DNA from several bacterial strains, we may want to know which genes they have in common and which are unique to some strains.


Pangenomic analysis studies the diversity and evolution of genes and genomes within a group of related organisms. It goes beyond traditional single-genome analysis by examining the collective genomic content of a population, species, or even a larger taxonomic group.

  • The core genome is the group of genes shared by every genome in the tested set. Gene sequences are similar but not necessarily identical. Core genome SNPs are those SNPs found in the genes in the core genome; i.e. at a particular site, the nucleotide varies. We can use these SNPs to infer relationships between the strains. Some authors have divided the core pangenome in

    • hard core: families of homologous genes that has at least one copy of the family shared by every genome (100% of genomes)

    • soft core: those families distributed above a certain threshold (90%)

  • The accessory genome is the group of genes that are not in all the strains.

    • Shell: is the part of the pangenome shared by the majority of the genomes in a pangenome. There is not a universally accepted threshold to define the shell genome, some authors consider a gene family as part of the shell pangenome if it shared by more than 50% of the genomes in the pangenome

    • Cloud: consists of those gene families shared by a minimal subset of the genomes in the pangenome, it includes singletons or genes present in only one of the genomes. It is also known as the peripheral genome. Gene families in this category are often related to ecological adaptation

  • The pan genome is the sum of the core and accessory genes. That is, a combination of all the genes that are found in the clade of interest.

In a study that involves the pangenomes of Staphylococcus aureus, some of them isolated from the international space station, the thresholds used for segmenting the pangenomes were as follows:

  • Cloud: presence in <10% of the genomes

  • Shell: presence in 10% to 95% of the genomes

  • Core , presence in >95% of the genomes

Recently several stand-alone or server-based suites have become available for pangenome analysis: a Review of Pangenome Tools and Recent Studies can be found here

We decided to use Roary a high speed stand alone pan genome pipeline, which takes annotated assemblies in GFF3 format (produced by Prokka) and calculates the pan genome.

Go Up


Preparing Data

Roary accepts the annotated genome in GFF3 format as input, per sample (e.g. output from Prokka), with the strict requirement that the samples must belong to the same species.

For this tutorial, we will use sequences from different strains of Staphylococcus aureus:

Go Up


Download datasets

  1. Click on link

  2. Send to

  3. Complete Record

  4. File

  5. Format: FASTA

  6. Create File

  7. Rename file with strain label

Go Up


Upload datasets on Galaxy

  1. Create a new history and rename it
  2. Upload data
  3. Create a collection named strains

Go Up


Warning: We will not be running the entire workflow during this tutorial: it can be quite time-consuming and resource-intensive. I have prepared the output of the workflow in advance.

Annotate Genomes

  1. Prokka Tool with the following parameters (leave everything else unchanged):
    • contigs to annotate: strains
    Warning: The execution of this tool can take approximately 10 minutes. To avoid unnecessary waiting time and server overload during the course, please retrieve the precomputed outputs from the shared history.
  2. MultiQC tool with following parameters
    • In “Results”:
      • Insert Results
        • Which tool was used generate logs?: Prokka
        • In “Prokka Output”:
          • Prokka Output: Prokka.txt
    • Report title: Genome Annotation Report

Go Up


Pangenomic Analysis with Roary

Roary is widely used in microbiology and bacterial genomics research to study the genomic diversity of bacterial populations, track the evolution of pathogens, and understand the genetic basis of phenotypic traits.

Starting with a set of annotated prokaryotic genomes, in standard formats like GFF, Roary will be able to:

  • converts coding sequences into protein sequences

  • clustered these protein sequences by several methods

  • further refines clusters into orthologous genes

  • for each sample, determines if gene is present/absent: produces the file gene_presence_absence.csv

  • uses this gene p/a information to build a tree, using FastTree: produces the file accessory_binary_genes.fa.newick

  • overall, calculates number of genes that are shared, and unique: produces the file summary_statistics.txt

  • aligns the core genes (if option used, as above) for downstream analyses

Roary Tool with the following parameters:

  • Individual gff files or a dataset collection: Collection
  • Dataset collection to submit to Roary: GFF collection from Prokka
  • In Additional Outputs
    • Select Accessory binary genes in newick format
Warning: The execution of this tool can take approximately 40 minutes. To avoid unnecessary waiting time and server overload during the course, please retrieve the precomputed outputs from the shared history.

Roary produces four main outputs:

  1. Summary statistics
    • Table reporting:
      • Core genome size (genes shared by all isolates)
      • Pan genome size (total number of genes)
    • Provides a global view of genome diversity across samples.

  2. Core gene alignment (FASTA)
    • Concatenated multiple alignment of core genes
    • Represents the conserved genomic backbone
    • Used as input for phylogenetic reconstruction
      (e.g. RAxML, IQ-TREE)
  3. Gene presence/absence matrix
    • Binary matrix describing gene content variability
    • For each gene:
      • annotated gene name (Column 3)
      • number of isolates in which the gene is present (Column 7)
    • Captures the accessory genome
      (plasmids, genomic islands, AMR genes)

  4. Accessory Binary Genes tree (Newick)
    • Tree based on presence/absence patterns of accessory genes
    • Represents similarity in gene content, not evolutionary distance
    • Highlights:
      • shared mobile genetic elements
      • functional convergence (e.g. AMR, adaptation)
    • ⚠️ Not a phylogenetic tree in the evolutionary sense

Go Up


Infer phylogeny using core gene snps

Roary has produced an alignment of the core genes. We can use this alignment to infer a phylogenetic tree of the isolates.

  1. Phylogeneitc reconstruction with RaXML Tool with the following parameters:
    • Source file with aligned sequences: Core gene alignment from Roary
    • Model Type: Nucleotide
    • Substitution Model: GTRGAMMA
  2. Inspect the six output files

  3. Click on Result.

  4. Under the file, click on the Visualize icon (a graph)

  5. Choose Phylogenetic Tree Visualization

Go Up


Visualize with Newick Display

  1. Newick Display Tool with the following parameters:
    • Newick file: Result from RaXML
    • Branch support: Display branch support
    • Branch length: Display branch length
    • Image width: 2000

2Newick Display Tool with the following parameters:

  • Newick file: Accessory Binary Genes tree from Roary
  • Branch support: Display branch support
  • Branch length: Display branch length
  • Image width: 2000

Go Up


Vizualize with Phandango

  1. Download Result from Phylogeneitc reconstruction with RaXML
  2. Rename as raxml.tree
  3. Download Gene presence/absence from Roary
  4. Rename as gene_presence_absence.csv
  5. Go to http://phandango.net/
  6. Drag and drop the two files onto the landing page.
    • view the tree of samples and their core and pan genomes
    • each blue coloured column is a gene: genes are present or absent in each isolate
    • the core genes are shared by all isolates

Go Up