Genome Assembly of a bacterial genome (MRSA)

Bacterial genome assembly from Illumina MiSeq reads, covering data quality assessment, assembly generation, and assembly quality metrics.


Introduction to Assembly Genome

Sequencing produces short, unordered reads that represent only small fragments of the genome. Genome assembly addresses this limitation by reconstructing longer genomic sequences, enabling meaningful biological interpretation and downstream analyses such as genome annotation, resistance gene detection, and comparative genomics.

Genome assembly consists of aligning and reconstructing these fragments to form a continuous sequence (that of the chromosomes) or a set of contiguous sequences (called contigs or scaffolds)

  • Contig: a contiguous sequence in an assembly. A contig does not contain long stretches of unknown sequences (aka assembly gaps). The contig is usually generated using the long-reads data.
  • Scaffold: a sequence consists of one or multiple contigs connected by assembly gaps of typically inexact sizes. A scaffold is also called a supercontig, though this terminology is rarely used nowadays. Usually, scaffolds are generated using the Hi-C data Assembly: a set of contigs or scaffolds.

Assembly Overview

  • Increases informational content
    Assembled sequences (contigs/scaffolds) enable the analysis of complete genes, regulatory regions, and genomic context beyond single reads.

  • Enables reference-free analysis
    In metagenomics and for many bacteria and viruses, suitable reference genomes may be unavailable or too distant.

  • Supports discovery of novel organisms and variants
    Assembly allows reconstruction of genomes or genomic fragments not present in reference databases.

  • Improves functional annotation
    Resistance genes, virulence factors, and viral proteins are more reliably identified on longer, continuous sequences.

  • Enables structural and comparative analyses
    Including the study of plasmids, mobile genetic elements, genomic rearrangements, and strain-level comparisons.

Assembly vs Align

Assembly tools

  • Genome
    Velvet, Velvet Optimizer, Spades, Abyss,
    MIRA, Newbler, SGA, AllPaths, Ray, SOAPdenovo, …

  • Meta-genome
    Meta Velvet, MetaSpades, SGA, custom scripts + above

  • Transcriptome
    Trinity, Oases, Trans-abyss

And many, many others…

Go Up


In this training we will build an assembly of a bacterial genome, from data produced in “Complete Genome Sequences of Eight Methicillin-Resistant Staphylococcus aureus Strains Isolated from Patients in Japan” Hikichi et al. 2019

Methicillin-resistant Staphylococcus aureus (MRSA) is a major pathogen causing nosocomial 
infections, and the clinical manifestations of MRSA range from asymptomatic colonization of the nasal mucosa to soft tissue infection to fulminant invasive disease. 

NOTE! Paired-end reads were generated using a MiSeq reagent kit (v3-600) on the MiSeq platform (Illumina).

Go Up


Galaxy and data preparation

  1. Create a new history and rename it
  2. Import data
    https://zenodo.org/record/10669812/files/DRR187559_1.fastqsanger.bz2
    https://zenodo.org/record/10669812/files/DRR187559_2.fastqsanger.bz2
    
  3. Rename the datasets to remove .fastqsanger.bz2 and keep only the sequence run ID (DRR187559_1 and DRR187559_2)
  4. Create a paired collection named Raw reads
    • Click on Select Items at the top of the history panel
    • Check all the datasets in your history you would like to include
    • Click n of N selected and choose Advanced Build List
    • You are in the collection building wizard. Choose List of Paired Datasets and click Next button at the right bottom corner.
    • Check and configure auto-pairing. Commonly matepairs have suffix _1 and _2 or _R1 and _R2. Click on Next at the bottom.
    • Edit the List Identifier as required.
    • Enter a name for your collection
    • Click Build to build your collection
    • Click on the checkmark icon at the top of your history again
  5. Tag the collection #unfiltered
  6. View the renamed files in the collection

Go Up


Quality Control

Before performing any genome assembly, it is essential to assess the quality of the input reads. Key questions to consider include:

  • What is the sequencing coverage of the genome?

  • What is the overall quality of the reads?

  • Is additional sequencing required?

  • Are the data suitable for the intended downstream analyses?

In the previous tutorial, we used FastQC for read quality assessment.
In this lesson, we will use Falco, an efficiency-optimized reimplementation of FastQC, designed to provide the same quality metrics with improved performance.

  1. Falco Tool with the following parameters
    • Raw read data from your current history: Raw Reads
  2. Inspect the generated HTML file

Go Up


Quality improvement

Depending on the analysis it could be possible that a certain quality or length is needed. In this case we are going to trim the data using fastp (or Trim-Galore etc.)

  1. fastp with the following parameters:
    • Single-end or paired reads: Paired Collection
    • Select paired collection(s): Raw Reads
    • In Filter Options:
      • In Length filtering Options:
        • Length required: 30
    • In Read Modification Options
      • In Per read cuitting by quality options:
        • Cut by quality in front (5’): Yes
        • Cut by quality in tail (3’): Yes
        • Cutting window size: 4
        • Cutting mean quality: 20
    • In Output Options:
      • Output JSON report: Yes
  2. Edit the tags of the fastp FASTQ outputs to
    • Remove the #unfiltered tag
    • Add a new tag #filtered

fastp generates also a report, similar to Falco, useful to compare the impact of the trimming and filtering.

Question

  1. How did the average read length change before and after filtering?
  2. Did trimming improve the mean quality scores?
  3. Did trimming affect the GC content?
  4. Is this data ok to assemble?

Exercises

  1. Run MultiQC on the JSON output generated by fastp.

Go Up


Checking expected species and contamination

Before proceeding with genome assembly, it is important to check if the expected species or strains can be identified in the data or if there is any contamination.

As this step is not specific to genome assembly, it is covered in a dedicated tutorial.

Go Up


Assembly

Now that the quality of the reads is determined and the data is filtered and/or trimmed, we can try to assemble the reads together to build longer sequences.

There are many tools that create assembly for short-read data, e.g. SPAdes, Abyss or Velvet. In this tutorial, we use Shovill. Shovill is a SPAdes-based genome assembler, improved to work faster and only for smaller (bacterial) genomes.

Warning: Shovill is for isolate data only, primarily small haploid organisms. It will NOT work on metagenomes or larger genomes
  1. Shovill with the following parameters:
    • Input reads type, collection or single library: Paired Collection
    • Paired collection: fastp Paired-end output
    • In Advanced options:
      • Estimated genome size: 2914567
        Estimated genome size (sourced from the cited article for this case, though typically retrieved from or NCBI) is optional. If unknown, Shovill estimates it during pre-processing. Nonetheless, providing it manually is recommended for more robust results with ‘heavy’ samples.
  2. View the FASTA file with contigs.
Warning: The execution of this tool can take approximately 10 minutes. To avoid unnecessary waiting time and server overload during the course, please retrieve the precomputed outputs from the shared history.

Question

  1. How big is the first contig?
  2. What is the coverage of your biggest (first) contig?

Go Up


Assembly Strategies: Individual vs. Co-assembly

  • Individual Assembly: Reads from each sample are assembled independently. This approach is ideal for preserving strain-specific variations and is computationally efficient. However, it may fail to reconstruct low-abundance sequences that lack sufficient coverage within a single library.

  • Co-assembly: Reads from multiple samples are pooled and assembled together into a single set of contigs. This increases the overall sequencing depth, facilitating the recovery of rare taxa. The main drawbacks include high computational requirements and the potential for generating chimeric contigs due to microdiversity between samples.

Which one to choose?

Situation Recommended Choice Reason
Pure bacterial isolates Individual Assembly Better for preserving specific strain identity.
Metagenomics (same site/patient) Co-Assembly Increases coverage to recover rare species.
Metagenomics (very different sites) Individual Assembly Avoids creating chimeras from unrelated species.
Limited computing resources (RAM) Individual Assembly Requires significantly less memory per run.

Go Up


Quality control of assembly

  1. Quast Tool with parameters
    • Assembly mode?: Co-Assembly
      • Use customized names for the input files?: No, use dataset names
        • Contigs/scaffolds file: contigs output Shovill
      • Reads options: Illumina paired-end reads in paired collection
        • FASTQ/FASTA files: reads from fastp output
          (To make the job quicker, you can select Disabled here. The raw reads will then not been mapped to the assembly to compute metrics, like the coverage.)
    • Output files: HTML report, PDF report, Tabular reports
  2. Inspect the HTML reports

Go Up


Assembly statistics

The following metrics are the most critical indicators of a high-quality bacterial assembly:

  1. N50 - The N50 value indicates that 50% of the total assembly length is contained in contigs of this size or larger.

    • Significance: It measures the continuity of the assembly.

    • Quality Benchmark: For a bacterial isolate, an N50 > 100–200 kb is generally considered excellent. A low N50 (e.g., < 20 kb) suggests a highly fragmented assembly.

  2. Number of Contigs (# contigs) - This represents the total number of fragments the genome has been broken into.

    • Significance: It measures fragmentation.

    • Quality Benchmark: Ideally, this should be below 100-150 for a single bacterial genome. Thousands of contigs often indicate issues with repeats or insufficient sequencing depth.

  3. Total Length - The cumulative length of all contigs should be close to the expected genome size (e.g., ~2.9 Mb for MRSA).

    • Significance: It measures completeness.

    • Quality Benchmark: The length should be within ±5-10% of the reference genome. Significant deviations may suggest contamination (if too high) or missing genomic regions (if too low).

  4. GC Content (%) - The percentage of Guanine and Cytosine bases in the assembly.

    • Significance: It serves as a check for taxonomic identity and contamination.

    • Quality Benchmark: It must be consistent with the target species (e.g., ~32.89% for MRSA).

See Quast Manual for more details.

Question

  1. How many contigs is there?
  2. What is the total length of all contigs?
  3. What is you GC content?

Go Up


Assembly Report

  1. MultiQC tool with following parameters
    • In “Results”:
    • Insert Results
      • Which tool was used generate logs?: QUAST
      • *In “QUAST Output”:
        • QUAST Output: QUAST Report on Shovill
  • Report title: Assembly Report

Go Up


Homework

You are required to repeat the Assembly + QUAST workflow using SPAdes (with --isolate optiopn) and compare the results with your previous Shovill assembly.

Go Up


Conclusions

In this tutorial, we prepared short reads, assembled them, and inspect the produced assembly for its quality. The assembly, even if uncomplete, is reasonable good to be used in downstream analysis:

Go Up