Bacterial Genome Annotation

Annotation and exploration of a draft bacterial genome, including gene prediction, identification of genomic components, evaluation of annotation results, and visualization of annotated features.


After sequencing and assembly, a genome can be annotated. It is an essential step to describe the genome.

Genome annotation consists in describing the structure and function of the components of the genome, by predicting, analyzing, and interpreting them in order to extract their biological significance and understand the biological processes in which they participate.

Among other things, it identifies the locations of genes and all the coding regions in a genome (structural annotation) and determines what those genes do (functional annotation).

Galaxy and data preparation

To illustrate the process to annotate a bacterial genome, we take the assembly generated in the previous tutorial.

Before starting the analysis, prepare your Galaxy workspace as follows:

  1. Create a new Galaxy history and give it a meaningful name.

  2. Import the Shovill contigs dataset into the new history by dragging and dropping it from a previous history (see here for instructions on managing and copying datasets between histories).

Go Up


Contig annotation

For annotating the contigs, several tools can be used for this purpose, including Prokka and Bakta.

Annotation with Bakta

Bakta is a tool for rapid and standardized annotation of bacterial genomes and plasmids from both isolates and metagenome-assembled genomes (MAGs).

  1. Bakta with the following parameters:
    • In Input/Output options:
      • The bakta database: latest one
      • The amrfinderplus database: latest one
      • Select genome in fasta format: Contig file
    • In Optional annotation:
      • Keep original contig header (–keep-contig-headers): Yes
    • In Selection of the output files:
      • Output files selection:
        • Annotation file in TSV
        • Annotation and sequence in GFF3
        • Feature nucleotide sequences as FASTA
        • Summary as TXT
        • Plot of the annotation result as SVG
  2. NOTE: Do not click Run.
Warning: Since the annotation process can take several hours, we will instead use precomputed annotation results by dragging and dropping them from the Datasets history.

Bakta can generate many outputs. Here we selected:

  • annotation_and_sequences in GFF3

    A GFF is a tab delimited file with 9 fields per line:

    1. seqid: The name of the sequence where the feature is located.

    2. source: The algorithm or procedure that generated the feature. This is typically the name of a software or database.

    3. type: The feature type name, like “gene” or “exon”. In a well structured GFF file, all the children features always follow their parents in a single block (so all exons of a transcript are put after their parent “transcript” feature line and before any other parent transcript line). In GFF3, all features and their relationships should be compatible with the standards released by the Sequence Ontology Project.

    4. start: Genomic start of the feature, with a 1-base offset. This is in contrast with other 0-offset half-open sequence formats, like BED.

    5. end: Genomic end of the feature, with a 1-base offset. This is the same end coordinate as it is in 0-offset half-open sequence formats, like BED.

    6. score: Numeric value that generally indicates the confidence of the source in the annotated feature. A value of “.” (a dot) is used to define a null value.

    7. strand: Single character that indicates the strand of the feature. This can be “+” (positive, or 5’->3’), “-“, (negative, or 3’->5’), “.” (undetermined), or “?” for features with relevant but unknown strands.

    8. phase: phase of CDS features; it can be either one of 0, 1, 2 (for CDS features) or “.” (for everything else). See the section below for a detailed explanation.

    9. attributes: A list of tag-value pairs separated by a semicolon with additional information about the feature.

  • summary with annotations as simple human readable TSV

    This a table with 9 columns (Sequence Id, Type, Start, Stop, Strand, Locus Tag, Gene, Product, DbXrefs).

  • Plot of the annotation as circular genome annotation

    1. The first ring represents the GC content per sliding window over the entire sequence(s) with in green representing GC above and red GC below average. The 2nd ring represents the GC skew in orange and blue.
    2. All features are plotted on two rings representing the forward and reverse strand from outer to inner with CDS in grey (the other colors are hard to distinguish)

Question

  • How many features have been identified?

Go Up


Annotation with Prokka

Prokka is a wrapper: it collects together several pieces of software (from various authors), and so avoids “re-inventing the wheel”.

Prokka finds and annotates features (both protein coding regions and RNA genes, i.e. tRNA, rRNA) present on a sequence, using a two-step process for the annotation of protein coding regions:

  1. protein coding regions on the genome are identified using Prodigal;

  2. the function of the encoded protein is predicted by similarity to proteins in one of many protein or protein domain databases.

Prokka is a software tool that can be used to annotate bacterial, archaeal and viral genomes quickly, generating standard output files in GenBank, EMBL and gff formats.

  1. Prokka Tool with the following parameters (leave everything else unchanged):
    • contigs to annotate: Contig file

Once Prokka has finished, examine each of its output files.

Extension Description
.gff This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV.
.gbk This is a standard Genbank file derived from the master .gff. If the input to prokka was a multi-FASTA, then this will be a multi-Genbank, with one record for each sequence.
.fna Nucleotide FASTA file of the input contig sequences.
.faa Protein FASTA file of the translated CDS sequences.
.ffn Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA)
.sqn An ASN1 format “Sequin” file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc.
.fsa Nucleotide FASTA file of the input contig sequences, used by “tbl2asn” to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines.
.tbl Feature Table file, used by “tbl2asn” to create the .sqn file.
.err Unacceptable annotations - the NCBI discrepancy report.
.log Contains all the output that Prokka produced during its run. This is a record of what settings you used, even if the –quiet option was enabled.
.txt Statistics relating to the annotated features found.
.tsv Tab-separated file of all features: locus_tag,ftype,len_bp,gene,EC_number,COG,product

Go Up


Further structural annotation

Bakta gives a lot of information already, especially regarding CDSs or RNAs, but some structural annotation might be missing, e.g. plasmids, or interesting to identify independently.

Plasmids

In assembled bacterial genomes, plasmids often appear as separate contigs, distinct from the chromosomal sequence.
For this reason, dedicated tools are commonly used to identify plasmid-derived contigs and distinguish plasmid-associated genes from chromosomal genes, a distinction that is particularly important for antimicrobial resistance (AMR) analyses.

To identify plasmids in our contigs, we use PlasmidFinder, a tool for the identification and typing of plasmid sequences in Whole-Genome Sequencing. It uses the plasmidfinder database with hundreds of sequences to predict the plasmid in the data.

  1. PlasmidFinder with the following parameters:
    • In Input parameters:
      • Choose a fasta or fastq file: Contig file
      • PlasmidFinder database: most recent one
  2. Click on Run Tool

PlasmidFinder generates several outputs:

  • raw_results.txt: A text file containing the result table and alignments

  • results.tsv: A tabular file with the following columns:

    1. Database

    2. Plasmid: Plasmid against which the input genome has been aligned.

    3. Identity: Percent identity in the alignment between the best matching plasmid in the database and the corresponding sequence in the inputgenome (also called the high-scoring segment pair (HSP)). A perfect alignment is 100%, but must also cover the entire length of the plasmid in the database (compare example 1 and 3).

    4. Query/Template Length: Query length is the length of the best matching plasmid in the database, while HSP length is the length of the alignment between the best matching plasmid and the corresponding sequence in the genome (also called the high-scoring segment pair (HSP)).

    5. Contig: Name of contig the plasmid is found in.

    6. Position in contig: Starting position of the found gene in the contig.

    7. Note: Notes about the plasmid

    8. Accession number: Reference Genbank accession number accoding to NCBI for the plasmid in the database.

  • plasmid.fasta: A fasta file containing the best matching sequences from the query genome

  • hit_in_genome.fasta: A fasta file containing the best matching plasmid genes from the database

Question

  1. How many plasmid sequences have been found?

  2. Where are they located?

  3. Are these sequences all associated with Staphylococcus aureus? (Looking at the accession number on the NCBI)

  4. What can we conclude about contig00019?

Go Up


Integrons

Integrons are genetic mechanisms that allow bacteria to adapt and evolve rapidly through the stockpiling and expression of new genes. An integron is minimally composed of:

  • a gene encoding for a site-specific recombinase (intI)

  • a proximal recombination site (attI), which is recognized by the integrase and at which gene cassettes may be inserted

  • a promoter (Pc) which directs transcription of cassette-encoded genes

To detect integrons, we will use IntegronFinder

  1. IntegronFinder with the following parameters:
    • Replicon file: Contig file
    • Thorough local detection: Yes
    • Search also for promoter and attI sites?: Yes
    • Remove log file: Yes

IntegronFinder generates 2 outputs:

  1. A summary with for each sequence in the input the number of identified CALIN elements, In0 elements, and complete integrons.

  2. An integron annotation file as a tabular

Question

  • How many integron elements have been found?

Go Up


IS (Insertion Sequence) elements

Insertion sequence (IS) element is a short DNA sequence that acts as a simple transposable element. IS are the smallest but most abundant autonomous transposable elements in bacterial genomes. They only code for proteins implicated in the transposition activity. They play then a key role in bacterial genome organization and evolution.

To detect IS elements, we will use ISEScan.

  1. ISEScan with the following parameters:
    • Genome fasta input: Contig file

ISEScan generates several files:

  • A summary as a table

  • The results as a table

  • The results as a GFF file

  • Several FASTA files:

    • IS nucleotide sequences
    • ORF nucleotide sequences
    • ORF amino acide sequences

Question

  1. How many IS elements have been detected?

  2. Where are they located?

  3. What the different IS families?

Go Up


Visualisation of the annotation

We would like to look at the annotation using JBrowse with several information:

  1. Annotations identified by Bakta

  2. Plasmid sequences identified by PlasmidFinder

  3. Integrons identified by IntegronFinder

  4. IS elements identified by ISEscan

JBrowse needs the annotations to be in GFF format. Bakta and ISEscan generated both GFF files. For PlasmidFinder and IntegronFinder, we need to format the outputs.

Transform PlasmidFinder to GFF

PlasmidFinder generated the results.tsv with all needed information. To transform it to a GFF, we need to:

  1. Split the 6th column on .. to have start and end into 2 separated columns

  2. Remove in the content of column 5 what is after the contig name

  3. Remove the 1st line

  4. Transform to GFF3

  5. Replace Text in a specific column with the following parameters:
    • File to process: results.tsv output of PlasmidFinder
    • In Replacement:
      • In “1: Replacement”
        • n column: Column: 6

        • Find pattern: (.)..(.)

        • Replace with: \1\t\2

        This will split the content of the 6th column on .. and put it into column 6 and column 7. Column 7 will be then replaced.

    • Insert Replacement
    • In “2: Replacement”
      • n column: Column: 5

      • Find pattern: (.)( len.)

      • Replace with: \1

      This will remove in the content of column 5 what is after the contig name

  6. Select last lines from a dataset (tail) with the following parameters:
    • Text file: output of Replace Text above

    • Operation: Keep everything from this line on

    • Number of lines: 2

  7. Table to GFF3 with the following parameters:
    • Table: output of the above Select last tool step

    • Record ID column or value: 5
    • Start column or value: 6
    • End column or value: 7
    • Type column or value: 2
    • Score column or value: 3
    • Source column or value: 1

    • Insert Qualifiers
      • Name: name
      • Qualifier value column or raw text: 8
    • Insert Qualifiers
      • Name: accession
      • Qualifier value column or raw text: 9
  8. Rename to PlasmidFinder GFF

Go Up


Transform IntegronFinder output to GFF (if integrons found)

IntegronFinder tabular output can be transformed to GFF by:

  1. Replace NA values on column 7 by 0

  2. Remove the first two lines

  3. Transform to GFF3

  4. Replace Text in a specific column with the following parameters:
    • File to process: results.tsv output ofIntegronFinder
    • In Replacement:
      • In “1: Replacement”
        • n column: Column: 7

        • Find pattern: NA

        • Replace with: 0

  5. Select last lines from a dataset (tail) with the following parameters:
    • Text file: output of Replace Text above

    • Operation: Keep everything from this line on

    • Number of lines: 3

  6. Table to GFF3 with the following parameters:
    • Table: output of the above Select last tool step

    • Record ID column or value: 2
    • Start column or value: 4
    • End column or value: 5
    • Type column or value: 11
    • Score column or value: 7
    • Source column or value: IntegronFinder

    • Insert Qualifiers
      • Name: name
      • Qualifier value column or raw text: 3
    • Insert Qualifiers
      • Name: annotation
      • Qualifier value column or raw text: 9
  7. Rename to IntegronFinder GFF

Go Up


Visualize the Genome

We can now launch JBrowse with different information track.

  1. JBrowse with the following parameters:
    • Reference genome to display: Use a genome from history
      • Select the reference genome: genome.fa
      • Genetic Code: 11. The Bacterial, Archaeal and Plant Plastid Code
      • In Track Group:
        • Insert Track Group
          • Track Category: Bakta
          • In Annotation Track:
            • Insert Annotation Track
              • Track Type: GFF/GFF3/BED Features
              • GFF/GFF3/BED Track Data: annotation_and_sequences (output of Bakta)
              • JBrowse Track Type [Advanced]: Neat Canvas Features
              • Track Visibility: On for new users
        • Insert Track Group
          • Track Category: Plasmid sequences
          • In Annotation Track: - Insert Annotation Track - Track Type: GFF/GFF3/BED Features - GFF/GFF3/BED Track Data: PlasmidFinder GFF - JBrowse Track Type [Advanced]: Neat Canvas Features - Track Visibility: On for new users
        • Insert Track Group
          • Track Category: IS elements
          • In Annotation Track:
            • Insert Annotation Track - Track Type: GFF/GFF3/BED Features - GFF/GFF3/BED Track Data: GFF output of ISEScan - JBrowse Track Type [Advanced]: Neat Canvas Features - Track Visibility: On for new users

        If integrons are found as IntegronFinder

        • Insert Track Group
          • Track Category: Integrons
          • In Annotation Track:
            • Insert Annotation Track
              • Track Type: GFF/GFF3/BED Features
              • GFF/GFF3/BED Track Data: IntegronFinder GFF
              • JBrowse Track Type [Advanced]: Neat Canvas Features
              • Track Visibility: On for new users
  2. View the output of JBrowse

In the output of the JBrowse you can view the genes, IS, plasmid, etc on the contigs. With the search tools you can easily find genes of interest. JBrowse can handle many inputs and can be very useful.

JBrowse

Question

  1. Have all sequences identified by PlasmidFinder on contig19 been identified by Bakta?

  2. Have all sequences identified by ISEScan on contig19 been identified by Bakta?

Go Up


Conclusion

In this tutorial, contigs were annotated with different tools and then visualized.

Go Up