Genome annotation with Prokka

How can we annotate a bacterial genome? How can we visualize annotated genomic features?


Genome annotation is the comprehensive process of deciphering the genetic blueprint of an organism, identifying its various elements, and assigning meaning to those elements. It involves two primary components:

  1. Structural Annotation: Positions of genomic features along the genome, ie. determining the physical organization of the genome
    • Gene Prediction: Identifying the locations and boundaries of genes within the DNA sequence.
    • Intron-Exon Boundaries: Defining the positions of introns and exons in genes.
    • Promoters and Regulatory Elements: Identifying regions responsible for gene regulation, such as promoters, enhancers, and transcription factor binding sites.
    • Non-Coding RNAs: Discovering non-coding RNA elements, including tRNAs, rRNAs, microRNAs, and long non-coding RNAs.
  2. Functional Annotation: Assigning functions to those features, ie. etermining the biological roles and functions of the annotated elements.
    • Protein Function: Assigning functions to proteins encoded by the genes.
    • Metabolic Pathways: Linking genes to specific metabolic pathways, providing insights into the organism’s biochemical capabilities.
    • Gene Ontology (GO) Terms: Categorizing genes based on their molecular functions, biological processes, and cellular components using the Gene Ontology database.
    • Homology and Orthology: Establishing relationships between genes in the annotated genome and those in other organisms, allowing for comparative genomics.

In this section we will use a software tool called Prokka to annotate a draft genome sequence.

  • Prokka is a wrapper: it collects together several pieces of software (from various authors), and so avoids “re-inventing the wheel”.
  • Prokka finds and annotates features (both protein coding regions and RNA genes, i.e. tRNA, rRNA) present on a sequence.
  • Prokka uses a two-step process for the annotation of protein coding regions:
    1. protein coding regions on the genome are identified using Prodigal;
    2. the function of the encoded protein is predicted by similarity to proteins in one of many protein or protein domain databases.
  • Prokka is a software tool that can be used to annotate bacterial, archaeal and viral genomes quickly, generating standard output files in GenBank, EMBL and gff formats.

Go Up


Importing the data into Galaxy

NOTE! Please, use:

Prokka requires assembled contigs.

  1. Create a new history
  2. Upload datasets
    https://zenodo.org/record/1156405/files/contigs.fasta
    

Go Up


Annotate the genome

  1. Prokka Tool with the following parameters (leave everything else unchanged):
    • contigs to annotate: contigs.fasta

Examine the output

Once Prokka has finished, examine each of its output files.

  • GFF: Annotation in GFF3 format, containing both sequence and annotations: all of the information about the features annotated
  • GBK: Standard GenBank file. As GFF Ffile contain all of the information about the features annotated (in different format.)
  • .txt: Statitstics relating to the features annotated.
  • .faa: The protein sequences of the genes annotated (FASTA).
  • .ffn: The nucleotide sequences of the genes annotated (FASTA).

Go Up


View annotated features in JBrowse

Now that we have annotated the draft genome sequence, we would like to view the sequence in the JBrowse genome viewer.

  1. JBrowse Tool with the following parameters
    • Reference genome to display: Use a genome from history
    • Select the reference genome: fna output of Prokka Tool
    • Genetic Code: 11: The Bacterial, Archaeal and Plant Plastid Code
    • In Track Group:
      • Insert Track Group
      • Track Category: gene annotations
        • Track Type: GFF/GFF3/BED Features
        • GFF/GFF3/BED Track Data: gff output of Prokka

Click on Insert Track Group

  1. Inspect the output
  2. Display all the tracks and practice maneuvering around
    • Click on the tick boxes on the left to display the tracks
    • Select contig 1 in the drop down box. You can only see one contig displayed at a time.
    • Zoom by clicking on the plus and minus buttons.
    • Right click on a gene/feature annotation (the bars on the annotation track), then select View Details to see more information.
      • gene name
      • product name
      • you can download the FASTA sequence by clicking on the disk icon

Go Up


MultiQC Report

  1. MultiQC tool with following parameters
    • In “Results”:
      • Insert Results
        • Which tool was used generate logs?: Prokka
        • *In “Prokka Output”:
          • Prokka Output: Prokka .txt
    • Report title: Genome Annotation Report

Go Up


Hands-On

Now, it’s your turn!

Go Up