Quality Control

Introduction to quality control of NGS raw data: key quality metrics, common issues, and strategies to improve dataset quality.

During sequencing, errors are introduced, such as incorrect nucleotides being called. These are due to the technical limitations of each sequencing platform. Sequencing errors might bias the analysis and can lead to a misinterpretation of the data. Adapters may also be present if the reads are longer than the fragments sequenced and trimming these may improve the number of reads mapped.

Sequence quality control is therefore an essential first step in analysis.
It is necessary to understand, identify and exclude error-types that may impact the interpretation of downstream analysis.
Catching errors early saves time later on.

Inspect a raw sequence file

Create a new history for this tutorial and give it a proper name

Import the file: a microbiome sample from a snake

https://zenodo.org/record/3977236/files/female_oral2.fastq-4143.gz

Rename the imported dataset to Reads.
Inspect the FASTQ file by clicking on the

Each read, representing a fragment of the library, is encoded by 4 lines:

Always begins with @ followed by the information about the read
The actual nucleic sequence
Always begins with a + and contains sometimes the same info in line 1
Has a string of characters which represent the quality scores associated with each base of the nucleic sequence; must have the same number of characters as line 2

So for example, the first sequence in our file is:

@M00970:337:000000000-BR5KF:1:1102:17745:1557 1:N:0:CGCAGAAC+ACAGAGTT
GTGCCAGCCGCCGCGGTAGTCCGACGTGGCTGTCTCTTATACACATCTCCGAGCCCACGAGACCGAAGAACATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAGAAGCAAATGACGATTCAAGAAAGAAAAAAACACAGAATACTAACAATAAGTCATAAACATCATCAACATAAAAAAGGAAATACACTTACAACACATATCAATATCTAAAATAAATGATCAGCACACAACATGACGATTACCACACATGTGTACTACAAGTCAACTA
+
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGFGGGGGGAFFGGFGGGGGGGGFGGGGGGGGGGGGGGFGGG+38+35*311*6,,31=******441+++0+0++0+*1*2++2++0*+*2*02*/***1*+++0+0++38++00++++++++++0+0+2++*+*+*+*+*****+0**+0**+***+)*.***1**//*)***)/)*)))*)))*),)0(((-((((-.(4(,,))).,(())))))).)))))))-))-(

It means that the fragment named @M00970 corresponds to the DNA sequence

GTGCCAGCCGCCGCGGTAGTCCGACGTGGCTGTCTCTTATACACATCTCCGAGCCCACGAGACCGAAGAACATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAGAAGCAAATGACGATTCAAGAAAGAAAAAAACACAGAATACTAACAATAAGTCATAAACATCATCAACATAAAAAAGGAAATACACTTACAACACATATCAATATCTAAAATAAATGATCAGCACACAACATGACGATTACCACACATGTGTACTACAAGTCAACTA

And this sequence has been sequenced with a quality

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGFGGGGGGAFFGGFGGGGGGGGFGGGGGGGGGGGGGGFGGG+38+35*311*6,,31=******441+++0+0++0+*1*2++2++0*+*2*02*/***1*+++0+0++38++00++++++++++0+0+2++*+*+*+*+*****+0**+0**+***+)*.***1**//*)***)/)*)))*)))*),)0(((-((((-.(4(,,))).,(())))))).)))))))-))-(

The quality score for each sequence is a string of characters, one for each base of the nucleic sequence, used to characterize the probability of mis-identification of each base. The score is encoded using the ASCII character table

So there is an ASCII character associated with each nucleotide, representing its Phred quality score, the probability of an incorrect base call:

Question

Which ASCII character corresponds to the worst Phred score for Illumina 1.8+?
What is the Phred quality score of the 3rd nucleotide of the 1st sequence?
What is the accuracy of this 3rd nucleotide?

Galaxy Training for Pathogen Genomics, AMR Detection & Virus ID

Quality Control

Inspect a raw sequence file

Assess quality with FastQC (short & long reads)

Per base sequence quality

Alert

Per tile sequence quality

Per sequence quality scores

Per base sequence content

Per sequence GC content

Per base N content

Sequence length distribution

Sequence Duplication Levels

Over-represented sequences

Adapter Content

Specific problem for alternates library types

Small/micro RNA

Amplicon

Bisulfite or Methylation sequencing

Adapter dimer contamination

Summarize data with MultiQC

Trim and filter

Cutadapt (Remove adapter sequences from FASTQ/FASTA)

Trim Galore! (Quality and adapter trimmer of reads)

Conclusions