Explains how to set up input data, including organizing sample data and supported data formats. Provides guidance on configuring the Musta CLI and customizing variant calling and annotation. Discusses the use of configuration files.
The Musta framework provides a high degree of flexibility for configuring your somatic mutation analysis. It organizes cancer sample processing tools into three distinct analysis modules: detection, classification, and interpretation.
You can run each module independently or combine them for a comprehensive analysis workflow tailored to your specific research needs.
In Musta, input datasets are provided through a YAML file called samples.yml
.
The samples.yml
file serves as a configuration file where you specify the input datasets for your analysis.
The Musta framework processes these datasets based on the chosen analysis module.
Here is a template of the samples.yml
file:
patient_A:
normal_sample_name:
- N1
tumor_sample_name:
- T1
normal_bam:
- path/to/bam/N1.bam
tumor_bam:
- path/to/bam/T1.bam
vcf:
- path/to/vcf/patient_A.vcf
maf:
- path/to/maf/patient_A.maf
patient_B:
normal_sample_name:
- N2
tumor_sample_name:
- T2
normal_bam:
- path/to/bam/N2.bam
tumor_bam:
- path/to/bam/T2.bam
vcf:
- path/to/vcf/patient_B.vcf
maf:
- path/to/maf/patient_B.maf
In this samples.yml
file,
we have included two patients, patient_A
and patient_B
, each with associated sample information.
For each patient, you can specify the following details:
You can add more patients or samples as needed to configure the samples.yml
file
for your specific somatic mutation analysis.
This file provides Musta with the essential information required for each analysis module.
WARNING! | Pay Attention
samtools
command: (See: http://www.htslib.org/doc/samtools-index.html)
samtools index /path/to/bam/your.bam
The resulting index file (your.bam.bai
) must be stored in the same directory as the original BAM file.
bgzip -c /path/to/vcf/your.vcf > /path/to/vcf/your.vcf.gz && tabix -p vcf /path/to/vcf/your.vcf.gz
This command will create a compressed VCF file (/path/to/vcf/your.vcf.gz
) and an associated index file (/path/to/vcf/your.vcf.gz.tbi
)
In addition to the input datasets, Musta also relies on a set of supplementary files that are essential for the execution of various steps within its workflow. These files play a crucial role in ensuring that Musta can effectively perform its analysis and deliver accurate results.
The Detection and Classification modules in Musta require the reference sequence in a single FASTA file format
with all contigs in one file. This file must adhere to the FASTA standard.
Additionally, you must provide a dictionary file with a .dict
extension and an index file with a .fai
extension.
It’s crucial that the dictionary and index files share the same basename as the FASTA file,
as Musta identifies them based on this naming convention.
To generate the necessary dictionary and index files for your reference FASTA sequence, you can follow these instructions provided by the Broad Institute:
CreateSequenceDictionary
tool to create a .dict
file from your FASTA file. You only need to specify the input reference; the tool will automatically name the output file appropriately.gatk-launch CreateSequenceDictionary -R ref.fasta
ref.fasta.dict
, which describes the contents of your FASTA filefaidx
command in Samtools to prepare the FASTA index file. This file contains byte offsets in the FASTA file for each contig, enabling precise access to reference bases at specific genomic coordinates in the FASTA file.samtools faidx ref.fasta
ref.fasta.fai
. Each line in this file corresponds to a contig in the FASTA sequence and provides information about the contig’s size, location, basesPerLine, and bytesPerLine.Following these steps will ensure that you have the required dictionary and index files for your reference FASTA sequence, which are essential for Musta’s Detection and Classification modules.
In any case, we recommend using one of the reference genome builds provided in the GATK Resource Bundle.
The Musta Detection module includes GATK Mutect as a Variant Caller, which requires a population VCF of germline sequencing data as a resource. This population VCF should contain allele frequencies of common and rare variants.
You can find these resource files in the Google Bucket under the following folders:
The resource files have names similar to af-only-gnomad.vcf
and af-only-gnomad.vcf.idx
.
These files are essential for running the Mutect variant caller in the Musta Detection module.
The Somatic Variant Calling Workflow based on GATK Mutect in Musta involves the estimation of the fraction of reads due to cross-sample contamination for each tumor sample. To calculate contamination, Musta requires an input VCF file containing variants and allele frequencies for comparison.
To access the required resource files for this process, you can find them in the Google Bucket under specific folders:
The resource files have names similar to small_exac_common.vcf
and small_exac_common.vcf.idx
.
In some scenarios, such as target sequencing, it may be necessary to focus the analysis on specific genomic regions. To do this, you must provide a compressed and indexed BED file that lists the regions you want to restrict the analysis to. This file allows Musta to concentrate its analysis efforts on the specified genomic regions, providing a more targeted and efficient analysis.
To compress and index a BED file, you can use the following command:
bgzip -c /path/to/your.bed > /path/to/your.bed.gz && tabix -p bed /path/to/your.bed.gz
In the absence of specific BED files, we recommend using the standard coding region, which can be downloaded from the UCSC Genome Browser. You can access this resource by following the provided link.
Please note that using the standard coding region is a common practice when you don’t have specific BED files tailored to your analysis, as it covers the coding regions of genes and is often a reasonable default choice for many variant analysis workflows.
The Annotation and Detection modules, which include variant callers like Muse and Lofreq, require a dbSNP file. This file is a compressed and indexed VCF file that contains known germline variants.
You can obtain these dbSNP files from the following Google Bucket locations:
Please download the appropriate dbSNP file for your reference genome from these links.
To compress and index your dbSNO file, run the usual command:
bgzip -c /path/to/vcf/your.vcf > /path/to/vcf/your.vcf.gz && tabix -p vcf /path/to/vcf/your.vcf.gz
We recommend organizing supplementary resources in the same location, separated by reference genome build, such as:
/path/to/resources/hg19
/path/to/resources/hg38
This structured organization helps ensure that the relevant resources are easily accessible and correctly associated with the reference genome build you are working with. It contributes to a smoother and more organized analysis workflow when using Musta for somatic variant detection, classification, and interpretation.
In the context of Musta Classification Module annotation sources refer to the databases and resources used to gather information about genetic variants. These sources provide functional annotations and additional information about the variants, such as their impact on genes, protein products, regulatory elements, and their prevalence in different populations.
Annotation sources typically include databases like dbSNP, gnomAD, ClinVar, COSMIC, and many others. Each of these databases collects, curates, and maintains data about genetic variations in the human genome. Variant Annoator use these databases to assess whether a specific variant has been previously reported in the germline, its association with diseases, population frequencies, and other relevant information.
To significantly improve the speed and efficiency of the Classification Module, it is highly recommended to maintain the necessary data sources on storage.
Data sources are organized within specific folders that are provided as input arguments. While you can specify multiple data source folders, it’s essential to ensure that no two data sources share the same name to avoid conflicts.
Each primary data source folder contains sub-directories for individual data sources, with further sub-directories for specific references (e.g., hg19 or hg38). Within the reference-specific data source directory, you’ll find a crucial configuration file that contains information about the data source and instructions on how to match it to a variant. This configuration file is a mandatory component of the data source.
Here’s an example of the structure for a data source directory:
dataSourcesFolder/
Data_Source_1/
hg19
data_source_1.config
data_source_1.data.file.one
data_source_1.data.file.two
data_source_1.data.file.three
...
hg38
data_source_1.config
data_source_1.data.file.one
data_source_1.data.file.two
data_source_1.data.file.three
...
Data_Source_2/
hg19
data_source_2.config
data_source_2.data.file.one
data_source_2.data.file.two
data_source_2.data.file.three
...
hg38
data_source_2.config
data_source_2.data.file.one
data_source_2.data.file.two
data_source_2.data.file.three
...
...
GATK providess two pre-packaged data source sets, facilitating Funcotator’s use without extensive additional configuration. These data source packages correspond to germline and somatic use cases. For somatic analysis, you should begin with the somatic data sources.
You can access versioned gzip archives of data source files for download. These archives are available at the link: gs://broad-public-datasets/funcotator/
You can find the full documentation about Funcotator Data sources at the following link: Funcotator Information and Tutorial.
VEP (Variant Effect Predictor) offers efficient variant annotation by utilizing various annotation sources. The highly efficient method is using a cache, which is a downloadable file containing comprehensive data, including transcript models, regulatory features, and variant details specific to a species. Leveraging a cache minimizes network connections and optimizes performance by allowing most data to be read from local disk. This strategy significantly enhances the speed and efficiency of variant annotations through VEP.
Please download the cache for version 106 from the following URL: https://ftp.ensembl.org/pub/grch37/release-106/variation/indexed_vep_cache/
You can find the full documentation about VEP Cache at the following link: https://grch37.ensembl.org/info/docs/tools/vep/script/vep_cache.html
We recommend organizing supplementary resources in the same location: /path/to/data-sources
Funcotator
├── clinvar
│ ├── hg19
│ └── hg38
├── cosmic
│ ├── hg19
│ └── hg38
├── dbsnp
│ ├── hg19
│ └── hg38
├── gencode
│ ├── hg19
│ └── hg38
├── gnomAD_exome
│ ├── hg19
│ └── hg38
├── gnomAD_genome
│ ├── hg19
│ └── hg38
vep
└── homo_sapiens
└── 106_GRCh37
To test Musta, users can download the demo dataset from the following URL: https://repolab.crs4.it/phd-core/test-data-somatic.
The “test-data-somatic” dataset is a subset of publicly available genomic data from the Genome Sequence Archive of the Beijing Institute of Genomics, accessible with the accession ID PRJCA000091.
Originally derived from a Hepatocellular Carcinoma (HCC) tumor, this dataset includes 23 tumor biopsies and a tumor-adjacent matched normal sample, all sequenced at an average depth of 74.4×.
For this subset, we have specifically selected three tumor samples and one matched normal sample. Moreover, the dataset has been downsized to encompass just 10% of the total reads pertaining to chromosome 22.
Here’s a quick start guide to get you started with this dataset:
git clone https://repolab.crs4.it/phd-core/test-data-somatic.git
cd test-data-somatic
sh decompress.sh
This script will extract the necessary data from the archives.
This demo dataset serves as a useful resource for testing and familiarizing yourself with Musta’s capabilities before applying it to your own somatic mutation analysis.