Configuring Musta

Explains how to set up input data, including organizing sample data and supported data formats. Provides guidance on configuring the Musta CLI and customizing variant calling and annotation. Discusses the use of configuration files.

The Musta framework provides a high degree of flexibility for configuring your somatic mutation analysis. It organizes cancer sample processing tools into three distinct analysis modules: detection, classification, and interpretation.

You can run each module independently or combine them for a comprehensive analysis workflow tailored to your specific research needs.

Input Datasets

In Musta, input datasets are provided through a YAML file called samples.yml. The samples.yml file serves as a configuration file where you specify the input datasets for your analysis. The Musta framework processes these datasets based on the chosen analysis module.

Here is a template of the samples.yml file:

patient_A:
  normal_sample_name:
    - N1
  tumor_sample_name:
    - T1
  normal_bam:
    - path/to/bam/N1.bam
  tumor_bam:
    - path/to/bam/T1.bam
  vcf:
    - path/to/vcf/patient_A.vcf
  maf:
    - path/to/maf/patient_A.maf

patient_B:
  normal_sample_name:
    - N2
  tumor_sample_name:
    - T2
  normal_bam:
    - path/to/bam/N2.bam
  tumor_bam:
    - path/to/bam/T2.bam
  vcf:
    - path/to/vcf/patient_B.vcf
  maf:
    - path/to/maf/patient_B.maf

In this samples.yml file, we have included two patients, patient_A and patient_B, each with associated sample information. For each patient, you can specify the following details:

normal_sample_name: The name of the matched normal (or reference) sample.
tumor_sample_name: The name of the tumor sample.
normal_bam: The path to the BAM file containing the matched normal sample’s sequencing data.
tumor_bam: The path to the BAM file containing the tumor sample’s sequencing data.
vcf: The path to the VCF file containing the variants detected for the tumor sample.
maf: The path to the MAF (Mutation Annotation Format) file for the tumor sample.

You can add more patients or samples as needed to configure the samples.yml file for your specific somatic mutation analysis. This file provides Musta with the essential information required for each analysis module.

WARNING! | Pay Attention

Please ensure that all your input BAM files are indexed. You can index a BAM file using the following samtools command: (See: http://www.htslib.org/doc/samtools-index.html)
```
samtools index /path/to/bam/your.bam
```
The resulting index file (your.bam.bai) must be stored in the same directory as the original BAM file.
Please ensure that all your input VCF files are compressed and indexed. You can compress and index a VCF file using the following command: (See: https://www.biostars.org/p/59492/)
```
bgzip -c /path/to/vcf/your.vcf > /path/to/vcf/your.vcf.gz && tabix -p vcf /path/to/vcf/your.vcf.gz
```
This command will create a compressed VCF file (/path/to/vcf/your.vcf.gz) and an associated index file (/path/to/vcf/your.vcf.gz.tbi)

Supplementary Resources

In addition to the input datasets, Musta also relies on a set of supplementary files that are essential for the execution of various steps within its workflow. These files play a crucial role in ensuring that Musta can effectively perform its analysis and deliver accurate results.

Reference file

The Detection and Classification modules in Musta require the reference sequence in a single FASTA file format with all contigs in one file. This file must adhere to the FASTA standard. Additionally, you must provide a dictionary file with a .dict extension and an index file with a .fai extension. It’s crucial that the dictionary and index files share the same basename as the FASTA file, as Musta identifies them based on this naming convention.

To generate the necessary dictionary and index files for your reference FASTA sequence, you can follow these instructions provided by the Broad Institute:

Creating the FASTA sequence dictionary file:
- Use the CreateSequenceDictionary tool to create a .dict file from your FASTA file. You only need to specify the input reference; the tool will automatically name the output file appropriately.
- Execute the following command:
```
gatk-launch CreateSequenceDictionary -R ref.fasta
```
- This will produce a SAM-style header file named ref.fasta.dict, which describes the contents of your FASTA file
Creating the FASTA index file:
- Use the faidx command in Samtools to prepare the FASTA index file. This file contains byte offsets in the FASTA file for each contig, enabling precise access to reference bases at specific genomic coordinates in the FASTA file.
- Run the following command:
```
samtools faidx ref.fasta
```
- This will generate a text file named ref.fasta.fai. Each line in this file corresponds to a contig in the FASTA sequence and provides information about the contig’s size, location, basesPerLine, and bytesPerLine.

Following these steps will ensure that you have the required dictionary and index files for your reference FASTA sequence, which are essential for Musta’s Detection and Classification modules.

In any case, we recommend using one of the reference genome builds provided in the GATK Resource Bundle.

Germline resource

The Musta Detection module includes GATK Mutect as a Variant Caller, which requires a population VCF of germline sequencing data as a resource. This population VCF should contain allele frequencies of common and rare variants.

You can find these resource files in the Google Bucket under the following folders:

The resource files have names similar to af-only-gnomad.vcf and af-only-gnomad.vcf.idx. These files are essential for running the Mutect variant caller in the Musta Detection module.

Variant file

The Somatic Variant Calling Workflow based on GATK Mutect in Musta involves the estimation of the fraction of reads due to cross-sample contamination for each tumor sample. To calculate contamination, Musta requires an input VCF file containing variants and allele frequencies for comparison.

To access the required resource files for this process, you can find them in the Google Bucket under specific folders:

The resource files have names similar to small_exac_common.vcf and small_exac_common.vcf.idx.

BED file

In some scenarios, such as target sequencing, it may be necessary to focus the analysis on specific genomic regions. To do this, you must provide a compressed and indexed BED file that lists the regions you want to restrict the analysis to. This file allows Musta to concentrate its analysis efforts on the specified genomic regions, providing a more targeted and efficient analysis.

To compress and index a BED file, you can use the following command:

bgzip -c /path/to/your.bed > /path/to/your.bed.gz && tabix -p bed /path/to/your.bed.gz

In the absence of specific BED files, we recommend using the standard coding region, which can be downloaded from the UCSC Genome Browser. You can access this resource by following the provided link.

Please note that using the standard coding region is a common practice when you don’t have specific BED files tailored to your analysis, as it covers the coding regions of genes and is often a reasonable default choice for many variant analysis workflows.

dbSNP file

The Annotation and Detection modules, which include variant callers like Muse and Lofreq, require a dbSNP file. This file is a compressed and indexed VCF file that contains known germline variants.

You can obtain these dbSNP files from the following Google Bucket locations:

For hg19 reference genome: gcp-public-data–broad-references/hg19
For hg38 reference genome: gcp-public-data–broad-references/hg38

Please download the appropriate dbSNP file for your reference genome from these links.

To compress and index your dbSNO file, run the usual command:

bgzip -c /path/to/vcf/your.vcf > /path/to/vcf/your.vcf.gz && tabix -p vcf /path/to/vcf/your.vcf.gz

Best Practices

We recommend organizing supplementary resources in the same location, separated by reference genome build, such as:

For hg19 (GRCh37): /path/to/resources/hg19
For hg38 (GRCh38): /path/to/resources/hg38

This structured organization helps ensure that the relevant resources are easily accessible and correctly associated with the reference genome build you are working with. It contributes to a smoother and more organized analysis workflow when using Musta for somatic variant detection, classification, and interpretation.

Classification Sources

In the context of Musta Classification Module annotation sources refer to the databases and resources used to gather information about genetic variants. These sources provide functional annotations and additional information about the variants, such as their impact on genes, protein products, regulatory elements, and their prevalence in different populations.

Annotation sources typically include databases like dbSNP, gnomAD, ClinVar, COSMIC, and many others. Each of these databases collects, curates, and maintains data about genetic variations in the human genome. Variant Annoator use these databases to assess whether a specific variant has been previously reported in the germline, its association with diseases, population frequencies, and other relevant information.

To significantly improve the speed and efficiency of the Classification Module, it is highly recommended to maintain the necessary data sources on storage.

Funcotator Data Sources

Data sources are organized within specific folders that are provided as input arguments. While you can specify multiple data source folders, it’s essential to ensure that no two data sources share the same name to avoid conflicts.

Each primary data source folder contains sub-directories for individual data sources, with further sub-directories for specific references (e.g., hg19 or hg38). Within the reference-specific data source directory, you’ll find a crucial configuration file that contains information about the data source and instructions on how to match it to a variant. This configuration file is a mandatory component of the data source.

Here’s an example of the structure for a data source directory:

dataSourcesFolder/
    Data_Source_1/
        hg19
            data_source_1.config
            data_source_1.data.file.one
            data_source_1.data.file.two
            data_source_1.data.file.three
            ...
        hg38
            data_source_1.config
            data_source_1.data.file.one
            data_source_1.data.file.two
            data_source_1.data.file.three
            ...
    Data_Source_2/
        hg19
            data_source_2.config
            data_source_2.data.file.one
            data_source_2.data.file.two
            data_source_2.data.file.three
            ...
        hg38
            data_source_2.config
            data_source_2.data.file.one
            data_source_2.data.file.two
            data_source_2.data.file.three
            ...
    ...

GATK providess two pre-packaged data source sets, facilitating Funcotator’s use without extensive additional configuration. These data source packages correspond to germline and somatic use cases. For somatic analysis, you should begin with the somatic data sources.

You can access versioned gzip archives of data source files for download. These archives are available at the link: gs://broad-public-datasets/funcotator/

You can find the full documentation about Funcotator Data sources at the following link: Funcotator Information and Tutorial.

VEP Cache

VEP (Variant Effect Predictor) offers efficient variant annotation by utilizing various annotation sources. The highly efficient method is using a cache, which is a downloadable file containing comprehensive data, including transcript models, regulatory features, and variant details specific to a species. Leveraging a cache minimizes network connections and optimizes performance by allowing most data to be read from local disk. This strategy significantly enhances the speed and efficiency of variant annotations through VEP.

Please download the cache for version 106 from the following URL: https://ftp.ensembl.org/pub/grch37/release-106/variation/indexed_vep_cache/

You can find the full documentation about VEP Cache at the following link: https://grch37.ensembl.org/info/docs/tools/vep/script/vep_cache.html

Best Practices

We recommend organizing supplementary resources in the same location: /path/to/data-sources

Funcotator
├── clinvar
│   ├── hg19
│   └── hg38
├── cosmic
│   ├── hg19
│   └── hg38
├── dbsnp
│   ├── hg19
│   └── hg38
├── gencode
│   ├── hg19
│   └── hg38
├── gnomAD_exome
│   ├── hg19
│   └── hg38
├── gnomAD_genome
│   ├── hg19
│   └── hg38
vep
└── homo_sapiens
    └── 106_GRCh37

Demo Dataset

To test Musta, users can download the demo dataset from the following URL: https://repolab.crs4.it/phd-core/test-data-somatic.

The “test-data-somatic” dataset is a subset of publicly available genomic data from the Genome Sequence Archive of the Beijing Institute of Genomics, accessible with the accession ID PRJCA000091.

Originally derived from a Hepatocellular Carcinoma (HCC) tumor, this dataset includes 23 tumor biopsies and a tumor-adjacent matched normal sample, all sequenced at an average depth of 74.4×.

For this subset, we have specifically selected three tumor samples and one matched normal sample. Moreover, the dataset has been downsized to encompass just 10% of the total reads pertaining to chromosome 22.

Here’s a quick start guide to get you started with this dataset:

Clone the Repository: Use the following git command to clone the “test-data-somatic” repository to your local machine:
```
git clone https://repolab.crs4.it/phd-core/test-data-somatic.git
```
Change Directory: Navigate to the cloned repository’s folder:
```
cd test-data-somatic
```
Decompress the Archives: Run the provided script to decompress the archive files. The command is as follows:
```
sh decompress.sh
```
This script will extract the necessary data from the archives.
Explore the Data: With the data decompressed, you can now explore the dataset for your specific research or testing purposes.

This demo dataset serves as a useful resource for testing and familiarizing yourself with Musta’s capabilities before applying it to your own somatic mutation analysis.

Go Up

MUSTA | MUtation and Somatic Tumor Analysis

Configuring Musta

Input Datasets

Supplementary Resources

Reference file

Germline resource

Variant file

BED file

dbSNP file

Best Practices

Classification Sources

Funcotator Data Sources

VEP Cache

Best Practices

Demo Dataset