MUSTA | MUtation and Somatic Tumor Analysis

Interpretation

It covers driver gene identification, pathway analysis, mutational signatures, and somatic interactions. users can learn how to make sense of the detected mutations.


Once the variants are annotated, it’s pivotal to interpret the somatic mutation data. The downstream analysis is facilitated by incorporating the Maftools software package into the Musta pipeline.

Starting from the MAF files generated during the Classification step or provided by the user, Musta performs comprehensive analysis and interpretation of somatic variants. It offers a wide range of downstream analysis and visualization modules extensively used in cancer genomic studies. This integration enables researchers to perform several key analyses:

  • Variant Visualization: Various graphical representations to aid in detecting mutation patterns and recurring characteristics.
    • Summary Plots: Show variant counts per sample and their distribution.
    • Onco Plots: Reveal mutation distribution patterns across samples.
    • Lollipop Plots: Depict mutations on protein structures.
    • Transitions and Transversions Plots: Categorize mutations.
    • Rainfall Plots: Visualize mutation-rich areas.
    • Oncostrip: Zoom in on specific genes, aiding in exploring features like mutual exclusivity and gene interactions.”
  • Driver Gene Identification: Identifying genes that drive cancer progression.
  • Pathway Analysis: Exploring pathways that are frequently altered in cancer.
  • Mutational Signature Analysis: Analyzing patterns of mutations to understand underlying mutational processes.
  • Tumor heterogeneity: Assessing the presence of multiple clones and ongoing evolution through clustering variants by allele frequencies. Quantifying intra-tumor heterogeneity using the MATH score, with higher scores indicating poorer prognosis and survival.
  • Somatic Interaction: Identifying mutually exclusive or co-occurring gene sets to reveal new pathways and mechanisms of tumorigenesis through Fisher’s exact test on all gene combinations.

Basic Usage

The command structure for invoking the Interpretation module in Musta is as follows:

musta interpret [options]

Here is a breakdown of the available options for the musta interpret command:

General Options:

  • -h, --help: Show the help message and exit.

Required Options:

  • --workdir PATH (-w PATH): Specifies the destination folder for analysis. This is where the Snakemake pipeline, logs, and analysis outputs will be located.
  • --samples-file PATH (-s PATH): Points to a YAML file listing the datasets you wish to analyze. This file provides information about the samples to be processed. For further details, please refer to the dedicate section.

Optional Options:

  • --all-variants or -a: Includes all variants in the final dataset. By default, only variants marked as PASS are included. Use this option if you want to include all variants regardless of their PASS status.

Additional Options:

  • --force, -f: Forces the re-creation of all output files.
  • --dryrun, -d: Describes the workflow but does not execute it. This is useful for previewing the analysis steps before running them.

Quick Start

  1. Obtain the “test-data-somatic” Dataset: First, download the “test-data-somatic” dataset by following the instructions provided in the dedicated section of this user guide.

  2. Create an Input Samples YAML File: Prepare an input samples.yml file, specifying your sample names, normal and tumor sample information, and paths to the corresponding BAM, VCF, and MAF files. Note: For the Interpretation Module, only the path to the MAF file is required.
    sample_A25:
        normal_sample_name:
            - N1
        tumor_sample_name:
            - A25
        normal_bam:
            - /path/to/test-data-somatic/data/bam/N1.chr22.bam
        tumor_bam:
            - /path/to/test-data-somatic/data/bam/A25.chr22.bam
        vcf:
            - /path/to/test-data-somatic/data/vcf/sample_A25.musta.snvs.vcf.gz
        maf:
            - /path/to/test-data-somatic/data/maf/sample_A25.annotated.vep.maf
    
    sample_B33:
        normal_sample_name:
            - N1
        tumor_sample_name:
            - B33
        normal_bam:
            - /path/to/test-data-somatic/data/bam/N1.chr22.bam
        tumor_bam:
            - /path/to/test-data-somatic/data/bam/B33.chr22.bam
        vcf:
            - /path/to/test-data-somatic/data/vcf/sample_B33.musta.snvs.vcf.gz
        maf:
            - /path/to/test-data-somatic/data/maf/sample_B33.annotated.vep.maf
    
    
    sample_C2:
        normal_sample_name:
            - N1
        tumor_sample_name:
            - C2
        normal_bam:
            - /path/to/test-data-somatic/data/bam/N1.chr22.bam
        tumor_bam:
            - /path/to/test-data-somatic/data/bam/C2.chr22.bam
        vcf:
            - /path/to/test-data-somatic/data/vcf/sample_C2.musta.snvs.vcf.gz
        maf:
            - /path/to/test-data-somatic/data/maf/sample_C2.annotated.vep.maf
    
    

    Modify it according to your specific requirements. Note: The “test-data-somatic” dataset refers only to chromosome 22 and is not suitable for a significant test. Therefore, we have included an additional MAF file in the demo data, downloaded from the Cancer Genome Atlas Program (TCGA), containing a subset of 25 samples from the BRCA cohort (breast cancer). Consequently, the samples.yml file assumes this form:

    sample_BRCA:
        maf:
            - /path/to/test-data-somatic/data/maf/BRCA.subset_25_samples.maf
    
           
    
  3. Run Musta Interpret: To execute the “Musta Interpret” module, use the following command with the “test-data-somatic” dataset.

    musta interpret \
    --workdir /path/to/workdir \
    --samples-file /path/to/samples.yml 
    

    Ensure you replace the /path/to/ placeholders with the actual file paths on your system.

After initiating the execution of musta classify, you will see the Musta log unfolding on your screen:

YYYY-MM-DD hh-mm-ss|INFO    |main |musta v1.0.0 - End-to-end pipeline to detect, classify and interpret mutations in cancer
YYYY-MM-DD hh-mm-ss|INFO    |main |Somatic Mutations Interpretation:    1.  Identification of cancer driver genes     2.  Check for enrichment of known oncogenic pathways.    3.  Infer tumor clonality by clustering variant allele frequencies.    4.  Deconvolution of Mutational Signatures    
YYYY-MM-DD hh-mm-ss|INFO    |main |Reading configuration file
YYYY-MM-DD hh-mm-ss|INFO    |main |Setting paths
YYYY-MM-DD hh-mm-ss|INFO    |main |Deploying musta:v1.2.0.1 pipeline from https://github.com/solida-core/musta.git
YYYY-MM-DD hh-mm-ss|INFO    |main |Setting  Config file
YYYY-MM-DD hh-mm-ss|INFO    |main |Setting  Samples file
YYYY-MM-DD hh-mm-ss|INFO    |main |Running
Building DAG of jobs..</pre>

After a few minutes, you will see the last lines of the log, and the execution of Musta will conclude. You can locate the resulting analysis in the following destination folders:

  • Logs: <WORKDIR>/logs/interpretation/
  • Outputs: <WORKDIR>/outputs/interpretation/>
  • Reports: <WORKDIR>/outputs/classification//report.html
  • Results: <WORKDIR>/outputs/interpretation/results These folders will contain the relevant data and reports generated during the musta classify execution.

Exploring <WORKDIR> Folder

Let’s take a look at the <WORKDIR> folder

.
├── benchmarks
│   ├── ...
│   └── interpretation
├── logs
│   ├── ...
│   └── interpretation
├── musta
│   ├── config
│   ├── LICENSE
│   ├── README.md
│   ├── resources
│   └── workflow
└── outputs
    ├── YYYYMMDD-hhmmss.report.html
    ├── ...
    └── interpretation
  • benchmarks: This directory stores benchmark data and results, including performance metrics for individual step used in the interpretation module.
      benchmarks
      ├── ...
      └── interpretation
          ├── driver_gene_identification.txt
          ├── mutational_signature_analysis.txt
          ├── pathway_analysis.txt
          ├── tumor_heterogeneity.txt
          └── variant_visualization.txt
    
  • logs: The logs directory stores various log files to keep track of Musta’s activities.
      logs
      ├── ...
      ├── interpretation
      │   ├── driver_gene_identification.log
      │   ├── mutational_signature_analysis.log
      │   ├── pathway_analysis.log
      │   ├── tumor_heterogeneity.log
      │   └── variant_visualization.log
    
  • musta: The directory houses the core of the Musta tool, including the Snakemake pipeline, which can be accessed from the official Musta GitHub repository: https://github.com/solida-core/musta.

  • outputs: The outputs directory is designed to house the outcomes and data produced during the execution of Musta’s interpretation module. It consists of the following subdirectories:
    • YYMMDD-hhmmss.report.html: Auto-generated, independent, detailed HTML reports that include runtime statistics, provenance information, workflow topology, and results.
    • interpretation: The directory houses the outcomes and data derived from the execution of Musta’s interpretation module. Within this directory, users can explore a wealth of output files and data that are byproducts of the analysis. Specifically, you can find:
      • PNG Files: generated by each step.
      • TSV Files: generated by each step.
      interpretation/
      └── results
          ├── driver_gene_identification
          │   ├── plots
          │   └── tables
          ├── mutational_signature_analysis
          │   ├── plots
          │   └── tables
          ├── pathway_analysis
          │   ├── plots
          │   └── tables
          ├── tumor_heterogeneity
          │   ├── plots
          │   └── tables
          └── variant_visualization
              ├── plots
              └── tables
    

Usage Examples

Let’s explore different scenarios for running Musta with various configurations using the musta interpret command.

1. Run Musta Interpret:

By default, only variants marked as PASS are included.

   musta interpret \
   --workdir /path/to/workdir \
   --samples-file /path/to/samples.yml 

2. Includes all variants in the final dataset:

Use this option if you want to include all variants regardless of their PASS status.

   musta interpret \
   --workdir /path/to/workdir \
   --samples-file /path/to/samples.yml \
   --all-variants 

Results

  • Variant Visualization. See here
variant_visualization
├── plots
│   ├── large_oncoplot.png
│   ├── large_oncoplot.with_samplenames.png
│   ├── lollipop
│   │   ├── ASPM.lollipop.png
│   │   ├── CDH1.lollipop.png
│   │   ├── GATA3.lollipop.png
│   │   ├── GRM3.lollipop.png
│   │   ├── HUWE1.lollipop.png
│   │   ├── MUC16.lollipop.png
│   │   ├── PIK3CA.lollipop.png
│   │   ├── PLCE1.lollipop.png
│   │   ├── TP53.lollipop.png
│   │   └── TTN.lollipop.png
│   ├── mafbar.png
│   ├── oncoplot.png
│   ├── oncoplot.with_samplenames.png
│   ├── rainfall
│   │   ├── TCGA-3C-AAAU-01A-11D-A41F-09 .rainfall.png
│   │   ├── TCGA-3C-AALI-01A-11D-A41F-09 .rainfall.png
│   │   ├── TCGA-3C-AALJ-01A-31D-A41F-09 .rainfall.png
│   │   ├── TCGA-3C-AALK-01A-11D-A41F-09 .rainfall.png
│   │   ├── TCGA-4H-AAAK-01A-12D-A41F-09 .rainfall.png
│   │   ├── TCGA-5L-AAT0-01A-12D-A41F-09 .rainfall.png
│   │   ├── TCGA-5L-AAT1-01A-12D-A41F-09 .rainfall.png
│   │   ├── TCGA-5T-A9QA-01A-11D-A41F-09 .rainfall.png
│   │   ├── TCGA-A1-A0SD-01A-11D-A10Y-09 .rainfall.png
│   │   ├── TCGA-A1-A0SE-01A-11D-A099-09 .rainfall.png
│   │   ├── TCGA-A1-A0SF-01A-11D-A142-09 .rainfall.png
│   │   ├── TCGA-A1-A0SG-01A-11D-A142-09 .rainfall.png
│   │   ├── TCGA-A1-A0SH-01A-11D-A099-09 .rainfall.png
│   │   ├── TCGA-A1-A0SI-01A-11D-A142-09 .rainfall.png
│   │   ├── TCGA-A1-A0SJ-01A-11D-A099-09 .rainfall.png
│   │   ├── TCGA-A1-A0SK-01A-12D-A099-09 .rainfall.png
│   │   ├── TCGA-A1-A0SM-01A-11D-A099-09 .rainfall.png
│   │   ├── TCGA-A1-A0SN-01A-11D-A142-09 .rainfall.png
│   │   ├── TCGA-A1-A0SO-01A-22D-A099-09 .rainfall.png
│   │   ├── TCGA-A1-A0SP-01A-11D-A099-09 .rainfall.png
│   │   ├── TCGA-A1-A0SQ-01A-21D-A142-09 .rainfall.png
│   │   ├── TCGA-A2-A04N-01A-11D-A10Y-09 .rainfall.png
│   │   ├── TCGA-A2-A04P-01A-31D-A128-09 .rainfall.png
│   │   ├── TCGA-A2-A04Q-01A-21W-A050-09 .rainfall.png
│   │   └── TCGA-A2-A04R-01A-41D-A117-09 .rainfall.png
│   ├── summary.png
│   ├── summary.with_samplenames.png
│   ├── TGCA_compare.png
│   ├── TGCA_compare.primary_site.png
│   ├── titv.png
│   ├── titv.with_samplenames.png
│   ├── top10_VAF.png
│   └── top20_VAF.png
└── tables
    ├── gene_summary.tsv
    ├── mutations_count_matrix.tsv
    ├── overview.tsv
    ├── sample_summary.tsv
    ├── TGCA_median_mutation_burden.tsv
    ├── TGCA_mutation_burden_perSample.tsv
    ├── TGCA_pairwise_t_test.tsv
    ├── transitions_and_transversions_fraction_contribution.txt
    ├── transitions_and_transversions_fraction_counts.txt
    └── transitions_and_transversions_TiTv_fractions.txt

4 directories, 58 files
  • Driver Gene Identification. See here
driver_gene_identification
├── plots
│   ├── drug_interactions.barplot.png
│   ├── drug_interactions.piechart.png
│   ├── oncodrive.png
│   └── somatic_interactions.png
└── tables
    ├── oncodrive.tsv
    └── somatic_interactions.tsv

2 directories, 6 files
  • Mutational Signature Analysis. See here
mutational_signature_analysis
├── plots
│   ├── Apobecdiff.png
│   ├── cosine_similarities_heatmap.png
│   ├── cosmic_signatures.png
│   ├── plotCophenetic.png
│   └── signature_contributions.png
└── tables
    ├── APOBEC_scores_matrix.tsv
    ├── cosime_similarities_matrix.tsv
    ├── nmf_matrix.tsv
    └── signatures_matrix.tsv

2 directories, 9 files

  • Pathway Analysis. See here
pathway_analysis
├── plots
│   ├── oncogenic_pathways.png
│   └── pathways
│       ├── Cell_Cycle_pathway.png
│       ├── Hippo_pathway.png
│       ├── MYC_pathway.png
│       ├── NOTCH_pathway.png
│       ├── NRF2_pathway.png
│       ├── PI3K_pathway.png
│       ├── RTK-RAS_pathway.png
│       ├── TGF-Beta_pathway.png
│       ├── TP53_pathway.png
│       └── WNT_pathway.png
└── tables
    └── oncogenic_pathways.tsv

3 directories, 12 files
  • Tumor Heterogeneity. See here
tumor_heterogeneity
├── plots
│   └── clusters
│       ├── TCGA-3C-AAAU-01A-11D-A41F-09_clusters.png
│       ├── TCGA-3C-AALI-01A-11D-A41F-09_clusters.png
│       ├── TCGA-3C-AALJ-01A-31D-A41F-09_clusters.png
│       ├── TCGA-3C-AALK-01A-11D-A41F-09_clusters.png
│       ├── TCGA-4H-AAAK-01A-12D-A41F-09_clusters.png
│       ├── TCGA-5L-AAT0-01A-12D-A41F-09_clusters.png
│       ├── TCGA-5L-AAT1-01A-12D-A41F-09_clusters.png
│       ├── TCGA-5T-A9QA-01A-11D-A41F-09_clusters.png
│       ├── TCGA-A1-A0SD-01A-11D-A10Y-09_clusters.png
│       ├── TCGA-A1-A0SE-01A-11D-A099-09_clusters.png
│       ├── TCGA-A1-A0SF-01A-11D-A142-09_clusters.png
│       ├── TCGA-A1-A0SG-01A-11D-A142-09_clusters.png
│       ├── TCGA-A1-A0SH-01A-11D-A099-09_clusters.png
│       ├── TCGA-A1-A0SI-01A-11D-A142-09_clusters.png
│       ├── TCGA-A1-A0SJ-01A-11D-A099-09_clusters.png
│       ├── TCGA-A1-A0SK-01A-12D-A099-09_clusters.png
│       ├── TCGA-A1-A0SM-01A-11D-A099-09_clusters.png
│       ├── TCGA-A1-A0SN-01A-11D-A142-09_clusters.png
│       ├── TCGA-A1-A0SO-01A-22D-A099-09_clusters.png
│       ├── TCGA-A1-A0SP-01A-11D-A099-09_clusters.png
│       ├── TCGA-A1-A0SQ-01A-21D-A142-09_clusters.png
│       ├── TCGA-A2-A04N-01A-11D-A10Y-09_clusters.png
│       ├── TCGA-A2-A04P-01A-31D-A128-09_clusters.png
│       ├── TCGA-A2-A04Q-01A-21W-A050-09_clusters.png
│       └── TCGA-A2-A04R-01A-41D-A117-09_clusters.png
└── tables
    ├── successful.tsv
    ├── TCGA-3C-AAAU-01A-11D-A41F-09_clusterData.tsv
    ├── TCGA-3C-AAAU-01A-11D-A41F-09_clusterMeans.tsv
    ├── TCGA-3C-AALI-01A-11D-A41F-09_clusterData.tsv
    ├── TCGA-3C-AALI-01A-11D-A41F-09_clusterMeans.tsv
    ├── TCGA-3C-AALJ-01A-31D-A41F-09_clusterData.tsv
    ├── TCGA-3C-AALJ-01A-31D-A41F-09_clusterMeans.tsv
    ├── TCGA-3C-AALK-01A-11D-A41F-09_clusterData.tsv
    ├── TCGA-3C-AALK-01A-11D-A41F-09_clusterMeans.tsv
    ├── TCGA-4H-AAAK-01A-12D-A41F-09_clusterData.tsv
    ├── TCGA-4H-AAAK-01A-12D-A41F-09_clusterMeans.tsv
    ├── TCGA-5L-AAT0-01A-12D-A41F-09_clusterData.tsv
    ├── TCGA-5L-AAT0-01A-12D-A41F-09_clusterMeans.tsv
    ├── TCGA-5L-AAT1-01A-12D-A41F-09_clusterData.tsv
    ├── TCGA-5L-AAT1-01A-12D-A41F-09_clusterMeans.tsv
    ├── TCGA-5T-A9QA-01A-11D-A41F-09_clusterData.tsv
    ├── TCGA-5T-A9QA-01A-11D-A41F-09_clusterMeans.tsv
    ├── TCGA-A1-A0SD-01A-11D-A10Y-09_clusterData.tsv
    ├── TCGA-A1-A0SD-01A-11D-A10Y-09_clusterMeans.tsv
    ├── TCGA-A1-A0SE-01A-11D-A099-09_clusterData.tsv
    ├── TCGA-A1-A0SE-01A-11D-A099-09_clusterMeans.tsv
    ├── TCGA-A1-A0SF-01A-11D-A142-09_clusterData.tsv
    ├── TCGA-A1-A0SF-01A-11D-A142-09_clusterMeans.tsv
    ├── TCGA-A1-A0SG-01A-11D-A142-09_clusterData.tsv
    ├── TCGA-A1-A0SG-01A-11D-A142-09_clusterMeans.tsv
    ├── TCGA-A1-A0SH-01A-11D-A099-09_clusterData.tsv
    ├── TCGA-A1-A0SH-01A-11D-A099-09_clusterMeans.tsv
    ├── TCGA-A1-A0SI-01A-11D-A142-09_clusterData.tsv
    ├── TCGA-A1-A0SI-01A-11D-A142-09_clusterMeans.tsv
    ├── TCGA-A1-A0SJ-01A-11D-A099-09_clusterData.tsv
    ├── TCGA-A1-A0SJ-01A-11D-A099-09_clusterMeans.tsv
    ├── TCGA-A1-A0SK-01A-12D-A099-09_clusterData.tsv
    ├── TCGA-A1-A0SK-01A-12D-A099-09_clusterMeans.tsv
    ├── TCGA-A1-A0SM-01A-11D-A099-09_clusterData.tsv
    ├── TCGA-A1-A0SM-01A-11D-A099-09_clusterMeans.tsv
    ├── TCGA-A1-A0SN-01A-11D-A142-09_clusterData.tsv
    ├── TCGA-A1-A0SN-01A-11D-A142-09_clusterMeans.tsv
    ├── TCGA-A1-A0SO-01A-22D-A099-09_clusterData.tsv
    ├── TCGA-A1-A0SO-01A-22D-A099-09_clusterMeans.tsv
    ├── TCGA-A1-A0SP-01A-11D-A099-09_clusterData.tsv
    ├── TCGA-A1-A0SP-01A-11D-A099-09_clusterMeans.tsv
    ├── TCGA-A1-A0SQ-01A-21D-A142-09_clusterData.tsv
    ├── TCGA-A1-A0SQ-01A-21D-A142-09_clusterMeans.tsv
    ├── TCGA-A2-A04N-01A-11D-A10Y-09_clusterData.tsv
    ├── TCGA-A2-A04N-01A-11D-A10Y-09_clusterMeans.tsv
    ├── TCGA-A2-A04P-01A-31D-A128-09_clusterData.tsv
    ├── TCGA-A2-A04P-01A-31D-A128-09_clusterMeans.tsv
    ├── TCGA-A2-A04Q-01A-21W-A050-09_clusterData.tsv
    ├── TCGA-A2-A04Q-01A-21W-A050-09_clusterMeans.tsv
    ├── TCGA-A2-A04R-01A-41D-A117-09_clusterData.tsv
    └── TCGA-A2-A04R-01A-41D-A117-09_clusterMeans.tsv

3 directories, 76 files