Add MechPredict plugin #772

ainefairbrother · 2025-02-05T14:14:45Z

Description

This PR adds the MechPredict plugin, which annotates missense variants with one of predicted gene-level mechanisms:

Dominant-negative (DN)
Gain-of-function (GOF)
Loss-of-function (LOF)

MechPredict does this by reading in gene-level probabilities predicted by an external model and assigning the most likely mechanism based on empircally-derived cut-offs described in the related manuscript. For example, if gene A has the following probability values: DN = 0.2, GOF = 0.3, LOF = 0.9, then the returned interpretation would be "gene_predicted_as_associated_with_loss_of_function_mechanism".

Notes

New VEP fields added by plugin
- MechPredict_pDN: Numeric
- MechPredict_pGOF: Numeric
- MechPredict_pLOF: Numeric
- MechPredict_interpretation: Character
The plugin only annotates transcript-variant pairs with missense_variant as the consequence. This is because the methods used by the authors to generate the predictions was optimised to assess missense mutations, the most common protein-altering mutations.
The plugin reads in MechPredict_input.tsv which can be generated using instructions in the module's header.
There is a known exception found during testing:
- The 'test with 50 missense variants - should annotate all' test will annotate 49 variants only. I believe this is to do with VEP's most severe consequence functionality - if a variant-transcript pair has >1 consequence, VEP will assign the more severe one.
- As such, in the case below, start_lost is assigned over missense, and so missense is removed as a consequence and is thus not annotated by MechPredict.

Testing

Test with 50 missense variants - should annotate all

# run vep with MechPredict
./vep --input_file /hps/software/users/ensembl/variation/fairbrot/data/test-data/clinvar_20210102_missense_50.vcf.gz \
--output_file /hps/software/users/ensembl/variation/fairbrot/MechPredict/MechPredict_test_missense_out.vcf \
--format vcf \
--vcf \
--dir_plugins /hps/software/users/ensembl/variation/fairbrot/VEP_plugins \
--plugin MechPredict,file=/nfs/production/flicek/ensembl/variation/data/MechPredict/MechPredict_input.tsv \
--offline \
--cache \
--cache_version 113 \
--dir_cache /nfs/production/flicek/ensembl/variation/data/VEP/tabixconverted \
--assembly GRCh38 \
--fasta /nfs/production/flicek/ensembl/variation/data/Homo_sapiens.GRCh38.dna.toplevel.fa.gz

# check output - are the MechPredict fields included?
cat /hps/software/users/ensembl/variation/fairbrot/MechPredict/MechPredict_test_missense_out.vcf | \
    grep -v "^#" | \
    grep "_mechanism" | 
    wc -l

Test with 50 intron variants - should annotate none

# run vep with MechPredict
./vep --input_file /hps/software/users/ensembl/variation/fairbrot/data/test-data/clinvar_20210102_intron_50.vcf.gz \
--output_file /hps/software/users/ensembl/variation/fairbrot/MechPredict/MechPredict_test_intron_out.vcf \
--format vcf \
--vcf \
--dir_plugins /hps/software/users/ensembl/variation/fairbrot/VEP_plugins \
--plugin MechPredict,file=/nfs/production/flicek/ensembl/variation/data/MechPredict/MechPredict_input.tsv \
--offline \
--cache \
--cache_version 113 \
--dir_cache /nfs/production/flicek/ensembl/variation/data/VEP/tabixconverted \
--assembly GRCh38 \
--fasta /nfs/production/flicek/ensembl/variation/data/Homo_sapiens.GRCh38.dna.toplevel.fa.gz

# check output - are the MechPredict fields included?
cat /hps/software/users/ensembl/variation/fairbrot/MechPredict/MechPredict_test_intron_out.vcf | \
    grep -v "^#" | \
    grep "_mechanism" | 
    wc -l

sarahhunt

Congratulations on your first plugin @ainefairbrother !

I spotted a couple of typos and places where we can make the information we are supplying clearer.
There are also optimisations we can make by changing data structures; let me know if it's useful to talk about these.

sarahhunt · 2025-02-14T12:26:36Z

MechPredict.pm

+  - `MechPredict_pDN`: Probability of a **dominant-negative (DN) mechanism**  
+  - `MechPredict_pGOF`: Probability of a **gain-of-function (GOF) mechanism**  
+  - `MechPredict_pLOF`: Probability of a **loss-of-function (LOF) mechanism**  
+  - `MechPredict_interpretation`: Statement of the most likely mechanism based on empirically-derived cutoffs from Badonyi et al., 2024. 


'interpretation' suggests a more involved process than what we are doing here. (As a general rule, humans interpret and software annotates, filters, flags, classifies, etc). 'MechPredict_prediction' would be a better name.

Thanks, changed to MechPredict_prediction.

sarahhunt · 2025-02-14T12:27:23Z

MechPredict.pm

+  - `MechPredict_interpretation`: Statement of the most likely mechanism based on empirically-derived cutoffs from Badonyi et al., 2024. 
+
+Usage:
+1. The raw data from the Badonyi et al., 2024 manuscript can be pre-processed using the folllwoing command: 


folllwoing -> folllowing

Point 1 would be clearer as 'Download the results the Badonyi et al., 2024' as the processing instructions start in point 2

Agreed, both fixed!

sarahhunt · 2025-02-14T13:14:01Z

MechPredict.pm

+        $params{$key} = $value if defined $key and defined $value;
+    }
+
+    my $file = $params{file} || die "Error: No data file supplied for the plugin.\n";


It would be helpful to say which plugin here as many may have been used

Good point, fixed.

sarahhunt · 2025-02-14T13:19:13Z

MechPredict.pm

+    cut --complement -f4 plof_svm_poly_2023-07-28.tsv | awk '{print $1 " " $2 "\t" $0}' | sort > plof_mod.tsv && \
+    join -t $'\t' -1 1 -2 1 pdn_mod.tsv pgof_mod.tsv | join -t $'\t' -1 1 -2 1 - plof_mod.tsv | cut --complement -f1,5,6,8,9 | sed '1i gene uniprot_id pDN pGOF pLOF' > MechPredict_input.tsv && \
+    rm pdn_mod.tsv pgof_mod.tsv plof_mod.tsv && \
+    awk 'BEGIN {print "gene\tuniprot_id\tmechanism\tprobability"} NR>1 {print $1, $2, "DN", $3; print $1, $2, "GOF", $4; print $1, $2, "LOF", $5;}' OFS='\t' MechPredict_input.tsv > MechPredict_input_pivot.tsv && \


This should be marginally quicker with a gene per line, which would involve less data manipulation too

Yes, agreed and fixed. This also simplifies the look-up later on.

sarahhunt · 2025-02-14T13:22:39Z

MechPredict.pm

+    }
+    close $fh;
+
+    # Debugging


The debug statements can be removed for production

sarahhunt · 2025-02-14T13:56:41Z

MechPredict.pm

+    my $data = $self->{data}{$gene_name};
+
+    my ($pdn, $pgof, $plof) = (undef, undef, undef);
+    foreach my $entry (@$data) {


It would be a lot quicker to use a a look up on a nested hash than look through an array. So we lose lines 235-246 and 257 onwards checks against:

$data->{gene}->{mechanism} = probability

Agreed. Array lookup is fast enough here due to small data size, but better practice to take advantage of the faster hash access

Fixed this by:

Changing data input to wide instead of long
gene uniprot_id pDN pGOF pLOF

Storing data in hashref per gene

$data{$gene} = { uniprot_id => $uniprot_id, pDN => $pDN, pGOF => $pGOF, pLOF => $pLOF };

Doing a hash key look-up to pull out data for gene

my ( $pdn, $pgof, $plof ) = @{$gene_data}{qw(pDN pGOF pLOF)};

sarahhunt · 2025-02-14T14:01:10Z

MechPredict.pm

+    #   - Two probabilities are high at the same time
+    #   - All probabilities are below their respective thresholds
+    #   - All probabilities are above their respective thresholds
+    # Instead, these end up categoried as "no_conclusive_mechanism_detected"


it would be interesting to know any mechanisms that pass the threshold, so we would report 2 if needed.
We could then have a a simpler structure where each probability is compared to the threshold once.

Was this method following the recommendation of the authors @ainefairbrother or am I remembering wrong?

The thresholds represent the probability at which the models are able to correctly identify 50% of positive cases and ~80% of negative cases when compared to LOF (i.e. p_dn = DN vs. LOF, p_gof = GOF vs. LOF and p_lof = LOF vs. non-LOF). The authors acknowledge overlapping cases in the test sets used to train the models, but state that the models are "blind" to this - i.e. multiple molecular mechanisms weren't explicitly modelled.

As I understand it, the models aim to pick apart LOF from other mechanisms, for instance, if p_lof (LOF vs non-LOF) is low, either one or both of the other two would be elevated, meaning that it is possible for p_gof and p_dn to exist concurrently. So, on reflection, I think you're probably right here @sarahhunt and it would be best to report all that pass threshold.

Let me know what you think.

I have now implemented the returning of all possible mechanisms, as below:

# Compare values to thresholds and populate prediction # Create prediction field my $prediction = ""; # Check each value against its threshold and append to prediction $prediction .= "gene_predicted_as_associated_with_dominant_negative_mechanism, " if $pdn >= $thresholds{pdn}; $prediction .= "gene_predicted_as_associated_with_gain_of_function_mechanism, " if $pgof >= $thresholds{pgof}; $prediction .= "gene_predicted_as_associated_with_loss_of_function_mechanism, " if $plof >= $thresholds{plof}; # Remove trailing comma and space if $prediction =~ s/, $//; # If no predictions met the threshold, assign a default message $prediction = "no_conclusive_mechanism_predicted" if $prediction eq "";

MechPredict.pm

sarahhunt · 2025-02-14T14:10:30Z

MechPredict.pm

+    my ( $self, $tva ) = @_;
+
+    # Get transcript ID
+    my $transcript = $tva->transcript;


This is not be necessary as feature_types is set to Transcript.

Both of these variables, or just the my ( $self, $tva ) = @_; ?

The return {} unless $transcript on L187;

Ah OK, understood. Done.

jamie-m-a self-assigned this Feb 14, 2025

jamie-m-a requested review from sarahhunt and jamie-m-a February 14, 2025 09:08

sarahhunt requested changes Feb 14, 2025

View reviewed changes

jamie-m-a requested a review from sarahhunt February 26, 2025 09:39

ainefairbrother force-pushed the MechPredict-devel branch from a2b5c95 to 94d0cdc Compare February 26, 2025 15:48

ainefairbrother added 24 commits February 26, 2025 15:51

chore: init

c60d874

feat: add parameter handling and TSV file parsing

d37066a

feat: add subroutines

d83759d

feat: add core logic

53559e8

feat: add return values

aa9fc5d

feat: add interpretation output field

75ddcaf

feat: add interpretation output field

6612d42

fix: processing of parameters in sub run

a91d8f4

fix: processing of parameters in sub run

29fd239

fix: file reading, gene name lookup, output format

aec7649

docs: tidy code and add comments

22c1ae3

fix: add interpretation to header hash

56bf2e4

docs: add header fields to .pm file

b3ada0b

docs: add header fields to .pm file

270ee13

docs: tidy code and add comments

557dc97

docs: augment module header

9b9cb2e

fix: add underscores to interpretation field

f6b2c39

docs: small changes to comments

7efce8a

docs: small changes to comments

1a31e50

docs: simplify comments, minimise header info char length

987120a

fix: add if to access values cached if --offline

6a6926c

fix: small fixes to otput strings

02ba516

fix: docs, comments, small optimisations, improve data input structure

03a82ad

fix: use hashref lookup for grabbing gene data from input dat

0af9fd9

ainefairbrother added 2 commits February 26, 2025 15:52

fix: remove return {} unless

692ae90

fix: output all predicted mechanisms

f9c4c93

ainefairbrother force-pushed the MechPredict-devel branch from 94d0cdc to f9c4c93 Compare February 26, 2025 15:52

docs: minor changes to comments

0acb15f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MechPredict plugin #772

Add MechPredict plugin #772

ainefairbrother commented Feb 5, 2025 •

edited

Loading

sarahhunt left a comment

sarahhunt Feb 14, 2025

ainefairbrother Feb 18, 2025

sarahhunt Feb 14, 2025

ainefairbrother Feb 18, 2025

sarahhunt Feb 14, 2025

ainefairbrother Feb 18, 2025

sarahhunt Feb 14, 2025

ainefairbrother Feb 18, 2025 •

edited

Loading

sarahhunt Feb 14, 2025

ainefairbrother Feb 18, 2025

sarahhunt Feb 14, 2025

jamie-m-a Feb 17, 2025

ainefairbrother Feb 18, 2025

sarahhunt Feb 14, 2025

jamie-m-a Feb 17, 2025

ainefairbrother Feb 18, 2025

ainefairbrother Feb 21, 2025

sarahhunt Feb 14, 2025

ainefairbrother Feb 18, 2025

sarahhunt Feb 19, 2025

ainefairbrother Feb 19, 2025 •

edited

Loading

Add MechPredict plugin #772

Are you sure you want to change the base?

Add MechPredict plugin #772

Conversation

ainefairbrother commented Feb 5, 2025 • edited Loading

Description

Notes

Testing

Test with 50 missense variants - should annotate all

Test with 50 intron variants - should annotate none

sarahhunt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ainefairbrother Feb 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ainefairbrother Feb 19, 2025 • edited Loading

Choose a reason for hiding this comment

ainefairbrother commented Feb 5, 2025 •

edited

Loading

ainefairbrother Feb 18, 2025 •

edited

Loading

ainefairbrother Feb 19, 2025 •

edited

Loading