Skip to content

Add MechPredict plugin #772

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

ainefairbrother
Copy link
Collaborator

@ainefairbrother ainefairbrother commented Feb 5, 2025

JIRA ticket: ENSVAR-6662

Description

This PR adds the MechPredict plugin, which annotates missense variants with one of predicted gene-level mechanisms:

  • Dominant-negative (DN)
  • Gain-of-function (GOF)
  • Loss-of-function (LOF)

MechPredict does this by reading in gene-level probabilities predicted by an external model and assigning the most likely mechanism based on empircally-derived cut-offs described in the related manuscript. For example, if gene A has the following probability values: DN = 0.2, GOF = 0.3, LOF = 0.9, then the returned interpretation would be "gene_predicted_as_associated_with_loss_of_function_mechanism".

Notes

  • New VEP fields added by plugin
    • MechPredict_pDN: Numeric
    • MechPredict_pGOF: Numeric
    • MechPredict_pLOF: Numeric
    • MechPredict_interpretation: Character
  • The plugin only annotates transcript-variant pairs with missense_variant as the consequence. This is because the methods used by the authors to generate the predictions was optimised to assess missense mutations, the most common protein-altering mutations.
  • The plugin reads in MechPredict_input.tsv which can be generated using instructions in the module's header.
  • There is a known exception found during testing:
    • The 'test with 50 missense variants - should annotate all' test will annotate 49 variants only. I believe this is to do with VEP's most severe consequence functionality - if a variant-transcript pair has >1 consequence, VEP will assign the more severe one.
    • As such, in the case below, start_lost is assigned over missense, and so missense is removed as a consequence and is thus not annotated by MechPredict.

Testing

Test with 50 missense variants - should annotate all

# run vep with MechPredict
./vep --input_file /hps/software/users/ensembl/variation/fairbrot/data/test-data/clinvar_20210102_missense_50.vcf.gz \
--output_file /hps/software/users/ensembl/variation/fairbrot/MechPredict/MechPredict_test_missense_out.vcf \
--format vcf \
--vcf \
--dir_plugins /hps/software/users/ensembl/variation/fairbrot/VEP_plugins \
--plugin MechPredict,file=/nfs/production/flicek/ensembl/variation/data/MechPredict/MechPredict_input.tsv \
--offline \
--cache \
--cache_version 113 \
--dir_cache /nfs/production/flicek/ensembl/variation/data/VEP/tabixconverted \
--assembly GRCh38 \
--fasta /nfs/production/flicek/ensembl/variation/data/Homo_sapiens.GRCh38.dna.toplevel.fa.gz

# check output - are the MechPredict fields included?
cat /hps/software/users/ensembl/variation/fairbrot/MechPredict/MechPredict_test_missense_out.vcf | \
    grep -v "^#" | \
    grep "_mechanism" | 
    wc -l

Test with 50 intron variants - should annotate none

# run vep with MechPredict
./vep --input_file /hps/software/users/ensembl/variation/fairbrot/data/test-data/clinvar_20210102_intron_50.vcf.gz \
--output_file /hps/software/users/ensembl/variation/fairbrot/MechPredict/MechPredict_test_intron_out.vcf \
--format vcf \
--vcf \
--dir_plugins /hps/software/users/ensembl/variation/fairbrot/VEP_plugins \
--plugin MechPredict,file=/nfs/production/flicek/ensembl/variation/data/MechPredict/MechPredict_input.tsv \
--offline \
--cache \
--cache_version 113 \
--dir_cache /nfs/production/flicek/ensembl/variation/data/VEP/tabixconverted \
--assembly GRCh38 \
--fasta /nfs/production/flicek/ensembl/variation/data/Homo_sapiens.GRCh38.dna.toplevel.fa.gz

# check output - are the MechPredict fields included?
cat /hps/software/users/ensembl/variation/fairbrot/MechPredict/MechPredict_test_intron_out.vcf | \
    grep -v "^#" | \
    grep "_mechanism" | 
    wc -l

@jamie-m-a jamie-m-a self-assigned this Feb 14, 2025
Copy link
Member

@sarahhunt sarahhunt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Congratulations on your first plugin @ainefairbrother !

I spotted a couple of typos and places where we can make the information we are supplying clearer.
There are also optimisations we can make by changing data structures; let me know if it's useful to talk about these.

MechPredict.pm Outdated
- `MechPredict_pDN`: Probability of a **dominant-negative (DN) mechanism**
- `MechPredict_pGOF`: Probability of a **gain-of-function (GOF) mechanism**
- `MechPredict_pLOF`: Probability of a **loss-of-function (LOF) mechanism**
- `MechPredict_interpretation`: Statement of the most likely mechanism based on empirically-derived cutoffs from Badonyi et al., 2024.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'interpretation' suggests a more involved process than what we are doing here. (As a general rule, humans interpret and software annotates, filters, flags, classifies, etc). 'MechPredict_prediction' would be a better name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, changed to MechPredict_prediction.

MechPredict.pm Outdated
- `MechPredict_interpretation`: Statement of the most likely mechanism based on empirically-derived cutoffs from Badonyi et al., 2024.

Usage:
1. The raw data from the Badonyi et al., 2024 manuscript can be pre-processed using the folllwoing command:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

folllwoing -> folllowing

Point 1 would be clearer as 'Download the results the Badonyi et al., 2024' as the processing instructions start in point 2

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, both fixed!

MechPredict.pm Outdated
$params{$key} = $value if defined $key and defined $value;
}

my $file = $params{file} || die "Error: No data file supplied for the plugin.\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to say which plugin here as many may have been used

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, fixed.

MechPredict.pm Outdated
cut --complement -f4 plof_svm_poly_2023-07-28.tsv | awk '{print $1 " " $2 "\t" $0}' | sort > plof_mod.tsv && \
join -t $'\t' -1 1 -2 1 pdn_mod.tsv pgof_mod.tsv | join -t $'\t' -1 1 -2 1 - plof_mod.tsv | cut --complement -f1,5,6,8,9 | sed '1i gene uniprot_id pDN pGOF pLOF' > MechPredict_input.tsv && \
rm pdn_mod.tsv pgof_mod.tsv plof_mod.tsv && \
awk 'BEGIN {print "gene\tuniprot_id\tmechanism\tprobability"} NR>1 {print $1, $2, "DN", $3; print $1, $2, "GOF", $4; print $1, $2, "LOF", $5;}' OFS='\t' MechPredict_input.tsv > MechPredict_input_pivot.tsv && \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be marginally quicker with a gene per line, which would involve less data manipulation too

Copy link
Collaborator Author

@ainefairbrother ainefairbrother Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed and fixed. This also simplifies the look-up later on.

MechPredict.pm Outdated
}
close $fh;

# Debugging
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The debug statements can be removed for production

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

MechPredict.pm Outdated
my $data = $self->{data}{$gene_name};

my ($pdn, $pgof, $plof) = (undef, undef, undef);
foreach my $entry (@$data) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a lot quicker to use a a look up on a nested hash than look through an array. So we lose lines 235-246 and 257 onwards checks against:

$data->{gene}->{mechanism} = probability

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Array lookup is fast enough here due to small data size, but better practice to take advantage of the faster hash access

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this by:

  • Changing data input to wide instead of long
    gene uniprot_id pDN pGOF pLOF

  • Storing data in hashref per gene

$data{$gene} = {
            uniprot_id => $uniprot_id,
            pDN        => $pDN,
            pGOF       => $pGOF,
            pLOF       => $pLOF
        };
  • Doing a hash key look-up to pull out data for gene
my ( $pdn, $pgof, $plof ) = @{$gene_data}{qw(pDN pGOF pLOF)};

MechPredict.pm Outdated
# - Two probabilities are high at the same time
# - All probabilities are below their respective thresholds
# - All probabilities are above their respective thresholds
# Instead, these end up categoried as "no_conclusive_mechanism_detected"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be interesting to know any mechanisms that pass the threshold, so we would report 2 if needed.
We could then have a a simpler structure where each probability is compared to the threshold once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this method following the recommendation of the authors @ainefairbrother or am I remembering wrong?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thresholds represent the probability at which the models are able to correctly identify 50% of positive cases and ~80% of negative cases when compared to LOF (i.e. p_dn = DN vs. LOF, p_gof = GOF vs. LOF and p_lof = LOF vs. non-LOF). The authors acknowledge overlapping cases in the test sets used to train the models, but state that the models are "blind" to this - i.e. multiple molecular mechanisms weren't explicitly modelled.

As I understand it, the models aim to pick apart LOF from other mechanisms, for instance, if p_lof (LOF vs non-LOF) is low, either one or both of the other two would be elevated, meaning that it is possible for p_gof and p_dn to exist concurrently. So, on reflection, I think you're probably right here @sarahhunt and it would be best to report all that pass threshold.

Let me know what you think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now implemented the returning of all possible mechanisms, as below:

# Compare values to thresholds and populate prediction
# Create prediction field
my $prediction = "";

# Check each value against its threshold and append to prediction
$prediction .= "gene_predicted_as_associated_with_dominant_negative_mechanism, " if $pdn  >= $thresholds{pdn};
$prediction .= "gene_predicted_as_associated_with_gain_of_function_mechanism, " if $pgof >= $thresholds{pgof};
$prediction .= "gene_predicted_as_associated_with_loss_of_function_mechanism, " if $plof >= $thresholds{plof};

# Remove trailing comma and space if
$prediction =~ s/, $//;

# If no predictions met the threshold, assign a default message
$prediction = "no_conclusive_mechanism_predicted" if $prediction eq "";

my ( $self, $tva ) = @_;

# Get transcript ID
my $transcript = $tva->transcript;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not be necessary as feature_types is set to Transcript.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of these variables, or just the my ( $self, $tva ) = @_; ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return {} unless $transcript on L187;

Copy link
Collaborator Author

@ainefairbrother ainefairbrother Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK, understood. Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants