-
Notifications
You must be signed in to change notification settings - Fork 119
Add MechPredict plugin #772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Congratulations on your first plugin @ainefairbrother !
I spotted a couple of typos and places where we can make the information we are supplying clearer.
There are also optimisations we can make by changing data structures; let me know if it's useful to talk about these.
MechPredict.pm
Outdated
- `MechPredict_pDN`: Probability of a **dominant-negative (DN) mechanism** | ||
- `MechPredict_pGOF`: Probability of a **gain-of-function (GOF) mechanism** | ||
- `MechPredict_pLOF`: Probability of a **loss-of-function (LOF) mechanism** | ||
- `MechPredict_interpretation`: Statement of the most likely mechanism based on empirically-derived cutoffs from Badonyi et al., 2024. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'interpretation' suggests a more involved process than what we are doing here. (As a general rule, humans interpret and software annotates, filters, flags, classifies, etc). 'MechPredict_prediction' would be a better name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, changed to MechPredict_prediction
.
MechPredict.pm
Outdated
- `MechPredict_interpretation`: Statement of the most likely mechanism based on empirically-derived cutoffs from Badonyi et al., 2024. | ||
|
||
Usage: | ||
1. The raw data from the Badonyi et al., 2024 manuscript can be pre-processed using the folllwoing command: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
folllwoing -> folllowing
Point 1 would be clearer as 'Download the results the Badonyi et al., 2024' as the processing instructions start in point 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, both fixed!
MechPredict.pm
Outdated
$params{$key} = $value if defined $key and defined $value; | ||
} | ||
|
||
my $file = $params{file} || die "Error: No data file supplied for the plugin.\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to say which plugin here as many may have been used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, fixed.
MechPredict.pm
Outdated
cut --complement -f4 plof_svm_poly_2023-07-28.tsv | awk '{print $1 " " $2 "\t" $0}' | sort > plof_mod.tsv && \ | ||
join -t $'\t' -1 1 -2 1 pdn_mod.tsv pgof_mod.tsv | join -t $'\t' -1 1 -2 1 - plof_mod.tsv | cut --complement -f1,5,6,8,9 | sed '1i gene uniprot_id pDN pGOF pLOF' > MechPredict_input.tsv && \ | ||
rm pdn_mod.tsv pgof_mod.tsv plof_mod.tsv && \ | ||
awk 'BEGIN {print "gene\tuniprot_id\tmechanism\tprobability"} NR>1 {print $1, $2, "DN", $3; print $1, $2, "GOF", $4; print $1, $2, "LOF", $5;}' OFS='\t' MechPredict_input.tsv > MechPredict_input_pivot.tsv && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be marginally quicker with a gene per line, which would involve less data manipulation too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agreed and fixed. This also simplifies the look-up later on.
MechPredict.pm
Outdated
} | ||
close $fh; | ||
|
||
# Debugging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The debug statements can be removed for production
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
MechPredict.pm
Outdated
my $data = $self->{data}{$gene_name}; | ||
|
||
my ($pdn, $pgof, $plof) = (undef, undef, undef); | ||
foreach my $entry (@$data) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be a lot quicker to use a a look up on a nested hash than look through an array. So we lose lines 235-246 and 257 onwards checks against:
$data->{gene}->{mechanism} = probability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Array lookup is fast enough here due to small data size, but better practice to take advantage of the faster hash access
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this by:
-
Changing data input to wide instead of long
gene uniprot_id pDN pGOF pLOF
-
Storing data in hashref per gene
$data{$gene} = {
uniprot_id => $uniprot_id,
pDN => $pDN,
pGOF => $pGOF,
pLOF => $pLOF
};
- Doing a hash key look-up to pull out data for gene
my ( $pdn, $pgof, $plof ) = @{$gene_data}{qw(pDN pGOF pLOF)};
MechPredict.pm
Outdated
# - Two probabilities are high at the same time | ||
# - All probabilities are below their respective thresholds | ||
# - All probabilities are above their respective thresholds | ||
# Instead, these end up categoried as "no_conclusive_mechanism_detected" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be interesting to know any mechanisms that pass the threshold, so we would report 2 if needed.
We could then have a a simpler structure where each probability is compared to the threshold once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this method following the recommendation of the authors @ainefairbrother or am I remembering wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thresholds represent the probability at which the models are able to correctly identify 50% of positive cases and ~80% of negative cases when compared to LOF (i.e. p_dn = DN vs. LOF, p_gof = GOF vs. LOF and p_lof = LOF vs. non-LOF). The authors acknowledge overlapping cases in the test sets used to train the models, but state that the models are "blind" to this - i.e. multiple molecular mechanisms weren't explicitly modelled.
As I understand it, the models aim to pick apart LOF from other mechanisms, for instance, if p_lof (LOF vs non-LOF) is low, either one or both of the other two would be elevated, meaning that it is possible for p_gof and p_dn to exist concurrently. So, on reflection, I think you're probably right here @sarahhunt and it would be best to report all that pass threshold.
Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now implemented the returning of all possible mechanisms, as below:
# Compare values to thresholds and populate prediction
# Create prediction field
my $prediction = "";
# Check each value against its threshold and append to prediction
$prediction .= "gene_predicted_as_associated_with_dominant_negative_mechanism, " if $pdn >= $thresholds{pdn};
$prediction .= "gene_predicted_as_associated_with_gain_of_function_mechanism, " if $pgof >= $thresholds{pgof};
$prediction .= "gene_predicted_as_associated_with_loss_of_function_mechanism, " if $plof >= $thresholds{plof};
# Remove trailing comma and space if
$prediction =~ s/, $//;
# If no predictions met the threshold, assign a default message
$prediction = "no_conclusive_mechanism_predicted" if $prediction eq "";
my ( $self, $tva ) = @_; | ||
|
||
# Get transcript ID | ||
my $transcript = $tva->transcript; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not be necessary as feature_types is set to Transcript.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both of these variables, or just the my ( $self, $tva ) = @_;
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return {} unless $transcript on L187;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK, understood. Done.
a2b5c95
to
94d0cdc
Compare
94d0cdc
to
f9c4c93
Compare
JIRA ticket: ENSVAR-6662
Description
This PR adds the MechPredict plugin, which annotates missense variants with one of predicted gene-level mechanisms:
MechPredict does this by reading in gene-level probabilities predicted by an external model and assigning the most likely mechanism based on empircally-derived cut-offs described in the related manuscript. For example, if gene A has the following probability values: DN = 0.2, GOF = 0.3, LOF = 0.9, then the returned interpretation would be "gene_predicted_as_associated_with_loss_of_function_mechanism".
Notes
MechPredict_pDN
: NumericMechPredict_pGOF
: NumericMechPredict_pLOF
: NumericMechPredict_interpretation
: CharacterMechPredict_input.tsv
which can be generated using instructions in the module's header.Testing
Test with 50 missense variants - should annotate all
Test with 50 intron variants - should annotate none