-
Notifications
You must be signed in to change notification settings - Fork 30
Solve TODOs and verify DIALOGUE 1 #714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #714 +/- ##
==========================================
+ Coverage 63.17% 65.79% +2.61%
==========================================
Files 47 47
Lines 6110 6127 +17
==========================================
+ Hits 3860 4031 +171
+ Misses 2250 2096 -154
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job with these changes so far, and good that you read the R version too! It looks like you've got a solid understanding of what's going on in this fairly complex piece of code, so I'm happy to discuss any of the comments and your design decisions. Have you done any testing for reproducibility of the R scores with the normalization change?
Looking forward to your next commits and to dropping my own changes in!
if normalize: | ||
return pseudobulks.to_numpy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it might seem strange that if normalize=True it's not actually normalized, but this was done because of the above comment - the R implementation has normalize=True and does not scale, and so we wanted this part to match, when setting all the hyperparameters to be the same across the two versions. Did you check that quantile-based capping brings the result closer to the R implementation? If so, then it's fine to flip the if-else logic into the correct orientation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They normalize it by default in the code :
Because by default the parameter center.flag is passed as T :
# Create a DataFrame; keys become columns | ||
aggr_df = pd.DataFrame(aggr) | ||
# Transpose so that rows correspond to samples and columns to features | ||
return aggr_df.T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you've moved the transpose from where ct_preprocess
is defined to here. However, this is going to cause a problem when _pseudobulk_feature_space
isn't used, as when agg_feature
is False. Currently, it's set to True by default and never exposed to the user, but this is because of the incomplete implementation of the TODO on the previous line 660: https://github.com/scverse/pertpy/blob/main/pertpy/tools/_dialogue.py#L660 which I understand is very vague, but I'm happy to hop on a call to explain. Can you change this so that either _pseudobulk_feature_space
and _get_pseudobulks
are merged as is suggested in the original TODO, or revert it back to the original implementation so that the stub still exists?
agg_feature: Whether to aggregate pseudobulks with some embeddings or not. | ||
normalize: Whether to mimic DIALOGUE behavior or not. | ||
subset_common: If True, restrict output to common samples across cell types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, this functionality has been missing for a while - in fact, dialoguepy fails with an obscure error without it and right now the user just has to know to do this beforehand, so this is great. Given that it's mandatory, it probably shouldn't be a parameter at all but instead just run by default with a loud warning.
# 4. Apply scaling/normalization to the aggregated data. | ||
# We wrap the output back in a DataFrame to preserve the sample IDs. | ||
ct_scaled = { | ||
ct: pd.DataFrame(self._scale_data(df, normalize=normalize), index=df.index, columns=df.columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultra minor quip, but _scale_data
should simply preserve the ordering of the df index and columns within the function, so that you don't need the pd.DataFrame wrap here. This is also the generally expected standard when it comes to calling a numpy function on a pandas DataFrame.
# We wrap the output back in a DataFrame to preserve the sample IDs. | ||
ct_scaled = { | ||
ct: pd.DataFrame(self._scale_data(df, normalize=normalize), index=df.index, columns=df.columns) | ||
for ct, df in ct_aggr.items() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for changing this to for ct, df
instead of for ct, ad
which was super misleading before
|
||
# TODO: https://github.com/livnatje/DIALOGUE/blob/55da9be0a9bf2fcd360d9e11f63e30d041ec4318/R/DIALOGUE.main.R#L121-L131 | ||
ct_preprocess = {ct: self._scale_data(ad, normalize=normalize).T for ct, ad in ct_aggr.items()} | ||
if subset_common: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, I would try to move this section up to right after ct_subs
is instantiated so that, in case there aren't the requisite samples, the code fails as quickly as possible. Furthermore, you won't have to reprocess all the various ct_X
variables so that they match.
) -> pd.DataFrame: | ||
"""Return Cell-averaged components from a passed feature space. | ||
|
||
TODO: consider merging with `get_pseudobulks` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not remove these TODOs since they haven't been done and probably still should be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, the code looks very different now tho
So, I also have a notebook that has benchmarked the current implementation of DIALOGUE against the R in their toy example, the results look good but I have a lot of datafiles and stuff that might need some adjustment, I need to speak to @Zethson but my other PR has an operational version, very scrappy tho, for now but enough for the figures I think, in a week or so I will get back to this. Now I have to focus on Pfizer. Thank you for all your comments Yuge :) |
Description of changes
I have benchmarked the initial part of DIALOGUE (DIALOGUE1). I changed the _pseudobulk_feature_space function so that the user can choose the aggregation method (median or mean) and the output now has samples as rows, matching the R implementation. I also modified the _scale_data function to center, scale, and cap extreme values (with a cap of 0.01) in a way that mirrors the R functions center.matrix and cap.mat. In addition, I updated the _load function to optionally restrict the data to common samples across cell types. The output of _load is now a dataframe that is converted back to a numpy array before further processing.
Technical details
The changes make the pseudobulk and normalization steps in Python produce results that match the R version. I added an optional parameter to subset to common samples and to choose the averaging function. I also ensure that the data are converted to numpy arrays before passing them to the penalized matrix decomposition functions.
Additional context
These changes only affect the initial part of DIALOGUE (DIALOGUE1) and do not modify downstream analysis.