Skip to content

Solve TODOs and verify DIALOGUE 1 #714

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Conversation

grpinto
Copy link
Collaborator

@grpinto grpinto commented Feb 21, 2025

Description of changes

I have benchmarked the initial part of DIALOGUE (DIALOGUE1). I changed the _pseudobulk_feature_space function so that the user can choose the aggregation method (median or mean) and the output now has samples as rows, matching the R implementation. I also modified the _scale_data function to center, scale, and cap extreme values (with a cap of 0.01) in a way that mirrors the R functions center.matrix and cap.mat. In addition, I updated the _load function to optionally restrict the data to common samples across cell types. The output of _load is now a dataframe that is converted back to a numpy array before further processing.

Technical details

The changes make the pseudobulk and normalization steps in Python produce results that match the R version. I added an optional parameter to subset to common samples and to choose the averaging function. I also ensure that the data are converted to numpy arrays before passing them to the penalized matrix decomposition functions.

Additional context

These changes only affect the initial part of DIALOGUE (DIALOGUE1) and do not modify downstream analysis.

@Zethson Zethson changed the title Validating almost the DIALOGUE1 Solve TODOs and verify DIALOGUE 1 Feb 22, 2025
@codecov-commenter
Copy link

codecov-commenter commented Feb 22, 2025

Codecov Report

Attention: Patch coverage is 92.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 65.79%. Comparing base (6a97036) to head (c27ffed).

Files with missing lines Patch % Lines
pertpy/tools/_dialogue.py 92.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #714      +/-   ##
==========================================
+ Coverage   63.17%   65.79%   +2.61%     
==========================================
  Files          47       47              
  Lines        6110     6127      +17     
==========================================
+ Hits         3860     4031     +171     
+ Misses       2250     2096     -154     
Files with missing lines Coverage Δ
pertpy/tools/_dialogue.py 38.11% <92.00%> (+24.16%) ⬆️

... and 3 files with indirect coverage changes

Copy link
Collaborator

@yugeji yugeji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job with these changes so far, and good that you read the R version too! It looks like you've got a solid understanding of what's going on in this fairly complex piece of code, so I'm happy to discuss any of the comments and your design decisions. Have you done any testing for reproducibility of the R scores with the normalization change?

Looking forward to your next commits and to dropping my own changes in!

if normalize:
return pseudobulks.to_numpy()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it might seem strange that if normalize=True it's not actually normalized, but this was done because of the above comment - the R implementation has normalize=True and does not scale, and so we wanted this part to match, when setting all the hyperparameters to be the same across the two versions. Did you check that quantile-based capping brings the result closer to the R implementation? If so, then it's fine to flip the if-else logic into the correct orientation!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# Create a DataFrame; keys become columns
aggr_df = pd.DataFrame(aggr)
# Transpose so that rows correspond to samples and columns to features
return aggr_df.T
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you've moved the transpose from where ct_preprocess is defined to here. However, this is going to cause a problem when _pseudobulk_feature_space isn't used, as when agg_feature is False. Currently, it's set to True by default and never exposed to the user, but this is because of the incomplete implementation of the TODO on the previous line 660: https://github.com/scverse/pertpy/blob/main/pertpy/tools/_dialogue.py#L660 which I understand is very vague, but I'm happy to hop on a call to explain. Can you change this so that either _pseudobulk_feature_space and _get_pseudobulks are merged as is suggested in the original TODO, or revert it back to the original implementation so that the stub still exists?

agg_feature: Whether to aggregate pseudobulks with some embeddings or not.
normalize: Whether to mimic DIALOGUE behavior or not.
subset_common: If True, restrict output to common samples across cell types.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, this functionality has been missing for a while - in fact, dialoguepy fails with an obscure error without it and right now the user just has to know to do this beforehand, so this is great. Given that it's mandatory, it probably shouldn't be a parameter at all but instead just run by default with a loud warning.

# 4. Apply scaling/normalization to the aggregated data.
# We wrap the output back in a DataFrame to preserve the sample IDs.
ct_scaled = {
ct: pd.DataFrame(self._scale_data(df, normalize=normalize), index=df.index, columns=df.columns)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultra minor quip, but _scale_data should simply preserve the ordering of the df index and columns within the function, so that you don't need the pd.DataFrame wrap here. This is also the generally expected standard when it comes to calling a numpy function on a pandas DataFrame.

# We wrap the output back in a DataFrame to preserve the sample IDs.
ct_scaled = {
ct: pd.DataFrame(self._scale_data(df, normalize=normalize), index=df.index, columns=df.columns)
for ct, df in ct_aggr.items()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for changing this to for ct, df instead of for ct, ad which was super misleading before


# TODO: https://github.com/livnatje/DIALOGUE/blob/55da9be0a9bf2fcd360d9e11f63e30d041ec4318/R/DIALOGUE.main.R#L121-L131
ct_preprocess = {ct: self._scale_data(ad, normalize=normalize).T for ct, ad in ct_aggr.items()}
if subset_common:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, I would try to move this section up to right after ct_subs is instantiated so that, in case there aren't the requisite samples, the code fails as quickly as possible. Furthermore, you won't have to reprocess all the various ct_X variables so that they match.

) -> pd.DataFrame:
"""Return Cell-averaged components from a passed feature space.

TODO: consider merging with `get_pseudobulks`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not remove these TODOs since they haven't been done and probably still should be done.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, the code looks very different now tho

@grpinto
Copy link
Collaborator Author

grpinto commented Mar 19, 2025

So, I also have a notebook that has benchmarked the current implementation of DIALOGUE against the R in their toy example, the results look good but I have a lot of datafiles and stuff that might need some adjustment, I need to speak to @Zethson but my other PR has an operational version, very scrappy tho, for now but enough for the figures I think, in a week or so I will get back to this. Now I have to focus on Pfizer.

Thank you for all your comments Yuge :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants