Solve TODOs and verify DIALOGUE 1 #714

grpinto · 2025-02-21T17:43:34Z

Description of changes

I have benchmarked the initial part of DIALOGUE (DIALOGUE1). I changed the _pseudobulk_feature_space function so that the user can choose the aggregation method (median or mean) and the output now has samples as rows, matching the R implementation. I also modified the _scale_data function to center, scale, and cap extreme values (with a cap of 0.01) in a way that mirrors the R functions center.matrix and cap.mat. In addition, I updated the _load function to optionally restrict the data to common samples across cell types. The output of _load is now a dataframe that is converted back to a numpy array before further processing.

Technical details

The changes make the pseudobulk and normalization steps in Python produce results that match the R version. I added an optional parameter to subset to common samples and to choose the averaging function. I also ensure that the data are converted to numpy arrays before passing them to the penalized matrix decomposition functions.

Additional context

These changes only affect the initial part of DIALOGUE (DIALOGUE1) and do not modify downstream analysis.

codecov-commenter · 2025-02-22T09:18:53Z

Codecov Report

Attention: Patch coverage is 92.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 65.79%. Comparing base (6a97036) to head (c27ffed).

Files with missing lines	Patch %	Lines
pertpy/tools/_dialogue.py	92.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #714      +/-   ##
==========================================
+ Coverage   63.17%   65.79%   +2.61%     
==========================================
  Files          47       47              
  Lines        6110     6127      +17     
==========================================
+ Hits         3860     4031     +171     
+ Misses       2250     2096     -154

Files with missing lines	Coverage Δ
pertpy/tools/_dialogue.py	`38.11% <92.00%> (+24.16%)`	⬆️

... and 3 files with indirect coverage changes

yugeji

Great job with these changes so far, and good that you read the R version too! It looks like you've got a solid understanding of what's going on in this fairly complex piece of code, so I'm happy to discuss any of the comments and your design decisions. Have you done any testing for reproducibility of the R scores with the normalization change?

Looking forward to your next commits and to dropping my own changes in!

yugeji · 2025-03-02T17:57:51Z

pertpy/tools/_dialogue.py

        if normalize:
-            return pseudobulks.to_numpy()


I know it might seem strange that if normalize=True it's not actually normalized, but this was done because of the above comment - the R implementation has normalize=True and does not scale, and so we wanted this part to match, when setting all the hyperparameters to be the same across the two versions. Did you check that quantile-based capping brings the result closer to the R implementation? If so, then it's fine to flip the if-else logic into the correct orientation!

They normalize it by default in the code :

https://github.com/livnatje/DIALOGUE/blob/9c146ccf28d7706aaa60d00947a9126b4e75fd69/R/DIALOGUE.main.R#L170

Because by default the parameter center.flag is passed as T :

https://github.com/livnatje/DIALOGUE/blob/9c146ccf28d7706aaa60d00947a9126b4e75fd69/R/DIALOGUE.main.R#L71

yugeji · 2025-03-02T17:57:56Z

pertpy/tools/_dialogue.py

+        # Create a DataFrame; keys become columns
+        aggr_df = pd.DataFrame(aggr)
+        # Transpose so that rows correspond to samples and columns to features
+        return aggr_df.T


I see that you've moved the transpose from where ct_preprocess is defined to here. However, this is going to cause a problem when _pseudobulk_feature_space isn't used, as when agg_feature is False. Currently, it's set to True by default and never exposed to the user, but this is because of the incomplete implementation of the TODO on the previous line 660: https://github.com/scverse/pertpy/blob/main/pertpy/tools/_dialogue.py#L660 which I understand is very vague, but I'm happy to hop on a call to explain. Can you change this so that either _pseudobulk_feature_space and _get_pseudobulks are merged as is suggested in the original TODO, or revert it back to the original implementation so that the stub still exists?

yugeji · 2025-03-02T17:58:00Z

pertpy/tools/_dialogue.py

            agg_feature: Whether to aggregate pseudobulks with some embeddings or not.
            normalize: Whether to mimic DIALOGUE behavior or not.
+            subset_common: If True, restrict output to common samples across cell types.


Awesome, this functionality has been missing for a while - in fact, dialoguepy fails with an obscure error without it and right now the user just has to know to do this beforehand, so this is great. Given that it's mandatory, it probably shouldn't be a parameter at all but instead just run by default with a loud warning.

yugeji · 2025-03-02T17:58:02Z

pertpy/tools/_dialogue.py

+        # 4. Apply scaling/normalization to the aggregated data.
+        #    We wrap the output back in a DataFrame to preserve the sample IDs.
+        ct_scaled = {
+            ct: pd.DataFrame(self._scale_data(df, normalize=normalize), index=df.index, columns=df.columns)


Ultra minor quip, but _scale_data should simply preserve the ordering of the df index and columns within the function, so that you don't need the pd.DataFrame wrap here. This is also the generally expected standard when it comes to calling a numpy function on a pandas DataFrame.

yugeji · 2025-03-02T17:58:04Z

pertpy/tools/_dialogue.py

+        #    We wrap the output back in a DataFrame to preserve the sample IDs.
+        ct_scaled = {
+            ct: pd.DataFrame(self._scale_data(df, normalize=normalize), index=df.index, columns=df.columns)
+            for ct, df in ct_aggr.items()


Thank you for changing this to for ct, df instead of for ct, ad which was super misleading before

yugeji · 2025-03-02T17:58:05Z

pertpy/tools/_dialogue.py


-        # TODO: https://github.com/livnatje/DIALOGUE/blob/55da9be0a9bf2fcd360d9e11f63e30d041ec4318/R/DIALOGUE.main.R#L121-L131
-        ct_preprocess = {ct: self._scale_data(ad, normalize=normalize).T for ct, ad in ct_aggr.items()}
+        if subset_common:


If possible, I would try to move this section up to right after ct_subs is instantiated so that, in case there aren't the requisite samples, the code fails as quickly as possible. Furthermore, you won't have to reprocess all the various ct_X variables so that they match.

yugeji · 2025-03-02T17:59:14Z

pertpy/tools/_dialogue.py

    ) -> pd.DataFrame:
-        """Return Cell-averaged components from a passed feature space.
-
-        TODO: consider merging with `get_pseudobulks`


Let's not remove these TODOs since they haven't been done and probably still should be done.

you are right, the code looks very different now tho

grpinto · 2025-03-19T17:46:10Z

So, I also have a notebook that has benchmarked the current implementation of DIALOGUE against the R in their toy example, the results look good but I have a lot of datafiles and stuff that might need some adjustment, I need to speak to @Zethson but my other PR has an operational version, very scrappy tho, for now but enough for the figures I think, in a week or so I will get back to this. Now I have to focus on Pfizer.

Thank you for all your comments Yuge :)

Goncalo Pinto and others added 5 commits February 10, 2025 18:25

Changing a typo to try it out

df6f416

dialogue embeddings added

a519397

adding the TODO

1a2edce

Fixed bugs in DIALOGUE for _load, _scale_data and pseudobulk

0e52356

Merge branch 'main' into main

e00594b

Zethson changed the title ~~Validating almost the DIALOGUE1~~ Solve TODOs and verify DIALOGUE 1 Feb 22, 2025

Back to 50

c27ffed

yugeji reviewed Mar 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solve TODOs and verify DIALOGUE 1 #714

Solve TODOs and verify DIALOGUE 1 #714

grpinto commented Feb 21, 2025

codecov-commenter commented Feb 22, 2025 •

edited

Loading

yugeji left a comment

yugeji Mar 2, 2025

grpinto Mar 19, 2025

yugeji Mar 2, 2025

yugeji Mar 2, 2025

yugeji Mar 2, 2025

yugeji Mar 2, 2025

yugeji Mar 2, 2025

yugeji Mar 2, 2025

grpinto Mar 19, 2025

grpinto commented Mar 19, 2025

Solve TODOs and verify DIALOGUE 1 #714

Are you sure you want to change the base?

Solve TODOs and verify DIALOGUE 1 #714

Conversation

grpinto commented Feb 21, 2025

codecov-commenter commented Feb 22, 2025 • edited Loading

Codecov Report

yugeji left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grpinto commented Mar 19, 2025

codecov-commenter commented Feb 22, 2025 •

edited

Loading