AllenCell · ritvikvasan · Apr 16, 2025 · Apr 16, 2025 · Apr 16, 2025 · Apr 16, 2025
diff --git a/docs/PREPROCESSING.md b/docs/PREPROCESSING.md
@@ -52,7 +52,7 @@ src
 
 # Polymorphic structures: Generate SDFs
 
-Use the segmentation data for polymorphic structures as input to the SDF generation step.
+Use the segmentation data for polymorphic structures as input to the SDF generation step. 
 
 ```
 src
@@ -64,3 +64,9 @@ src
                ├── get_max_bounding_box.py <- Get bounds of the largest scaled mesh
                └── pc_sdfs.py    <- Sample point clouds from scaled meshes
 ```
+
+The scale factors can be computed using the `get_max_bounding_box` script. Alternatively, the pre-computed scale factors can be downloaded along with the rest of the preprocessed data. The following scale factors are available for download 
+
+1. [WTC-11 hIPSc single cell image dataset v1 nucleolus (NPM1)](https://open.quiltdata.com/b/allencell/tree/aics/morphology_appropriate_representation_learning/preprocessed_data/npm1/scale_factor.npz)
+2. [WTC-11 hIPSc single cell image dataset v1 nucleolus (NPM1) 64 resolution](https://open.quiltdata.com/b/allencell/tree/aics/morphology_appropriate_representation_learning/preprocessed_data/npm1_64_res/scale_factor.npz)
+3. [WTC-11 hIPSc single cell image dataset v1 polymorphic structures](https://open.quiltdata.com/b/allencell/tree/aics/morphology_appropriate_representation_learning/preprocessed_data/other_polymorphic/scale_factor.npz)
diff --git a/docs/USAGE.md b/docs/USAGE.md
@@ -79,7 +79,7 @@ aws s3 cp --no-sign-request --recursive s3://allencell/aics/morphology_appropria
 Training these models can take days. We've published our trained models so you don't have to. Skip to the [next section](#2-model-inference) if you'd like to just use our models.
 
 1. Create a single cell manifest (e.g. csv, parquet) for each dataset with a column corresponding to final processed paths, and create a split column corresponding to train/test/validation split.
-2. Update the final single cell dataset path (`SINGLE_CELL_DATASET_PATH`) and the column in the manifest for appropriate input modality (`SDF_COLUMN`/`SEG_COLUMN`/`POINTCLOUD_COLUMN`/`IMAGE_COLUMN`) in each [datamodule file](../configs/data/). e.g. for PCNA data these yaml files are located here -
+2. Update the [datamodule config file](../configs/data/) with the path to this single cell manifest. For example, update the `path` key in the [pcna config](../configs/data/pcna/image.yaml) to be the path to the processed single cell manifest. Additionally, update the `image` and `cell_id` keys under `transforms/groups` to point to their corresponding column names in the single cell manifest. Similarly, update all other image and pointcloud datamodule files for the PCNA dataset here -
 
 ```
 configs
@@ -199,10 +199,31 @@ python src/br/analysis/run_features_combine.py --feature_path_1 './outputs_npm1/
 | other_punctate    | `python src/br/analysis/run_analysis.py --save_path "./outputs_other_punctate/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/other_punctate" --dataset_name "other_punctate" --run_name "Rotation_invariant_pointcloud_structurenorm" --sdf False --pacmap True` |
 | pcna              | `python src/br/analysis/run_analysis.py --save_path "./outputs_pcna/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/pcna" --dataset_name "pcna" --run_name "Rotation_invariant_pointcloud_jitter" --sdf False --pacmap False`                                     |
 
-3. To run drug perturbation analysis using the pre-computed features, run
+## Steps to run analysis for the nucleolar drug perturbation dataset
+
+1. To compute q-values for the mean average precision scores associated with perturbation retrieval using the pre-computed features, run
 
 ```
 python src/br/analysis/run_drugdata_analysis.py --save_path "./outputs_npm1_perturb/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1_perturb/" --dataset_name "npm1_perturb"
 ```
 
-To compute cellprofiler features, open the [project file](../src/br/analysis/cellprofiler/npm1_perturb_cellprofiler.cpproj) using cellprofiler, and point to the single cell images of nucleoli in the [npm1 perturbation dataset](https://open.quiltdata.com/b/allencell/tree/aics/NPM1_single_cell_drug_perturbations/). This will generate a csv named `MyExpt_Image.csv` that contains mean, median, and stdev statistics per image across the different computed features.
+2. To compute cellprofiler features, open the [project file](../src/br/analysis/cellprofiler/npm1_perturb_cellprofiler.cpproj) using cellprofiler, and point to the single cell images of nucleoli in the [npm1 perturbation dataset](https://open.quiltdata.com/b/allencell/tree/aics/NPM1_single_cell_drug_perturbations/). This will generate a csv named `MyExpt_Image.csv` that contains mean, median, and stdev statistics per image across the different computed features.
+
+3. To compute classification scores for the number of pieces of nucleoli using the pre-computed features and cellprofiler features, run
+
+```
+python src/br/analysis/run_classification.py --save_path "./outputs_npm1/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1/" --dataset_name "npm1"
+```
+
+4. To run LDA analysis on the drug perturbation dataset, run
+
+```
+python src/br/analysis/run_drugdata_LDA.py --save_path "./outputs_npm1_perturb/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1_perturb/" --dataset_name "npm1_perturb" --raw_path "./NPM1_single_cell_drug_perturbations/"
+```
+
+To run a baseline LDA on the DMSO subset of the drug perturbation dataset, run
+
+```
+python src/br/analysis/run_drugdata_LDA.py --save_path "./outputs_npm1_perturb/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1_perturb/" --dataset_name "npm1_perturb" --raw_path "./NPM1_single_cell_drug_perturbations/" --baseline True
+```
+
diff --git a/src/br/analysis/run_drugdata_LDA.py b/src/br/analysis/run_drugdata_LDA.py
@@ -10,6 +10,7 @@
 import matplotlib.pyplot as plt
 import pandas as pd
 from br.models.compute_features import get_embeddings
+from br.analysis.analysis_utils import str2bool
 from br.models.utils import get_all_configs_per_dataset
 from skimage import measure
 import seaborn as sns
@@ -162,6 +163,7 @@ def main(args):
         "Brefeldin 5uM": "Brefeldin",
     }
     all_ret["condition"] = all_ret["condition"].replace(map_)
+    all_ret = all_ret.merge(raw_df[['CellId', 'plate_id']], on='CellId')
     cols = [i for i in all_ret.columns if "mu" in i]
 
     hits = [
@@ -174,7 +176,6 @@ def main(args):
         "Roscovitine 10uM",
     ]
 
-    res = {}
     scale_lows = [0.3, 0.3, 0.3, 0.3, 0.4, 0.25, 0.3, 0.3, 0.3, 0.3]
     scale_highs = [0.3, 0.3, 0.3, 0.3, 0.4, 0.25, 0.3, 0.3, 0.3, 0.3]
     scale_lows = [i * 0.1 for i in scale_lows]
@@ -194,12 +195,23 @@ def main(args):
     merge_thresh[6] = 7
     sns.set_context("talk")
 
+    if args.baseline:
+        all_ret = all_ret.loc[all_ret['condition'] == 'DMSO (control)']
+        hits = [[215, 214, 231, 213, 232, 230, 233, 216]] # random sample of plates
+        merge_thresh = [11]
+        scale_lows = [i*2 for i in scale_lows]
+        scale_highs = [i*3 for i in scale_highs]
+
     for j, hit in enumerate(hits):
         print("Analysis for", hit)
         scale_low = scale_lows[j]
         scale_high = scale_highs[j]
-        tmp1 = all_ret.loc[all_ret["condition"] == "DMSO (control)"]
-        tmp2 = all_ret.loc[all_ret["condition"] == hit]
+        if not args.baseline:
+            tmp1 = all_ret.loc[all_ret["condition"] == "DMSO (control)"]
+            tmp2 = all_ret.loc[all_ret["condition"] == hit]
+        else:
+            tmp1 = all_ret.loc[all_ret['plate_id'].isin(hit[:4])]
+            tmp2 = all_ret.loc[all_ret['plate_id'].isin(hit[4:])]
         tmp1["class"] = 0
         tmp2["class"] = 1
         tmp = pd.concat([tmp1, tmp2], axis=0).reset_index(drop=True)
@@ -210,8 +222,6 @@ def main(args):
         preds = clf.fit_transform(X, y)
         lda_direction = clf.coef_[0]
         lda_line = np.array([-lda_direction * scale_low, lda_direction * scale_high])
-        res[hit] = preds
-
         fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
         colors = plt.cm.Set2(np.linspace(0, 1, 8))
         # PCA Projection plot
@@ -393,6 +403,7 @@ def main(args):
         "--dataset_name", type=str, required=True, help="Name of the dataset."
     )
     parser.add_argument("--raw_path", type=str, required=True, help="Path to raw data")
+    parser.add_argument("--baseline", type=str2bool, default=False, help="Perform LDA baseline only")
     args = parser.parse_args()
 
     # Validate that required paths are provided
@@ -405,5 +416,9 @@ def main(args):
     """
     Example run:
 
-    python src/br/analysis/run_drugdata_analysis.py --save_path "./outputs_npm1_perturb/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1_perturb/" --dataset_name "npm1_perturb" --raw_path "./NPM1_single_cell_drug_perturbations/"
+    For all drugs:
+    python src/br/analysis/run_drugdata_LDA.py --save_path "./outputs_npm1_perturb/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1_perturb/" --dataset_name "npm1_perturb" --raw_path "./NPM1_single_cell_drug_perturbations/"
+
+    For baseline (DMSO subset 1 -> DMSO subset 2): 
+    python src/br/analysis/run_drugdata_LDA.py --save_path "./outputs_npm1_perturb/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1_perturb/" --dataset_name "npm1_perturb" --raw_path "./NPM1_single_cell_drug_perturbations/" --baseline True
     """