Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion docs/PREPROCESSING.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ src

# Polymorphic structures: Generate SDFs

Use the segmentation data for polymorphic structures as input to the SDF generation step.
Use the segmentation data for polymorphic structures as input to the SDF generation step.

```
src
Expand All @@ -64,3 +64,9 @@ src
         ├── get_max_bounding_box.py <- Get bounds of the largest scaled mesh
         └── pc_sdfs.py <- Sample point clouds from scaled meshes
```

The scale factors can be computed using the `get_max_bounding_box` script. Alternatively, the pre-computed scale factors can be downloaded along with the rest of the preprocessed data. The following scale factors are available for download

1. [WTC-11 hIPSc single cell image dataset v1 nucleolus (NPM1)](https://open.quiltdata.com/b/allencell/tree/aics/morphology_appropriate_representation_learning/preprocessed_data/npm1/scale_factor.npz)
2. [WTC-11 hIPSc single cell image dataset v1 nucleolus (NPM1) 64 resolution](https://open.quiltdata.com/b/allencell/tree/aics/morphology_appropriate_representation_learning/preprocessed_data/npm1_64_res/scale_factor.npz)
3. [WTC-11 hIPSc single cell image dataset v1 polymorphic structures](https://open.quiltdata.com/b/allencell/tree/aics/morphology_appropriate_representation_learning/preprocessed_data/other_polymorphic/scale_factor.npz)
27 changes: 24 additions & 3 deletions docs/USAGE.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ aws s3 cp --no-sign-request --recursive s3://allencell/aics/morphology_appropria
Training these models can take days. We've published our trained models so you don't have to. Skip to the [next section](#2-model-inference) if you'd like to just use our models.

1. Create a single cell manifest (e.g. csv, parquet) for each dataset with a column corresponding to final processed paths, and create a split column corresponding to train/test/validation split.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have info about which columns this manifest must have? Seems related to Yuichiro's question. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, if you run our preprocessing, then the column name for registered images in registered_path, column name for pointclouds is pointcloud_path, column name for segs is seg_path, and column name for sdfs is sdf_path. The name really can be anything, you just have to point to that column name in the config file...Im not sure I want to fix this name to something specific

2. Update the final single cell dataset path (`SINGLE_CELL_DATASET_PATH`) and the column in the manifest for appropriate input modality (`SDF_COLUMN`/`SEG_COLUMN`/`POINTCLOUD_COLUMN`/`IMAGE_COLUMN`) in each [datamodule file](../configs/data/). e.g. for PCNA data these yaml files are located here -
2. Update the [datamodule config file](../configs/data/) with the path to this single cell manifest. For example, update the `path` key in the [pcna config](../configs/data/pcna/image.yaml) to be the path to the processed single cell manifest. Additionally, update the `image` and `cell_id` keys under `transforms/groups` to point to their corresponding column names in the single cell manifest. Similarly, update all other image and pointcloud datamodule files for the PCNA dataset here -

```
configs
Expand Down Expand Up @@ -199,10 +199,31 @@ python src/br/analysis/run_features_combine.py --feature_path_1 './outputs_npm1/
| other_punctate | `python src/br/analysis/run_analysis.py --save_path "./outputs_other_punctate/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/other_punctate" --dataset_name "other_punctate" --run_name "Rotation_invariant_pointcloud_structurenorm" --sdf False --pacmap True` |
| pcna | `python src/br/analysis/run_analysis.py --save_path "./outputs_pcna/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/pcna" --dataset_name "pcna" --run_name "Rotation_invariant_pointcloud_jitter" --sdf False --pacmap False` |

3. To run drug perturbation analysis using the pre-computed features, run
## Steps to run analysis for the nucleolar drug perturbation dataset

1. To compute q-values for the mean average precision scores associated with perturbation retrieval using the pre-computed features, run

```
python src/br/analysis/run_drugdata_analysis.py --save_path "./outputs_npm1_perturb/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1_perturb/" --dataset_name "npm1_perturb"
```

To compute cellprofiler features, open the [project file](../src/br/analysis/cellprofiler/npm1_perturb_cellprofiler.cpproj) using cellprofiler, and point to the single cell images of nucleoli in the [npm1 perturbation dataset](https://open.quiltdata.com/b/allencell/tree/aics/NPM1_single_cell_drug_perturbations/). This will generate a csv named `MyExpt_Image.csv` that contains mean, median, and stdev statistics per image across the different computed features.
2. To compute cellprofiler features, open the [project file](../src/br/analysis/cellprofiler/npm1_perturb_cellprofiler.cpproj) using cellprofiler, and point to the single cell images of nucleoli in the [npm1 perturbation dataset](https://open.quiltdata.com/b/allencell/tree/aics/NPM1_single_cell_drug_perturbations/). This will generate a csv named `MyExpt_Image.csv` that contains mean, median, and stdev statistics per image across the different computed features.

3. To compute classification scores for the number of pieces of nucleoli using the pre-computed features and cellprofiler features, run

```
python src/br/analysis/run_classification.py --save_path "./outputs_npm1/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1/" --dataset_name "npm1"
```

4. To run LDA analysis on the drug perturbation dataset, run

```
python src/br/analysis/run_drugdata_LDA.py --save_path "./outputs_npm1_perturb/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1_perturb/" --dataset_name "npm1_perturb" --raw_path "./NPM1_single_cell_drug_perturbations/"
```

To run a baseline LDA on the DMSO subset of the drug perturbation dataset, run

```
python src/br/analysis/run_drugdata_LDA.py --save_path "./outputs_npm1_perturb/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1_perturb/" --dataset_name "npm1_perturb" --raw_path "./NPM1_single_cell_drug_perturbations/" --baseline True
```

27 changes: 21 additions & 6 deletions src/br/analysis/run_drugdata_LDA.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
import matplotlib.pyplot as plt
import pandas as pd
from br.models.compute_features import get_embeddings
from br.analysis.analysis_utils import str2bool
from br.models.utils import get_all_configs_per_dataset
from skimage import measure
import seaborn as sns
Expand Down Expand Up @@ -162,6 +163,7 @@ def main(args):
"Brefeldin 5uM": "Brefeldin",
}
all_ret["condition"] = all_ret["condition"].replace(map_)
all_ret = all_ret.merge(raw_df[['CellId', 'plate_id']], on='CellId')
cols = [i for i in all_ret.columns if "mu" in i]

hits = [
Expand All @@ -174,7 +176,6 @@ def main(args):
"Roscovitine 10uM",
]

res = {}
scale_lows = [0.3, 0.3, 0.3, 0.3, 0.4, 0.25, 0.3, 0.3, 0.3, 0.3]
scale_highs = [0.3, 0.3, 0.3, 0.3, 0.4, 0.25, 0.3, 0.3, 0.3, 0.3]
scale_lows = [i * 0.1 for i in scale_lows]
Expand All @@ -194,12 +195,23 @@ def main(args):
merge_thresh[6] = 7
sns.set_context("talk")

if args.baseline:
all_ret = all_ret.loc[all_ret['condition'] == 'DMSO (control)']
hits = [[215, 214, 231, 213, 232, 230, 233, 216]] # random sample of plates
merge_thresh = [11]
scale_lows = [i*2 for i in scale_lows]
scale_highs = [i*3 for i in scale_highs]

for j, hit in enumerate(hits):
print("Analysis for", hit)
scale_low = scale_lows[j]
scale_high = scale_highs[j]
tmp1 = all_ret.loc[all_ret["condition"] == "DMSO (control)"]
tmp2 = all_ret.loc[all_ret["condition"] == hit]
if not args.baseline:
tmp1 = all_ret.loc[all_ret["condition"] == "DMSO (control)"]
tmp2 = all_ret.loc[all_ret["condition"] == hit]
else:
tmp1 = all_ret.loc[all_ret['plate_id'].isin(hit[:4])]
tmp2 = all_ret.loc[all_ret['plate_id'].isin(hit[4:])]
tmp1["class"] = 0
tmp2["class"] = 1
tmp = pd.concat([tmp1, tmp2], axis=0).reset_index(drop=True)
Expand All @@ -210,8 +222,6 @@ def main(args):
preds = clf.fit_transform(X, y)
lda_direction = clf.coef_[0]
lda_line = np.array([-lda_direction * scale_low, lda_direction * scale_high])
res[hit] = preds

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
colors = plt.cm.Set2(np.linspace(0, 1, 8))
# PCA Projection plot
Expand Down Expand Up @@ -393,6 +403,7 @@ def main(args):
"--dataset_name", type=str, required=True, help="Name of the dataset."
)
parser.add_argument("--raw_path", type=str, required=True, help="Path to raw data")
parser.add_argument("--baseline", type=str2bool, default=False, help="Perform LDA baseline only")
args = parser.parse_args()

# Validate that required paths are provided
Expand All @@ -405,5 +416,9 @@ def main(args):
"""
Example run:

python src/br/analysis/run_drugdata_analysis.py --save_path "./outputs_npm1_perturb/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1_perturb/" --dataset_name "npm1_perturb" --raw_path "./NPM1_single_cell_drug_perturbations/"
For all drugs:
python src/br/analysis/run_drugdata_LDA.py --save_path "./outputs_npm1_perturb/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1_perturb/" --dataset_name "npm1_perturb" --raw_path "./NPM1_single_cell_drug_perturbations/"

For baseline (DMSO subset 1 -> DMSO subset 2):
python src/br/analysis/run_drugdata_LDA.py --save_path "./outputs_npm1_perturb/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1_perturb/" --dataset_name "npm1_perturb" --raw_path "./NPM1_single_cell_drug_perturbations/" --baseline True
"""
Loading