diff --git a/docs/PREPROCESSING.md b/docs/PREPROCESSING.md index 76de45b..5b50e0f 100644 --- a/docs/PREPROCESSING.md +++ b/docs/PREPROCESSING.md @@ -52,7 +52,7 @@ src # Polymorphic structures: Generate SDFs -Use the segmentation data for polymorphic structures as input to the SDF generation step. +Use the segmentation data for polymorphic structures as input to the SDF generation step. ``` src @@ -64,3 +64,9 @@ src          ├── get_max_bounding_box.py <- Get bounds of the largest scaled mesh          └── pc_sdfs.py <- Sample point clouds from scaled meshes ``` + +The scale factors can be computed using the `get_max_bounding_box` script. Alternatively, the pre-computed scale factors can be downloaded along with the rest of the preprocessed data. The following scale factors are available for download + +1. [WTC-11 hIPSc single cell image dataset v1 nucleolus (NPM1)](https://open.quiltdata.com/b/allencell/tree/aics/morphology_appropriate_representation_learning/preprocessed_data/npm1/scale_factor.npz) +2. [WTC-11 hIPSc single cell image dataset v1 nucleolus (NPM1) 64 resolution](https://open.quiltdata.com/b/allencell/tree/aics/morphology_appropriate_representation_learning/preprocessed_data/npm1_64_res/scale_factor.npz) +3. [WTC-11 hIPSc single cell image dataset v1 polymorphic structures](https://open.quiltdata.com/b/allencell/tree/aics/morphology_appropriate_representation_learning/preprocessed_data/other_polymorphic/scale_factor.npz) \ No newline at end of file diff --git a/docs/USAGE.md b/docs/USAGE.md index 66cb405..20ec5a6 100644 --- a/docs/USAGE.md +++ b/docs/USAGE.md @@ -79,7 +79,7 @@ aws s3 cp --no-sign-request --recursive s3://allencell/aics/morphology_appropria Training these models can take days. We've published our trained models so you don't have to. Skip to the [next section](#2-model-inference) if you'd like to just use our models. 1. Create a single cell manifest (e.g. csv, parquet) for each dataset with a column corresponding to final processed paths, and create a split column corresponding to train/test/validation split. -2. Update the final single cell dataset path (`SINGLE_CELL_DATASET_PATH`) and the column in the manifest for appropriate input modality (`SDF_COLUMN`/`SEG_COLUMN`/`POINTCLOUD_COLUMN`/`IMAGE_COLUMN`) in each [datamodule file](../configs/data/). e.g. for PCNA data these yaml files are located here - +2. Update the [datamodule config file](../configs/data/) with the path to this single cell manifest. For example, update the `path` key in the [pcna config](../configs/data/pcna/image.yaml) to be the path to the processed single cell manifest. Additionally, update the `image` and `cell_id` keys under `transforms/groups` to point to their corresponding column names in the single cell manifest. Similarly, update all other image and pointcloud datamodule files for the PCNA dataset here - ``` configs @@ -199,10 +199,31 @@ python src/br/analysis/run_features_combine.py --feature_path_1 './outputs_npm1/ | other_punctate | `python src/br/analysis/run_analysis.py --save_path "./outputs_other_punctate/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/other_punctate" --dataset_name "other_punctate" --run_name "Rotation_invariant_pointcloud_structurenorm" --sdf False --pacmap True` | | pcna | `python src/br/analysis/run_analysis.py --save_path "./outputs_pcna/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/pcna" --dataset_name "pcna" --run_name "Rotation_invariant_pointcloud_jitter" --sdf False --pacmap False` | -3. To run drug perturbation analysis using the pre-computed features, run +## Steps to run analysis for the nucleolar drug perturbation dataset + +1. To compute q-values for the mean average precision scores associated with perturbation retrieval using the pre-computed features, run ``` python src/br/analysis/run_drugdata_analysis.py --save_path "./outputs_npm1_perturb/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1_perturb/" --dataset_name "npm1_perturb" ``` -To compute cellprofiler features, open the [project file](../src/br/analysis/cellprofiler/npm1_perturb_cellprofiler.cpproj) using cellprofiler, and point to the single cell images of nucleoli in the [npm1 perturbation dataset](https://open.quiltdata.com/b/allencell/tree/aics/NPM1_single_cell_drug_perturbations/). This will generate a csv named `MyExpt_Image.csv` that contains mean, median, and stdev statistics per image across the different computed features. +2. To compute cellprofiler features, open the [project file](../src/br/analysis/cellprofiler/npm1_perturb_cellprofiler.cpproj) using cellprofiler, and point to the single cell images of nucleoli in the [npm1 perturbation dataset](https://open.quiltdata.com/b/allencell/tree/aics/NPM1_single_cell_drug_perturbations/). This will generate a csv named `MyExpt_Image.csv` that contains mean, median, and stdev statistics per image across the different computed features. + +3. To compute classification scores for the number of pieces of nucleoli using the pre-computed features and cellprofiler features, run + +``` +python src/br/analysis/run_classification.py --save_path "./outputs_npm1/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1/" --dataset_name "npm1" +``` + +4. To run LDA analysis on the drug perturbation dataset, run + +``` +python src/br/analysis/run_drugdata_LDA.py --save_path "./outputs_npm1_perturb/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1_perturb/" --dataset_name "npm1_perturb" --raw_path "./NPM1_single_cell_drug_perturbations/" +``` + +To run a baseline LDA on the DMSO subset of the drug perturbation dataset, run + +``` +python src/br/analysis/run_drugdata_LDA.py --save_path "./outputs_npm1_perturb/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1_perturb/" --dataset_name "npm1_perturb" --raw_path "./NPM1_single_cell_drug_perturbations/" --baseline True +``` + diff --git a/src/br/analysis/run_drugdata_LDA.py b/src/br/analysis/run_drugdata_LDA.py index 3ce3aac..8e10a91 100644 --- a/src/br/analysis/run_drugdata_LDA.py +++ b/src/br/analysis/run_drugdata_LDA.py @@ -10,6 +10,7 @@ import matplotlib.pyplot as plt import pandas as pd from br.models.compute_features import get_embeddings +from br.analysis.analysis_utils import str2bool from br.models.utils import get_all_configs_per_dataset from skimage import measure import seaborn as sns @@ -162,6 +163,7 @@ def main(args): "Brefeldin 5uM": "Brefeldin", } all_ret["condition"] = all_ret["condition"].replace(map_) + all_ret = all_ret.merge(raw_df[['CellId', 'plate_id']], on='CellId') cols = [i for i in all_ret.columns if "mu" in i] hits = [ @@ -174,7 +176,6 @@ def main(args): "Roscovitine 10uM", ] - res = {} scale_lows = [0.3, 0.3, 0.3, 0.3, 0.4, 0.25, 0.3, 0.3, 0.3, 0.3] scale_highs = [0.3, 0.3, 0.3, 0.3, 0.4, 0.25, 0.3, 0.3, 0.3, 0.3] scale_lows = [i * 0.1 for i in scale_lows] @@ -194,12 +195,23 @@ def main(args): merge_thresh[6] = 7 sns.set_context("talk") + if args.baseline: + all_ret = all_ret.loc[all_ret['condition'] == 'DMSO (control)'] + hits = [[215, 214, 231, 213, 232, 230, 233, 216]] # random sample of plates + merge_thresh = [11] + scale_lows = [i*2 for i in scale_lows] + scale_highs = [i*3 for i in scale_highs] + for j, hit in enumerate(hits): print("Analysis for", hit) scale_low = scale_lows[j] scale_high = scale_highs[j] - tmp1 = all_ret.loc[all_ret["condition"] == "DMSO (control)"] - tmp2 = all_ret.loc[all_ret["condition"] == hit] + if not args.baseline: + tmp1 = all_ret.loc[all_ret["condition"] == "DMSO (control)"] + tmp2 = all_ret.loc[all_ret["condition"] == hit] + else: + tmp1 = all_ret.loc[all_ret['plate_id'].isin(hit[:4])] + tmp2 = all_ret.loc[all_ret['plate_id'].isin(hit[4:])] tmp1["class"] = 0 tmp2["class"] = 1 tmp = pd.concat([tmp1, tmp2], axis=0).reset_index(drop=True) @@ -210,8 +222,6 @@ def main(args): preds = clf.fit_transform(X, y) lda_direction = clf.coef_[0] lda_line = np.array([-lda_direction * scale_low, lda_direction * scale_high]) - res[hit] = preds - fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6)) colors = plt.cm.Set2(np.linspace(0, 1, 8)) # PCA Projection plot @@ -393,6 +403,7 @@ def main(args): "--dataset_name", type=str, required=True, help="Name of the dataset." ) parser.add_argument("--raw_path", type=str, required=True, help="Path to raw data") + parser.add_argument("--baseline", type=str2bool, default=False, help="Perform LDA baseline only") args = parser.parse_args() # Validate that required paths are provided @@ -405,5 +416,9 @@ def main(args): """ Example run: - python src/br/analysis/run_drugdata_analysis.py --save_path "./outputs_npm1_perturb/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1_perturb/" --dataset_name "npm1_perturb" --raw_path "./NPM1_single_cell_drug_perturbations/" + For all drugs: + python src/br/analysis/run_drugdata_LDA.py --save_path "./outputs_npm1_perturb/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1_perturb/" --dataset_name "npm1_perturb" --raw_path "./NPM1_single_cell_drug_perturbations/" + + For baseline (DMSO subset 1 -> DMSO subset 2): + python src/br/analysis/run_drugdata_LDA.py --save_path "./outputs_npm1_perturb/" --embeddings_path "./morphology_appropriate_representation_learning/model_embeddings/npm1_perturb/" --dataset_name "npm1_perturb" --raw_path "./NPM1_single_cell_drug_perturbations/" --baseline True """