`precision_recall()` counts within group connections twice but not between group connections

When reading the source code for `cytominer_eval.operations.precision_recall()` I noticed that the similarity_melted_df variable counts each replicate pair twice, e.g. A1 --> A2 and A2 --> A1. 
This becomes a problem because only the first replicate_group_col in lines 49-52 is subsequently used for grouping:

```
49 replicate_group_cols = [
50    "{x}{suf}".format(x=x, suf=pair_ids[list(pair_ids)[0]]["suffix"])  # [0] keeps only the first of two grouping columns
51    for x in replicate_groups
52 ]
```

In the next step, each group is passed to `calculate_precision_recall()`:

```
59   precision_recall_df_at_k = similarity_melted_df.groupby(
60        replicate_group_cols
61    ).apply(lambda x: calculate_precision_recall(x, k=k_))
62    precision_recall_df = precision_recall_df.append(precision_recall_df_at_k)
```

With the effect that all samples from within a group are counted twice. However, samples from outside the group are only counted once because `group_by` will filter out one direction. 

Let me clarify this with an example. Consider 5 samples, the first 3 from group 'A', the second 2 from group 'B', both with greater within-group than between group correlations:

![image](https://user-images.githubusercontent.com/9386875/136257908-8b8de522-ca2b-489d-88af-db5adcca5ccf.png)

Then what `calculate_precision_recall` will see is this:
![image](https://user-images.githubusercontent.com/9386875/136258074-b4aa609b-a4f4-4a92-971e-f9bdb0b80fd8.png)

For example, one can see that the `sample_pair_a` column has a row for `A1-->A2` and one for `A2-->A1` but only one for `A1-->B1`. `B1-->A1` is missing because of the way the melted data frame is generated and the grouping is performed. One can also see that the similarity metrics for within group connections appear in duplicates. 

Accordingly the outcome for precision and recall at k=4 is the following:
![image](https://user-images.githubusercontent.com/9386875/136258807-854faf9b-1296-4bde-902f-b6204cb77345.png)
Precision: all 4 closest connections are from within group for A but only 2 for group B. 
Recall: 4/6 connections found for A but all 2 found for B. 

In summary, the computations are not entirely correct, especially for smaller groups. Also consider that with odd values for k only one of the two connections of the symmetric pair is used.

Admittedly, this is a bit mind-boggling. I recommend using a debugger if you want to trace all the steps in detail by yourself. 

Proposed solution: I would suggest to count each pair only once when creating the melted data frame. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`precision_recall()` counts within group connections twice but not between group connections #62

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

precision_recall() counts within group connections twice but not between group connections #62

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`precision_recall()` counts within group connections twice but not between group connections #62