Stef/make negs filtered interactions by stephaniesamm · Pull Request #57 · BioGeMT/miRBench_paper

stephaniesamm · 2025-02-14T10:02:22Z

This PR should not be merged. It was opened only to facilitate reviewing the PR on miRBind repo (BioGeMT/miRBind_2.0#5), where this miRBench_paper repo is a submodule. Changes were made to this repo (mainly post-process, make_negs, filter_interactions) to make negatives for miRNA:gene positive pairs that have been filtered for specific interaction types (canonical seed, noncanonical seed, and non seed).

…EADME

katarinagresova

Looked at filter_interactions part so far

katarinagresova · 2025-02-14T16:39:36Z

In the post processing pipeline, we have now two similar modules - filter_interactions and filtering. I am just thinking how to make the whole pipeline coherent and reusable.
Do you think it could be possible to have add_seed_types part somewhere in the beggining, like adding a new column to the output of HD.
And filter_interactions would be added to filtering - the user could specify columns and values to filter on.
But this is just an idea.

…f family, prior to generating negs

…ut each block writing to the same ofile

…sure about each block writing to the same ofile" This reverts commit 48bad05.

davidcechak · 2025-02-17T13:24:02Z

-    # Shuffle the negative pool and drop duplicates based on ClusterID
-    negative_pool = negative_pool.sample(frac=1, random_state=seed).drop_duplicates(subset=['gene_cluster_ID'], keep='first')
+    # Shuffle the negative gene pool and drop duplicates based on ClusterID
+    negative_gene_pool = negative_gene_pool.sample(frac=1, random_state=seed).drop_duplicates(subset=['gene_cluster_ID'], keep='first')


The negative_gene_pool is shuffled twice unnecessarily - once here, once in the loop below. This shuffle can be omitted

The shuffling here before drop duplicates is so that we dont always keep the same sequence for the same cluster to sample from in negative_gene_pool.

davidcechak · 2025-02-17T13:35:32Z

-    columns = ['gene', 'feature', 'test', 'chr', 'start', 'end', 'strand', 'gene_cluster_ID']
-    negatives_df  = negative_genes[columns].copy()
+        # Shuffle the negative gene pool with incrementing seed for each miRNA
+        negative_gene_pool = negative_gene_pool.sample(frac=1, random_state=seed + 1)


In each cycle, the same value is used for 'random_state', the value seed + 1.
If we want to incrementally add 1 to the seed in each cycle, we need to assign it to the variable.
seed = seed + 1
negative_gene_pool = negative_gene_pool.sample(frac=1, random_state=seed)

You're right! Missed this, thanks!

The reason for this second shuffling is so that as I am iterating over mirnas in unique_mirnas, the negative_gene_pool is different each time, so that when i then iterate over negative_gene_pool, since the loop stops when enough valid negs have been found, its not always the same ones that occur first that get validated and included as negatives but its random.

davidcechak · 2025-02-17T14:18:50Z

-    negatives_df['noncodingRNA'] = block['noncodingRNA'].values
-    negatives_df['noncodingRNA_name'] = block['noncodingRNA_name'].values
-    negatives_df['noncodingRNA_fam'] = block['noncodingRNA_fam'].values
+        # Iterate over each row of the negative gene pool


Unnecessary comment imho

davidcechak · 2025-02-17T14:19:14Z

+            negative_candidate['noncodingRNA_name'] = block[block['noncodingRNA'] == mirna]['noncodingRNA_name'].iloc[0] # Assumes that the name is the same for all occurrences of the miRNA
+            negative_candidate['noncodingRNA_fam'] = block[block['noncodingRNA'] == mirna]['noncodingRNA_fam'].iloc[0] # Assumes that the family is the same for all occurrences of the miRNA
+
+            # Compute seeds for the negative candidate


Unnecessary comment imho

davidcechak · 2025-02-17T14:19:34Z

+            elif interaction_type == 'noncanonicalseed':
+                negative_candidate = negative_candidate[(negative_candidate['Seed6mer'] == 0) & (negative_candidate['Seed6merBulgeOrMismatch'] == 1)]
+
+            # If negative candidate is empty


Unnecessary comment

davidcechak · 2025-02-17T14:19:43Z

+            # If negative candidate is empty
+            if negative_candidate.empty:
+                continue
+            # If negative candidate contains something


Unnecessary comment

katarinagresova · 2025-02-18T17:58:52Z

-    negatives_df['noncodingRNA_name'] = block['noncodingRNA_name'].values
-    negatives_df['noncodingRNA_fam'] = block['noncodingRNA_fam'].values
+        # Iterate over each row of the negative gene pool
+        for index, row in negative_gene_pool.iterrows():


I might be missing something, but why are we iterating over each row separately here? Because the following logic seems to be for handling multiple samples, but only one row is being passes to it.

We iterate over each negative gene candidate and check if it is valid (i.e. contains the interaction type in question), until enough valid candidates have been found or the end of the negative_gene_pool (where all candidates for a paritular miRNA are stored) is reached. Perhaps the confusion is because I keep treating the row as its own df?

The confusion is that I think we could process the whole block of negative gene candidates at once. Is there something that needs to be done to each row separately?

katarinagresova · 2025-02-18T18:05:40Z

+  - r-base
+  - r-biostrings
+  - r-decipher
+  - pyBigWig


Is this new for this version of the pipeline? Or should it be propagated to the main pipeline as well?

It is not new just clarified in a common space for post-process pipeline because the same info exists in the clustering/, conservation/ dirs etc. So I guess yes, would be nice to propagate to the main pipeline too. I'll branch from main, make the same changes and open a new PR to merge?

Yes, sounds good.

katarinagresova · 2025-02-27T14:54:57Z

+                    valid_negatives.to_csv(output_file, sep='\t', index=False, header=False, mode='a')                    
+                    # Exit the loop to move on to the next unique miRNA in the block
+                    break
+                # If there are not enough valid negatives    


When there are not enough negatives, we still save them to the output? And we do not case about miRNA frequencies?

katarinagresova · 2025-02-27T14:55:41Z

+                    # Get block rows for which column noncodingRNA == mirna and save to file (positives)
+                    block_mirna = block[block['noncodingRNA'] == mirna].copy()
+                    block_mirna.to_csv(output_file, sep='\t', index=False, header=False, mode='a')
+
+                    # Slice the valid negatives df to the required frequency, process valid negatives, and save to file (negatives)
+                    valid_negatives = valid_negatives.iloc[:mirna_frequency].copy()
+                    valid_negatives = process_valid_negatives(valid_negatives, block.columns)
+                    valid_negatives.to_csv(output_file, sep='\t', index=False, header=False, mode='a')                    


Extracting process_valid_negatives is nice. however, this whole block of code is repeated.

stephaniesamm added 7 commits February 7, 2025 11:46

Added seed types and filtering interactions into postprocess pipeline

6a9c1eb

Removed print statements from clustering scripts

af58270

minor changes and dropping seed cols after filtering

06ac7d5

New make_negs script for specific interaction types

b67e075

Functional make_neg_sets producing 1:1 pos to neg examples. Updated R…

36b1166

…EADME

testing submodule with branch on mirbench

d064cdb

test success

71e0773

stephaniesamm requested review from davidcechak and katarinagresova February 14, 2025 10:02

updated workflow diagram

00e463a

katarinagresova reviewed Feb 14, 2025

View reviewed changes

stephaniesamm added 3 commits February 16, 2025 18:27

fixed off-by-one error. file was being sorted by mirna name instead o…

2330f0b

…f family, prior to generating negs

parallelised make_negs script with python multiprocessing, unsure abo…

48bad05

…ut each block writing to the same ofile

Revert "parallelised make_negs script with python multiprocessing, un…

d606fd5

…sure about each block writing to the same ofile" This reverts commit 48bad05.

davidcechak reviewed Feb 17, 2025

View reviewed changes

Comment thread code/filter_interactions/add_seed_types.py Outdated

Comment thread code/filter_interactions/filter_interactions.py Outdated

davidcechak reviewed Feb 17, 2025

View reviewed changes

katarinagresova reviewed Feb 18, 2025

View reviewed changes

stephaniesamm added 6 commits February 19, 2025 10:36

fixed incrementing seed

db7b256

added interaction type as argument

ab7907b

Renamed get seeds function to add seeds for clarity

444946c

Removed unnecessary imports

c6df493

Added make_reproducibility_seed function

1238605

Extracted common code for processing valid negs into a function

35e07c6

katarinagresova reviewed Feb 27, 2025

View reviewed changes

Conversation

stephaniesamm commented Feb 14, 2025

Uh oh!

katarinagresova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants