Stef/make negs filtered interactions#57
Conversation
katarinagresova
left a comment
There was a problem hiding this comment.
Looked at filter_interactions part so far
There was a problem hiding this comment.
In the post processing pipeline, we have now two similar modules - filter_interactions and filtering. I am just thinking how to make the whole pipeline coherent and reusable.
Do you think it could be possible to have add_seed_types part somewhere in the beggining, like adding a new column to the output of HD.
And filter_interactions would be added to filtering - the user could specify columns and values to filter on.
But this is just an idea.
…f family, prior to generating negs
…ut each block writing to the same ofile
…sure about each block writing to the same ofile" This reverts commit 48bad05.
| # Shuffle the negative pool and drop duplicates based on ClusterID | ||
| negative_pool = negative_pool.sample(frac=1, random_state=seed).drop_duplicates(subset=['gene_cluster_ID'], keep='first') | ||
| # Shuffle the negative gene pool and drop duplicates based on ClusterID | ||
| negative_gene_pool = negative_gene_pool.sample(frac=1, random_state=seed).drop_duplicates(subset=['gene_cluster_ID'], keep='first') |
There was a problem hiding this comment.
The negative_gene_pool is shuffled twice unnecessarily - once here, once in the loop below. This shuffle can be omitted
There was a problem hiding this comment.
The shuffling here before drop duplicates is so that we dont always keep the same sequence for the same cluster to sample from in negative_gene_pool.
| columns = ['gene', 'feature', 'test', 'chr', 'start', 'end', 'strand', 'gene_cluster_ID'] | ||
| negatives_df = negative_genes[columns].copy() | ||
| # Shuffle the negative gene pool with incrementing seed for each miRNA | ||
| negative_gene_pool = negative_gene_pool.sample(frac=1, random_state=seed + 1) |
There was a problem hiding this comment.
In each cycle, the same value is used for 'random_state', the value seed + 1.
If we want to incrementally add 1 to the seed in each cycle, we need to assign it to the variable.
seed = seed + 1
negative_gene_pool = negative_gene_pool.sample(frac=1, random_state=seed)
There was a problem hiding this comment.
You're right! Missed this, thanks!
The reason for this second shuffling is so that as I am iterating over mirnas in unique_mirnas, the negative_gene_pool is different each time, so that when i then iterate over negative_gene_pool, since the loop stops when enough valid negs have been found, its not always the same ones that occur first that get validated and included as negatives but its random.
| negatives_df['noncodingRNA'] = block['noncodingRNA'].values | ||
| negatives_df['noncodingRNA_name'] = block['noncodingRNA_name'].values | ||
| negatives_df['noncodingRNA_fam'] = block['noncodingRNA_fam'].values | ||
| # Iterate over each row of the negative gene pool |
| negative_candidate['noncodingRNA_name'] = block[block['noncodingRNA'] == mirna]['noncodingRNA_name'].iloc[0] # Assumes that the name is the same for all occurrences of the miRNA | ||
| negative_candidate['noncodingRNA_fam'] = block[block['noncodingRNA'] == mirna]['noncodingRNA_fam'].iloc[0] # Assumes that the family is the same for all occurrences of the miRNA | ||
|
|
||
| # Compute seeds for the negative candidate |
| elif interaction_type == 'noncanonicalseed': | ||
| negative_candidate = negative_candidate[(negative_candidate['Seed6mer'] == 0) & (negative_candidate['Seed6merBulgeOrMismatch'] == 1)] | ||
|
|
||
| # If negative candidate is empty |
| # If negative candidate is empty | ||
| if negative_candidate.empty: | ||
| continue | ||
| # If negative candidate contains something |
| negatives_df['noncodingRNA_name'] = block['noncodingRNA_name'].values | ||
| negatives_df['noncodingRNA_fam'] = block['noncodingRNA_fam'].values | ||
| # Iterate over each row of the negative gene pool | ||
| for index, row in negative_gene_pool.iterrows(): |
There was a problem hiding this comment.
I might be missing something, but why are we iterating over each row separately here? Because the following logic seems to be for handling multiple samples, but only one row is being passes to it.
There was a problem hiding this comment.
We iterate over each negative gene candidate and check if it is valid (i.e. contains the interaction type in question), until enough valid candidates have been found or the end of the negative_gene_pool (where all candidates for a paritular miRNA are stored) is reached. Perhaps the confusion is because I keep treating the row as its own df?
There was a problem hiding this comment.
The confusion is that I think we could process the whole block of negative gene candidates at once. Is there something that needs to be done to each row separately?
| - r-base | ||
| - r-biostrings | ||
| - r-decipher | ||
| - pyBigWig |
There was a problem hiding this comment.
Is this new for this version of the pipeline? Or should it be propagated to the main pipeline as well?
There was a problem hiding this comment.
It is not new just clarified in a common space for post-process pipeline because the same info exists in the clustering/, conservation/ dirs etc. So I guess yes, would be nice to propagate to the main pipeline too. I'll branch from main, make the same changes and open a new PR to merge?
| valid_negatives.to_csv(output_file, sep='\t', index=False, header=False, mode='a') | ||
| # Exit the loop to move on to the next unique miRNA in the block | ||
| break | ||
| # If there are not enough valid negatives |
There was a problem hiding this comment.
When there are not enough negatives, we still save them to the output? And we do not case about miRNA frequencies?
| # Get block rows for which column noncodingRNA == mirna and save to file (positives) | ||
| block_mirna = block[block['noncodingRNA'] == mirna].copy() | ||
| block_mirna.to_csv(output_file, sep='\t', index=False, header=False, mode='a') | ||
|
|
||
| # Slice the valid negatives df to the required frequency, process valid negatives, and save to file (negatives) | ||
| valid_negatives = valid_negatives.iloc[:mirna_frequency].copy() | ||
| valid_negatives = process_valid_negatives(valid_negatives, block.columns) | ||
| valid_negatives.to_csv(output_file, sep='\t', index=False, header=False, mode='a') |
There was a problem hiding this comment.
Extracting process_valid_negatives is nice. however, this whole block of code is repeated.
This PR should not be merged. It was opened only to facilitate reviewing the PR on miRBind repo (BioGeMT/miRBind_2.0#5), where this miRBench_paper repo is a submodule. Changes were made to this repo (mainly post-process, make_negs, filter_interactions) to make negatives for miRNA:gene positive pairs that have been filtered for specific interaction types (canonical seed, noncanonical seed, and non seed).