Skip to content

Stef/make negs filtered interactions#57

Open
stephaniesamm wants to merge 17 commits intomainfrom
stef/make_negs_filtered_interactions
Open

Stef/make negs filtered interactions#57
stephaniesamm wants to merge 17 commits intomainfrom
stef/make_negs_filtered_interactions

Conversation

@stephaniesamm
Copy link
Copy Markdown
Contributor

This PR should not be merged. It was opened only to facilitate reviewing the PR on miRBind repo (BioGeMT/miRBind_2.0#5), where this miRBench_paper repo is a submodule. Changes were made to this repo (mainly post-process, make_negs, filter_interactions) to make negatives for miRNA:gene positive pairs that have been filtered for specific interaction types (canonical seed, noncanonical seed, and non seed).

Copy link
Copy Markdown
Member

@katarinagresova katarinagresova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked at filter_interactions part so far

Comment thread code/clustering/gene_fasta.py
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the post processing pipeline, we have now two similar modules - filter_interactions and filtering. I am just thinking how to make the whole pipeline coherent and reusable.
Do you think it could be possible to have add_seed_types part somewhere in the beggining, like adding a new column to the output of HD.
And filter_interactions would be added to filtering - the user could specify columns and values to filter on.
But this is just an idea.

Comment thread code/filter_interactions/add_seed_types.py Outdated
…sure about each block writing to the same ofile"

This reverts commit 48bad05.
Comment thread code/filter_interactions/add_seed_types.py Outdated
Comment thread code/filter_interactions/filter_interactions.py Outdated
# Shuffle the negative pool and drop duplicates based on ClusterID
negative_pool = negative_pool.sample(frac=1, random_state=seed).drop_duplicates(subset=['gene_cluster_ID'], keep='first')
# Shuffle the negative gene pool and drop duplicates based on ClusterID
negative_gene_pool = negative_gene_pool.sample(frac=1, random_state=seed).drop_duplicates(subset=['gene_cluster_ID'], keep='first')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The negative_gene_pool is shuffled twice unnecessarily - once here, once in the loop below. This shuffle can be omitted

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shuffling here before drop duplicates is so that we dont always keep the same sequence for the same cluster to sample from in negative_gene_pool.

Comment thread code/make_neg_sets/make_neg_sets.py Outdated
columns = ['gene', 'feature', 'test', 'chr', 'start', 'end', 'strand', 'gene_cluster_ID']
negatives_df = negative_genes[columns].copy()
# Shuffle the negative gene pool with incrementing seed for each miRNA
negative_gene_pool = negative_gene_pool.sample(frac=1, random_state=seed + 1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In each cycle, the same value is used for 'random_state', the value seed + 1.
If we want to incrementally add 1 to the seed in each cycle, we need to assign it to the variable.
seed = seed + 1
negative_gene_pool = negative_gene_pool.sample(frac=1, random_state=seed)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right! Missed this, thanks!

The reason for this second shuffling is so that as I am iterating over mirnas in unique_mirnas, the negative_gene_pool is different each time, so that when i then iterate over negative_gene_pool, since the loop stops when enough valid negs have been found, its not always the same ones that occur first that get validated and included as negatives but its random.

negatives_df['noncodingRNA'] = block['noncodingRNA'].values
negatives_df['noncodingRNA_name'] = block['noncodingRNA_name'].values
negatives_df['noncodingRNA_fam'] = block['noncodingRNA_fam'].values
# Iterate over each row of the negative gene pool
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary comment imho

negative_candidate['noncodingRNA_name'] = block[block['noncodingRNA'] == mirna]['noncodingRNA_name'].iloc[0] # Assumes that the name is the same for all occurrences of the miRNA
negative_candidate['noncodingRNA_fam'] = block[block['noncodingRNA'] == mirna]['noncodingRNA_fam'].iloc[0] # Assumes that the family is the same for all occurrences of the miRNA

# Compute seeds for the negative candidate
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary comment imho

elif interaction_type == 'noncanonicalseed':
negative_candidate = negative_candidate[(negative_candidate['Seed6mer'] == 0) & (negative_candidate['Seed6merBulgeOrMismatch'] == 1)]

# If negative candidate is empty
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary comment

# If negative candidate is empty
if negative_candidate.empty:
continue
# If negative candidate contains something
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary comment

Comment thread code/post_process/postprocess_2_make_negatives.sh
Comment thread code/make_neg_sets/make_neg_sets.py Outdated
negatives_df['noncodingRNA_name'] = block['noncodingRNA_name'].values
negatives_df['noncodingRNA_fam'] = block['noncodingRNA_fam'].values
# Iterate over each row of the negative gene pool
for index, row in negative_gene_pool.iterrows():
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something, but why are we iterating over each row separately here? Because the following logic seems to be for handling multiple samples, but only one row is being passes to it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We iterate over each negative gene candidate and check if it is valid (i.e. contains the interaction type in question), until enough valid candidates have been found or the end of the negative_gene_pool (where all candidates for a paritular miRNA are stored) is reached. Perhaps the confusion is because I keep treating the row as its own df?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The confusion is that I think we could process the whole block of negative gene candidates at once. Is there something that needs to be done to each row separately?

Comment thread code/make_neg_sets/make_neg_sets.py
Comment on lines +27 to +30
- r-base
- r-biostrings
- r-decipher
- pyBigWig
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this new for this version of the pipeline? Or should it be propagated to the main pipeline as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not new just clarified in a common space for post-process pipeline because the same info exists in the clustering/, conservation/ dirs etc. So I guess yes, would be nice to propagate to the main pipeline too. I'll branch from main, make the same changes and open a new PR to merge?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sounds good.

valid_negatives.to_csv(output_file, sep='\t', index=False, header=False, mode='a')
# Exit the loop to move on to the next unique miRNA in the block
break
# If there are not enough valid negatives
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there are not enough negatives, we still save them to the output? And we do not case about miRNA frequencies?

Comment on lines +139 to +146
# Get block rows for which column noncodingRNA == mirna and save to file (positives)
block_mirna = block[block['noncodingRNA'] == mirna].copy()
block_mirna.to_csv(output_file, sep='\t', index=False, header=False, mode='a')

# Slice the valid negatives df to the required frequency, process valid negatives, and save to file (negatives)
valid_negatives = valid_negatives.iloc[:mirna_frequency].copy()
valid_negatives = process_valid_negatives(valid_negatives, block.columns)
valid_negatives.to_csv(output_file, sep='\t', index=False, header=False, mode='a')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracting process_valid_negatives is nice. however, this whole block of code is repeated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants