Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/gcf tree #75

Merged
merged 16 commits into from
Nov 6, 2023
Merged

Feature/gcf tree #75

merged 16 commits into from
Nov 6, 2023

Conversation

nlouwen
Copy link
Collaborator

@nlouwen nlouwen commented Nov 3, 2023

Contains GCF alignment and tree building, as well as highlights of compared region (i.e. protocluster/core).

  • small adjustment to load_dataset_folder to accommodate mibig and reference loading.

@nlouwen nlouwen requested a review from adraismawur November 3, 2023 09:26
Copy link
Collaborator

@adraismawur adraismawur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good so far, just some minor things.

Potential refactor idea in the future is to split figuring out the comparable region from calculating the scores, so that there is not a large amount of data being returned in the yield. the workflow is a bit complicated to do this but still.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, but to optimize performance can you change this to send the numbers directly instead of the entire pair object? See inline comments.

And to keep the order consistent everywhere, let's do:

(
  pair_a_id,
  pair_b_id,
  dist,
  jaccard,
  adjacency,
  dss,
  [lcs stuff], 
  [start and stops after extend],
  [reverse],
  [alignment mode]
)

distance, jaccard, adjacency, dss = results[idx]
# for idx, pair in enumerate(task_batch):
# distance, jaccard, adjacency, dss = results[idx]
for pair, distance, jaccard, adjacency, dss in results:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In another branch I discovered this can be simplified:

for a, b, c, d in some_tuple:
    yield (a, b, c, d)

is equivalent to

yield from some_tuple

@@ -251,13 +269,13 @@ def calculate_scores_pair(

for pair in pairs:
if pair.region_a.parent_gbk == pair.region_b.parent_gbk:
results.append((0.0, 1.0, 1.0, 1.0))
results.append((pair, 0.0, 1.0, 1.0, 1.0))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

results.append(0,0, 1.0, 1.0, 1.0, [lcs stuff], [comparable distance starts and stops], [reverse], [alignment mode])

@@ -96,7 +96,7 @@ def get_edge(
if edge is None:
return None

return cast(tuple[int, int, float, float, float, float, str], edge)
return cast(tuple[int, int, float, float, float, float, str], edge[0:7])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you're here, can you refactor this a bit to select columns in the select_statement on 86 instead of getting everything and returning a slice of the result. e.g.

distance_table = DB.metadata.tables["distance"]
select_columns = (
  distance_table.c.region_a_id,
  distance_table.c.region_b_id, # note the last comma; this is a tuple
)

select_statement = (
   distance_table
   .select(columns)

something like that. not essential if you want to skip it.


Returns:
Written file containing protein domain alignment
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a sneaky TODO: refactor to this docstring? just to indicate that this is not our code 😎

@nlouwen nlouwen requested a review from adraismawur November 3, 2023 13:52
@adraismawur adraismawur merged commit 69e2784 into dev Nov 6, 2023
7 checks passed
@nlouwen nlouwen deleted the feature/gcf-tree branch November 6, 2023 11:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants