Skip to content

Add ability to detect "families"#551

Open
freemansw1 wants to merge 23 commits intotobac-project:RC_v1.7.0from
freemansw1:family_id
Open

Add ability to detect "families"#551
freemansw1 wants to merge 23 commits intotobac-project:RC_v1.7.0from
freemansw1:family_id

Conversation

@freemansw1
Copy link
Copy Markdown
Member

I've talked about this before; this is (finally) the code for family detection (family tracking still to come). This code links features together in space.

For the reviewers (I'm leaning toward two reviewers given that this is a new concept for tobac), please check the following:

  • Documentation makes sense
  • Example works and makes sense
  • This works for your data

Note: this is targeted at a new branch, RC_v1.7.0.

  • Have you followed our guidelines in CONTRIBUTING.md?
  • Have you self-reviewed your code and corrected any misspellings?
  • Have you written documentation that is easy to understand?
  • Have you written descriptive commit messages?
  • Have you added NumPy docstrings for newly added functions?
  • Have you formatted your code using black?
  • If you have introduced a new functionality, have you added adequate unit tests?
  • Have all tests passed in your local clone?
  • If you have introduced a new functionality, have you added an example notebook?
  • Have you kept your pull request small and limited so that it is easy to review?
  • Have the newest changes from this branch been merged?

@w-k-jones and/or @JuliaKukulies are you up for reviewing? I know this is for v1.7.0, but I'd like to keep this moving so I can implement family tracking soon.

@freemansw1 freemansw1 added this to the v1.7.0 milestone Jan 19, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Jan 19, 2026

Codecov Report

❌ Patch coverage is 90.86957% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.75%. Comparing base (fcb7cd3) to head (9f12f17).

Files with missing lines Patch % Lines
tobac/merge_split/families/feature_family_id.py 88.88% 10 Missing ⚠️
tobac/utils/internal/label_functions.py 93.33% 6 Missing ⚠️
tobac/utils/datetime.py 75.00% 3 Missing ⚠️
tobac/utils/general.py 91.30% 2 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##           RC_v1.7.0     #551      +/-   ##
=============================================
+ Coverage      64.84%   65.75%   +0.91%     
=============================================
  Files             27       31       +4     
  Lines           3985     4135     +150     
=============================================
+ Hits            2584     2719     +135     
- Misses          1401     1416      +15     
Flag Coverage Δ
unittests 65.75% <90.86%> (+0.91%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 19, 2026

Linting results by Pylint:

Your code has been rated at 8.36/10 (previous run: 8.36/10, +0.00)
The linting score is an indicator that reflects how well your code version follows Pylint’s coding standards and quality metrics with respect to the RC_v1.7.0 branch.
A decrease usually indicates your new code does not fully meet style guidelines or has potential errors.

@freemansw1
Copy link
Copy Markdown
Member Author

Blocked by #553 . Also, need to add in a test and handling if there are no families.

@w-k-jones
Copy link
Copy Markdown
Member

@freemansw1 I'll be happy to review this after #554 is merged

@w-k-jones w-k-jones added the enhancement Addition of new features, or improved functionality of existing features label Feb 11, 2026
@w-k-jones w-k-jones self-requested a review February 11, 2026 10:54
@w-k-jones
Copy link
Copy Markdown
Member

Some overall thoughts before I do an in depth review:

First off, really nice addition! On your particular points, the documentation is nice, but there is nothing in the user guide section (the same is true for merge/split actually), and the docstring for identify_feature_families_from_data seems to copy that from identify_feature_families_from_segmentation rather than describe what it actually does. The example is nice, and I'll have a go at running it with other data.

General thoughts:

  1. We should decide on a fixed term for a collection of features at the same time step. I have used "cluster" before but am happy to switch to "family" as that avoids confusion with other clustering methods. We should change the feature_family_id column to be named family to keep the same pattern with feature, cell, track etc.
  2. Does this belong as part of merge/split, or its own module? I'm undecided, but leaning towards merge/split being for combining tracks over time, whereas this is for combining features on individual time steps.
  3. I recommend moving identify_feature_families from utils.general to the same module as the other family functions to avoid unnecessary coupling (unless identify_feature_families is used in other modules)
  4. It would be nice to add a function to merge/split cells based on detected families (maybe in a future PR)

I also noticed that the PBC labelling at current only uses connectivity=1 for connecting labels across borders, but I can fix that as part of the feature detection refactoring

@freemansw1
Copy link
Copy Markdown
Member Author

bug to investigate/test: grid output increments rather than carrying through 0 values in label with multiple times

@freemansw1
Copy link
Copy Markdown
Member Author

bug to investigate/test: grid output increments rather than carrying through 0 values in label with multiple times

This has been resolved with the latest commit.

)

else:
# TODO: integrate 3D stats - mostly around center coordinates - need the functions in tobac proper
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Came across an issue when testing this with model data, do you want to implement this for the initial release?

# TODO: deal with dim order for 3D
if is_3D:
if "vdim" not in rows_at_time:
raise NotImplementedError(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could support this most easily by just reducing the mask using "any" along the vertical dimension

)

rows_at_time["vdim_adj"] = np.clip(
int(rows_at_time["vdim"] + 0.5).astype(int), a_min=0, a_max=v_max - 1
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raises TypeError: cannot convert the series to <class 'int'>, fix:

Suggested change
int(rows_at_time["vdim"] + 0.5).astype(int), a_min=0, a_max=v_max - 1
(rows_at_time["vdim"] + 0.5).astype(int), a_min=0, a_max=v_max - 1

Comment on lines +152 to +155
if target == "minimum":
mask = in_arr < threshold
elif target == "maximum":
mask = in_arr > threshold
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises the issue of whether we consider the threshold inclusive again. I lean towards yes, but then we would have to change the threshold value in identify_feature_families_from_segmentation

def identify_feature_families_from_data(
feature_df: pd.DataFrame,
in_data: xr.DataArray,
threshold: float,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Threshold should be optional if target=="bool"

Comment on lines +229 to +233
points_list = (
rows_at_time["hdim_1_adj"].values,
rows_at_time["hdim_2_adj"].values,
rows_at_time["vdim_adj"].values,
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This leads to an error if the order of dimensions is different (e.g. z, y, x). Need to use the vdim axis to insert vdim in the correct position for indexing

@w-k-jones
Copy link
Copy Markdown
Member

Have been testing this out with some 3D model data, and hacking fixes to 3D bugs along the way, two main thoughts:

  1. Do we want to support 3D data in the initial release? If not, we should immediately raise an informative NotImplementedError when checking for 3D data
  2. It would be nice if thefamily_stats_df was formatted in the same manner as the output from feature detection with, in particular, the frame column required for tracking

@w-k-jones
Copy link
Copy Markdown
Member

Found an error which occurs if you try to detect families with an input dataframe that already has a feature_family_id column:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[91], line 1
----> 1 combined_features, families = identify_feature_families_from_data(
      2     combined_features, cli_2d, 0.00001, #return_grid=True,
      3 )

File [~/python/tobac/tobac/merge_split/families/feature_family_id.py:275](https://jupyterhub.dkrz.de/user/b382728/levante-spawner-preset/lab/tree/home/b/b382728/python/tracking_experiments/~/python/tobac/tobac/merge_split/families/feature_family_id.py#line=274), in identify_feature_families_from_data(feature_df, in_data, threshold, return_grid, family_column_name, time_padding, PBC_flag, target, unlinked_family_id)
    273 out_df = feature_df.merge(family_df, on="feature", how="inner")
    274 if unlinked_family_id is not None:
--> 275     out_df.loc[out_df["feature_family_id"] == 0, "feature_family_id"] = -1
    276 else:
    277     out_df = out_df[
    278         np.logical_and(
    279             out_df["feature_family_id"] != 0, out_df["feature_family_id"] != -1
    280         )
    281     ]

File [~/.conda/envs/easy-2026/lib/python3.14/site-packages/pandas/core/frame.py:4113](https://jupyterhub.dkrz.de/user/b382728/levante-spawner-preset/lab/tree/home/b/b382728/python/tracking_experiments/~/.conda/envs/easy-2026/lib/python3.14/site-packages/pandas/core/frame.py#line=4112), in DataFrame.__getitem__(self, key)
   4109 
   4110         if is_single_key:
   4111             if self.columns.nlevels > 1:
   4112                 return self._getitem_multilevel(key)
-> 4113             indexer = self.columns.get_loc(key)
   4114             if is_integer(indexer):
   4115                 indexer = [indexer]
   4116         else:

File [~/.conda/envs/easy-2026/lib/python3.14/site-packages/pandas/core/indexes/base.py:3819](https://jupyterhub.dkrz.de/user/b382728/levante-spawner-preset/lab/tree/home/b/b382728/python/tracking_experiments/~/.conda/envs/easy-2026/lib/python3.14/site-packages/pandas/core/indexes/base.py#line=3818), in Index.get_loc(self, key)
   3814     if isinstance(casted_key, slice) or (
   3815         isinstance(casted_key, abc.Iterable)
   3816         and any(isinstance(x, slice) for x in casted_key)
   3817     ):
   3818         raise InvalidIndexError(key)
-> 3819     raise KeyError(key) from err
   3820 except TypeError:
   3821     # If we have a listlike key, _check_indexing_error will raise
   3822     #  InvalidIndexError. Otherwise we fall through and re-raise
   3823     #  the TypeError.
   3824     self._check_indexing_error(key)

KeyError: 'feature_family_id'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Addition of new features, or improved functionality of existing features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants