Skip to content

Conversation

veni-vidi-vici-dormivi
Copy link
Collaborator

@veni-vidi-vici-dormivi veni-vidi-vici-dormivi commented May 28, 2025

Here is a method that enables searching for intersecting values of a certain key along all values of another key. The specific use case here is: "Which scenarios and members are available for both variables tas and hfds" for example.

The usage would be to find all available scenarios and members for both variables and then search the resulting FileContainer for intersecting values along scenario_member. In this case search_key = variable and intersect_key = scenario_member. I chose this approach because I felt it relatively straight forward, more than implementing it in FileFinder.

I am not very happy with the names and my explanation in the docstring, but at least the example should make it quite clear.

Copy link

codecov bot commented May 28, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Flag Coverage Δ
unittests 99.71% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
filefisher/_filefinder.py 100.00% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mathause
Copy link
Collaborator

Would it make sense to require two FileContainer objects and add a intersect method? Or an intersect function. I think on would be intersect_key but I may be missing the function of search_key.

import filefisher

test_paths = [
    "historical/tas",
    "historical/hfds",
    "ssp585/hfds",
]

ff = filefisher.Filefinder("{scen}", "{variable}", test_paths=test_paths)

fc_tas = ff.find_files(variable="tas")
fc_hfds = ff.find_files(variable="hfds")


fc_tas.intersect(fc_hfds, on="variable")

where the on key must be unique in each of the containers. Also not sure if this should return one or two elements.

The implementation could be along the lines of

import pandas as pd


def intersect(df_l: pd.DataFrame, df_r: pd.DataFrame, on: str):
        
    assert (df_l.columns == df_r.columns).all()

    assert len(df_l[on].unique()) == 1
    assert len(df_r[on].unique()) == 1

    columns = df_l.columns.drop(on)    

    mi_l = pd.MultiIndex.from_frame(df_l[columns])
    mi_r = pd.MultiIndex.from_frame(df_r[columns])

    sel = mi_l.intersection(mi_r)

    sel_l = mi_l.get_locs(sel)
    sel_r = mi_r.get_locs(sel)


    l = df_l.iloc[sel_l]
    r = df_r.iloc[sel_r]

    return pd.concat([l, r])


intersect(fc_tas.df, fc_hfds.df, on="variable")

@veni-vidi-vici-dormivi
Copy link
Collaborator Author

Right. That is also nice. My application was one where if did
fc = ff.find_files(variable = ["tas", "hfds"])
and I wanted to find the intersection of a specified key for the two variables. So my approach works on a single file container.

Yours is easier to understand. I am thinking about possible advantages of my implementation... With mine only the entries of the intersecton_key are checked and the others could be anything, but I'm not sure if that is an advantage. Working on one FileContainer seemed more intuitive to me because let's say I was looking for ten variables at once it would be more cumbersome to make individual containers and also take longer.

@mathause
Copy link
Collaborator

mathause commented Oct 1, 2025

Yet another idea would be to combine this as a groupby method and intersect function (or align or ?).

filefisher.align(fc.grouby(on="variable"), except="variable")

but then we have to pass the key twice. Maybe groupby_and_align (or groupby_and_intersect). I think I am looking for alternatives because I found your naming not intuitive. In any case functionality to look for all models/ ensemble members provide a list of variables would be very welcome...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants