Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggest the correct name when no key matches in the dataset #9943

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

Illviljan
Copy link
Contributor

@Illviljan Illviljan commented Jan 12, 2025

I found the error when I make a typo on the dataset keys not so helpful. The truncated list of variables hides all the ones that I wanted to see. Instead, add a fuzzy matching function that does the typical "Did you mean X".

  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
possibilities = ("blech", "gray_r", 1, None, (2, 56))
# possibilities = tuple(f"long_name_{i}" for i in range(0, 1000))
t = np.arange(80)
var = xr.Variable(dims=("time",), data=np.arange(t.size))
data_vars = {v: ("time", var) for v in possibilities}
ds = xr.Dataset(data_vars, coords={"T": t})
ds["bluch"]
# KeyError: "No variable named 'bluch'. Did you mean one of ('blech',)?"

Further reading:
python/cpython#16850
matplotlib/matplotlib#28115
https://en.wikipedia.org/wiki/Levenshtein_distance

@@ -1611,6 +1611,11 @@ def __getitem__(
return self._construct_dataarray(key)
except KeyError as e:
message = f"No variable named {key!r}. Variables on the dataset include {shorten_list_repr(list(self.variables.keys()), max_items=10)}"

best_guess = utils.did_you_mean(key, self.variables.keys())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing idea. I would print the best guess first, and then any others so that's it's easy to see

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe should just remove the "Variables on the dataset include ..." ? They try to do the same thing I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you could sort the whole list by similarity and then print that (truncated as above)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it prioritizes best_guess. If best_guess is empty you could be working in the wrong dataset, so it's still nice to get some kind of clue which dataset you're using.

xarray/core/utils.py Outdated Show resolved Hide resolved
@max-sixty
Copy link
Collaborator

Big +1 on this; I'd also enjoy this as a user.

Is there any concern that some processes might be running ds[foo] in a hot loop, and this would cause a performance regression?

@headtr1ck
Copy link
Collaborator

Big +1 on this; I'd also enjoy this as a user.

Is there any concern that some processes might be running ds[foo] in a hot loop, and this would cause a performance regression?

We could add an LRU cache over did_you_mean if this is an issue

@max-sixty
Copy link
Collaborator

We could add an LRU cache over did_you_mean if this is an issue

Though I'm thinking that someone could query whether different keys exist; i.e. for x in very_long_list:; try:; ds[x]

Overall I say let's go ahead and we can reassess if we hear reports of slowdowns. Folks can use in / __contains__ for a fast path...

@Illviljan
Copy link
Contributor Author

Though I'm thinking that someone could query whether different keys exist; i.e. for x in very_long_list:; try:; ds[x]

I thought about this case as well, my initial idea was to just use ds.get(x) then. But turns out .get is one of the few ones we inherit from collections.Mapping and it's pretty much try:; ds[x]. We can probably do that one better as well if it becomes a problem.

@Illviljan Illviljan added the plan to merge Final call for comments label Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants