Skip to content

Conversation

@aqm577
Copy link

@aqm577 aqm577 commented Sep 20, 2025

  • I agree to follow the project's code of conduct.
  • I added an entry to CHANGES.md if knowledge of this change could be valuable to users.

Greetings!

I was looking into adding methods to create multi-dimensional rasters in this library when I stumbled upon the section about object lifetimes in the user-oriented documentation for GDAL's multidimensional raster data model. It states that it's perfectly legal to keep using e.g. groups and mdarrays even after closing the dataset they belong to. For this reason it surprised me that Group, Dimension and MDArray in this library all keep a reference to the dataset the where retrieved from.

When I tried to remove these references I discovered that MDArray instances keep a C pointer to a dataset so I guess this is one reason to keep the references around. The only place found it used however was in the get_statistics method and since the alternative of passing the dataset to this method as an argument doesn't sound that bad I'm wondering if this was done since we hold the reference anyway.

In this PR I removed the references and added some tests that indicate that it's OK to do so. There may be other reasons for keeping these references around, or the pointer in the mdarray instances is the reason, and in that case I will close this PR. But if they were added on intuition without really being necessary I think it can be worth considering to remove them for the improvement in usability (IMHO). It will of course be a break in backwards compatibility.

Some additional notes:

  • I removed the GroupOrDimension and GroupOrArray enums since the are no longer used.
  • As noted above, I added a dataset argument to MDArray.get_statistics since the mdarray instance no longer keep a pointer to the dataset around.
  • Since an MDArray no longer keep a reference to neither a group nor a dimension I replaced the from_c_mdarray_and_group and from_c_mdarray_and_dimension methods with a single from_c_mdarray method.
  • I removed the _dataset argument from Group.from_c_group.
  • I removed the _parent argument from Dimension.from_c_dimension.

Please let me know what you think! I realize this is quite a bold PR for a first time contributor 🙂 If you think this is a good idea I will make sure to update CHANGES.md.

@lnicola
Copy link
Member

lnicola commented Oct 2, 2025

Thank you. I'm a bit on the fence about this. We generally try to match the C++ API (where indeed, you don't have to pass a reference to the dataset when computing the statistics, unlike in the C one).

Are you using this API? I assume dropping the lifetime makes the user code a lot nicer?

@aqm577
Copy link
Author

aqm577 commented Oct 3, 2025

Thank you for taking a look at this! I wasn't aware that it isn't necessary to pass the dataset to the C++ API and if it's a design decision to model that API I understand your hesitation.

Regarding my usage: I would like to use the this library to create new multi-dimensional datasets in a project, but this functionality is not yet implemented if I'm not mistaken. It was when I started to look into adding this functionality that I discovered that the references wasn't strictly necessary. Before I would make a PR to add methods that needed to consider the resulting lifetime annotations I wanted to raise this issue. I haven't found any technical problems with implementing the new methods with the references intact.

With this said, I still think it would be a nicer API to get rid of the references and the lifetime annotations from the above mentioned structs. In my mind they impose a restriction and a cognitive load and I'm a strong proponent of keeping API's as simple as possible. But I fully respect if you consider backwards compatibility and/or compatibility with the C++ API more important 🙂

@ChristianBeilschmidt
Copy link
Contributor

From https://gdal.org/en/stable/user/multidim_raster_data_model.html#objects-lifetime:

the GDALGroup instance returned by GDALDataset::GetRootGroup() can be used after the dataset has been closed.

So they cannot be used if you have to pass them a dataset in order to work. Hence, this solution also doesn't cut it, yet.

A potential point of attention is that, when creating / editing a dataset, all those objects keep alive the underlying file descriptors, so changes are only guaranteed to be serialized when all objects related to a dataset have been released.

So it seems to keep some kind of reference.

Maybe we could use an Arc<…> instead of the raw pointer. But my thoughts don't go further than that, yet.

@lnicola
Copy link
Member

lnicola commented Oct 6, 2025

So they cannot be used if you have to pass them a dataset in order to work.

Sorry, would you mind expanding on this? I don't really understand it.

So it seems to keep some kind of reference. Maybe we could use an Arc<…> instead of the raw pointer.

That's a bit like what I'm doing in #677. GDAL can do reference counting, but it's not thread-safe (atomic). And in addition, I think calling GDALClose on the original dataset might close it immediately, even if it doesn't free the memory, which doesn't sound like something we'd want.

But the phrasing in the MD docs is different ("can be used after the dataset has been closed").

@ChristianBeilschmidt
Copy link
Contributor

So they cannot be used if you have to pass them a dataset in order to work.

Sorry, would you mind expanding on this? I don't really understand it.

Well, you cannot use if after the dataset is closed if you require the dataset to be a parameter of your function(s) of that struct.
Moreover, the API is not that ergonomic this way either.
But I like the idea of fewer lifetimes.

@lnicola
Copy link
Member

lnicola commented Oct 6, 2025

Yeah, I think the point is that you don't close it, but keep it around and still avoid the lifetimes on the MD types.

@aqm577
Copy link
Author

aqm577 commented Oct 6, 2025

Yeah, I think the point is that you don't close it, but keep it around and still avoid the lifetimes on the MD types.

This is how I imagined it. If it's only get_statistics that needs the dataset I think it's acceptable to require that the user keep it around in that case, like they would if they were using the C API. All other code could throw it away and use the MD types without lifetime annotations referencing the dataset. I have no sense of how common calls to get_statistics are however.

I tried to understand how the C++ API handles the dataset by browsing the GDAL repository but unfortunately my C++ is a little to rusty (no pun intended 😉). From what I could gather get_statistics calculates and writes the statistics if they are missing, and from using GDAL I believe it can be written to a side-car file, so I'm guessing this is the reason the original dataset is needed.

I also thought about a reference-counted pointer but even if it is the right choice I can't motivate it as easily as this PR. If we disregard get_statistics for a moment, this PR only removes a constraint that hopefully isn't needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants