suggestion for reading subcrates #244

LauLauThom · 2025-12-02T15:53:47Z

Hi guys,

live from the deNBI hackathon here, I have been playing with reading an entity referencing a subcrate i.e traversing the graph from the top crate to a subcrate, as suggested in the 1.2 spec.

I am proposing a simple approach here, with a new Subcrate class extending the Dataset class.
I defined this class in the main rocrate.py file, in models.py it would cause circular dependencies.

This would allow things like

main_crate = ROCrate(test_data_dir / "crate_with_subcrate")
subcrate = main_crate.get("subcrate")
subfile = subcrate.get("subfile.txt")
# or 
subfile = subcrate["hasPart"][0]

(see added tests too)

at this point I am mostly interested to know if you think that could be a viable approach before going further.

The implementation is such that the subcrate is only loaded when accessing some of its attribute, to avoid potentially loading large amount of metadata, as one purpose of the subcrate is also to reduce the amount of information in the main crate.

elichad · 2025-12-03T09:21:49Z

I had a quick look and I quite like this implementation at first glance.

Some extra suggestions:

log a message when a subcrate is parsed, to make clear where metadata is being retrieved from
could be useful to have an ROCrate.subcrates property, analogous to ROCrate.data_entities - a list of subcrate entities that is built while parsing the main crate. That list can then be used for iteration even if you don't know the subcrate ids (you could even do this recursively if you were willing to load all the crates into memory)

LauLauThom · 2025-12-03T14:30:44Z

Thanks for the quick feedback !

I have pushed a few more commits to allow things like subfile = main_crate.get("subcrate/subfile.txt") and also implemented the ROCrate.subcrates property as suggested.
I also tested with another nested level i.e another subcrate in the subcrate 😅, and that works too.

For the logging, it seems it's not used in the codebase yet, should I just use the logging package, initializing a default logger instance like logger = logging.getLogger(__name__) ?

Now I am wondering if a Subcrate should be a valid RO-Crate object too.
I think it should not be a problem, only a bit verbose to implement all required methods (see Subcrate.get_entities() for an example)

Happy to discuss this in the drop-in call tomorrow 😉

EDIT : I also commited a .pre-commit-config.yaml file to help enforcing flake8 syntax, could be removed before merging of course

simleo · 2025-12-04T09:49:01Z

I went through the code and did some testing, which exposed problems:

With parse_subcrate left to False all seems well:

>>> from rocrate.rocrate import ROCrate
>>> crate = ROCrate("test/test-data/crate_with_subcrate")
>>> d = crate.get("subcrate/")
>>> d
<subcrate/ Dataset>
>>> d.get("conformsTo")
'https://w3id.org/ro/crate/'
>>> crate.write("/tmp/crate")

With parse_subcrate set to True:

>>> from rocrate.rocrate import ROCrate
>>> crate = ROCrate("test/test-data/crate_with_subcrate", parse_subcrate=True)
>>> d = crate.get("subcrate/")
>>> d
<subcrate/ Dataset>
>>> d.get("conformsTo")  # this fails
>>> crate.write("/tmp/crate")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/simleo/git/ro-crate-py/rocrate/rocrate.py", line 567, in write
    writable_entity.write(base_path)
  File "/home/simleo/git/ro-crate-py/rocrate/model/metadata.py", line 97, in write
    super()._write_from_stream(write_path)
  File "/home/simleo/git/ro-crate-py/rocrate/model/file.py", line 63, in _write_from_stream
    for _, chunk in self.stream():
  File "/home/simleo/git/ro-crate-py/rocrate/model/metadata.py", line 90, in stream
    yield self.id, str.encode(json.dumps(content, indent=4, sort_keys=True), encoding='utf-8')
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
          ^^^^^^^^^^^
  File "/usr/lib/python3.12/json/encoder.py", line 202, in encode
    chunks = list(chunks)
             ^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/encoder.py", line 432, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/usr/lib/python3.12/json/encoder.py", line 406, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.12/json/encoder.py", line 326, in _iterencode_list
    yield from chunks
  File "/usr/lib/python3.12/json/encoder.py", line 406, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.12/json/encoder.py", line 326, in _iterencode_list
    yield from chunks
  File "/usr/lib/python3.12/json/encoder.py", line 439, in _iterencode
    o = _default(o)
        ^^^^^^^^^^^
  File "/usr/lib/python3.12/json/encoder.py", line 180, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type File is not JSON serializable

I think this has something to do with the hacking of ._jsonld in the Subcrate class. More generally, I'm concerned with the mixing of interfaces (some Dataset and some ROCrate behavior) that happens in the Subcrate class. The role of Subcrate is not clear cut, it sounds very weird that a Subcrate object has a subcrate attribute:

>>> from rocrate.rocrate import ROCrate
>>> crate = ROCrate("test/test-data/crate_with_subcrate", parse_subcrate=True)
>>> subcrate = crate.get("subcrate/")
>>> subcrate.subcrate
>>> subcrate.get("conformsTo")
>>> subcrate.subcrate
<rocrate.rocrate.ROCrate object at 0x7b6edf975160>
>>> subcrate.subcrate.subcrate_entities
[<subsubcrate/ Dataset>]
>>> subcrate.get("conformsTo")
>>>

The new structure looks very confusing. I did not have the time to work on it yet, but I have the feeling that subcrate support could / should be done without using a special Subcrate class: a subcrate should be of type ROCrate, like the main one, while the subcrate attribute might be added directly to the Dataset class.

The main thing I would like to point out about the current solution is that it hacks critical sections of the code, so getting it right is harder than it looks.

also fix flake8 with precommit

also prevents directly accessing items listed in subcrate under hasPart e.g subcrate.get("subfile.txt")

LauLauThom · 2025-12-04T16:50:57Z

Thanks for the feedback Simone, I added a couple changes.

First I made sure any attribute on the original dataset entity (such as conformsTo) is conserved.
I also renamed the Subcrate.subcrate attribute to Subcrate._crate to avoid confusion, and added a get_crate getter for this attribute.

I also changed a bit the behaviour, such that the Subcrate entity behaves more like a Dataset, removing for instance the Subcrate.get_entities.

The crate.write does not raise an exception anymore, but the content of subcrate is not written. I will have a look at it at some point.

LauLauThom marked this pull request as draft December 3, 2025 08:46

LauLauThom added 16 commits December 4, 2025 17:13

first attempt

91b3d6e

move the Subcrate to rocrate main class

8203025

use get_norm_value instead of get + as_list

25b56d3

add support for get with rocrate

c21e0e0

add simple tests

0dbdb15

handle url for subcrates

ce2a024

add flag parse_subcrate

fbe0eee

fix missing flag parse_rocrate

bd02ee2

add subcrate_entities property

bf6f276

support get of subcrate entity from top crate

fd1fd53

also fix flake8 with precommit

load_subcrate as hidden function

9916641

add get_entities to subcrate

22b34f2

add Subcrate to test_data_entities

bf7e0ea

fix issue with nested crates

9e9a4cf

keep conformsTo in Subcrate

21556c8

use getter for inner crate

d98b22f

also prevents directly accessing items listed in subcrate under hasPart e.g subcrate.get("subfile.txt")

LauLauThom force-pushed the subcrate branch from 0d91be2 to d98b22f Compare December 4, 2025 16:16

LauLauThom added 2 commits December 4, 2025 17:41

remove get_entities from Subcrate

629bcfe

remove subcrate.get_entities from tests

771247d

implement crate writing

1368208

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

suggestion for reading subcrates #244

suggestion for reading subcrates #244

Uh oh!

LauLauThom commented Dec 2, 2025

Uh oh!

elichad commented Dec 3, 2025

Uh oh!

LauLauThom commented Dec 3, 2025 •

edited

Loading

Uh oh!

simleo commented Dec 4, 2025

Uh oh!

LauLauThom commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

suggestion for reading subcrates #244

Are you sure you want to change the base?

suggestion for reading subcrates #244

Uh oh!

Conversation

LauLauThom commented Dec 2, 2025

Uh oh!

elichad commented Dec 3, 2025

Uh oh!

LauLauThom commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simleo commented Dec 4, 2025

Uh oh!

LauLauThom commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LauLauThom commented Dec 3, 2025 •

edited

Loading