Skip to content

Conversation

@LauLauThom
Copy link
Contributor

Hi guys,

live from the deNBI hackathon here, I have been playing with reading an entity referencing a subcrate i.e traversing the graph from the top crate to a subcrate, as suggested in the 1.2 spec.

I am proposing a simple approach here, with a new Subcrate class extending the Dataset class.
I defined this class in the main rocrate.py file, in models.py it would cause circular dependencies.

This would allow things like

main_crate = ROCrate(test_data_dir / "crate_with_subcrate")
subcrate = main_crate.get("subcrate")
subfile = subcrate.get("subfile.txt")
# or 
subfile = subcrate["hasPart"][0]

(see added tests too)

at this point I am mostly interested to know if you think that could be a viable approach before going further.

The implementation is such that the subcrate is only loaded when accessing some of its attribute, to avoid potentially loading large amount of metadata, as one purpose of the subcrate is also to reduce the amount of information in the main crate.

@LauLauThom LauLauThom marked this pull request as draft December 3, 2025 08:46
@elichad
Copy link
Contributor

elichad commented Dec 3, 2025

I had a quick look and I quite like this implementation at first glance.

Some extra suggestions:

  • log a message when a subcrate is parsed, to make clear where metadata is being retrieved from
  • could be useful to have an ROCrate.subcrates property, analogous to ROCrate.data_entities - a list of subcrate entities that is built while parsing the main crate. That list can then be used for iteration even if you don't know the subcrate ids (you could even do this recursively if you were willing to load all the crates into memory)

@LauLauThom
Copy link
Contributor Author

LauLauThom commented Dec 3, 2025

Thanks for the quick feedback !

I have pushed a few more commits to allow things like subfile = main_crate.get("subcrate/subfile.txt") and also implemented the ROCrate.subcrates property as suggested.
I also tested with another nested level i.e another subcrate in the subcrate 😅, and that works too.

For the logging, it seems it's not used in the codebase yet, should I just use the logging package, initializing a default logger instance like logger = logging.getLogger(__name__) ?

Now I am wondering if a Subcrate should be a valid RO-Crate object too.
I think it should not be a problem, only a bit verbose to implement all required methods (see Subcrate.get_entities() for an example)

Happy to discuss this in the drop-in call tomorrow 😉

EDIT : I also commited a .pre-commit-config.yaml file to help enforcing flake8 syntax, could be removed before merging of course

@simleo
Copy link
Collaborator

simleo commented Dec 4, 2025

I went through the code and did some testing, which exposed problems:

With parse_subcrate left to False all seems well:

>>> from rocrate.rocrate import ROCrate
>>> crate = ROCrate("test/test-data/crate_with_subcrate")
>>> d = crate.get("subcrate/")
>>> d
<subcrate/ Dataset>
>>> d.get("conformsTo")
'https://w3id.org/ro/crate/'
>>> crate.write("/tmp/crate")

With parse_subcrate set to True:

>>> from rocrate.rocrate import ROCrate
>>> crate = ROCrate("test/test-data/crate_with_subcrate", parse_subcrate=True)
>>> d = crate.get("subcrate/")
>>> d
<subcrate/ Dataset>
>>> d.get("conformsTo")  # this fails
>>> crate.write("/tmp/crate")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/simleo/git/ro-crate-py/rocrate/rocrate.py", line 567, in write
    writable_entity.write(base_path)
  File "/home/simleo/git/ro-crate-py/rocrate/model/metadata.py", line 97, in write
    super()._write_from_stream(write_path)
  File "/home/simleo/git/ro-crate-py/rocrate/model/file.py", line 63, in _write_from_stream
    for _, chunk in self.stream():
  File "/home/simleo/git/ro-crate-py/rocrate/model/metadata.py", line 90, in stream
    yield self.id, str.encode(json.dumps(content, indent=4, sort_keys=True), encoding='utf-8')
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
          ^^^^^^^^^^^
  File "/usr/lib/python3.12/json/encoder.py", line 202, in encode
    chunks = list(chunks)
             ^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/encoder.py", line 432, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/usr/lib/python3.12/json/encoder.py", line 406, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.12/json/encoder.py", line 326, in _iterencode_list
    yield from chunks
  File "/usr/lib/python3.12/json/encoder.py", line 406, in _iterencode_dict
    yield from chunks
  File "/usr/lib/python3.12/json/encoder.py", line 326, in _iterencode_list
    yield from chunks
  File "/usr/lib/python3.12/json/encoder.py", line 439, in _iterencode
    o = _default(o)
        ^^^^^^^^^^^
  File "/usr/lib/python3.12/json/encoder.py", line 180, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type File is not JSON serializable

I think this has something to do with the hacking of ._jsonld in the Subcrate class. More generally, I'm concerned with the mixing of interfaces (some Dataset and some ROCrate behavior) that happens in the Subcrate class. The role of Subcrate is not clear cut, it sounds very weird that a Subcrate object has a subcrate attribute:

>>> from rocrate.rocrate import ROCrate
>>> crate = ROCrate("test/test-data/crate_with_subcrate", parse_subcrate=True)
>>> subcrate = crate.get("subcrate/")
>>> subcrate.subcrate
>>> subcrate.get("conformsTo")
>>> subcrate.subcrate
<rocrate.rocrate.ROCrate object at 0x7b6edf975160>
>>> subcrate.subcrate.subcrate_entities
[<subsubcrate/ Dataset>]
>>> subcrate.get("conformsTo")
>>> 

The new structure looks very confusing. I did not have the time to work on it yet, but I have the feeling that subcrate support could / should be done without using a special Subcrate class: a subcrate should be of type ROCrate, like the main one, while the subcrate attribute might be added directly to the Dataset class.

The main thing I would like to point out about the current solution is that it hacks critical sections of the code, so getting it right is harder than it looks.

@LauLauThom
Copy link
Contributor Author

Thanks for the feedback Simone, I added a couple changes.

First I made sure any attribute on the original dataset entity (such as conformsTo) is conserved.
I also renamed the Subcrate.subcrate attribute to Subcrate._crate to avoid confusion, and added a get_crate getter for this attribute.

I also changed a bit the behaviour, such that the Subcrate entity behaves more like a Dataset, removing for instance the Subcrate.get_entities.

The crate.write does not raise an exception anymore, but the content of subcrate is not written. I will have a look at it at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants