Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ntuple] Split primary and auxiliary models in RNTupleJoinProcessor::CreateJoin #17936

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

enirolf
Copy link
Contributor

@enirolf enirolf commented Mar 10, 2025

To reflect the interface changes introduced in #17964, instead of a single list of models (one per primary/auxiliary ntuple), a distinction is made between the models of the primary and auxiliary ntuples.

Same as before, internally they are combined into one singular model used by the processor. This model has to have the schemas of auxiliary RNTuples as untyped record fields, where the name of the record field is the name of the RNTuple.

A new addition is the added support for partial model specification in the list of auxiliary models. This means that users can pass a null pointer in the position of the corresponding ntuple, and then only for that ntuple, the processor will infer the model from the on-disk descriptor.

Helps move forward #17132.

@enirolf enirolf self-assigned this Mar 10, 2025
@enirolf enirolf requested a review from jblomer as a code owner March 10, 2025 12:49
Copy link

github-actions bot commented Mar 10, 2025

Test Results

    19 files      19 suites   5d 10h 30m 59s ⏱️
 2 726 tests  2 724 ✅ 0 💤 2 ❌
49 935 runs  49 932 ✅ 0 💤 3 ❌

For more details on these failures, see this check.

Results for commit 1965191.

♻️ This comment has been updated with latest results.

Comment on lines 256 to 265
std::vector<RNTupleOpenSpec> auxNTuples{{fNTupleNames[1], fFileNames[1]}, {fNTupleNames[2], fFileNames[2]}};

auto joinModel = RNTupleProcessor::CreateJoinModel(std::move(primaryModel), std::move(auxModels), auxNTuples);

auto i = joinModel->GetDefaultEntry().GetPtr<int>("i");
auto x = joinModel->GetDefaultEntry().GetPtr<float>("x");
auto y = joinModel->GetDefaultEntry().GetPtr<std::vector<float>>(fNTupleNames[1] + ".y");
auto z = joinModel->GetDefaultEntry().GetPtr<float>(fNTupleNames[2] + ".z");

auto proc = RNTupleProcessor::CreateJoin({fNTupleNames[0], fFileNames[0]}, auxNTuples, {"i"}, std::move(joinModel));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface boils down to:

auto primaryModel = RNTupleModel::Create();
std::vector<std::unique_ptr<RNTupleModel>> auxModels;
... fill auxModels ...
std::vector<RNTupleOpenSpec> auxNTuples;
... fill auxNTuples ...
auto joinModel = RNTupleProcessor::CreateJoinModel(std::move(primaryModel), std::move(auxModels), auxNTuples);
auto proc = RNTupleProcessor::CreateJoin({fNTupleNames[0], fFileNames[0]}, auxNTuples, {"i"}, std::move(joinModel));

Feels a bit redundant and/or disjointed (pun not really intended). In particular in the last 2 lines auxNTuples is repeated and (I assume) must be in the exact same state (ie. be the same) during both calls. What happens if you pass 2 completely different vector to both? Couldn't the CreateJoin extract the auxNTuples from the result of CreateJoinModel (in which case the risk of wrong call would be eliminated).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are valid points. I wanted to move away from the current interface where the model of the primary model is bunched together with the ones of the aux ntuples, to reflect also the splitting of the RNTupleOpenSpecs themselves. So the alternative approach I thought of is to have the same pattern (i.e., primaryModel, {auxModels...}) and then only internally call CreateModel. I didn't go in this direction because it would the number of arguments to CreateJoin even longer, but it does remove this redundancy and risk of user error, so maybe I should go with that after all..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Philippe that if we can reduce the redundancy on the user side we should go for it, even if it means making the function signature longer (maybe we should pass an Options struct to it?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this particular case using an Options struct will be just as if not more cumbersome, but I will change the signature to take primaryModel, {auxModels...} then 👍.

@enirolf enirolf marked this pull request as draft March 12, 2025 15:11
@enirolf enirolf force-pushed the ntuple-processor-join-model branch 2 times, most recently from cc09ee6 to 1c01b4e Compare March 12, 2025 16:02
@enirolf enirolf changed the title [ntuple] Require single model in RNTupleJoinProcessor::Create [ntuple] Split primary and auxiliary models in RNTupleJoinProcessor::CreateJoin Mar 13, 2025
@enirolf enirolf force-pushed the ntuple-processor-join-model branch from 1c01b4e to af05f1e Compare March 13, 2025 07:47
@enirolf enirolf marked this pull request as ready for review March 13, 2025 10:16
@enirolf enirolf requested a review from pcanal March 13, 2025 10:17
@enirolf enirolf force-pushed the ntuple-processor-join-model branch 2 times, most recently from 435ae83 to be24fe8 Compare March 18, 2025 09:18
@enirolf enirolf requested a review from hahnjo March 18, 2025 09:19
Copy link
Member

@hahnjo hahnjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One final comment when looking at the code again; we probably want to test a mixture of bare and non-bare models...

enirolf added 6 commits March 18, 2025 15:40
To reflect the interface changes introduced in rev d5b4931, instead of a
single list of models (one per primary/auxiliary ntuple), a distinction
is made between the models of the primary and auxiliary ntuples. Same as
before, internally they are combined into one singular model used by the
processor.
@enirolf enirolf force-pushed the ntuple-processor-join-model branch from be24fe8 to 1965191 Compare March 18, 2025 14:56
Copy link
Member

@hahnjo hahnjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM! (should have a second approval)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants