Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ntuple] Add partioning to RNTupleJoinTable #17919

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

enirolf
Copy link
Contributor

@enirolf enirolf commented Mar 7, 2025

With this addition, the RNTupleJoinTable can be split up into several mappings from join values to entry numbers, according to some (numeric) partition key. Using a partition key is optional; a default partition key is used when none has been specified by the user.

The RNTupleJoinTable now has an inner class, REntryMapping, which in practice now represents what the RNTupleJoinTable represented previously, i.e., a mapping from join field values to entry numbers:

entryMapping
|- {x_0, y_0} -> 42
|- {x_1, y_1} -> 99
|- {x_2, y_2} -> 12
|- ...

The join table itself now instead provides a mapping from partition keys to (a collection of) these entry mappings:

joinTable
|- 0
   |- entryMapping_0
      |- {x_0, y_0} -> 42
      |- ...
|- 4
   |- entryMapping_0
      |- {x_0, y_0} -> 99
      |- ...
   |- entryMapping_1
      |- ...
|- ...
|- kDefaultPartitionKey
   |- entryMapping_0
      |- {x_0, y_0} -> 12
      |- ...
   |- ...

The reason one partition can contain multiple entry mappings is twofold:

  1. This is needed when only the default partition key is used (or in other words, no partitions are used
  2. Less relevant for now, but we could foresee cases where the partition keys are based on some (meta-data) attribute that is shared across more than one page source.

The most immediate use case of the partitioning approach added in this PR, is that this way, the RNTupleJoinTable itself is not restricted to one page source anymore (this restriction is now in REntryMapping instead). This is useful for the integration into the RNTupleProcessor, where we want to be able to create joins of chains of RNTuples and have to deal with more than one page source (see also #17132).

As a side-effect, the state management of the join table and the notion of lazy building has changed. There is no single isBuilt state anymore, and the Add function eagerly builds the mapping for the provided page source and adds it to the join table. As such, the responsibility of deciding whether to eagerly or lazily build the join table is moved to the application using the join table (i.e. by strategically calling Add).

@enirolf enirolf self-assigned this Mar 7, 2025
@enirolf enirolf force-pushed the ntuple-join-table-partitions branch 2 times, most recently from 6834e5a to 7ebbde6 Compare March 7, 2025 16:41
Copy link

github-actions bot commented Mar 7, 2025

Test Results

    20 files      20 suites   4d 23h 29m 30s ⏱️
 2 726 tests  2 725 ✅ 0 💤 1 ❌
52 591 runs  52 590 ✅ 0 💤 1 ❌

For more details on these failures, see this check.

Results for commit 3a3fa99.

♻️ This comment has been updated with latest results.

@enirolf enirolf force-pushed the ntuple-join-table-partitions branch from 7ebbde6 to 4cc1604 Compare March 10, 2025 14:05
Copy link
Member

@vepadulano vepadulano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement! Some minor comments for now

Comment on lines 59 to 62
RCombinedJoinFieldValue(const std::vector<NTupleJoinFieldValue_t> &fieldValues)
{
fFieldValues.reserve(fieldValues.size());
fFieldValues = fieldValues;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be equivalent?

Suggested change
RCombinedJoinFieldValue(const std::vector<NTupleJoinFieldValue_t> &fieldValues)
{
fFieldValues.reserve(fieldValues.size());
fFieldValues = fieldValues;
}
RCombinedJoinFieldValue(const std::vector<NTupleJoinFieldValue_t> &fieldValues): fFieldValues(fieldValues)
{
}

Comment on lines 56 to 57
class RCombinedJoinFieldValue {
public:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe struct better represents the class layout in this case?

/// https://www.boost.org/doc/libs/1_55_0/doc/html/hash/reference.html#boost.hash_combine).
struct RCombinedJoinFieldValueHash {
inline std::size_t operator()(const RCombinedJoinFieldValue &joinFieldValue) const
/// The mapping itself. Maps field values (or combinations thereof in case the join key is a composed of multiple
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// The mapping itself. Maps field values (or combinations thereof in case the join key is a composed of multiple
/// The mapping itself. Maps field values (or combinations thereof in case the join key is composed of multiple

switch (*run) {
// For run == 0, no entry indexes are expected because they were added using the default partition key
case 0: EXPECT_EQ(entryIdxs.size(), 0ul); break;
// For run == 0, we expect exactly one entry index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// For run == 0, we expect exactly one entry index
// For run == 1, we expect exactly one entry index

@enirolf enirolf force-pushed the ntuple-join-table-partitions branch from 4cc1604 to 8e2228d Compare March 10, 2025 14:58
@enirolf enirolf marked this pull request as ready for review March 10, 2025 15:47
@enirolf enirolf requested a review from jblomer as a code owner March 10, 2025 15:47
@enirolf enirolf requested a review from vepadulano March 10, 2025 15:48
Copy link
Contributor

@silverweed silverweed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of minor comments

@enirolf enirolf force-pushed the ntuple-join-table-partitions branch from 8e2228d to cdcfb92 Compare March 11, 2025 14:06
@enirolf enirolf requested a review from silverweed March 11, 2025 14:06
Copy link
Member

@hahnjo hahnjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left many comments. The answer to some of them may well be that it's motivated by future changes. In that case please feel free to simply resolve the thread(s).

Comment on lines 228 to 233
/// \return The entry numbers that correspond to `valuePtrs`. When there are no corresponding entries, an empty
/// vector is returned.
std::vector<ROOT::NTupleSize_t> GetEntryIndexes(const std::vector<void *> &valuePtrs) const
{
return GetEntryIndexes(valuePtrs, fPartitionKeys);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this useful in case of multiple partitions since the return value loses the information which mapping the indexes are relative to? Maybe what we want instead is GetEntryIndexes(const std::vector<void *> &valuePtrs, PartitionKey_t partitionKey = kDefaultPartitionKey)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue I see with that interface is that it may give false negatives when there are mappings in partitions other than the default. I think it's okay to lose information here, if one needs this information they should instead use (GetEntryIndexes(const std::vector<void *> &valuePtrs, PartitionKey_t partitionKey) to be able to track the partitions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I now realize that the information is equally lost for GetEntryIndexes(const std::vector<void *> &valuePtrs, const std::vector<PartitionKey_t> &partitionKeys). I'm still not sure how useful this is (at the moment) since a mapping spans an entire page source, but I guess this will change in the future.

@enirolf enirolf force-pushed the ntuple-join-table-partitions branch from cdcfb92 to 130714e Compare March 18, 2025 14:21
@@ -35,57 +35,98 @@ namespace Internal {
// clang-format on
class RNTupleJoinTable {
public:
using NTupleJoinValue_t = std::uint64_t;
using JoinValue_T = std::uint64_t;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
using JoinValue_T = std::uint64_t;
using JoinValue_t = std::uint64_t;

(that's what the commit message says...)

Comment on lines 228 to 233
/// \return The entry numbers that correspond to `valuePtrs`. When there are no corresponding entries, an empty
/// vector is returned.
std::vector<ROOT::NTupleSize_t> GetEntryIndexes(const std::vector<void *> &valuePtrs) const
{
return GetEntryIndexes(valuePtrs, fPartitionKeys);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I now realize that the information is equally lost for GetEntryIndexes(const std::vector<void *> &valuePtrs, const std::vector<PartitionKey_t> &partitionKeys). I'm still not sure how useful this is (at the moment) since a mapping spans an entire page source, but I guess this will change in the future.

@enirolf enirolf force-pushed the ntuple-join-table-partitions branch 2 times, most recently from 1e7f395 to 3a3fa99 Compare March 19, 2025 10:26
@enirolf enirolf requested a review from hahnjo March 19, 2025 13:23
Copy link
Member

@hahnjo hahnjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in principle, thanks for making all the changes. I have one final question about the semantics of GetPartitionedEntryIndexes, and what it should return in case there is no index in a partition.

Comment on lines 152 to 154
auto entriesForPartition = GetEntryIndexes(valuePtrs, partitionKey);
entryIdxs[partitionKey].insert(entryIdxs[partitionKey].end(), entriesForPartition.begin(),
entriesForPartition.end());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this check

Suggested change
auto entriesForPartition = GetEntryIndexes(valuePtrs, partitionKey);
entryIdxs[partitionKey].insert(entryIdxs[partitionKey].end(), entriesForPartition.begin(),
entriesForPartition.end());
auto entriesForPartition = GetEntryIndexes(valuePtrs, partitionKey);
if (!entriesForPartition.empty())
entryIdxs[partitionKey].insert(entryIdxs[partitionKey].end(), entriesForPartition.begin(),
entriesForPartition.end());

? The question is what do we want to return if there is no index in a particular partition: an empty vector or omit the key from the map. The other overload seems to do the latter, and that's also what I would semantically expect. (we should probably add a test for this overload, if I'm not mistaken)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, thanks for catching this! I think for now indeed it's fine to use the latter behavior and let time and future developments tell whether whether this was the right intuition. I've expanded the tests as well.

enirolf added 4 commits March 19, 2025 15:59
With this addition, the join table can be split up into several
mappings from join values to entry numbers, according to some (numeric)
partition key. It has several use cases, but the immediate one is that
with this approach, the join table is not restricted to one page source
anymore. This is useful for the integration into the `RNTupleProcessor`,
where we want to be able to create joins of chains of RNTuples and have
to deal with more than one page source.

As a side-effect, the state management of the join table and the notion
of lazy building has changed. There is no single `isBuilt` state
anymore, and the `Add` function eagerly builds the mapping for the
provided page source and adds it to the join table. As such, the
responsibility of deciding whether to eagerly or lazily build the join
table is moved to the application using the join table (i.e. by
strategically calling `Add`).
Since it is declared within `RNTupleJoinTable`, it is clear that this
type belongs to RNTuple from the context.
@enirolf enirolf force-pushed the ntuple-join-table-partitions branch from 3a3fa99 to 14daddb Compare March 19, 2025 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants