-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ntuple] Add partioning to RNTupleJoinTable
#17919
base: master
Are you sure you want to change the base?
Conversation
6834e5a
to
7ebbde6
Compare
Test Results 20 files 20 suites 5d 7h 5m 26s ⏱️ Results for commit 130714e. ♻️ This comment has been updated with latest results. |
7ebbde6
to
4cc1604
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvement! Some minor comments for now
RCombinedJoinFieldValue(const std::vector<NTupleJoinFieldValue_t> &fieldValues) | ||
{ | ||
fFieldValues.reserve(fieldValues.size()); | ||
fFieldValues = fieldValues; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be equivalent?
RCombinedJoinFieldValue(const std::vector<NTupleJoinFieldValue_t> &fieldValues) | |
{ | |
fFieldValues.reserve(fieldValues.size()); | |
fFieldValues = fieldValues; | |
} | |
RCombinedJoinFieldValue(const std::vector<NTupleJoinFieldValue_t> &fieldValues): fFieldValues(fieldValues) | |
{ | |
} |
class RCombinedJoinFieldValue { | ||
public: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe struct
better represents the class layout in this case?
/// https://www.boost.org/doc/libs/1_55_0/doc/html/hash/reference.html#boost.hash_combine). | ||
struct RCombinedJoinFieldValueHash { | ||
inline std::size_t operator()(const RCombinedJoinFieldValue &joinFieldValue) const | ||
/// The mapping itself. Maps field values (or combinations thereof in case the join key is a composed of multiple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// The mapping itself. Maps field values (or combinations thereof in case the join key is a composed of multiple | |
/// The mapping itself. Maps field values (or combinations thereof in case the join key is composed of multiple |
switch (*run) { | ||
// For run == 0, no entry indexes are expected because they were added using the default partition key | ||
case 0: EXPECT_EQ(entryIdxs.size(), 0ul); break; | ||
// For run == 0, we expect exactly one entry index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// For run == 0, we expect exactly one entry index | |
// For run == 1, we expect exactly one entry index |
4cc1604
to
8e2228d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of minor comments
////////////////////////////////////////////////////////////////////////// | ||
/// Container for the combined hashes of join field values. | ||
struct RCombinedJoinFieldValue { | ||
public: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This public
is superfluous
/// \throws RException If the join table has not been built, and can therefore not be used yet. | ||
void EnsureBuilt() const; | ||
/// \brief Get all entry mappings in the join table. | ||
const std::vector<REntryMapping *> GetAllMappings() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you're returning a value, it's better to return it non-const (same thing for the next function)
8e2228d
to
cdcfb92
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left many comments. The answer to some of them may well be that it's motivated by future changes. In that case please feel free to simply resolve the thread(s).
/// \return The entry numbers that correspond to `valuePtrs`. When there are no corresponding entries, an empty | ||
/// vector is returned. | ||
std::vector<ROOT::NTupleSize_t> GetEntryIndexes(const std::vector<void *> &valuePtrs) const | ||
{ | ||
return GetEntryIndexes(valuePtrs, fPartitionKeys); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this useful in case of multiple partitions since the return value loses the information which mapping the indexes are relative to? Maybe what we want instead is GetEntryIndexes(const std::vector<void *> &valuePtrs, PartitionKey_t partitionKey = kDefaultPartitionKey)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue I see with that interface is that it may give false negatives when there are mappings in partitions other than the default. I think it's okay to lose information here, if one needs this information they should instead use (GetEntryIndexes(const std::vector<void *> &valuePtrs, PartitionKey_t partitionKey)
to be able to track the partitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I now realize that the information is equally lost for GetEntryIndexes(const std::vector<void *> &valuePtrs, const std::vector<PartitionKey_t> &partitionKeys)
. I'm still not sure how useful this is (at the moment) since a mapping spans an entire page source, but I guess this will change in the future.
cdcfb92
to
130714e
Compare
@@ -35,57 +35,98 @@ namespace Internal { | |||
// clang-format on | |||
class RNTupleJoinTable { | |||
public: | |||
using NTupleJoinValue_t = std::uint64_t; | |||
using JoinValue_T = std::uint64_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using JoinValue_T = std::uint64_t; | |
using JoinValue_t = std::uint64_t; |
(that's what the commit message says...)
/// \return The entry numbers that correspond to `valuePtrs`. When there are no corresponding entries, an empty | ||
/// vector is returned. | ||
std::vector<ROOT::NTupleSize_t> GetEntryIndexes(const std::vector<void *> &valuePtrs) const | ||
{ | ||
return GetEntryIndexes(valuePtrs, fPartitionKeys); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I now realize that the information is equally lost for GetEntryIndexes(const std::vector<void *> &valuePtrs, const std::vector<PartitionKey_t> &partitionKeys)
. I'm still not sure how useful this is (at the moment) since a mapping spans an entire page source, but I guess this will change in the future.
/// partition. | ||
/// | ||
/// \return A reference to the updated join table. | ||
RNTupleJoinTable &Add(RPageSource &pageSource, PartitionKey_t partitionKey = kDefaultPartitionKey); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two more options:
4. (variation of 2.) continue to have a single RNTupleJoinTable::Build
, but pass it all page sources at once. This still requires the RNTupleJoinTable
to figure out which entry mapping a page source belonged to - not sure I see how to do this.
5. (variation of 3.) have RNTupleJoinTable::Add
also build the mapping, but lazily construct the entire join table in RNTupleJoinProcessor
. This may also simplify the code a bit since a join table is always built, but it's still deferred from the user perspective.
I currently favor 5. but that's because I thought of it last, so I probably haven't discovered its shortcomings yet 😃
With this addition, the join table can be split up into several mappings from join values to entry numbers, according to some (numeric) partition key. It has several use cases, but the immediate one is that with this approach, the join table is not restricted to one page source anymore. This is useful for the integration into the `RNTupleProcessor`, where we want to be able to create joins of chains of RNTuples and have to deal with more than one page source. As a side-effect, the state management of the join table and the notion of lazy building has changed. There is no single `isBuilt` state anymore, and the `Add` function eagerly builds the mapping for the provided page source and adds it to the join table. As such, the responsibility of deciding whether to eagerly or lazily build the join table is moved to the application using the join table (i.e. by strategically calling `Add`).
Since it is declared within `RNTupleJoinTable`, it is clear that this type belongs to RNTuple from the context.
130714e
to
1e7f395
Compare
With this addition, the
RNTupleJoinTable
can be split up into several mappings from join values to entry numbers, according to some (numeric) partition key. Using a partition key is optional; a default partition key is used when none has been specified by the user.The
RNTupleJoinTable
now has an inner class,REntryMapping
, which in practice now represents what theRNTupleJoinTable
represented previously, i.e., a mapping from join field values to entry numbers:The join table itself now instead provides a mapping from partition keys to (a collection of) these entry mappings:
The reason one partition can contain multiple entry mappings is twofold:
The most immediate use case of the partitioning approach added in this PR, is that this way, the
RNTupleJoinTable
itself is not restricted to one page source anymore (this restriction is now inREntryMapping
instead). This is useful for the integration into theRNTupleProcessor
, where we want to be able to create joins of chains of RNTuples and have to deal with more than one page source (see also #17132).As a side-effect, the state management of the join table and the notion of lazy building has changed. There is no single
isBuilt
state anymore, and theAdd
function eagerly builds the mapping for the provided page source and adds it to the join table. As such, the responsibility of deciding whether to eagerly or lazily build the join table is moved to the application using the join table (i.e. by strategically callingAdd
).