-
Notifications
You must be signed in to change notification settings - Fork 220
Add Varint
type for variable-length integer encoding
#1229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
} | ||
} | ||
|
||
public static ByteBuffer putVarInt(ByteBuffer b, long v) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how I feel about maintaining this code. Can we use a library for this?
For context: this is a per-requisite for #1189. It's quite isolated. |
tools/varint/build.gradle.kts
Outdated
testFixturesApi(libs.assertj.core) | ||
} | ||
|
||
description = "Provides variable length integer encoding" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to create a new module for just one class? I'm not against having multiple modules, but I'd like to understand the benefits.
My concerns are:
- Increased Complexity – If we continue this approach, we'll end up with significantly more modules to maintain, which could add overhead.
- Impact on Downstream Users – More modules mean more dependencies, potentially making it harder for downstream users who will have to manage a lot of Polaris jars, other than just a few of them.
Would love to hear the rationale behind this approach!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal here is to have one dependency that does not have unnecessary other dependencies, kryo for example comes with a bunch of other dependencies (so "even more jars").
The opposite, not sure if you're proposing that, is to have monoliths. The disadvantage is what we currently have in the code base - a bunch of unrelated things that depend on each other.
I don't see an impact on downstream users because:
a) versions are consistently managed via the bom
b) dependencies are transitive, not manual - unless you're using the ant-ique build tool, you're fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern here is more about the maintenance overhead rather than the impact on downstream users. Of course, if we write perfect bug-free code downstream users are not impacted. But if we can avoid a new module, and indeed even new code to maintain, I think we should explore that option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal here is to have one dependency that does not have unnecessary other dependencies, kryo for example comes with a bunch of other dependencies (so "even more jars").
The chance of a downstream project depends only on this module is rare. Otherwise, I'd recommend to contribute the code to the other Apache projects like Apache Common(https://commons.apache.org/). In a lot of real use cases, a downstream project will likely depend on modules like polaris-core
. Plus, given that this module only depends on libs.guava
, can we put it in polaris-core
which has guava already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that this one is used by a bunch of modules in #1189. This piece is "tailor made" for those use cases. Kryo's serialization is different, and brings full-blown object mapping + alternative logging - quite too much.
polaris-core
already has way too may things - depends on Iceberg and public API modules - things on which persistence work should really not depend on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eric-maynard : by "appropriate" do you mean somewhere under /persistence
? That would work for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If only one module needs this, I think this code can probably live in that module for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see... but as discussed in the persistence community meeting, I thought people wanted the NoSQL code to arrive in smaller, easier to review chunks 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small PRs does not mean many small modules
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I won't object to folding Varint
into one of the modules that come later. Perhaps we could have the module skeleton in this PR and add other classes later... Does it work for you, @snazy ?
tools/varint/src/main/java/org/apache/polaris/tools/varint/VarInt.java
Outdated
Show resolved
Hide resolved
tools/varint/src/main/java/org/apache/polaris/tools/varint/VarInt.java
Outdated
Show resolved
Hide resolved
469fc70
to
164fe7a
Compare
Rebased+force pushed - too many conflicts for a merge. |
* specific language governing permissions and limitations | ||
* under the License. | ||
*/ | ||
package org.apache.polaris.tools.varint; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm mainly concerned about having these in a new module and in the o.a.polaris.tools namespace. If this is for enabling the connector for Mongo shouldn't it live in that namespace? Otherwise we have to consider the general utility to the entire project and the burden we take on by exposing these public classes and apis? If we did use an already existing library or placed it in a nessie/mongo specific package that wouldn't be required
|
||
@ParameterizedTest | ||
@MethodSource | ||
public void varInt(long value, byte[] binary) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like we aren't really flexing the code base here. I would assume we would want to be doing a series of ser/de where we serialize and deserialize random positive integers and prove that the same value goes in and out. Because of the lightness of the operations here we should be able to fuzz hundreds of thousands of values in a ms duration test right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go with random, I believe it is important to have a fixed seed for repeatable tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mildly disagree about not flexing the code. I do not think it is necessary to use many test values, but we probably should use values of all possible lengths (current tests appear to miss some lengths, indeed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's reasonable, For example I know some other projects that just test every possible bit length int and then +-1 IE
For i in MAXLENGTH * 8
test 2^i
test 2^i + 1
test 2^i - 1
Moved the module and updated the package name and added some more test cases. |
// 49 bits -> 7 x 7 bits | ||
arguments(0x1ffffffffffffL, new byte[] {-1, -1, -1, -1, -1, -1, 127}), | ||
// 56 bits -> 8 x 7 bits | ||
arguments(0xffffffffffffffL, new byte[] {-1, -1, -1, -1, -1, -1, -1, 127}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be worth having a test with zeros (and small numbers) in the middle 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0 doesn't make sense - that terminates the varint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant zeros in the hex representation :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates, @snazy . Test coverage LGTM given the encoding/decoding logic.
@@ -33,6 +33,7 @@ dependencies { | |||
api(project(":polaris-immutables")) | |||
api(project(":polaris-misc-types")) | |||
api(project(":polaris-version")) | |||
api(project(":polaris-persistence-varint")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still seems outside the scope of one particular persistence implementation that we expect will need these types
edit: Basically the same as @RussellSpitzer's comment here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking something like
:polaris-persistenance-nosql
or a sub module within that but basically an isolated part of the nosql layer. Just so it's absolutely clear what code is only for that adapter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:polaris-persistenance-nosql
looks very broad... I hope we're not putting all of NoSQL-related code in the same module :)
Just so it's absolutely clear what code is only for that adapter
What could be a disadvantage of reusing this module outside the noSQL Persistence impl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically my thought is that Varint serialization is not a core competency of the Polaris project, it's just a utility that is specific to the current approach's NoSql implementation. I think it's ok to get in if it's just something that the Mongo impl wants to use in a separate codebase, but if this is something we wanted to use generally I would strongly lean towards finding an existing library rather than rolling our own.
I think we would have a different discussion if this was a analytics engine, or something that obviously needed it's own varint impl, but i'm not sure why a Catalog needs it's own varint impl. That said, I'm willing to include one if it's isolated from the rest of the code base.
As for modules, I don't think there really should be an issue with keeping the whole Mongo impl in it's own module. I think at least everything should be in the same Java package unless there is a proven need to expose a public api to the rest of the project. That's just my gut feeling here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. By that logic it might be best to keep Varint
inside the (future) module that actually needs it (not separate jar artifacts). @snazy WDYT?
If that works for everybody, and since this PR was reviewed, how about we merge it with renaming the module to :polaris-persistenance-nosql-varint
and when the bulk of NoSQL persistence comes we'll fold it into the module that needs it?
It's a separate module to separate things.