Add `Varint` type for variable-length integer encoding #1229

snazy · 2025-03-20T19:38:43Z

It's a separate module to separate things.

eric-maynard · 2025-03-20T19:44:09Z

tools/varint/src/main/java/org/apache/polaris/tools/varint/VarInt.java

+    }
+  }
+
+  public static ByteBuffer putVarInt(ByteBuffer b, long v) {


I don't know how I feel about maintaining this code. Can we use a library for this?

snazy · 2025-03-20T20:01:06Z

For context: this is a per-requisite for #1189. It's quite isolated.

flyrain · 2025-03-20T21:24:30Z

tools/varint/build.gradle.kts

+  testFixturesApi(libs.assertj.core)
+}
+
+description = "Provides variable length integer encoding"


Do we need to create a new module for just one class? I'm not against having multiple modules, but I'd like to understand the benefits.

My concerns are:

Increased Complexity – If we continue this approach, we'll end up with significantly more modules to maintain, which could add overhead.

Impact on Downstream Users – More modules mean more dependencies, potentially making it harder for downstream users who will have to manage a lot of Polaris jars, other than just a few of them.

Would love to hear the rationale behind this approach!

The goal here is to have one dependency that does not have unnecessary other dependencies, kryo for example comes with a bunch of other dependencies (so "even more jars").

The opposite, not sure if you're proposing that, is to have monoliths. The disadvantage is what we currently have in the code base - a bunch of unrelated things that depend on each other.

I don't see an impact on downstream users because:
a) versions are consistently managed via the bom
b) dependencies are transitive, not manual - unless you're using the ant-ique build tool, you're fine.

My concern here is more about the maintenance overhead rather than the impact on downstream users. Of course, if we write perfect bug-free code downstream users are not impacted. But if we can avoid a new module, and indeed even new code to maintain, I think we should explore that option.

The goal here is to have one dependency that does not have unnecessary other dependencies, kryo for example comes with a bunch of other dependencies (so "even more jars").

The chance of a downstream project depends only on this module is rare. Otherwise, I'd recommend to contribute the code to the other Apache projects like Apache Common(https://commons.apache.org/). In a lot of real use cases, a downstream project will likely depend on modules like polaris-core. Plus, given that this module only depends on libs.guava, can we put it in polaris-core which has guava already?

Please note that this one is used by a bunch of modules in #1189. This piece is "tailor made" for those use cases. Kryo's serialization is different, and brings full-blown object mapping + alternative logging - quite too much.

polaris-core already has way too may things - depends on Iceberg and public API modules - things on which persistence work should really not depend on.

@eric-maynard : by "appropriate" do you mean somewhere under /persistence? That would work for me.

If only one module needs this, I think this code can probably live in that module for now

I see... but as discussed in the persistence community meeting, I thought people wanted the NoSQL code to arrive in smaller, easier to review chunks 🤔

small PRs does not mean many small modules

I won't object to folding Varint into one of the modules that come later. Perhaps we could have the module skeleton in this PR and add other classes later... Does it work for you, @snazy ?

tools/varint/build.gradle.kts

tools/varint/src/main/java/org/apache/polaris/tools/varint/VarInt.java

snazy · 2025-03-22T09:54:58Z

Rebased+force pushed - too many conflicts for a merge.

RussellSpitzer · 2025-03-27T21:46:55Z

tools/varint/src/main/java/org/apache/polaris/tools/varint/VarInt.java

+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.polaris.tools.varint;


I'm mainly concerned about having these in a new module and in the o.a.polaris.tools namespace. If this is for enabling the connector for Mongo shouldn't it live in that namespace? Otherwise we have to consider the general utility to the entire project and the burden we take on by exposing these public classes and apis? If we did use an already existing library or placed it in a nessie/mongo specific package that wouldn't be required

RussellSpitzer · 2025-03-27T21:54:13Z

tools/varint/src/test/java/org/apache/polaris/tools/varint/TestVarInt.java

+
+  @ParameterizedTest
+  @MethodSource
+  public void varInt(long value, byte[] binary) {


This feels like we aren't really flexing the code base here. I would assume we would want to be doing a series of ser/de where we serialize and deserialize random positive integers and prove that the same value goes in and out. Because of the lightness of the operations here we should be able to fuzz hundreds of thousands of values in a ms duration test right?

If we go with random, I believe it is important to have a fixed seed for repeatable tests.

I mildly disagree about not flexing the code. I do not think it is necessary to use many test values, but we probably should use values of all possible lengths (current tests appear to miss some lengths, indeed).

That's reasonable, For example I know some other projects that just test every possible bit length int and then +-1 IE
For i in MAXLENGTH * 8
test 2^i
test 2^i + 1
test 2^i - 1

snazy · 2025-03-28T09:31:37Z

Moved the module and updated the package name and added some more test cases.

dimas-b · 2025-03-29T02:20:21Z

...nents/persistence/varint/src/test/java/org/apache/polaris/persistence/varint/TestVarInt.java

+        // 49 bits -> 7 x 7 bits
+        arguments(0x1ffffffffffffL, new byte[] {-1, -1, -1, -1, -1, -1, 127}),
+        // 56 bits -> 8 x 7 bits
+        arguments(0xffffffffffffffL, new byte[] {-1, -1, -1, -1, -1, -1, -1, 127}),


might be worth having a test with zeros (and small numbers) in the middle 🤔

0 doesn't make sense - that terminates the varint.

I meant zeros in the hex representation :)

dimas-b

Thanks for the updates, @snazy . Test coverage LGTM given the encoding/decoding logic.

eric-maynard · 2025-04-01T16:50:15Z

bom/build.gradle.kts

@@ -33,6 +33,7 @@ dependencies {
    api(project(":polaris-immutables"))
    api(project(":polaris-misc-types"))
    api(project(":polaris-version"))
+    api(project(":polaris-persistence-varint"))


This still seems outside the scope of one particular persistence implementation that we expect will need these types

edit: Basically the same as @RussellSpitzer's comment here

I was thinking something like

:polaris-persistenance-nosql

or a sub module within that but basically an isolated part of the nosql layer. Just so it's absolutely clear what code is only for that adapter

:polaris-persistenance-nosql looks very broad... I hope we're not putting all of NoSQL-related code in the same module :)

Just so it's absolutely clear what code is only for that adapter

What could be a disadvantage of reusing this module outside the noSQL Persistence impl?

Basically my thought is that Varint serialization is not a core competency of the Polaris project, it's just a utility that is specific to the current approach's NoSql implementation. I think it's ok to get in if it's just something that the Mongo impl wants to use in a separate codebase, but if this is something we wanted to use generally I would strongly lean towards finding an existing library rather than rolling our own.

I think we would have a different discussion if this was a analytics engine, or something that obviously needed it's own varint impl, but i'm not sure why a Catalog needs it's own varint impl. That said, I'm willing to include one if it's isolated from the rest of the code base.

As for modules, I don't think there really should be an issue with keeping the whole Mongo impl in it's own module. I think at least everything should be in the same Java package unless there is a proven need to expose a public api to the rest of the project. That's just my gut feeling here.

Fair enough. By that logic it might be best to keep Varint inside the (future) module that actually needs it (not separate jar artifacts). @snazy WDYT?

If that works for everybody, and since this PR was reviewed, how about we merge it with renaming the module to :polaris-persistenance-nosql-varint and when the bulk of NoSQL persistence comes we'll fold it into the module that needs it?

snazy requested review from adutra, ashvina and dennishuo as code owners March 20, 2025 19:38

github-project-automation bot added this to Basic Kanban Board Mar 20, 2025

snazy requested review from dimas-b, eric-maynard, jackye1995, jbonofre, vvcephei, collado-mike, RussellSpitzer, takidau, MonkeyCanCode, flyrain and ebyhr as code owners March 20, 2025 19:38

github-project-automation bot moved this to PRs In Progress in Basic Kanban Board Mar 20, 2025

eric-maynard reviewed Mar 20, 2025

View reviewed changes

flyrain reviewed Mar 20, 2025

View reviewed changes

dimas-b reviewed Mar 20, 2025

View reviewed changes

snazy force-pushed the add-varint branch 2 times, most recently from 469fc70 to 164fe7a Compare March 22, 2025 09:49

dimas-b previously approved these changes Mar 24, 2025

View reviewed changes

github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board Mar 24, 2025

snazy force-pushed the add-varint branch from 164fe7a to aaa6f06 Compare March 26, 2025 07:31

RussellSpitzer reviewed Mar 27, 2025

View reviewed changes

dimas-b reviewed Mar 29, 2025

View reviewed changes

snazy dismissed dimas-b’s stale review via 10b552f March 30, 2025 05:38

dimas-b previously approved these changes Mar 31, 2025

View reviewed changes

eric-maynard reviewed Apr 1, 2025

View reviewed changes

Add Varint type for variable-length integer encoding

9dbe761

snazy dismissed dimas-b’s stale review via 9dbe761 April 7, 2025 15:11

snazy force-pushed the add-varint branch from 52967bf to 9dbe761 Compare April 7, 2025 15:11

dimas-b approved these changes Apr 15, 2025

View reviewed changes

snazy merged commit e5e173f into apache:main Apr 15, 2025
5 checks passed

github-project-automation bot moved this from Ready to merge to Done in Basic Kanban Board Apr 15, 2025

snazy deleted the add-varint branch April 15, 2025 15:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `Varint` type for variable-length integer encoding #1229

Add `Varint` type for variable-length integer encoding #1229

snazy commented Mar 20, 2025

eric-maynard Mar 20, 2025

snazy commented Mar 20, 2025

flyrain Mar 20, 2025

snazy Mar 22, 2025

eric-maynard Mar 23, 2025

flyrain Mar 24, 2025

snazy Mar 26, 2025

dimas-b Mar 27, 2025

eric-maynard Mar 27, 2025

dimas-b Mar 27, 2025

eric-maynard Mar 27, 2025

dimas-b Mar 27, 2025

snazy commented Mar 22, 2025

RussellSpitzer Mar 27, 2025

RussellSpitzer Mar 27, 2025

dimas-b Mar 28, 2025

dimas-b Mar 28, 2025

RussellSpitzer Mar 28, 2025

snazy commented Mar 28, 2025

dimas-b Mar 29, 2025

snazy Mar 31, 2025

dimas-b Mar 31, 2025

dimas-b left a comment

eric-maynard Apr 1, 2025 •

edited

Loading

RussellSpitzer Apr 1, 2025

dimas-b Apr 1, 2025

RussellSpitzer Apr 1, 2025 •

edited

Loading

dimas-b Apr 2, 2025

Add Varint type for variable-length integer encoding #1229

Add Varint type for variable-length integer encoding #1229

Conversation

snazy commented Mar 20, 2025

Choose a reason for hiding this comment

snazy commented Mar 20, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

snazy commented Mar 22, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

snazy commented Mar 28, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimas-b left a comment

Choose a reason for hiding this comment

eric-maynard Apr 1, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RussellSpitzer Apr 1, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Add `Varint` type for variable-length integer encoding #1229

Add `Varint` type for variable-length integer encoding #1229

eric-maynard Apr 1, 2025 •

edited

Loading

RussellSpitzer Apr 1, 2025 •

edited

Loading