Skip to content

Add Varint type for variable-length integer encoding #1229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 15, 2025

Conversation

snazy
Copy link
Member

@snazy snazy commented Mar 20, 2025

It's a separate module to separate things.

}
}

public static ByteBuffer putVarInt(ByteBuffer b, long v) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how I feel about maintaining this code. Can we use a library for this?

@snazy
Copy link
Member Author

snazy commented Mar 20, 2025

For context: this is a per-requisite for #1189. It's quite isolated.

testFixturesApi(libs.assertj.core)
}

description = "Provides variable length integer encoding"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to create a new module for just one class? I'm not against having multiple modules, but I'd like to understand the benefits.

My concerns are:

  1. Increased Complexity – If we continue this approach, we'll end up with significantly more modules to maintain, which could add overhead.
  2. Impact on Downstream Users – More modules mean more dependencies, potentially making it harder for downstream users who will have to manage a lot of Polaris jars, other than just a few of them.

Would love to hear the rationale behind this approach!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal here is to have one dependency that does not have unnecessary other dependencies, kryo for example comes with a bunch of other dependencies (so "even more jars").

The opposite, not sure if you're proposing that, is to have monoliths. The disadvantage is what we currently have in the code base - a bunch of unrelated things that depend on each other.

I don't see an impact on downstream users because:
a) versions are consistently managed via the bom
b) dependencies are transitive, not manual - unless you're using the ant-ique build tool, you're fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern here is more about the maintenance overhead rather than the impact on downstream users. Of course, if we write perfect bug-free code downstream users are not impacted. But if we can avoid a new module, and indeed even new code to maintain, I think we should explore that option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal here is to have one dependency that does not have unnecessary other dependencies, kryo for example comes with a bunch of other dependencies (so "even more jars").

The chance of a downstream project depends only on this module is rare. Otherwise, I'd recommend to contribute the code to the other Apache projects like Apache Common(https://commons.apache.org/). In a lot of real use cases, a downstream project will likely depend on modules like polaris-core. Plus, given that this module only depends on libs.guava, can we put it in polaris-core which has guava already?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that this one is used by a bunch of modules in #1189. This piece is "tailor made" for those use cases. Kryo's serialization is different, and brings full-blown object mapping + alternative logging - quite too much.

polaris-core already has way too may things - depends on Iceberg and public API modules - things on which persistence work should really not depend on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eric-maynard : by "appropriate" do you mean somewhere under /persistence? That would work for me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If only one module needs this, I think this code can probably live in that module for now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see... but as discussed in the persistence community meeting, I thought people wanted the NoSQL code to arrive in smaller, easier to review chunks 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small PRs does not mean many small modules

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't object to folding Varint into one of the modules that come later. Perhaps we could have the module skeleton in this PR and add other classes later... Does it work for you, @snazy ?

@snazy snazy force-pushed the add-varint branch 2 times, most recently from 469fc70 to 164fe7a Compare March 22, 2025 09:49
@snazy
Copy link
Member Author

snazy commented Mar 22, 2025

Rebased+force pushed - too many conflicts for a merge.

dimas-b
dimas-b previously approved these changes Mar 24, 2025
@github-project-automation github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board Mar 24, 2025
* specific language governing permissions and limitations
* under the License.
*/
package org.apache.polaris.tools.varint;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mainly concerned about having these in a new module and in the o.a.polaris.tools namespace. If this is for enabling the connector for Mongo shouldn't it live in that namespace? Otherwise we have to consider the general utility to the entire project and the burden we take on by exposing these public classes and apis? If we did use an already existing library or placed it in a nessie/mongo specific package that wouldn't be required


@ParameterizedTest
@MethodSource
public void varInt(long value, byte[] binary) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like we aren't really flexing the code base here. I would assume we would want to be doing a series of ser/de where we serialize and deserialize random positive integers and prove that the same value goes in and out. Because of the lightness of the operations here we should be able to fuzz hundreds of thousands of values in a ms duration test right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go with random, I believe it is important to have a fixed seed for repeatable tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mildly disagree about not flexing the code. I do not think it is necessary to use many test values, but we probably should use values of all possible lengths (current tests appear to miss some lengths, indeed).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's reasonable, For example I know some other projects that just test every possible bit length int and then +-1 IE
For i in MAXLENGTH * 8
test 2^i
test 2^i + 1
test 2^i - 1

@snazy
Copy link
Member Author

snazy commented Mar 28, 2025

Moved the module and updated the package name and added some more test cases.

// 49 bits -> 7 x 7 bits
arguments(0x1ffffffffffffL, new byte[] {-1, -1, -1, -1, -1, -1, 127}),
// 56 bits -> 8 x 7 bits
arguments(0xffffffffffffffL, new byte[] {-1, -1, -1, -1, -1, -1, -1, 127}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth having a test with zeros (and small numbers) in the middle 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 doesn't make sense - that terminates the varint.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant zeros in the hex representation :)

dimas-b
dimas-b previously approved these changes Mar 31, 2025
Copy link
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, @snazy . Test coverage LGTM given the encoding/decoding logic.

@@ -33,6 +33,7 @@ dependencies {
api(project(":polaris-immutables"))
api(project(":polaris-misc-types"))
api(project(":polaris-version"))
api(project(":polaris-persistence-varint"))
Copy link
Contributor

@eric-maynard eric-maynard Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still seems outside the scope of one particular persistence implementation that we expect will need these types

edit: Basically the same as @RussellSpitzer's comment here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking something like

:polaris-persistenance-nosql

or a sub module within that but basically an isolated part of the nosql layer. Just so it's absolutely clear what code is only for that adapter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:polaris-persistenance-nosql looks very broad... I hope we're not putting all of NoSQL-related code in the same module :)

Just so it's absolutely clear what code is only for that adapter

What could be a disadvantage of reusing this module outside the noSQL Persistence impl?

Copy link
Member

@RussellSpitzer RussellSpitzer Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically my thought is that Varint serialization is not a core competency of the Polaris project, it's just a utility that is specific to the current approach's NoSql implementation. I think it's ok to get in if it's just something that the Mongo impl wants to use in a separate codebase, but if this is something we wanted to use generally I would strongly lean towards finding an existing library rather than rolling our own.

I think we would have a different discussion if this was a analytics engine, or something that obviously needed it's own varint impl, but i'm not sure why a Catalog needs it's own varint impl. That said, I'm willing to include one if it's isolated from the rest of the code base.

As for modules, I don't think there really should be an issue with keeping the whole Mongo impl in it's own module. I think at least everything should be in the same Java package unless there is a proven need to expose a public api to the rest of the project. That's just my gut feeling here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. By that logic it might be best to keep Varint inside the (future) module that actually needs it (not separate jar artifacts). @snazy WDYT?

If that works for everybody, and since this PR was reviewed, how about we merge it with renaming the module to :polaris-persistenance-nosql-varint and when the bulk of NoSQL persistence comes we'll fold it into the module that needs it?

@snazy snazy merged commit e5e173f into apache:main Apr 15, 2025
5 checks passed
@github-project-automation github-project-automation bot moved this from Ready to merge to Done in Basic Kanban Board Apr 15, 2025
@snazy snazy deleted the add-varint branch April 15, 2025 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants