Skip to content

Conversation

silvanocerza
Copy link

This in an attempt to make the library work with DBs larger than 2GB.

As discussed in #154 I started out creating a private Buffer interface that defines all ByteBuffer methods used by the library, though using long instead of int where necessary.

I also implemented it with SingleBuffer, it just wraps a single ByteBuffer and dispatches most method calls to that.

I obviously had to make some changes to use Buffer and long where neccessary.

These are the benchmarks before and after the change. There seems to be a small impact in the performance, I would consider it negligible but I'd love your feedback @oschwald. If this is good for you I'll keep with this approach.

Before:

$ java -cp target/classes:sample Benchmark "src/test/resources/maxmind-db/test-data/GeoLite2-City-Test.mmdb"
No caching
Warming up
Requests per second: 12238560
Requests per second: 15192594
Requests per second: 18715945

Benchmarking
Requests per second: 19939930
Requests per second: 20182262
Requests per second: 18237705
Requests per second: 19340692
Requests per second: 20495198

With caching
Warming up
Requests per second: 20486957
Requests per second: 16962125
Requests per second: 20492853

Benchmarking
Requests per second: 20080976
Requests per second: 19449435
Requests per second: 19674518
Requests per second: 17438679
Requests per second: 18250728

After:

$ java -cp target/classes:sample Benchmark "src/test/resources/maxmind-db/test-data/GeoLite2-City-Test.mmdb"
No caching
Warming up
Requests per second: 9042490
Requests per second: 15097511
Requests per second: 18496268

Benchmarking
Requests per second: 18985389
Requests per second: 17247624
Requests per second: 18502386
Requests per second: 19458596
Requests per second: 19564717

With caching
Warming up
Requests per second: 18938661
Requests per second: 16036499
Requests per second: 19542414

Benchmarking
Requests per second: 17596791
Requests per second: 18068941
Requests per second: 19250173
Requests per second: 19177891
Requests per second: 19458849

Copy link
Member

@oschwald oschwald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good start! I had a number of minor comments. In terms of performance, I believe the bounds checks on the methods should be unnecessary given their usage and eliminating them should help reduce the overhead a bit.

@silvanocerza
Copy link
Author

Made the requested changes, this is the new benchmark.

$ java -cp target/classes:sample Benchmark "src/test/resources/maxmind-db/test-data/GeoLite2-City-Test.mmdb"
No caching
Warming up
Requests per second: 9419547
Requests per second: 15061866
Requests per second: 18671361

Benchmarking
Requests per second: 19330939
Requests per second: 17348774
Requests per second: 18305868
Requests per second: 19884570
Requests per second: 19941156

With caching
Warming up
Requests per second: 19980485
Requests per second: 16225425
Requests per second: 19684910

Benchmarking
Requests per second: 19650596
Requests per second: 16992233
Requests per second: 18420135
Requests per second: 19801441
Requests per second: 19802013

I'll keep working on the MulltiBuffer implementation.

@silvanocerza
Copy link
Author

MultiBuffer implemented.

@oschwald should I update the script in https://github.com/maxmind/MaxMind-DB to generate big DB too? I guess it will be useful for tests and benchmarks.

@oschwald
Copy link
Member

MultiBuffer implemented.

Thanks! I'll try to take a look soon.

@oschwald should I update the script in https://github.com/maxmind/MaxMind-DB to generate big DB too? I guess it will be useful for tests and benchmarks.

I don't think we would want a large database in that repo. There are quite a few other projects pulling that in, and a large database would impact them. In terms of testing, it might make sense to focus on good unit-test coverage of MultiBuffer and all of its methods directly. One way to do this more easily without slowing down the whole test suite would be to add a constructor where you can set the chunk size and setting it to a small number for the tests. Potentially we could also add a package-private constructor for the reader that allowed setting the chunk size and then parameterize the existing reader tests so that they cover both the SingleBuffer case and the MultiBuffer case.

@silvanocerza
Copy link
Author

I don't think we would want a large database in that repo. There are quite a few other projects pulling that in, and a large database would impact them.

Ah ok, I thought it was only used for your libraries.

In terms of testing, it might make sense to focus on good unit-test coverage of MultiBuffer and all of its methods directly. One way to do this more easily without slowing down the whole test suite would be to add a constructor where you can set the chunk size and setting it to a small number for the tests. Potentially we could also add a package-private constructor for the reader that allowed setting the chunk size and then parameterize the existing reader tests so that they cover both the SingleBuffer case and the MultiBuffer case.

Sounds good, I can cover both buffers like that. 👍

Copy link
Member

@oschwald oschwald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only had a chance to do a cursory review, but I noticed a few things.

throw new IllegalArgumentException("File channel has no data");
}

MultiBuffer buf = new MultiBuffer(size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this allocate a bunch of ByteBuffers that we will immediately replace with the mmap-backed ones? I think this problem exists several other places as well, e.g., duplicate.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I added the private constructor that works with buffers after this and forgot to change.

Copy link
Member

@oschwald oschwald Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this still an issue? I'd expect something like this:

      int fullChunks = (int) (size / DEFAULT_CHUNK_SIZE);
      int remainder = (int) (size % DEFAULT_CHUNK_SIZE);
      int totalChunks = fullChunks + (remainder > 0 ? 1 : 0);

      ByteBuffer[] buffers = new ByteBuffer[totalChunks];
      long remaining = size;

       for (int i = 0; i < totalChunks; i++) {
             long chunkPos = (long) i * DEFAULT_CHUNK_SIZE;
             long chunkSize = Math.min(DEFAULT_CHUNK_SIZE, remaining);
           buffers[i] = channel.map(
                     FileChannel.MapMode.READ_ONLY,
                     chunkPos,
                     chunkSize
             );
             remaining -= chunkSize;
        }
       return new MultiBuffer(buffers, DEFAULT_CHUNK_SIZE);

I thought I saw this fixed last time, but either it was lost in the rebase or I overlooked it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think I missed it completely, fixed it now.

throw new NullPointerException("Unable to use a NULL InputStream");
}
final int chunkSize = Integer.MAX_VALUE;
final int chunkSize = Integer.MAX_VALUE / 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the motivation behind this change?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly cause of the changes in aafbae6. I made the same change in MultiBuffer too.

I was getting allocation errors trying to allocate byte[Integer.MAX_VALUE], as far as I understood because when allocating an array some memory is reserved for various metadata.

I noticed that Integer.MAX_VALUE - 8 does the trick, at least on my machine, though I don't know if every platform would have been fine so I went with half max int to be safe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever threshold we use, we should probably use it for the decision to use a single buffer on lines 23 and 35 as well. Presumably the allocation there would have the same issue. From what I can tell, Integer.MAX_VALUE - 8 should be safe. We probably just just define this as a constant in the class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, you should just set DEFAULT_CHUNK_SIZE to this and then use that here.

Comment on lines 241 to 260
@Test
public void testWrapValidChunks() {
ByteBuffer[] chunks = new ByteBuffer[] {
ByteBuffer.allocateDirect(MultiBuffer.DEFAULT_CHUNK_SIZE),
ByteBuffer.allocateDirect(500)
};

MultiBuffer buffer = MultiBuffer.wrap(chunks);
assertEquals(MultiBuffer.DEFAULT_CHUNK_SIZE + 500, buffer.capacity());
}

@Test
public void testWrapInvalidChunkSize() {
ByteBuffer[] chunks = new ByteBuffer[] {
ByteBuffer.allocateDirect(500),
ByteBuffer.allocateDirect(MultiBuffer.DEFAULT_CHUNK_SIZE)
};

assertThrows(IllegalArgumentException.class, () -> MultiBuffer.wrap(chunks));
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess these tests might be causing the failure? Quite strange as the chunk size is not max int.

Possible solution I see is move the chunks size check from wrap to the constructor and test that using a small chunk size. At that point wrap is just a oneliner. Sounds good?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all the tests allocating buffers of MultiBuffer.DEFAULT_CHUNK_SIZE will need to be adjusted as we are likely hitting the MaxDirectMemorySize limit set on the JVM. This also includes testDecodeStringTooLarge below, I believe.

Your approach for wrap makes sense.

Copy link
Member

@oschwald oschwald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I have been pretty busy, but here is some preliminary feedback.

throw new NullPointerException("Unable to use a NULL InputStream");
}
final int chunkSize = Integer.MAX_VALUE;
final int chunkSize = Integer.MAX_VALUE / 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever threshold we use, we should probably use it for the decision to use a single buffer on lines 23 and 35 as well. Presumably the allocation there would have the same issue. From what I can tell, Integer.MAX_VALUE - 8 should be safe. We probably just just define this as a constant in the class.

throw new NullPointerException("Unable to use a NULL InputStream");
}
final int chunkSize = Integer.MAX_VALUE;
final int chunkSize = Integer.MAX_VALUE / 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, you should just set DEFAULT_CHUNK_SIZE to this and then use that here.

Comment on lines 241 to 260
@Test
public void testWrapValidChunks() {
ByteBuffer[] chunks = new ByteBuffer[] {
ByteBuffer.allocateDirect(MultiBuffer.DEFAULT_CHUNK_SIZE),
ByteBuffer.allocateDirect(500)
};

MultiBuffer buffer = MultiBuffer.wrap(chunks);
assertEquals(MultiBuffer.DEFAULT_CHUNK_SIZE + 500, buffer.capacity());
}

@Test
public void testWrapInvalidChunkSize() {
ByteBuffer[] chunks = new ByteBuffer[] {
ByteBuffer.allocateDirect(500),
ByteBuffer.allocateDirect(MultiBuffer.DEFAULT_CHUNK_SIZE)
};

assertThrows(IllegalArgumentException.class, () -> MultiBuffer.wrap(chunks));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all the tests allocating buffers of MultiBuffer.DEFAULT_CHUNK_SIZE will need to be adjusted as we are likely hitting the MaxDirectMemorySize limit set on the JVM. This also includes testDecodeStringTooLarge below, I believe.

Your approach for wrap makes sense.

@silvanocerza
Copy link
Author

I think I fixed everything you pointed out. The current test failures though have me quite stumped.
I see they're failing because the heap has run out of memory, though it's failing in ReaderTest before MultiBufferTest runs. 🤔

I tried bumping the Surefire JVM heap size but there are so conflicts with master so I'm not sure whether that will solve the failures or not. I'm not even sure if it's a good way to solve the failure to be fair. 😅

Copy link
Member

@oschwald oschwald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggested test change may help with the CI, although I haven't gone through all the tests closely.

}

int readNode(ByteBuffer buffer, int nodeNumber, int index)
int readNode(Buffer buffer, long nodeNumber, int index)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return long from this function. We will also need to update it replace the static decodeInteger with a decodeLong. The issue is that for 32-bit nodes, we could overflow.

@silvanocerza
Copy link
Author

silvanocerza commented Oct 2, 2025

I rebased to increase the Surefire JVM heap size but that doesn't seem to work. Though I managed to reproduce the issue locally with this:

$ docker run --rm -m 5g --memory-swap 5g \
  -v "$PWD":/ws -w /ws maven:3.9-eclipse-temurin-21 \
  bash -lc 'export MAVEN_OPTS="-Xms512m -Xmx1g"; mvn -B -e clean test \
    -Dsurefire.argLine="-Xms512m -Xmx1024m -XX:MaxMetaspaceSize=192m -XX:MaxDirectMemorySize=256m -XX:+ExitOnOutOfMemoryError" \
    -Dsurefire.forkCount=1'

The issues seems to be here when running TestReader.testBrokenSearchTreePointerStream(), for some reason it exceeds the heap size. Using ByteBuffer.allocateDirect() doesn't help either, only lowering considerably MultiBuffer.DEFAULT_CHUNK_SIZE works.

Though than it fails in MultiBufferTest.testDecodeStringTooLarge().

Think I'll go with a similar approach we used for other tests and create methods and constructor to set the chunk size and test those.

@silvanocerza silvanocerza marked this pull request as ready for review October 2, 2025 10:04
@silvanocerza silvanocerza requested a review from oschwald October 2, 2025 10:05
@silvanocerza
Copy link
Author

I managed to make tests pass, I added some protected methods and the public ones are simple wrappers like we'd done for other methods.

I removed the MultiBuffer.wrap() method as it was redundant at this point given the changes in BufferHolder constructors to use a custom chunk size when building it from a stream.

All in all I'm quite satisfied with the current state of the PR.

@Test
public void testNoIpV4SearchTreeStream() throws IOException {
this.testReader = new Reader(getStream("MaxMind-DB-no-ipv4-search-tree.mmdb"));
this.testReader = new Reader(getStream("MaxMind-DB-no-ipv4-search-tree.mmdb"), 2048);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to parameterize the tests (and others) so that we are testing both SingleBuffer and MultiBuffer.

Also, we might get better test coverage of edge cases if we used a lower value for the MultiBuffer cases.

throw new IllegalArgumentException("File channel has no data");
}

MultiBuffer buf = new MultiBuffer(size);
Copy link
Member

@oschwald oschwald Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this still an issue? I'd expect something like this:

      int fullChunks = (int) (size / DEFAULT_CHUNK_SIZE);
      int remainder = (int) (size % DEFAULT_CHUNK_SIZE);
      int totalChunks = fullChunks + (remainder > 0 ? 1 : 0);

      ByteBuffer[] buffers = new ByteBuffer[totalChunks];
      long remaining = size;

       for (int i = 0; i < totalChunks; i++) {
             long chunkPos = (long) i * DEFAULT_CHUNK_SIZE;
             long chunkSize = Math.min(DEFAULT_CHUNK_SIZE, remaining);
           buffers[i] = channel.map(
                     FileChannel.MapMode.READ_ONLY,
                     chunkPos,
                     chunkSize
             );
             remaining -= chunkSize;
        }
       return new MultiBuffer(buffers, DEFAULT_CHUNK_SIZE);

I thought I saw this fixed last time, but either it was lost in the rebase or I overlooked it.

@silvanocerza silvanocerza requested a review from oschwald October 8, 2025 09:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants