-
Notifications
You must be signed in to change notification settings - Fork 48
Add support for files larger than 2GB #289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good start! I had a number of minor comments. In terms of performance, I believe the bounds checks on the methods should be unnecessary given their usage and eliminating them should help reduce the overhead a bit.
Made the requested changes, this is the new benchmark.
I'll keep working on the |
@oschwald should I update the script in https://github.com/maxmind/MaxMind-DB to generate big DB too? I guess it will be useful for tests and benchmarks. |
Thanks! I'll try to take a look soon.
I don't think we would want a large database in that repo. There are quite a few other projects pulling that in, and a large database would impact them. In terms of testing, it might make sense to focus on good unit-test coverage of |
Ah ok, I thought it was only used for your libraries.
Sounds good, I can cover both buffers like that. 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only had a chance to do a cursory review, but I noticed a few things.
throw new IllegalArgumentException("File channel has no data"); | ||
} | ||
|
||
MultiBuffer buf = new MultiBuffer(size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this allocate a bunch of ByteBuffers
that we will immediately replace with the mmap-backed ones? I think this problem exists several other places as well, e.g., duplicate
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I added the private constructor that works with buffers after this and forgot to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this still an issue? I'd expect something like this:
int fullChunks = (int) (size / DEFAULT_CHUNK_SIZE);
int remainder = (int) (size % DEFAULT_CHUNK_SIZE);
int totalChunks = fullChunks + (remainder > 0 ? 1 : 0);
ByteBuffer[] buffers = new ByteBuffer[totalChunks];
long remaining = size;
for (int i = 0; i < totalChunks; i++) {
long chunkPos = (long) i * DEFAULT_CHUNK_SIZE;
long chunkSize = Math.min(DEFAULT_CHUNK_SIZE, remaining);
buffers[i] = channel.map(
FileChannel.MapMode.READ_ONLY,
chunkPos,
chunkSize
);
remaining -= chunkSize;
}
return new MultiBuffer(buffers, DEFAULT_CHUNK_SIZE);
I thought I saw this fixed last time, but either it was lost in the rebase or I overlooked it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think I missed it completely, fixed it now.
throw new NullPointerException("Unable to use a NULL InputStream"); | ||
} | ||
final int chunkSize = Integer.MAX_VALUE; | ||
final int chunkSize = Integer.MAX_VALUE / 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the motivation behind this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly cause of the changes in aafbae6. I made the same change in MultiBuffer
too.
I was getting allocation errors trying to allocate byte[Integer.MAX_VALUE]
, as far as I understood because when allocating an array some memory is reserved for various metadata.
I noticed that Integer.MAX_VALUE - 8
does the trick, at least on my machine, though I don't know if every platform would have been fine so I went with half max int to be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whatever threshold we use, we should probably use it for the decision to use a single buffer on lines 23 and 35 as well. Presumably the allocation there would have the same issue. From what I can tell, Integer.MAX_VALUE - 8
should be safe. We probably just just define this as a constant in the class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, you should just set DEFAULT_CHUNK_SIZE
to this and then use that here.
@Test | ||
public void testWrapValidChunks() { | ||
ByteBuffer[] chunks = new ByteBuffer[] { | ||
ByteBuffer.allocateDirect(MultiBuffer.DEFAULT_CHUNK_SIZE), | ||
ByteBuffer.allocateDirect(500) | ||
}; | ||
|
||
MultiBuffer buffer = MultiBuffer.wrap(chunks); | ||
assertEquals(MultiBuffer.DEFAULT_CHUNK_SIZE + 500, buffer.capacity()); | ||
} | ||
|
||
@Test | ||
public void testWrapInvalidChunkSize() { | ||
ByteBuffer[] chunks = new ByteBuffer[] { | ||
ByteBuffer.allocateDirect(500), | ||
ByteBuffer.allocateDirect(MultiBuffer.DEFAULT_CHUNK_SIZE) | ||
}; | ||
|
||
assertThrows(IllegalArgumentException.class, () -> MultiBuffer.wrap(chunks)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess these tests might be causing the failure? Quite strange as the chunk size is not max int.
Possible solution I see is move the chunks size check from wrap
to the constructor and test that using a small chunk size. At that point wrap
is just a oneliner. Sounds good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all the tests allocating buffers of MultiBuffer.DEFAULT_CHUNK_SIZE
will need to be adjusted as we are likely hitting the MaxDirectMemorySize
limit set on the JVM. This also includes testDecodeStringTooLarge
below, I believe.
Your approach for wrap makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I have been pretty busy, but here is some preliminary feedback.
throw new NullPointerException("Unable to use a NULL InputStream"); | ||
} | ||
final int chunkSize = Integer.MAX_VALUE; | ||
final int chunkSize = Integer.MAX_VALUE / 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whatever threshold we use, we should probably use it for the decision to use a single buffer on lines 23 and 35 as well. Presumably the allocation there would have the same issue. From what I can tell, Integer.MAX_VALUE - 8
should be safe. We probably just just define this as a constant in the class.
throw new NullPointerException("Unable to use a NULL InputStream"); | ||
} | ||
final int chunkSize = Integer.MAX_VALUE; | ||
final int chunkSize = Integer.MAX_VALUE / 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, you should just set DEFAULT_CHUNK_SIZE
to this and then use that here.
@Test | ||
public void testWrapValidChunks() { | ||
ByteBuffer[] chunks = new ByteBuffer[] { | ||
ByteBuffer.allocateDirect(MultiBuffer.DEFAULT_CHUNK_SIZE), | ||
ByteBuffer.allocateDirect(500) | ||
}; | ||
|
||
MultiBuffer buffer = MultiBuffer.wrap(chunks); | ||
assertEquals(MultiBuffer.DEFAULT_CHUNK_SIZE + 500, buffer.capacity()); | ||
} | ||
|
||
@Test | ||
public void testWrapInvalidChunkSize() { | ||
ByteBuffer[] chunks = new ByteBuffer[] { | ||
ByteBuffer.allocateDirect(500), | ||
ByteBuffer.allocateDirect(MultiBuffer.DEFAULT_CHUNK_SIZE) | ||
}; | ||
|
||
assertThrows(IllegalArgumentException.class, () -> MultiBuffer.wrap(chunks)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all the tests allocating buffers of MultiBuffer.DEFAULT_CHUNK_SIZE
will need to be adjusted as we are likely hitting the MaxDirectMemorySize
limit set on the JVM. This also includes testDecodeStringTooLarge
below, I believe.
Your approach for wrap makes sense.
I think I fixed everything you pointed out. The current test failures though have me quite stumped. I tried bumping the Surefire JVM heap size but there are so conflicts with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The suggested test change may help with the CI, although I haven't gone through all the tests closely.
} | ||
|
||
int readNode(ByteBuffer buffer, int nodeNumber, int index) | ||
int readNode(Buffer buffer, long nodeNumber, int index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should return long
from this function. We will also need to update it replace the static decodeInteger
with a decodeLong
. The issue is that for 32-bit nodes, we could overflow.
3a5f79a
to
da3cbac
Compare
This reverts commit d6d7acd.
I rebased to increase the Surefire JVM heap size but that doesn't seem to work. Though I managed to reproduce the issue locally with this: $ docker run --rm -m 5g --memory-swap 5g \
-v "$PWD":/ws -w /ws maven:3.9-eclipse-temurin-21 \
bash -lc 'export MAVEN_OPTS="-Xms512m -Xmx1g"; mvn -B -e clean test \
-Dsurefire.argLine="-Xms512m -Xmx1024m -XX:MaxMetaspaceSize=192m -XX:MaxDirectMemorySize=256m -XX:+ExitOnOutOfMemoryError" \
-Dsurefire.forkCount=1' The issues seems to be here when running Though than it fails in Think I'll go with a similar approach we used for other tests and create methods and constructor to set the chunk size and test those. |
I managed to make tests pass, I added some protected methods and the public ones are simple wrappers like we'd done for other methods. I removed the All in all I'm quite satisfied with the current state of the PR. |
@Test | ||
public void testNoIpV4SearchTreeStream() throws IOException { | ||
this.testReader = new Reader(getStream("MaxMind-DB-no-ipv4-search-tree.mmdb")); | ||
this.testReader = new Reader(getStream("MaxMind-DB-no-ipv4-search-tree.mmdb"), 2048); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to parameterize the tests (and others) so that we are testing both SingleBuffer
and MultiBuffer
.
Also, we might get better test coverage of edge cases if we used a lower value for the MultiBuffer
cases.
throw new IllegalArgumentException("File channel has no data"); | ||
} | ||
|
||
MultiBuffer buf = new MultiBuffer(size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this still an issue? I'd expect something like this:
int fullChunks = (int) (size / DEFAULT_CHUNK_SIZE);
int remainder = (int) (size % DEFAULT_CHUNK_SIZE);
int totalChunks = fullChunks + (remainder > 0 ? 1 : 0);
ByteBuffer[] buffers = new ByteBuffer[totalChunks];
long remaining = size;
for (int i = 0; i < totalChunks; i++) {
long chunkPos = (long) i * DEFAULT_CHUNK_SIZE;
long chunkSize = Math.min(DEFAULT_CHUNK_SIZE, remaining);
buffers[i] = channel.map(
FileChannel.MapMode.READ_ONLY,
chunkPos,
chunkSize
);
remaining -= chunkSize;
}
return new MultiBuffer(buffers, DEFAULT_CHUNK_SIZE);
I thought I saw this fixed last time, but either it was lost in the rebase or I overlooked it.
This in an attempt to make the library work with DBs larger than 2GB.
As discussed in #154 I started out creating a private
Buffer
interface that defines allByteBuffer
methods used by the library, though usinglong
instead ofint
where necessary.I also implemented it with
SingleBuffer
, it just wraps a singleByteBuffer
and dispatches most method calls to that.I obviously had to make some changes to use
Buffer
andlong
where neccessary.These are the benchmarks before and after the change. There seems to be a small impact in the performance, I would consider it negligible but I'd love your feedback @oschwald. If this is good for you I'll keep with this approach.
Before:
After: