Skip to content

Conversation

dungba88
Copy link
Contributor

Description

Lazily write the FST padding byte, so that in case the FST is empty (no accepted nodes) nothing will be written. This is important for off-heap writing, as we don't want to add that extra byte when the FST would be thrown away. Found while working on #12980

@dungba88 dungba88 changed the title lazily write the FST padding byte Lazily write the FST padding byte Dec 26, 2023
Copy link
Contributor

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jan 10, 2024
@mikemccand
Copy link
Member

Egads, thank you bot! This one had already fallen past the event horizon of my email box. I'll try to review soon.

// pad: ensure no node gets address 0 which is reserved to mean
// the stop state w/ no arcs
dataOutput.writeByte((byte) 0);
// the stop state w/ no arcs. the actual byte will be written lazily
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm could we instead add a boolean paddingBytePending or so? Set it to true, here, then when the byte is lazily written, set it to false, write the byte, and increment numBytesWritten at that point? Otherwise it's sort of weird to increment numBytesWritten when we didn't actually write the byte yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numBytesWritten is used for determine the address of the to be written nodes, so if we don't increment it here, it would mess up the address.

freezeTail(0);
if (root.numArcs == 0) {
if (fst.metadata.emptyOutput == null) {
return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment that this means a completely empty FST? Accepts nothing?

}

private void writePaddingByte() throws IOException {
assert numBytesWritten == 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment explaining what this padding byte even is for? I myself cannot remember :)

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dungba88 -- I left a few small comments.

How serious is this, during off-heap writing? Are we using off-heap writing anywhere in Lucene yet (e.g. block tree)?

@dungba88
Copy link
Contributor Author

How serious is this, during off-heap writing? Are we using off-heap writing anywhere in Lucene yet (e.g. block tree)?

This will break the off-heap writing, as we are writting byte which won't belong to any FST, and it would mess up subsequent reads (if multiple FST are written to the same file)

We are not using off-heap, but I have 2 other PR to start doing that:

@mikemccand
Copy link
Member

How serious is this, during off-heap writing? Are we using off-heap writing anywhere in Lucene yet (e.g. block tree)?

This will break the off-heap writing, as we are writting byte which won't belong to any FST, and it would mess up subsequent reads (if multiple FST are written to the same file)

Oh no, I see. Hmm, well could we increment the numBytesWritten but use a separately boolean indicating that the lazy byte hasn't been written? It's spooky to have numBytesWritten = 1 mean it may or may not have been written.

@mikemccand
Copy link
Member

We are not using off-heap, but I have 2 other PR to start doing that:

OK, phew, thanks. Those will be next up :)

@github-actions github-actions bot removed the Stale label Jan 11, 2024
@dungba88
Copy link
Contributor Author

Thanks @mikemccand for reviewing, I've addressed those in the latest revision.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it looks great @dungba88, I'll merge!

@mikemccand mikemccand merged commit 701619d into apache:main Jan 11, 2024
mikemccand pushed a commit that referenced this pull request Jan 11, 2024
* lazily write the FST padding byte

* Also write the pad byte when there is emptyOutput

* add comment

* Add more comments
@mikemccand mikemccand added this to the 9.10.0 milestone Jan 11, 2024
slow-J pushed a commit to slow-J/lucene that referenced this pull request Jan 16, 2024
* lazily write the FST padding byte

* Also write the pad byte when there is emptyOutput

* add comment

* Add more comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants