Skip to content

8358680: AOT cache creation fails: no strings should have been added #25816

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

iklam
Copy link
Member

@iklam iklam commented Jun 16, 2025

Background: when writing the string table in the AOT cache, we do this:

  1. Find out the number of strings in the interned string table
  2. Allocate Java object arrays that are large enough to store these strings. These arrays are used by StringTable::lookup_shared() in the production run.
  3. Enter safepoint
  4. Copy the strings into the arrays

This bug happened because:

  • Step 1 is not thread safe, so it may be reading a stale version of _items_count
  • JIT compiler threads may create more interned strings after step 1

This PR attempts to fix both issues.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8358680: AOT cache creation fails: no strings should have been added (Bug - P4)

Contributors

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/25816/head:pull/25816
$ git checkout pull/25816

Update a local copy of the PR:
$ git checkout pull/25816
$ git pull https://git.openjdk.org/jdk.git pull/25816/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 25816

View PR using the GUI difftool:
$ git pr show -t 25816

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/25816.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jun 16, 2025

👋 Welcome back iklam! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jun 16, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk bot added the rfr Pull request is ready for review label Jun 16, 2025
@openjdk
Copy link

openjdk bot commented Jun 16, 2025

@iklam The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@mlbridge
Copy link

mlbridge bot commented Jun 16, 2025

Webrevs

@iklam
Copy link
Member Author

iklam commented Jun 16, 2025

@shipilev I copied the part of your JDK-8352042: [leyden] Parallel precompilation changes from the Leyden repo that implements CompileBroker::wait_for_no_active_tasks().

@iklam
Copy link
Member Author

iklam commented Jun 16, 2025

/contributor add @shipilev

@openjdk
Copy link

openjdk bot commented Jun 16, 2025

@iklam
Contributor Aleksey Shipilev <[email protected]> successfully added.

@vnkozlov
Copy link
Contributor

Can something else add string to table between you get number and safe point? How you avoid that?
I assume you can't get number at safepoint because you need to allocate object array not at safepoint. Right?

@iklam
Copy link
Member Author

iklam commented Jun 16, 2025

Can something else add string to table between you get number and safe point? How you avoid that? I assume you can't get number at safepoint because you need to allocate object array not at safepoint. Right?

Not that I am aware of. In testing, I added precond(THREAD->is_Java_thread() into StringTable::intern() and it never failed. So I assume that this function can only be called by real Java threads or compiler threads.

@iklam
Copy link
Member Author

iklam commented Jun 16, 2025

Can something else add string to table between you get number and safe point? How you avoid that? I assume you can't get number at safepoint because you need to allocate object array not at safepoint. Right?

Not that I am aware of. In testing, I added precond(THREAD->is_Java_thread() into StringTable::intern() and it never failed. So I assume that this function can only be called by real Java threads or compiler threads.

Here are all the subclasses of JavaThread:

class AttachListenerThread : public JavaThread {
class CompilerThread : public JavaThread {
class DeoptimizeObjectsALotThread : public JavaThread {
class JvmtiAgentThread : public JavaThread {
class MonitorDeflationThread : public JavaThread {
class NotificationThread : public JavaThread {
class ServiceThread : public JavaThread {
class StringDedupThread : public JavaThread {
class TrainingReplayThread : public JavaThread {

StringDedupThread is disabled when dumping heap objects. And the other threads probably will either be inactive, or won't be doing anything that would result in string interning.

I guess someone could use jcmd to connect to the JVM at the wrong time and that might upset the archive assembly process. But that would create far worse problems than the bug that we are trying to fix here.

@shipilev
Copy link
Member

shipilev commented Jun 16, 2025

@shipilev I copied the part of your JDK-8352042: [leyden] Parallel precompilation changes from the Leyden repo that implements CompileBroker::wait_for_no_active_tasks().

Well, I am rewriting that part in openjdk/leyden#79. I can give you a hunk on top of this PR that would do the same thing: use the CompileTaskWait_lock instead.

@shipilev
Copy link
Member

Actually, I need #25409 in mainline first :)

@iklam
Copy link
Member Author

iklam commented Jun 16, 2025

Actually, I need #25409 in mainline first :)

I can wait for that for JDK 26.

If I want to fix for JDK 25, will the current fix be good enough?

@shipilev
Copy link
Member

Would be nice to avoid divergence between JDK 25, JDK 26 and Leyden/premain.

Anyhow, I now think this fix in incomplete. In premain, we use this wait_for_no_active_tasks, because we know all the compiler tasks were queued, and we just need to run them down. But here, we still have a race: compiler threads may finish current batch of compilations, wait_for_no_active_tasks would return, and then we can start compiling again.

I think we need to figure our when we dump the shared table. Maybe even shutdown the compiler right before going into CDS dump? See how CompileBroker::set_should_block and VM_Exit::wait_for_threads_in_native_to_block do it. It would still be pretty awkward for a fix.

@iklam
Copy link
Member Author

iklam commented Jun 17, 2025

Would be nice to avoid divergence between JDK 25, JDK 26 and Leyden/premain.

Anyhow, I now think this fix in incomplete. In premain, we use this wait_for_no_active_tasks, because we know all the compiler tasks were queued, and we just need to run them down. But here, we still have a race: compiler threads may finish current batch of compilations, wait_for_no_active_tasks would return, and then we can start compiling again.

No Java code is executing at this point (we are in the only thread that can run Java code). Is there still a possibility for new compile tasks to be added?

I think we need to figure our when we dump the shared table. Maybe even shutdown the compiler right before going into CDS dump? See how CompileBroker::set_should_block and VM_Exit::wait_for_threads_in_native_to_block do it. It would still be pretty awkward for a fix.

In Leyden, we run the AOT compiler after the CDS dumping has finished, so if we shut down the compiler we would have to restart it.

Copy link
Contributor

@coleenp coleenp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of comments. Overall, the approach seems good.


// This flag will be cleared after intern table dumping has completed, so we can run the
// compiler again (for future AOT method compilation, etc).
DEBUG_ONLY(Atomic::release_store(&_disable_interning_during_cds_dump, 1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think atomics work with bool or is this a refcount ?


void CompileTask::wait_for_no_active_tasks() {
MonitorLocker locker(CompileTaskAlloc_lock);
while (_active_tasks > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this have to have an Atomic::load() to make it re-read in the loop? Even though it's after we reacquire the lock.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not if _active_tasks is only written whilst the same lock is held.

@@ -346,6 +348,10 @@ bool StringTable::has_work() {
return Atomic::load_acquire(&_has_work);
}

size_t StringTable::items_count() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a convention to make accessor functions that use acquire semantics to be named items_count_acquire().

@shipilev
Copy link
Member

I still dislike hooking up to compiler infrastructure to figure out if something is adding interned strings. I really, really dislike the divergence we would introduce with JDK 25 -> JDK 26 once a variant of JDK-8357473 lands in mainline. I cannot yet think of better solution though, let me think about it some more. At very least we need to get the sequencing of patches right...

@vnkozlov
Copy link
Contributor

I thought your comment was for #25841

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot [email protected] rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

5 participants