Skip to content

Conversation

@andrew-m-leonard
Copy link
Contributor

@andrew-m-leonard andrew-m-leonard commented Dec 5, 2025

Fixes #6773

Concurrent tests with CLOUD_PROVIDER=azure, with 10 "static" nodes available:

00:00:13.906  Specified LABEL: ci.role.test&&hw.arch.x86&&sw.os.linux&&!sw.tool.glibc.2_12
[Pipeline] stage
[Pipeline] { (Queue)
Did you forget the `def` keyword? WorkflowScript seems to be setting a field named dynamicAgents (to a value of type ArrayList) which could lead to memory leaks or other issues.
[Pipeline] echo
00:00:13.923  dynamicAgents: [azure, fyre, EBC]
[Pipeline] echo
00:00:13.925  Added dynamic agent, nodeLabel: 'hw.arch.x86&&sw.os.linux&&!sw.tool.glibc.2_12&&(ci.role.test||ci.agent.dynamic)', dynamicLabel: 'azure'
[Pipeline] echo
00:00:13.926  Provisioning LABEL: hw.arch.x86&&sw.os.linux&&!sw.tool.glibc.2_12&&(ci.role.test||ci.agent.dynamic)
[Pipeline] nodesByLabel
00:00:13.934  Found a total of 10 nodes with the 'hw.arch.x86&&sw.os.linux&&!sw.tool.glibc.2_12&&(ci.role.test||ci.agent.dynamic)' label
[Pipeline] node
00:00:28.937  Still waiting to schedule task
00:00:28.937  Waiting for next available executor on ‘[hw.arch.x86&&sw.os.linux&&!sw.tool.glibc.2_12&&(ci.role.test||ci.agent.dynamic)](https://ci.adoptium.net/label/hw.arch.x86&&sw.os.linux&&!sw.tool.glibc.2_12&&(ci.role.test%7C%7Cci.agent.dynamic)/)’
00:03:26.402  Running on [test-linux-x64-d0f330](https://ci.adoptium.net/computer/test%2Dlinux%2Dx64%2Dd0f330/) in /home/adoptopenjdk/workspace/Test_openjdk21_hs_extended.openjdk_x86-64_linux_testList_0

@andrew-m-leonard andrew-m-leonard self-assigned this Dec 5, 2025
@karianna
Copy link
Contributor

karianna commented Dec 5, 2025

It's starting to feel like we need a state machine here...

Copy link
Contributor

@smlambert smlambert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sophia-guo
Copy link
Contributor

sophia-guo commented Dec 8, 2025

If static nodes are available they should be used first. Is this still expected behaviour for now? Or it doesn't matter to use dynamic ones or static ones?

@smlambert
Copy link
Contributor

If static nodes are available they should be sued first. Is this still expected behaviour for now? Or it doesn't matter to use dynamic ones or static ones?

This PR fixes the behaviour. Somehow over the past year, that functionality got 'broken', to only use dynamic agents if CLOUD_PROVIDER is set. Andrew has restored the behaviour to use idle static nodes first, then spin up dynamic agents if all static nodes idle.

@andrew-m-leonard
Copy link
Contributor Author

andrew-m-leonard commented Dec 8, 2025

If static nodes are available they should be sued first. Is this still expected behaviour for now? Or it doesn't matter to use dynamic ones or static ones?

If BOTH a static node and a dynamic node(ie.it hasn't been released after doing work..) is available, then it can use either I think that's reasonable.

@sophia-guo
Copy link
Contributor

Instead of changing the order of if clause if (!areNodesWithLabelOnline(LABEL)) {} does one line of change LABEL += '&&ci.agent.dynamic' to LABEL += '&&(ci.role.test||ci.agent.dynamic)' also work, which re-check the statics nodes status in case they are available after the initial check time-out?

@andrew-m-leonard
Copy link
Contributor Author

If static nodes are available they should be sued first. Is this still expected behaviour for now? Or it doesn't matter to use dynamic ones or static ones?

This PR fixes the behaviour. Somehow over the past year, that functionality got 'broken', to only use dynamic agents if CLOUD_PROVIDER is set. Andrew has restored the behaviour to use idle static nodes first, then spin up dynamic agents if all static nodes idle.

Yes, if a Static node is "available", then it will get used first, although if both a "Static" & a "Dynamic" are available, ie.a previous provisioned Dynamic has just finished some work and is "idle" and not yet been returned to the pool, then either could be used...
That is very subtly different to what the previous code was trying to do, in that it was trying to query "idle" Static... and only add Dynamic label if it could not find any, but I believe that was a bit flawed, it is better to add (ci.role.test || ci.agent.dynamic) and let Jenkins make the choice... as if both types are available, we should let it choose.

@smlambert
Copy link
Contributor

as if both types are available, we should let it choose.

Agree, and especially as we may configure via the Jenkins plugin how long a particular dynamic agent stays alive once it is spun up...

@andrew-m-leonard
Copy link
Contributor Author

Instead of changing the order of if clause if (!areNodesWithLabelOnline(LABEL)) {} does one line of change LABEL += '&&ci.agent.dynamic' to LABEL += '&&(ci.role.test||ci.agent.dynamic)' also work, which re-check the statics nodes status in case they are available after the initial check time-out?

The problem previously was if (!areNodesWithLabelOnline(LABEL)) is not checking "available"... it is only checking if "online".
As you highlight also there is a race condition when trying to check "available" (using APIs or...), as by the time you get to the node(LABEL) the available one may not be available any longer...!

So i've changed the logic to basically:

  • Construct the correct required LABEL based on static (ci.role.test) or dynamic (ci.agent.dynamic) requirement
  • Then do a "online" check in case ALL nodes are "Down/Disconnected"
  • Then do the node(LABEL) provision

@andrew-m-leonard
Copy link
Contributor Author

Instead of changing the order of if clause if (!areNodesWithLabelOnline(LABEL)) {} does one line of change LABEL += '&&ci.agent.dynamic' to LABEL += '&&(ci.role.test||ci.agent.dynamic)' also work, which re-check the statics nodes status in case they are available after the initial check time-out?

The other problem with that was also if no Static's are online, then the Dynamic logic was ignored and a Timeout abort occured

@andrew-m-leonard andrew-m-leonard changed the title Dynamic wait Correct Dynamic Queue logic for CLOUD_PROVIDER and Static node combinations Dec 8, 2025
@andrew-m-leonard
Copy link
Contributor Author

Also tested on a local Jenkins with no Cloud dynamic agents, and if CLOUD_PROVIDER is specified for any reason it will correctly just use the Statics (previously it would hang)

@smlambert
Copy link
Contributor

re: #6776 (comment)

@andrew-m-leonard - if there is not already a clear comment in that piece of code, it might be good to add your clarifying comment from above to guide others looking at that code in the future

@andrew-m-leonard
Copy link
Contributor Author

re: #6776 (comment)

@andrew-m-leonard - if there is not already a clear comment in that piece of code, it might be good to add your clarifying comment from above to guide others looking at that code in the future

yep, i'll beef up the comments I added, thanks

@sophia-guo
Copy link
Contributor

In this case I feel like a static will be used first but meanwhile the dynamic one will still be spun up since node(Label) has 'ci.agent.dynamic ' ?

Yes, if a Static node is "available", then it will get used first. ( No previous provisioned Dynamic available)

@andrew-m-leonard
Copy link
Contributor Author

In this case I feel like a static will be used first but meanwhile the dynamic one will still be spun up since node(Label) has 'ci.agent.dynamic ' ?

Yes, if a Static node is "available", then it will get used first. ( No previous provisioned Dynamic available)

Not sure that matters, but I don't think Jenkins provisioning will trigger a Cloud plugin when it is scheduling it to a Static...

My test above didn't do that either, there were 10 Static nodes available when I triggered the sanity.openjdk, all 3 went on the Static nodes, no Dynamics were spun up. I then triggered the extended.openjdk which requested 10 nodes, the first 7 went on the Static nodes, the remaining 3 then waited 2mins or so for the Dynamic agents to spin up. So Jenkins did a great job!

@JasonFengJ9 JasonFengJ9 merged commit 7f0bf63 into adoptium:master Dec 8, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Using CLOUD_PROVIDER ignores existing Static nodes

5 participants