Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spread checks over time with with max slop time #848

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

spaceisntsyntax
Copy link

Helps avoid dogpiling services that are checked using the same mechanism by adding per-child and per-test random additional sleep time between checks.

Helps avoid dogpiling services that are checked using the same mechanism by adding per-child and per-test random additional sleep time between checks.
@krig krig added the feature label Oct 24, 2016
@krig
Copy link
Contributor

krig commented Oct 24, 2016

I don't feel qualified to review this myself, but I don't know who might be considered maintainer for ldirectord now...

@mcnewton, you have been adding features recently - any opinion?

@lge
Copy link
Member

lge commented Oct 24, 2016

I don't think this patch as is has even been running.
Last time I checked, perl did not have random(), but rand().
And even that rand() probably behaves differently from what you think you are doing there.
Feel free to prove me wrong :-)

Also, did you really see bad effects in real life,
or are you trying to solve something that does not really exist,
purely based on some theoretical assumptions?

If you have seen bad behaviour in real life,
have you any metric before and after?
Which metric?

@mcnewton
Copy link
Contributor

@krig thanks for the include - not really adding features as much as scratching a few itches :).

I haven't tested, but agree with @lge - I'm not convinced it will work as stands.

And also would be interested in what situations this is useful for. I have never seen service checks being a problem for the realservers, and we run checks much quicker than the default. Real traffic usually drown the service checks out entirely in our case.

Copy link
Contributor

@mcnewton mcnewton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments inline.


Defines the maximum number of second to be randomly added to checkinterval.

When fork=no this option is not used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no code for this that I can see?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, ignore that - it's in run_child() rather than ld_main() so it's fine.

If set in the virtual server section then the global value is overridden.

This option causes checks to spread out in time, in order to avoid
overloading services probed using the same test.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Abstracting checks from services might be nicer, but that would be much more difficult to do.

@@ -2740,7 +2767,7 @@ sub run_child
_check_real($v, $r);
}
$0 = "ldirectord $virtual_id";
sleep $checkinterval;
sleep $checkinterval + (random($$) * $checkintervalspread);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted, should be rand(), and as $$ is the current PID you're multiplying checkintervalspread by what is often likely to be a very large integer, so this becomes something like sleep(a few hours). Which doesn't seem right?

Surely should just be

sleep $checkinterval + rand($checkintervalspread);

?

@spaceisntsyntax
Copy link
Author

You're right that it's not in use, but yes, it is meant to address a
real-world problem.

There are several external services that, for various client-side reason,
are backed by the same server-side resource, but which the client must
address on a variety of ports. Because we have to test the health for each
listening port separately, even though the real servers are identical, the
real servers end up under a dog-pile of checks all at the same time every
few seconds. If the checks were spread out, even by a few hundred
milliseconds, the effect would not be harmful.

Thanks.

Mike Rylander
| President
| Equinox Software, Inc. / Open Your Library
| phone: 1-877-OPEN-ILS (673-6457)
| email: [email protected]
| web: http://www.esilibrary.com

On Mon, Oct 24, 2016 at 4:02 PM, Lars Ellenberg [email protected]
wrote:

I don't think this patch as is has even been running.
Last time I checked, perl did not have random(), but rand().
And even that rand() probably behaves differently from what you think you
are doing there.
Feel free to prove me wrong :-)

Also, did you really see bad effects in real life,
or are you trying to solve something that does not really exist,
purely based on some theoretical assumptions?

If you have seen bad behaviour in real life,
have you any metric before and after?
Which metric?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#848 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AP_H2QEnnzHVngR7Xc0luga1dNQBAEWCks5q3Q7PgaJpZM4KB69s
.

@knet-jenkins
Copy link

knet-jenkins bot commented Jun 12, 2023

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents-pipeline/job/PR-848/1/input

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants