Spread checks over time with with max slop time #848

spaceisntsyntax · 2016-09-20T17:49:38Z

Helps avoid dogpiling services that are checked using the same mechanism by adding per-child and per-test random additional sleep time between checks.

krig · 2016-10-24T15:51:40Z

I don't feel qualified to review this myself, but I don't know who might be considered maintainer for ldirectord now...

@mcnewton, you have been adding features recently - any opinion?

lge · 2016-10-24T20:02:19Z

I don't think this patch as is has even been running.
Last time I checked, perl did not have random(), but rand().
And even that rand() probably behaves differently from what you think you are doing there.
Feel free to prove me wrong :-)

Also, did you really see bad effects in real life,
or are you trying to solve something that does not really exist,
purely based on some theoretical assumptions?

If you have seen bad behaviour in real life,
have you any metric before and after?
Which metric?

mcnewton · 2016-10-24T23:05:14Z

@krig thanks for the include - not really adding features as much as scratching a few itches :).

I haven't tested, but agree with @lge - I'm not convinced it will work as stands.

And also would be interested in what situations this is useful for. I have never seen service checks being a problem for the realservers, and we run checks much quicker than the default. Real traffic usually drown the service checks out entirely in our case.

mcnewton

A few comments inline.

mcnewton · 2016-10-24T23:06:10Z

ldirectord/ldirectord.in

+
+Defines the maximum number of second to be randomly added to checkinterval.
+
+When fork=no this option is not used.


There is no code for this that I can see?

Sorry, ignore that - it's in run_child() rather than ld_main() so it's fine.

mcnewton · 2016-10-24T23:07:56Z

ldirectord/ldirectord.in

+If set in the virtual server section then the global value is overridden.
+
+This option causes checks to spread out in time, in order to avoid
+overloading services probed using the same test.


Abstracting checks from services might be nicer, but that would be much more difficult to do.

mcnewton · 2016-10-24T23:09:59Z

ldirectord/ldirectord.in

@@ -2740,7 +2767,7 @@ sub run_child
 			_check_real($v, $r);
 		}
 		$0 = "ldirectord $virtual_id";
-		sleep $checkinterval;
+		sleep $checkinterval + (random($$) * $checkintervalspread);


As noted, should be rand(), and as $$ is the current PID you're multiplying checkintervalspread by what is often likely to be a very large integer, so this becomes something like sleep(a few hours). Which doesn't seem right?

Surely should just be

sleep $checkinterval + rand($checkintervalspread);

?

spaceisntsyntax · 2016-10-25T18:22:29Z

You're right that it's not in use, but yes, it is meant to address a
real-world problem.

There are several external services that, for various client-side reason,
are backed by the same server-side resource, but which the client must
address on a variety of ports. Because we have to test the health for each
listening port separately, even though the real servers are identical, the
real servers end up under a dog-pile of checks all at the same time every
few seconds. If the checks were spread out, even by a few hundred
milliseconds, the effect would not be harmful.

Thanks.

On Mon, Oct 24, 2016 at 4:02 PM, Lars Ellenberg [email protected]
wrote:

I don't think this patch as is has even been running.
Last time I checked, perl did not have random(), but rand().
And even that rand() probably behaves differently from what you think you
are doing there.
Feel free to prove me wrong :-)

Also, did you really see bad effects in real life,
or are you trying to solve something that does not really exist,
purely based on some theoretical assumptions?

If you have seen bad behaviour in real life,
have you any metric before and after?
Which metric?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#848 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AP_H2QEnnzHVngR7Xc0luga1dNQBAEWCks5q3Q7PgaJpZM4KB69s
.

knet-jenkins · 2023-06-12T09:05:48Z

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents-pipeline/job/PR-848/1/input

Spread checks over time with with max slop time

d63b9fc

Helps avoid dogpiling services that are checked using the same mechanism by adding per-child and per-test random additional sleep time between checks.

krig added the feature label Oct 24, 2016

mcnewton reviewed Oct 24, 2016

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spread checks over time with with max slop time #848

Spread checks over time with with max slop time #848

spaceisntsyntax commented Sep 20, 2016

krig commented Oct 24, 2016

lge commented Oct 24, 2016

mcnewton commented Oct 24, 2016

mcnewton left a comment

mcnewton Oct 24, 2016

mcnewton Oct 24, 2016

mcnewton Oct 24, 2016

mcnewton Oct 24, 2016

spaceisntsyntax commented Oct 25, 2016

knet-jenkins bot commented Jun 12, 2023


		Defines the maximum number of second to be randomly added to checkinterval.

		When fork=no this option is not used.

Spread checks over time with with max slop time #848

Are you sure you want to change the base?

Spread checks over time with with max slop time #848

Conversation

spaceisntsyntax commented Sep 20, 2016

krig commented Oct 24, 2016

lge commented Oct 24, 2016

mcnewton commented Oct 24, 2016

mcnewton left a comment

Choose a reason for hiding this comment

mcnewton Oct 24, 2016

Choose a reason for hiding this comment

mcnewton Oct 24, 2016

Choose a reason for hiding this comment

mcnewton Oct 24, 2016

Choose a reason for hiding this comment

mcnewton Oct 24, 2016

Choose a reason for hiding this comment

spaceisntsyntax commented Oct 25, 2016

knet-jenkins bot commented Jun 12, 2023