Skip to content

Conversation

MadLittleMods
Copy link
Contributor

@MadLittleMods MadLittleMods commented Sep 10, 2025

Remove sentinel logcontext in Clock utilities (looping_call, looping_call_now, call_later)

Part of #18905

Lints for ensuring we use Clock.call_later instead of reactor.callLater, etc are coming in #18944

Testing strategy

  1. Configure Synapse to log at the DEBUG level
  2. Start Synapse: poetry run synapse_homeserver --config-path homeserver.yaml
  3. Wait 10 seconds for the database profiling loop to execute
  4. Notice the logcontext being used for the Total database time log line

Before (sentinel):

2025-09-10 16:36:58,651 - synapse.storage.TIME - 707 - DEBUG - sentinel - Total database time: 0.646% {room_forgetter_stream_pos(2): 0.131%, reap_monthly_active_users(1): 0.083%, get_device_change_last_converted_pos(1): 0.078%}

After (looping_call):

2025-09-10 16:36:58,651 - synapse.storage.TIME - 707 - DEBUG - looping_call - Total database time: 0.646% {room_forgetter_stream_pos(2): 0.131%, reap_monthly_active_users(1): 0.083%, get_device_change_last_converted_pos(1): 0.078%}

Dev notes

SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial tests.util.test_logcontext.LoggingContextTestCase

Other related fixes matrix-org/synapse#4209

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Code style is correct (run the linters)

@MadLittleMods MadLittleMods changed the title Add logcontext to Clock utilities (looping_call, looping_call_now, call_later) Remove sentinel logcontext in Clock utilities (looping_call, looping_call_now, call_later) Sep 10, 2025
# test is done once d2 finishes
return d2

@logcontext_clean
Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added @logcontext_clean everywhere as it makes sense to fail when we see logcontext_error used in the logcontext tests.

Interestingly, I do still see some logcontext_error (ex. 2025-09-10 20:20:17-0500 [-] synapse.logging.context - 90 - WARNING - one - Re-starting finished log context one) logs when running these run_in_background test logs but the tests still aren't failing 🤔

These logs also happen on develop (even develop from days ago before any logcontext changes) so this is a long-standing problem.

Something to solve in an another PR ⏩

with context.PreserveLoggingContext():
callback(*args, **kwargs)

with context.PreserveLoggingContext():
Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PreserveLoggingContext() doesn't seem necessary but please double-check. We aren't calling anything at this point. Only scheduling something for the reactor to run later

Unless we think that a reactor implementation might call immediately when delay is 0 (reactor.callLater(0, func)). Although, I'm assuming this has the same semantics as setTimeout(code, 0) in JavaScript where it should be run as soon as possible but not right away (put it on the queue).

Please cross-check with whether the test_call_later is good enough to stress this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems fair to me. I can't rule out some weird non-obvious interaction but if that was the case, there should've been a comment to that extent.

Comment on lines 202 to 206
assert context.current_context() is context.SENTINEL_CONTEXT, (
"Expected `call_later` callback from the reactor to start with the sentinel logcontext "
f"but saw {context.current_context()}. In other words, another task shouldn't have "
"leaked their logcontext to us."
)
Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we ok with breaking the world when we run into this kind of problem?

It would be much nicer to catch this and get a report (what we're doing now) than logging in the background.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, this has already caught the problem in our email pusher tests where we were leaking logcontexts -> 9116e74

Comment on lines +98 to +101
self.assertTrue(
callback_finished,
"Callback never finished which means the test probably didn't wait long enough",
)
Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this to ensure we're actually getting through everything as expected.

Previously, we didn't run through the callback completely and missed the competing assertion for example.

d: Deferred = Deferred()
self.email_attempts.append((d, args, kwargs))
return d
return make_deferred_yieldable(d)
Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were leaking the logcontext into the reactor

This took some debug diving to find the culprit but turns out we already had the same fix in the HTTP pusher tests since a267c2e (matrix-org/synapse#4204)


Previously, this was causing some tests to fail:

SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial tests.push.test_email.EmailPusherTests.test_invite_to_empty_room_sends_email

_trial_temp/test.log

2025-09-11 01:01:23-0500 [-] synapse.util - 308 - ERROR - emailpush.process-2 - Looping call died
	Traceback (most recent call last):
	  File "/virtualenvs/matrix-synapse-xCtC9ulO-py3.13/lib/python3.13/site-packages/twisted/internet/defer.py", line 216, in maybeDeferred
	    result = f(*args, **kwargs)
	  File "synapse/synapse/util/__init__.py", line 202, in wrapped_f
	    assert context.current_context() is context.SENTINEL_CONTEXT, (
	           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	AssertionError: Expected `looping_call` callback from the reactor to start with the sentinel logcontext but saw emailpush.process-2. In other words, another task shouldn't have leaked their logcontext to us.

We caught this because we added that new sentinel logcontext assertion (discussed in another thread above)

@MadLittleMods MadLittleMods marked this pull request as ready for review September 11, 2025 06:15
@MadLittleMods MadLittleMods requested a review from a team as a code owner September 11, 2025 06:15
See #18907 (comment)

Prior art here uses `one`  here but I find it difficult to write about in a comment. For example, "in one context" is kinda generic sounding when I'm talking about the specific context.
See #18907 (comment)

> We should wrap this in a `try: ... finally: callback_finished = True` so we can see any potential underlying error that occurs in the `competing_callback()`
Comment on lines 155 to 157
Note that the function will be called with generic `looping_call` logcontext, so
if it is anything other than a trivial task, you probably want to wrap it in
`run_as_background_process` to give it more specific label and track metrics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not your problem to deal with, but I've often wondered why we don't just make looping_call take a name and do this wrapping itself. I can't think of a generic looping_call ctx ever being the desirable outcome.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A possible future 👍

with context.PreserveLoggingContext():
callback(*args, **kwargs)

with context.PreserveLoggingContext():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems fair to me. I can't rule out some weird non-obvious interaction but if that was the case, there should've been a comment to that extent.

@MadLittleMods MadLittleMods merged commit e7d98d3 into develop Sep 22, 2025
76 of 78 checks passed
@MadLittleMods MadLittleMods deleted the madlittlemods/looping-call-logcontext branch September 22, 2025 19:51
@MadLittleMods
Copy link
Contributor Author

Thanks for the review @reivilibre 🦔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants