Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add options for max diffs before restarting process #202

Open
Mr0grog opened this issue Mar 20, 2025 · 2 comments
Open

Add options for max diffs before restarting process #202

Mr0grog opened this issue Mar 20, 2025 · 2 comments
Labels
enhancement New feature or request server Specific to the diffing server, rather than diff algorithms

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Mar 20, 2025

Folks at Internet Archive would like to see an option to limit the number of requests/diffs that can be handled by a single child process of the diff server (once we hit that limit, we should kill that process and start a new one). The idea is to have something similar to Gunicorn’s max_requests and maybe max_requests_jitter.

The server currently runs all diffs via a ProcessPoolExecutor:

def get_diff_executor(self, reset=False):
if self.application.terminating:
raise RuntimeError('Diff executor is being shut down.')
executor = self.settings.get('diff_executor')
if reset or not executor:
if executor:
try:
# NOTE: we don't need await this; we just want to make sure
# the old executor gets cleaned up.
shutdown_executor_in_loop(executor)
except Exception:
pass
executor = concurrent.futures.ProcessPoolExecutor(
DIFFER_PARALLELISM,
initializer=initialize_diff_worker)
self.settings['diff_executor'] = executor
return executor

ProcessPoolExecutor has a max_tasks_per_child option that basically does this, so we might be able to just lean on that. Doing so doesn’t give us a way to do jitter, but that might be fine.

Most of our config options come in via environment variables, so we should probably use env vars for this, too.

@Mr0grog Mr0grog added enhancement New feature or request server Specific to the diffing server, rather than diff algorithms labels Mar 20, 2025
@Mr0grog
Copy link
Member Author

Mr0grog commented Mar 20, 2025

Ah, now that I’ve created this, I see that max_tasks_per_child is new in Python 3.11, which we do not yet support because of #165. So we’ll have to do this more manually by keeping track of requests and restarting the pool when we hit the max. Or maybe subclassing ProcessPoolExecutor. Need to look into what that would be like.

@Mr0grog
Copy link
Member Author

Mr0grog commented Mar 20, 2025

A quick perusal and comparison of ProcessPoolExecutor implementations in 3.10 and 3.11 suggests that subclassing might not be a great approach, as I suspected. In 3.10, the executor does not do a great job checking on the number of running processes and restarting them, since it does not expect a worker process to randomly exit.

I think we can either:

  • Be really simplistic and just restart the pool every N diffs (or maybe N × {worker_count}).
  • Implement our own custom worker pool. Probably not a great idea: even though our use case is narrower and simpler than all the things ProcessPoolExecutor has to support, I suspect we will painfully rediscover a lot of complexities that are currently already being solved for us.

@Mr0grog Mr0grog moved this from Inbox to Prioritized in Web Monitoring Mar 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request server Specific to the diffing server, rather than diff algorithms
Projects
Status: Prioritized
Development

No branches or pull requests

1 participant