Add options for max diffs before restarting process #202

Mr0grog · 2025-03-20T05:59:03Z

Folks at Internet Archive would like to see an option to limit the number of requests/diffs that can be handled by a single child process of the diff server (once we hit that limit, we should kill that process and start a new one). The idea is to have something similar to Gunicorn’s max_requests and maybe max_requests_jitter.

The server currently runs all diffs via a ProcessPoolExecutor:

web-monitoring-diff/web_monitoring_diff/server/server.py

Lines 542 to 560 in 1753d43

    
           def get_diff_executor(self, reset=False): 
        
               if self.application.terminating: 
        
                   raise RuntimeError('Diff executor is being shut down.') 
        
               executor = self.settings.get('diff_executor') 
        
               if reset or not executor: 
        
                   if executor: 
        
                       try: 
        
                           # NOTE: we don't need await this; we just want to make sure 
        
                           # the old executor gets cleaned up. 
        
                           shutdown_executor_in_loop(executor) 
        
                       except Exception: 
        
                           pass 
        
                   executor = concurrent.futures.ProcessPoolExecutor( 
        
                       DIFFER_PARALLELISM, 
        
                       initializer=initialize_diff_worker) 
        
                   self.settings['diff_executor'] = executor 
        
               return executor

ProcessPoolExecutor has a max_tasks_per_child option that basically does this, so we might be able to just lean on that. Doing so doesn’t give us a way to do jitter, but that might be fine.

Most of our config options come in via environment variables, so we should probably use env vars for this, too.

The text was updated successfully, but these errors were encountered:

Mr0grog · 2025-03-20T06:02:02Z

Ah, now that I’ve created this, I see that max_tasks_per_child is new in Python 3.11, which we do not yet support because of #165. So we’ll have to do this more manually by keeping track of requests and restarting the pool when we hit the max. Or maybe subclassing ProcessPoolExecutor. Need to look into what that would be like.

Mr0grog · 2025-03-20T06:36:30Z

A quick perusal and comparison of ProcessPoolExecutor implementations in 3.10 and 3.11 suggests that subclassing might not be a great approach, as I suspected. In 3.10, the executor does not do a great job checking on the number of running processes and restarting them, since it does not expect a worker process to randomly exit.

I think we can either:

Be really simplistic and just restart the pool every N diffs (or maybe N × {worker_count}).
Implement our own custom worker pool. Probably not a great idea: even though our use case is narrower and simpler than all the things ProcessPoolExecutor has to support, I suspect we will painfully rediscover a lot of complexities that are currently already being solved for us.

Mr0grog added enhancement New feature or request server Specific to the diffing server, rather than diff algorithms labels Mar 20, 2025

Mr0grog added this to Web Monitoring Mar 20, 2025

github-project-automation bot moved this to Inbox in Web Monitoring Mar 20, 2025

Mr0grog moved this from Inbox to Prioritized in Web Monitoring Mar 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add options for max diffs before restarting process #202

Add options for max diffs before restarting process #202

Mr0grog commented Mar 20, 2025

Mr0grog commented Mar 20, 2025

Mr0grog commented Mar 20, 2025

Add options for max diffs before restarting process #202

Add options for max diffs before restarting process #202

Comments

Mr0grog commented Mar 20, 2025

Mr0grog commented Mar 20, 2025

Mr0grog commented Mar 20, 2025