Skip to content

Conversation

@Electroid
Copy link

Closes #5229
Closes #4337

Summary

Users have reported that commands sometimes hang. The root cause is because the timeout handler only kills the direct child process, but grandchildren processes are kept alive.

Note: While this appears similar to recently merged #5258, those changes did not address the same issue that was present in the unified_exec codepath.

Reproduction

If you run the command below, it will hang. Instead, it should yield back to Codex.

$ echo "Run this bash command: sleep 30 & echo \"Background PID: \$!\"; exit 0" | codex --enable unified_exec 

Solution

While portable_pty already calls setsid() internally (creating a process group where pgid == pid), it only kills the direct child. This PR adds process group killing:

  1. Track the child PID when spawning
  2. On timeout/drop, call killpg(pgid, SIGKILL) to kill the entire process group
  3. Then call killer.kill() as backup

Copy link
Contributor

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 76 to +116
impl Drop for ExecCommandSession {
fn drop(&mut self) {
#[cfg(unix)]
if let Ok(mut pid_guard) = self.pid.lock() {
if let Some(pid) = pid_guard.take() {
if let Err(e) = kill_child_process_group(pid) {
trace!("Failed to kill process group for pid {}: {}", pid, e);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard against killing reused PIDs after child exit

The session destructor kills a process group whenever pid is still set (kill_child_process_group(pid)), but the PID is only cleared in the wait thread after child.wait() returns. Because wait() reaps the process and frees its PID before the mutex is updated (lines ~233‑247), there is a brief window where the child has fully exited and its PID may already be reused by an unrelated process while pid_guard still contains the stale value. If the session is dropped during that window, getpgid/killpg will target the new process group and may SIGKILL an unrelated process tree. Consider checking exit_status before killing or clearing pid before calling wait() so that drop never acts on a PID that might already be recycled.

Useful? React with 👍 / 👎.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a look.

@Electroid Electroid marked this pull request as draft November 12, 2025 23:08
@etraut-openai
Copy link
Collaborator

Thanks for the contribution. Before I ask someone on the codex team to review your PR, could you please fix the lint issues? You can ignore the failed test. That's a flaky test that someone on the team is already working on fixing.

@etraut-openai etraut-openai added the needs-response Additional information is requested label Nov 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-response Additional information is requested

Projects

None yet

2 participants