Skip to content

Statement timeout during cargo build leads to crash #413

Open
@jscheid

Description

@jscheid

Hi, thanks for all your hard work on plrust. So this is an interesting one. I've been hitting an "abnormal server process exit" and tracked it down to a statement timeout setting:

I had recently configured a default statement_timeout of 45000 (ms) and when I recompiled one of my plrust functions, it started downloading and recompiling external crates which took longer than that. Exactly after 45 seconds the build failed -- presumably because it received some signal from Postgres, which is expected of course -- but shouldn't it exit more cleanly, rather than taking down the whole worker process all worker processes?

Logs
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
[...]
PQconsumeInput() ERROR:  
   0: `cargo build` failed

Location:
   /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/convert/mod.rs:716

`cargo build` stderr:
       Updating crates.io index
    Downloading crates ...
     Downloaded thiserror-impl v1.0.62
     Downloaded thiserror v1.0.62
     Downloaded syn v2.0.71
      Compiling thiserror v1.0.62
      Compiling syn v2.0.71
      Compiling getrandom v0.2.15
      Compiling uuid v1.10.0
      Compiling bindgen v0.68.1
      Compiling serde_derive v1.0.204
      Compiling thiserror-impl v1.0.62
      Compiling enum-map-derive v0.17.0
      Compiling enum-map v2.7.3
      Compiling pest v2.7.11
      Compiling semver-parser v0.10.2
      Compiling serde v1.0.204
      Compiling semver v0.11.0
      Compiling rustc_version v0.3.3
      Compiling atomic-traits v0.3.0
      Compiling toml_datetime v0.6.6
      Compiling serde_spanned v0.6.6
      Compiling serde_json v1.0.120
      Compiling toml_edit v0.22.15
      Compiling serde_cbor v0.11.2
      Compiling toml v0.8.14
      Compiling cargo_toml v0.16.3
      Compiling pgrx-pg-config v0.11.0
      Compiling pgrx-pg-sys v0.11.0
      Compiling pgrx v0.11.0
      Compiling plrust-trusted-pgrx v1.2.7


Source Code:
   // SNIP
   // (I have the full error message here and can make it available if needed,
   // but I don't think it matters. The function doesn't declare any additional dependencies.)

	  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ BACKTRACE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
	                                ⋮ 2 frames hidden ⋮
	   3: plrust::user_crate::build::FnBuild::build_internal::hcdeaf7e52dfecfae
	      at <unknown source file>:<unknown line>
	   4: plrust::plrust::compile_function::h90a18482c5d32d1b
	      at <unknown source file>:<unknown line>
	   5: pgrx_pg_sys::submodules::panic::run_guarded::hbd55bed297e7b964
	      at <unknown source file>:<unknown line>
	   6: plrust_validator_wrapper<unknown>
	      at <unknown source file>:<unknown line>
	   7: FunctionCall1Coll<unknown>
	      at <unknown source file>:<unknown line>
	   8: OidFunctionCall1Coll<unknown>
	      at <unknown source file>:<unknown line>
	   9: ProcedureCreate<unknown>
	      at <unknown source file>:<unknown line>
	  10: CreateFunction<unknown>
	      at <unknown source file>:<unknown line>
	  11: standard_ProcessUtility<unknown>
	      at <unknown source file>:<unknown line>
	  12: pgrx_pg_sys::include::pg14::standard_ProcessUtility::he0425f28cedd16c0
	      at <unknown source file>:<unknown line>
	  13: pgrx_pg_sys::submodules::panic::run_guarded::h2575082d6cd79498
	      at <unknown source file>:<unknown line>
	  14: plrust::hooks::plrust_process_utility_hook::h95d150ca0faf3908
	      at <unknown source file>:<unknown line>
	  15: PortalRun<unknown>
	      at <unknown source file>:<unknown line>
	  16: PostgresMain<unknown>
	      at <unknown source file>:<unknown line>
	  17: PostmasterMain<unknown>
	      at <unknown source file>:<unknown line>
	  18: main<unknown>
	      at <unknown source file>:<unknown line>
	  19: __libc_start_main<unknown>
	      at <unknown source file>:<unknown line>
	
	Run with COLORBT_SHOW_HIDDEN=1 environment variable to disable frame filtering.
	Run with RUST_BACKTRACE=full to include source snippets.
2024-07-16 12:40:58.612 GMT [217] DETAIL:  
	   0: pgrx_pg_sys::submodules::panic::register_pg_guard_panic_hook::{{closure}}
	   1: std::panicking::rust_panic_with_hook
	   2: std::panicking::begin_panic_handler::{{closure}}
	   3: std::sys_common::backtrace::__rust_end_short_backtrace
	   4: rust_begin_unwind
	   5: core::panicking::panic_fmt
	   6: pgrx_pg_sys::submodules::panic::run_guarded
	   7: plrust_validator_wrapper
	   8: FunctionCall1Coll
	   9: OidFunctionCall1Coll
	  10: ProcedureCreate
	  11: CreateFunction
	  12: <unknown>
	  13: standard_ProcessUtility
	  14: pgrx_pg_sys::include::pg14::standard_ProcessUtility
	  15: pgrx_pg_sys::submodules::panic::run_guarded
	  16: plrust::hooks::plrust_process_utility_hook
	  17: <unknown>
	  18: <unknown>
	  19: PortalRun
	  20: <unknown>
	  21: PostgresMain
	  22: <unknown>
	  23: PostmasterMain
	  24: main
	  25: __libc_start_main
	  26: <unknown>

server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

psql# SELECT version();
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                               version                                                               │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ PostgreSQL 14.12 (Debian 14.12-1.pgdg110+1) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Also, I'm not sure if I understand the error message fully, is it just the one worker process exiting or is this the equivalent of a full postmaster crash? The mention about possibly corrupted shared memory makes me think it's the latter. Unfortunately I can't test this as reliably anymore because the dependencies are now already downloaded and built (because I tested successfully with the statement timeout removed.)

(I realized that I can test this easily by removing our prewarming code, and I can confirm that all workers are crashing.)

Now, the most pressing question I'm facing is how worried I should be about the unclean exit. Could you shed some light on this? I can't think of a foolproof way to prevent this scenario from ever occurring in production so it would be good to know what the impact might be in that case.

Now, the most pressing question I'm facing is whether this is to do with the specific way plrust is set up in our dev env or whether the same could happen in our RDS production environment. I'm going to be able to test this more later today, but perhaps you can shed any light on this?


This incident also brought to mind a whole other question: you'll notice in the logs that plrust is downloading and building external crates, this is of course the reason I ran into the timeout in the first place. However, we're priming the workdir (along these lines) and so this wouldn't be reflecting initial setup but rather it must be an implicit (automatic) upgrade of existing dependencies, if I'm not missing anything. (The plrust function that the logs are for isn't specifying any additional crates by the way.)

The fact that it's downloading a brand-new version of thiserror only released a few days ago (long after I've built the Docker image in question and even longer after plrust 1.27 was released) also speaks to this theory.

If this is caused by automatic upgrades, and not due to a misconfiguration (of the plrust installation) in our dev env, I can see a range of potential problems, from supply chain security, to server instability due to bugs in newer crate versions, to fluctuating duration of migrations, and potentially subtle differences in plrust function behavior depending on when exactly they were compiled.

Would this (automatic upgrading of crate versions) have something to do with the way our Dockerfile is installing plrust in the dev env, or would this also be expected behavior on RDS, say? Is there any configuration we could apply to pin crate versions so that they can only be upgraded manually?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions