Skip to content

Set RUST_BACKTRACE=1 for production services: create a crate and use it #5360

Open
@sunshowers

Description

@sunshowers

While debugging an instance of #2416, I saw at gc08's /pool/ext/8a199f12-4f5c-483a-8aca-f97856658a35/crypt/debug/oxz_nexus_65a11c18-7f59-41ac-b9e7-680627f996e7/oxide-nexus:default.log.1711665000:

thread 'tokio-runtime-worker' panicked at nexus/db-queries/src/db/sec_store.rs:65:60:
called `Result::unwrap()` on an `Err` value: InternalError { internal_message: "database error (kind = Unknown): result is ambiguous: error=rpc error: code = Unavailable desc = error reading from server: read tcp [fd00:1122:3344:109::3]:56722->[fd00:1122:3344:105::3]:32221: read: connection reset by peer [exhausted]\n" }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[ Mar 28 22:01:58 Stopping because all processes in service exited. ]
[ Mar 28 22:01:58 Executing stop method (:kill). ]

In this case the issue is pretty clear, but I'm wondering if we've considered setting RUST_BACKTRACE=1 in our production environment. Having backtraces is something that can definitely aid in debugging, but maybe it isn't a big deal because the core file can show what's going on. (But see #5359.)

According to https://stackoverflow.com/questions/29421727/how-much-overhead-does-rust-backtrace-1-have it seems like there's some performance cost, so we'd have to measure it carefully.

Wonder if @hawkw has thoughts here.

### Tasks
- [ ] Create a small crate to set RUST_BACKTRACE=1 if it isn't set already (and maybe RUST_LIB_BACKTRACE as well)
- [ ] Use the crate in nexus
- [ ] Use the crate in sled-agent
- [ ] Use it in wicketd
- [ ] Use it elsewhere (add tasks for other services that would benefit)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions