Skip to content

[Bug]: Orchestrator memory leak #2029

@agiping

Description

@agiping

Sandbox ID or Build ID

No response

Environment

E2B version: Community release 2026.09

  • Deployment: Self-hosted on bare-metal (1 API node + 5 sandbox nodes)
  • OS: Linux (kernel 5.x)
  • Workload: ~100-700 concurrent sandboxes, continuous creation and destruction

Timestamp of the issue

2026-03-01 23:27 UTC

Frequency

One-time occurrence

Expected behavior

Orchestrator has normal memory behavior

Actual behavior

  • Orchestrator's memory grows monotonically
  • After 2 days: 37.2 GB RSS, approaching OOM kill threshold (40 GB limit)
  • Only 23 running VMs on one of the sandbox node, but 382 NBD devices occupied and 394 mmap'd rootfs regions
  • 3,983 threads (goroutine leak?)
  • Sandbox directories and rootfs CoW files remain on disk after VM destruction

On one the of sandbox node:

Diagnostics are as bellow:

smaps_rollup (/proc//smaps_rollup)

Rss: 39017532 kB (~37.2 GB)
Pss_Anon: 6588208 kB (~6.3 GB) ← Go heap + goroutine stacks
Pss_File: 32427604 kB (~30.9 GB) ← leaked mmap'd rootfs files
Private_Clean: 32790556 kB (~31.3 GB)
AnonHugePages: 2910208 kB (~2.8 GB)

30.9 GB (83%) is file-backed mmap from rootfs/snapshot files that were never unmapped.

NBD device leak

382 NBD devices show a kernel PID, but only 23 VMs are running

$ ls /sys/block/nbd*/pid | wc -l
382

Only 23 child firecracker processes

$ pgrep -P <orchestrator_pid> | wc -l
23

Memory maps

/proc//maps shows hundreds of mmap entries for sandbox rootfs paths like:

7521b2e00000-752c520d0000 rw-s ... /orchestrator/sandbox/rootfs/ids7ddu137jqgfb3ayfe-...cow
7536f1200000-754191660000 rw-s ... /orchestrator/sandbox/rootfs/ikwi2ibau1z3a2q80u5ow-...cow
...

These correspond to sandboxes that were destroyed long ago but whose mmap regions were never released.

Sandbox directory residue

Hundreds of sandbox directories remain on disk after VM destruction

$ ls /orchestrator/sandbox/ | wc -l
396 # but only 23 VMs running

Issue reproduction

Reproduction

  1. Deploy orchestrator 2026.09 on a single api node + 5 sandbox nodes
  2. Continuously create and destroy sandboxes (100 - 700 concurrent, sustained over hours)
  3. Monitor nomad alloc status xxxx-id-of-orchestrator-alloc
  4. Compare ls /sys/block/nbd*/pid | wc -l vs actual running VMs (pgrep firecracker | wc -l)

Additional context

The orchestrator process leaks memory continuously, growing from ~200 MB to 37+ GB RSS within 2 days on a single-node deployment running community release 2026.09. 83% of the leaked memory is file-backed mmap regions from sandbox rootfs/snapshot files that are never unmapped after sandbox destruction.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions