GitHub issue: alibaba/OpenSandbox#442
Issue summary:
- Support non-expiring sandboxes
- Let callers manage cleanup explicitly
- Keep existing TTL-based behavior for current users
- Work across Docker and Kubernetes runtimes where supported
Current implementation does not support this. TTL is a hard requirement in:
- API request/response models
- Docker runtime scheduling and restore logic
- Kubernetes workload creation and renew flows
This document captures the recommended refactor direction before implementation starts.
Introduce a manual cleanup mode without adding a new top-level mode field for now.
Chosen semantic:
timeoutpresent: sandbox uses TTL behaviortimeoutomitted ornull: sandbox uses manual cleanup behavior
Non-goals for this refactor:
- Do not support magic values like
timeout=0ortimeout=-1 - Do not redesign the lifecycle API beyond what is required for manual cleanup
- Do not overload
renew_expirationto switch a sandbox from manual mode back to TTL mode
This refactor is compatible through a controlled upgrade path, not through strict protocol backward compatibility.
Important compatibility fact:
- Once manual cleanup is enabled in an environment, lifecycle responses may contain
expiresAt=null - Lifecycle responses may also serialize other nullable fields explicitly as
nullinstead of omitting them - Older SDKs that assume
expiresAtis always a timestamp may fail when they callcreate,get, orlist - Older schema-generated clients may also fail if they assume fields such as
metadata,status.reason,status.message, orstatus.lastTransitionAtare always omitted or always non-null - Existing TTL-based callers are unaffected as long as they do not encounter manual-cleanup sandboxes
Recommended rollout order:
- Upgrade all SDKs/clients that read lifecycle API responses
- Upgrade the server
- Only then start creating sandboxes with
timeoutomitted ornull
Operational rule:
- Do not create manual-cleanup sandboxes in a shared environment until all readers of the lifecycle API have been upgraded
This should be called out explicitly in release notes and upgrade documentation.
Compared with adding expirationMode, using timeout: Optional[int] is the smallest compatible change that still maps cleanly to the feature request.
Advantages:
- Smaller API and SDK surface change
- Easier migration from the current TTL-only model
- Preserves current behavior for existing clients that already send
timeout
Tradeoffs:
- Mode becomes implicit rather than explicit
timeout == nullcan mean either deliberate manual mode or missing input- Future expansion beyond
ttl/manualmay require a second API refactor
For the current scope, these tradeoffs are acceptable.
TTL is currently mandatory.
Relevant files:
server/opensandbox_server/api/schema.pyspecs/sandbox-lifecycle.yml
Current constraints:
CreateSandboxRequest.timeoutis required and bounded to60-86400CreateSandboxResponse.expiresAtis requiredSandbox.expiresAtis requiredRenewSandboxExpirationRequest.expiresAtis required and assumes the sandbox already has TTL semantics
Relevant file:
server/opensandbox_server/services/docker.py
Current behavior:
- Creation always computes
expires_at = created_at + timeout - Creation always schedules expiration via in-process timer
- Existing sandboxes are restored from the expiration label on server startup
- Sandbox read/list responses always expose
expiresAt renew_expiration()only supports extending TTL
Relevant files:
server/opensandbox_server/services/k8s/kubernetes_service.pyserver/opensandbox_server/services/k8s/batchsandbox_provider.pyserver/opensandbox_server/services/k8s/agent_sandbox_provider.py
Current behavior:
- Creation always computes
expires_at = created_at + timeout - BatchSandbox writes
spec.expireTime - agent-sandbox writes
spec.shutdownTime renew_expiration()patches those fields- Sandbox read/list responses expose
expiresAt
CreateSandboxRequest.timeout should become optional.
Rules:
timeoutomitted ornullmeans manual cleanup modetimeoutpresent means TTL mode- If present,
timeoutmust still satisfy60 <= timeout <= 86400 timeout=0andtimeout<0remain invalid
Suggested request examples:
TTL mode:
{
"image": { "uri": "python:3.11" },
"timeout": 3600,
"resourceLimits": {},
"entrypoint": ["sleep", "infinity"]
}Manual cleanup mode:
{
"image": { "uri": "python:3.11" },
"resourceLimits": {},
"entrypoint": ["sleep", "infinity"]
}expiresAt should become nullable in:
CreateSandboxResponseSandbox
Rules:
- TTL sandbox:
expiresAtcontains an RFC 3339 timestamp - Manual sandbox:
expiresAtisnull
Do not use renew_expiration as a mode switch.
Recommended behavior:
- TTL sandbox: renew works as it does today
- Manual sandbox: renew fails clearly
Recommended response:
409 Conflictpreferred400 Bad Requestacceptable if existing error handling makes that much simpler
Recommended error message:
"Sandbox <id> does not have automatic expiration enabled."
Files to update:
server/opensandbox_server/api/schema.pyspecs/sandbox-lifecycle.yml
Required changes:
- Make
CreateSandboxRequest.timeoutoptional - Make
CreateSandboxResponse.expiresAtoptional - Make
Sandbox.expiresAtoptional - Update field descriptions to document manual cleanup behavior
- Update request/response examples in the OpenAPI spec
Recommended validation rule:
- No custom mode field
- Validation only enforces bounds when
timeoutis notNone
File to update:
server/opensandbox_server/services/docker.py
For manual sandboxes:
- No expiration timestamp is computed
- No expiration label is written
- A dedicated runtime marker should be written (for example
opensandbox.io/manual-cleanup=true) - No expiration timer is scheduled
- Sandbox survives server restart without restoration warnings
- Read/list responses return
expiresAt=None
Current logic:
_prepare_creation_context()always returns a concreteexpires_at
Target logic:
- Return
expires_at: Optional[datetime] Nonewhenrequest.timeout is None
Current logic:
- Expiration label is assumed to exist
Target logic:
- Only write
SANDBOX_EXPIRES_AT_LABELwhenexpires_at is not None - Write a dedicated manual-cleanup label/annotation when
expires_at is None
Current logic:
_provision_sandbox()always schedules expiration
Target logic:
- Only call
_schedule_expiration()whenexpires_at is not None
Current logic:
_container_to_sandbox()falls back to a concreteexpires_at
Target logic:
- Manual sandbox should produce
expiresAt=None - Avoid fallback behavior that fabricates an expiration timestamp from
created_at
Current logic:
_restore_existing_sandboxes()warns when a sandbox is missing the expiration label
Target logic:
- Missing expiration label should only be treated as valid when the manual-cleanup marker is present
- Continue warning on sandboxes that have neither an expiration label nor a manual-cleanup marker
- Only restore timers for TTL sandboxes that actually carry expiration metadata
Current logic:
renew_expiration()assumes every sandbox has TTL enabled
Target logic:
- Reject renewal if the manual-cleanup marker is present
- Continue treating "missing expiration metadata without manual marker" as malformed state rather than silently converting it to manual mode
Files to update:
server/opensandbox_server/services/k8s/kubernetes_service.pyserver/opensandbox_server/services/k8s/workload_provider.pyserver/opensandbox_server/services/k8s/batchsandbox_provider.pyserver/opensandbox_server/services/k8s/agent_sandbox_provider.py
Kubernetes support depends on the underlying CRDs.
Open question:
- Can BatchSandbox omit
spec.expireTime? - Can agent-sandbox omit
spec.shutdownTime?
This must be confirmed before claiming end-to-end support.
Add a provider capability check:
supports_manual_cleanup() -> bool
Persist the chosen mode on workload metadata as well:
- TTL sandbox: keep expiration field populated
- Manual sandbox: omit expiration field and write a provider-neutral marker (label or annotation)
Rationale:
- Docker can support manual cleanup immediately
- Kubernetes providers may differ based on CRD semantics
- The server should fail clearly when the selected provider cannot represent a non-expiring sandbox
In KubernetesSandboxService.create_sandbox():
- Compute
expires_at: Optional[datetime] - If
request.timeout is Noneand provider does not support manual cleanup, fail early with a clear message
Suggested message:
"Manual cleanup mode is not supported by the current Kubernetes workload provider."
If supported by the CRD:
- Make
expires_atoptional in provider interfaces - Omit
spec.expireTimewhenexpires_at is None get_expiration()should returnNonewhen the field is absentupdate_expiration()should reject manual sandboxes instead of silently enabling TTL
If not supported by the CRD:
- Return
Falsefromsupports_manual_cleanup() - Keep current
expireTimebehavior unchanged
If supported by the CRD:
- Make
expires_atoptional in provider interfaces - Omit
spec.shutdownTimewhenexpires_at is None get_expiration()should returnNonewhen the field is absentupdate_expiration()should reject manual sandboxes
If not supported by the CRD:
- Return
Falsefromsupports_manual_cleanup() - Keep current
shutdownTimebehavior unchanged
Files likely affected:
server/opensandbox_server/services/sandbox_service.pyserver/opensandbox_server/services/k8s/workload_provider.py
Required updates:
- Any method signature currently assuming
expires_at: datetimeshould be reviewed - Provider creation/update/get-expiration flows should allow
Optional[datetime]where needed - Abstract service docs should describe manual cleanup semantics
Recommended failure cases:
Case:
- User omits
timeout - Provider cannot represent non-expiring sandbox
Response:
- HTTP 400
Message:
"Manual cleanup mode is not supported by the current runtime/provider."
Response:
- HTTP 409 preferred
Message:
"Sandbox <id> does not have automatic expiration enabled."
Keep current behavior:
- Reject
timeout=0 - Reject negative values
- Reject values above max bound
This refactor should preserve backward compatibility for current users.
Expected compatibility behavior:
- Existing clients sending
timeoutcontinue to work unchanged - Existing responses for TTL sandboxes remain unchanged
- New manual-cleanup behavior is opt-in via omission of
timeout
Compatibility caveat:
- Any generated SDKs may need regeneration because
timeoutandexpiresAttypes change from required to optional - Generated SDKs should also tolerate explicit
nullvalues in optional lifecycle fields, not only missing fields - Cross-SDK request shapes do not need to be byte-for-byte identical if language constraints differ. In particular, the
C# SDK may use an explicit
ManualCleanupflag instead oftimeout=nullso it can keep "unset means use default TTL" distinct from "explicitly request manual cleanup".
Files likely affected:
server/tests/test_schema.py- route tests covering create/get/list/renew
Add coverage for:
- Create request without
timeout - Create request with valid
timeout - Reject
timeout=0 - Create response with
expiresAt=null - Sandbox model with
expiresAt=null
File likely affected:
server/tests/test_docker_service.py
Add coverage for:
- Manual sandbox creation does not schedule expiration
- Manual sandbox creation does not write expiration label
- Manual sandbox get/list returns
expiresAt=None - Server restart restore path ignores manual sandboxes without warning
- Renew expiration on manual sandbox fails clearly
- TTL sandbox behavior remains unchanged
Files likely affected:
server/tests/k8s/test_kubernetes_service.pyserver/tests/k8s/test_batchsandbox_provider.pyserver/tests/k8s/test_agent_sandbox_provider.py
Add coverage for:
- Manual mode rejected when provider capability is false
- Manual mode omits expiration fields when provider capability is true
- Manual mode writes the runtime marker when provider capability is true
get_expiration()returnsNonewhen expiration field is absent- Renew expiration fails for manual sandboxes
- TTL sandbox behavior remains unchanged
Follow-up checks:
- Regenerate or validate OpenAPI docs if needed
- Verify generated SDKs handle optional
timeoutand nullableexpiresAt
- Update schema models in
server/opensandbox_server/api/schema.py - Update OpenAPI spec in
specs/sandbox-lifecycle.yml - Refactor Docker runtime to support
expires_at: Optional[datetime] - Add Kubernetes provider capability plumbing
- Implement Kubernetes manual mode only where confirmed supported
- Add and update tests
- Regenerate SDK/spec artifacts if required by repo workflow
These should be resolved early in the branch:
- Does BatchSandbox allow
spec.expireTimeto be omitted? - Does agent-sandbox allow
spec.shutdownTimeto be omitted? - Should renew-on-manual return
400or409? - Should list/get expose any explicit hint that a sandbox is manual, or is
expiresAt=nullsufficient?
Recommended implementation default for questions 1 and 2 until confirmed:
- Return
Falsefromsupports_manual_cleanup()for both Kubernetes providers - Enable Kubernetes manual mode only after CRD behavior is verified by tests or upstream documentation
Recommended answer for question 4:
expiresAt=nullis sufficient for the first iteration
The smallest practical refactor is:
- Make
timeoutoptional - Treat missing
timeoutas manual cleanup mode - Make
expiresAtnullable - Support manual mode in Docker immediately
- Gate Kubernetes support behind provider capability and CRD validation
- Keep
renew_expiration()TTL-only
This preserves current behavior while creating a clear path to non-expiring sandboxes with limited API churn.