docs
Durable Runs
A Durable Run survives container crashes. Ninetrix combines Docker restart policies with API-based checkpoint recovery so the agent resumes from the last complete turn automatically.
AGENTFILE_API_URL and AGENTFILE_RUNNER_TOKEN so the agent can reach the checkpoint API. Run ninetrix dev to start the local stack — checkpointing is automatic via the telemetry API.How it works
- Crash — agent container exits unexpectedly (OOM, SIGKILL, network failure, unhandled exception)
- Docker restarts the container —
--restart=on-failure:3keeps the sameAGENTFILE_THREAD_IDenv var across restarts - State is loaded — on boot the agent calls
GET /v1/runners/threads/{thread_id}/latestto fetch the last checkpoint from the API - History is repaired —
_repair_history()detects any incomplete turn left by the crash (an assistant message withtool_callsbut no matchingtool_result) and rolls back to the last clean turn boundary - Agent resumes — execution continues from the repaired checkpoint; no duplicate work, no data loss
Default behaviour
execution.durability defaults to true. You do not need to add anything — just ensure telemetry is wired up:
agents:
researcher:
metadata:
role: Web researcher
goal: Research topics and produce reports
runtime:
provider: anthropic
model: claude-sonnet-4-6
# durability is true by default — no extra config needed
Opt out
Set execution.durability: false to run without the restart policy. Useful for one-shot batch jobs where a mid-crash retry would be incorrect:
agents:
batch-processor:
metadata:
role: Batch processor
goal: Process a fixed dataset exactly once
runtime:
provider: anthropic
model: claude-sonnet-4-6
execution:
durability: false # container exits on failure, no auto-restart
Warm pool (ninetrix up)
ninetrix up assigns a stable AGENTFILE_THREAD_ID to each agent and persists it in the pool state file (~/.agentfile/pools/<swarm>.json). Restarted containers reuse the same thread ID, so the checkpoint chain is continuous across restarts for the lifetime of the pool.
History repair detail
Each turn records a turn_start_history_len anchor in the checkpoint before any tool calls execute. If the agent crashes after the LLM emits tool_calls but before the corresponding tool_result is written, the history is in an inconsistent state. On resume, _repair_history() detects this and truncates history back to the anchor, discarding the orphaned tool_calls message. The turn is re-executed cleanly.
Restart limit
The Docker restart policy is on-failure:3. After three consecutive failures the container stops. Check ninetrix logs and ninetrix trace to diagnose.
# Inspect what happened before the crash
ninetrix logs --agent researcher
ninetrix trace --thread-id <thread-id>