docs

Durable Runs

A Durable Run survives container crashes. Ninetrix combines Docker restart policies with API-based checkpoint recovery so the agent resumes from the last complete turn automatically.

How to enable
Set AGENTFILE_API_URL and AGENTFILE_RUNNER_TOKEN so the agent can reach the checkpoint API. Run ninetrix dev to start the local stack — checkpointing is automatic via the telemetry API.

How it works

  1. Crash — agent container exits unexpectedly (OOM, SIGKILL, network failure, unhandled exception)
  2. Docker restarts the container--restart=on-failure:3 keeps the same AGENTFILE_THREAD_ID env var across restarts
  3. State is loaded — on boot the agent calls GET /v1/runners/threads/{thread_id}/latest to fetch the last checkpoint from the API
  4. History is repaired_repair_history() detects any incomplete turn left by the crash (an assistant message with tool_calls but no matching tool_result) and rolls back to the last clean turn boundary
  5. Agent resumes — execution continues from the repaired checkpoint; no duplicate work, no data loss

Default behaviour

execution.durability defaults to true. You do not need to add anything — just ensure telemetry is wired up:

agentfile.yaml
agents:
  researcher:
    metadata:
      role: Web researcher
      goal: Research topics and produce reports
    runtime:
      provider: anthropic
      model: claude-sonnet-4-6
    # durability is true by default — no extra config needed

Opt out

Set execution.durability: false to run without the restart policy. Useful for one-shot batch jobs where a mid-crash retry would be incorrect:

agentfile.yaml
agents:
  batch-processor:
    metadata:
      role: Batch processor
      goal: Process a fixed dataset exactly once
    runtime:
      provider: anthropic
      model: claude-sonnet-4-6
    execution:
      durability: false    # container exits on failure, no auto-restart

Warm pool (ninetrix up)

ninetrix up assigns a stable AGENTFILE_THREAD_ID to each agent and persists it in the pool state file (~/.agentfile/pools/<swarm>.json). Restarted containers reuse the same thread ID, so the checkpoint chain is continuous across restarts for the lifetime of the pool.

History repair detail

Each turn records a turn_start_history_len anchor in the checkpoint before any tool calls execute. If the agent crashes after the LLM emits tool_calls but before the corresponding tool_result is written, the history is in an inconsistent state. On resume, _repair_history() detects this and truncates history back to the anchor, discarding the orphaned tool_calls message. The turn is re-executed cleanly.

Idempotent tools recommended
If a tool call completed externally before the crash (e.g. a file was written, an email was sent), history repair will re-execute it. Design tools to be idempotent where possible, or use HITL approval gates for irreversible actions.

Restart limit

The Docker restart policy is on-failure:3. After three consecutive failures the container stops. Check ninetrix logs and ninetrix trace to diagnose.

Bash
# Inspect what happened before the crash
ninetrix logs --agent researcher
ninetrix trace --thread-id <thread-id>
On this page