`agent.yaml` schema

Every declarative runtime lives or dies by its spec. agent.yaml is that spec.

Auto-generated from the Zod schema in slice 8. This page is currently a hand-curated subset of the frozen fields. The complete per-field table (with types, defaults, since markers, and examples) lands when slice 8 wires the scripts/docs-schema-extract.ts generator into CI.

Required fields (v1.0)

Pulled from AgentSpec in packages/core/src/types/session.ts:

Field	Type	Required?	Description
`name`	`string`	yes	Agent identity. Must be unique per tenant. Appears in metrics + logs.
`model`	`string`	yes	Model id as the provider understands it (`claude-sonnet-4-5`, `openai/gpt-4o-mini`, etc.).
`systemPrompt`	`string` (multi-line)	yes	Baseline instructions. Loaded once per session.
`temperature`	`number`	no	Provider sampling temperature. Defaults to the provider's own default (often `0.7`).
`maxTokens`	`number`	no	Per-turn cap. Defaults to the provider's ceiling.
`subagentDepthCap`	`number`	no	Max depth for sub-agent spawning. Defaults to `3`.

Example

name: concierge
model: claude-sonnet-4-5
temperature: 0
maxTokens: 2048
subagentDepthCap: 2
systemPrompt: |
  You are a Slack concierge for an engineering team. Answer questions
  about the local repository and refuse to take external actions.
skills:
  - skills/concierge.md
tools:
  defaults:
    - Read
    - Glob
    - Grep

Full examples for every template live under templates/ in the repo.

`tools.rateLimit` — per-tool throttle

Added 0.6.x (Enterprise Production Plan §3 Item #7).

The provider-level limiter caps how fast Declaragent talks to the LLM. That's not enough for a tool — especially Bash, which can shell out to curl and hammer a downstream API at whatever rate the LLM's tool-use loop can generate. tools.rateLimit adds a per-tool token bucket that fires before each invocation of tool.execute(). When the bucket is empty, the engine sleeps cooperatively until a token frees; the LLM turn stays alive and the next tool_use block in the same response simply waits its turn.

name: concierge
model: claude-sonnet-4-5
systemPrompt: |
  You are a careful concierge. Prefer Read/Grep over Bash.
tools:
  defaults:
    - Read
    - Grep
    - Bash
  rateLimit:
    # Cap shell execution to 1 call/sec, no burst. A 10-call burst
    # through this tool will take ~9 seconds end-to-end.
    Bash:
      rps: 1
    # MCP tool names are namespaced `mcp__<server>__<tool>`.
    mcp__github__list_issues:
      rps: 5
      burst: 10

Fields.

Field	Type	Required?	Description
`rps`	`number` (`> 0`)	yes	Steady-state rate in calls per second. Fractional values accepted (`0.5` = one call every two seconds).
`burst`	`number` (`> 0`)	no	Max calls absorbed without throttling. Defaults to `rps`. Set to `2 * rps` for the classic token-bucket "1-second absorb" pattern.

Observability.

Every stall increments declaragent_tool_rate_limit_waits_total{agent,tool} and adds its observed wait to declaragent_tool_rate_limit_wait_ms{agent,tool} on the /metrics endpoint.
Waits that exceed 1 s emit a rate_limited audit record on the hash chain (tenant id, session id, tool name, configured rps/burst, observed wait). Short burst-absorption waits are silent to avoid chain bloat.

Scope.

Tools not listed under rateLimit bypass the gate entirely — no overhead for unconfigured tools.
Per-user / per-tenant ceilings are not yet supported; see Enterprise Production Plan #5 multi-tenant quota work.
Dynamic rate adjustment based on 429 response headers is not supported; set a conservative rps up front.

`rpc.auth` — OIDC / OAuth2 envelope authentication

Added 0.7.x (Enterprise Production Plan §3 Item #4).

Every inter-agent RPC envelope carries an optional auth block. Up to 0.7.x, up ignored the block and trusted every envelope (the legacy internal/hmac path). Flipping the switch:

rpc:
  auth:
    enabled: true      # default false — legacy trust path stays on

...tells up to walk rpc-peers.yaml, build an AuthVerifyRegistry from every peer with an auth: block, and pass it into createAgentInboxAdapter({ authRegistry }) at boot. From that point on, envelopes from a registered peer must carry a matching OIDC / OAuth2 bearer token; a bad / missing / expired token routes to the DLQ under kind=rejected, reason=auth-rejected.

Defaults.

rpc.auth.enabled defaults to false. Flipping every fleet on day-one would break every operator who hasn't yet staged a peer auth: block in rpc-peers.yaml. The transition is: stage the peer configs, flip enabled: true per agent, monitor auth_check audit records, then decommission the legacy path on the peer side.
Peers without an auth: block in rpc-peers.yaml still follow the legacy path regardless of this toggle — auth is opt-in per peer.

Peer config (in rpc-peers.yaml, not agent.yaml):

version: 1
peers:
  - agent: agent://peer-b
    transports:
      - kind: kafka
        brokers: [kafka-1.acme.example:9092]
        topics:
          requests: agents.peer-b.requests
    auth:
      provider: oidc                    # or 'oauth2-client'
      issuer: https://dex.acme.example
      audience: peer-a
      jwksUri: https://dex.acme.example/keys   # optional — derived from issuer discovery
      scopes: [agents:call]             # optional — strict AND check on verify
  - agent: agent://peer-c
    transports:
      - kind: nats
        servers: [nats://nats.acme.example:4222]
        subjects:
          requests: agents.peer-c.requests
    auth:
      provider: oauth2-client
      tokenEndpoint: https://idp.acme.example/oauth2/token
      clientId: decl-peer-a
      clientSecretRef: secret://platform/decl-peer-a-client-secret
      jwksUri: https://idp.acme.example/keys
      issuer: https://idp.acme.example
      audience: peer-c
      scopes: [agents:call]

clientSecretRef resolves through the Phase-6 secret resolver — secret:// / env: / file: / vault: / aws-sm: / gcp-sm: all work.

Observability.

Every envelope emits an auth_check audit record with the decision (accept / reject), provider name (oidc / oauth2-client), resolved subject, and the rejection reason when applicable.
Rejected envelopes also land in rejected_events under kind=auth-rejected so declaragent dlq list --kind rejected surfaces the specific peer + reason.

`controlPlane.auth` — OIDC / OAuth2 on the HTTP listener

Added 0.7.x (Enterprise Production Plan §3 Item #5, Slice 2 of docs/CONTROL_PLANE_PLAN.md).

The control-plane listener (/metrics, /status, /events, /dlq, /audit, /logs) binds to 127.0.0.1:9464 by default. Enable this block and remote callers (a reverse-proxied declaragent fleet ps, a Prometheus federation) must present a valid Bearer token:

controlPlane:
  auth:
    enabled: true                           # default false (back-compat)
    provider: oidc                          # or 'oauth2-client'
    issuer: "https://dex.acme.example"
    audience: "declaragent-control-plane"
    jwksUri: "https://dex.acme.example/keys"    # optional — derived from issuer discovery
    scopes: ["control:read"]                # strict AND — every scope required
    # allowLoopback: false                  # optional — default true; flip for zero-trust

Defaults.

controlPlane.auth.enabled defaults to false. Legacy single-host fleets keep working unchanged.
allowLoopback defaults to true: requests whose Host header is 127.0.0.1 / localhost bypass the verifier so same-host curls and declaragent ps work without a token. Zero-trust deployments set false to require a token on every request.
When enabled: true, the up daemon implicitly flips the listener's allowRemote: true — otherwise non-loopback requests would hit the pre-auth 403 before the middleware ever runs. The bind address stays 127.0.0.1; a reverse proxy is expected to front the port.

OAuth2 Client-Credentials uses the same shape as rpc-peers.yaml#auth for that flow:

controlPlane:
  auth:
    enabled: true
    provider: oauth2-client
    tokenEndpoint: "https://idp.acme.example/oauth/token"
    clientId: "declaragent-control-plane"
    clientSecretRef: "env:CP_CLIENT_SECRET"  # also accepts file:/secret: refs
    audience: "declaragent-control-plane"
    issuer: "https://idp.acme.example/"
    jwksUri: "https://idp.acme.example/.well-known/jwks.json"
    scopes: ["control:read"]

Rejection reasons. Failed requests return 401 with a typed body { "error": "...", "reason": "..." }. The reason vocabulary (missing-token, malformed-token, bad-signature, expired, wrong-issuer, wrong-audience, insufficient-scope, idp-unreachable, provider-failed) is stable so remote CLIs can branch without parsing free-form error text.

See Authenticated control plane for the full request flow + rejection-reason table.

`mcp.supervised` — MCP auto-recovery opt-in

Added 0.7.x (Enterprise Production Plan §3 Item #8).

MCP stdio servers crash. The runtime's supervisor wraps each server in a ping health check + exponential backoff + circuit breaker + tool- catalog re-registration loop, so a crashed server respawns transparently from the engine's perspective. Routing defaults:

mcp:
  supervised: all      # default — every MCP server wrapped in a supervisor
# mcp.supervised: none          # opt out globally (raw client, no recovery)
# mcp.supervised: [github, jira] # allow-list specific servers

Defaults.

mcp.supervised defaults to 'all'. Supervision is observational when nothing crashes — one ping per server every 10 s — and only activates the respawn + circuit paths on failure. Defaulting on means every operator gets auto-recovery out of the box.
'none' bypasses the supervisor completely. Useful when debugging a flaky server where the supervisor's ping probe itself is suspect, or when a server's initialize is too slow for the supervisor's handshake timeout.

Observability.

mcp_server_restarts_total{server_id, reason} counter — one sample per respawn attempt, labelled with the trigger reason (initial, transport-closed, ping-failed, init-failed, probe).
mcp_server_circuit_state{server_id} gauge (0=closed, 1=half-open, 2=open). A PagerDuty alert on mcp_server_circuit_state > 1 for 5m catches a genuinely stuck server.
mcp_server_circuit_open_total{server_id} counter — increments once per closed | half-open → open transition. Prefer this over the gauge for alertmanager rules: a simple increase(mcp_server_circuit_open_total[5m]) > 0 expression fires exactly when a server flips to open, no label comparison required.

Behaviour.

Backoff schedule: 1s, 2s, 4s, 8s, 16s, 32s, 60s cap.
Circuit opens after 5 consecutive backoff-exhausted respawn sequences. Half-opens after 30 s for a probe respawn.
Tool-call race protection: a call in flight when the server crashes resolves with a typed EMCPCRASHED error instead of hanging — the tool adapter translates this into a normal ToolEvent error so engine callers see a clean rejection.

Recipe — isolate a flaky MCP server for debugging

When one MCP server is crash-looping and you want to understand why without taking the rest of the fleet offline, use the list form of mcp.supervised to exclude the suspect server while the healthy ones stay wrapped in the supervisor.

# agent.yaml — healthy servers keep auto-recovery; `flaky-server` runs raw
mcp:
  servers:
    github: { command: 'npx', args: ['@modelcontextprotocol/server-github'] }
    jira:   { command: 'npx', args: ['@modelcontextprotocol/server-jira'] }
    flaky-server:
      command: 'node'
      args: ['./mcp-servers/flaky-server.js']

  # Allow-list: only these servers get supervised. `flaky-server` is
  # absent, so it runs as a raw client — you see every crash surface as
  # an `EMCPCRASHED` error in the engine log with no respawn churn
  # hiding the root cause.
  supervised: [github, jira]

Why this helps.

You can reproduce the crash in place. The supervisor's default 1s → 2s → 4s backoff respawns the server behind your back; raw-mode leaves the transport closed so the first crash is the last crash.
The other servers keep auto-recovering. A flaky Jira instance today doesn't need to block you from investigating a flaky internal tool you shipped yesterday.
Metrics still tell a story. The mcp_server_restarts_total and mcp_server_circuit_open_total counters continue to emit for the supervised servers, so your existing alerts stay useful while you debug the excluded one.

Once the fix lands, flip the entry back into the list (or drop to supervised: all) and redeploy — no code change required.

`audit.export` — SIEM forwarding

Added 0.6.x (Enterprise Production Plan §3 Item #10).

Declaragent keeps a hash-chained audit log in per-host SQLite. Compliance teams need that stream in their SIEM. Add an audit.export block to your agent.yaml and the up daemon starts an in-process loop that batches new rows every 10 s and POSTs them to one of:

Splunk — HEC (HTTP Event Collector) at <hec-url>/services/collector/event.
Elastic — POST <base>/<index>/_bulk with apiKey / basic / bearer auth.
Datadog — Logs v2 intake (https://http-intake.logs.<site>/api/v2/logs).

Secrets should live in environment variables — set token: env:SPLUNK_HEC_TOKEN and the exporter resolves the value at boot. Missing / empty env vars fail loud so a typo can't silently bypass export.

Splunk

audit:
  export:
    kind: splunk
    hecUrl: https://splunk.acme.example:8088
    token: env:SPLUNK_HEC_TOKEN
    index: declaragent-audit     # optional — Splunk-side default when omitted
    source: declaragent.audit    # optional
    sourcetype: _json            # optional
    batchSize: 500               # optional — rows per push, default 500
    intervalMs: 10000            # optional — tick cadence, default 10_000

Elastic

audit:
  export:
    kind: elastic
    baseUrl: https://es.acme.example:9200
    index: declaragent-audit
    auth:
      kind: apiKey               # or "basic" | "bearer"
      apiKey: env:ELASTIC_API_KEY

For basic: { kind: basic, username: env:ELASTIC_USER, password: env:ELASTIC_PASSWORD }. For bearer: { kind: bearer, token: env:ELASTIC_TOKEN }.

Datadog

audit:
  export:
    kind: datadog
    apiKey: env:DD_API_KEY
    site: datadoghq.com          # or us3. / us5. / datadoghq.eu / ap1. / ddog-gov.com
    service: declaragent
    tags: env:prod,team:platform # optional DD tag string

Operational behaviour

Cursor. The loop writes the highest acked seq to the audit_export_cursor table in the same SQLite file. A restart picks up exactly where the previous run stopped — no gaps, no duplicates under happy-path conditions. Non-happy paths give at-least-once delivery (if the vendor acks but Declaragent crashes before the cursor advance, those rows re-push on boot; the seq is stable so downstream dedup is cheap).
Failure handling. Retryable errors (network / 5xx / 429) re-queue on the next tick. Non-retryable errors (401 / 403 / 400) pause the loop immediately. Five consecutive retryable failures also pause. The paused state emits declaragent_audit_export_paused{exporter,vendor}=1 on the /metrics endpoint. Resume happens on up restart.
Observability.
- declaragent_audit_export_acked_total{exporter,vendor} — rows ack'd.
- declaragent_audit_export_failures_total{exporter,vendor,retryable} — failed pushes.
- declaragent_audit_export_last_seq{exporter,vendor} — highest acked seq.
- declaragent_audit_export_paused{exporter,vendor} — 1 when halted.
Redaction. Tokens and vendor-path suffixes (/services/collector/event, /api/v2/logs, /<index>/_bulk) are never written to logs or Prometheus labels — all exporter error strings go through a redaction pass first.

Scope

In. Forward-only stream of every new audit row. Standard tool_call, channel_event, secret_access, auth_check, rate_limited, erasure tombstones — everything the hash chain holds.
Out. Historical backfill, generic syslog/CEF, custom vendor codecs. Operators who need backfill can dump the SQLite file manually and replay via the vendor's own bulk-load tooling.

Fields documented post-slice 8

The generated reference will additionally document:

skills[] — filesystem paths to per-skill markdown prompts.
tools.defaults[] / tools.allow[] / tools.deny[] — engine tool allowlists.
channels[] — inbound/outbound platform adapters (Slack, Discord, Telegram, WhatsApp).
event_sources[] — Kafka, SQS, AMQP, MQTT, NATS, webhook.
plugins[] — extension ids referenced from the plugin manifest.
quotas — daily token budget, per-minute rate limits, dailyTokenUSD.
tenants — per-tenant override block (when --multi-tenant).
secrets — ${secret:...} references bound at launch.

Each of these is implemented in Phase 1–6 code and will be frozen in slice 8 via a @since 1.0.0 JSDoc tag.

CLI → declaragent init — scaffolds this file for you.
Cookbook → concierge — walks an example end-to-end.
Troubleshooting → error-codes — every EINVAL you can hit at load time.

Required fields (v1.0)​

Example​

tools.rateLimit — per-tool throttle​

rpc.auth — OIDC / OAuth2 envelope authentication​

controlPlane.auth — OIDC / OAuth2 on the HTTP listener​

mcp.supervised — MCP auto-recovery opt-in​

Recipe — isolate a flaky MCP server for debugging​

audit.export — SIEM forwarding​

Splunk​

Elastic​

Datadog​

Operational behaviour​

Scope​

Fields documented post-slice 8​

Related​

Required fields (v1.0)

Example

`tools.rateLimit` — per-tool throttle

`rpc.auth` — OIDC / OAuth2 envelope authentication

`controlPlane.auth` — OIDC / OAuth2 on the HTTP listener

`mcp.supervised` — MCP auto-recovery opt-in

Recipe — isolate a flaky MCP server for debugging

`audit.export` — SIEM forwarding

Splunk

Elastic

Datadog

Operational behaviour

Scope

Fields documented post-slice 8

Related