agent.yaml schema
Every declarative runtime lives or dies by its spec. agent.yaml is that spec.
Auto-generated from the Zod schema in slice 8. This page is currently a
hand-curated subset of the frozen fields. The complete per-field table
(with types, defaults, since markers, and examples) lands when slice 8
wires the scripts/docs-schema-extract.ts generator into CI.
Required fields (v1.0)
Pulled from AgentSpec in packages/core/src/types/session.ts:
| Field | Type | Required? | Description |
|---|---|---|---|
name | string | yes | Agent identity. Must be unique per tenant. Appears in metrics + logs. |
model | string | yes | Model id as the provider understands it (claude-sonnet-4-5, openai/gpt-4o-mini, etc.). |
systemPrompt | string (multi-line) | yes | Baseline instructions. Loaded once per session. |
temperature | number | no | Provider sampling temperature. Defaults to the provider's own default (often 0.7). |
maxTokens | number | no | Per-turn cap. Defaults to the provider's ceiling. |
subagentDepthCap | number | no | Max depth for sub-agent spawning. Defaults to 3. |
Example
name: concierge
model: claude-sonnet-4-5
temperature: 0
maxTokens: 2048
subagentDepthCap: 2
systemPrompt: |
You are a Slack concierge for an engineering team. Answer questions
about the local repository and refuse to take external actions.
skills:
- skills/concierge.md
tools:
defaults:
- Read
- Glob
- Grep
Full examples for every template live under templates/ in the repo.
tools.rateLimit — per-tool throttle
Added 0.6.x (Enterprise Production Plan §3 Item #7).
The provider-level limiter caps how fast Declaragent talks to the LLM.
That's not enough for a tool — especially Bash, which can shell out to
curl and hammer a downstream API at whatever rate the LLM's tool-use
loop can generate. tools.rateLimit adds a per-tool token bucket that
fires before each invocation of tool.execute(). When the bucket is
empty, the engine sleeps cooperatively until a token frees; the LLM turn
stays alive and the next tool_use block in the same response simply
waits its turn.
name: concierge
model: claude-sonnet-4-5
systemPrompt: |
You are a careful concierge. Prefer Read/Grep over Bash.
tools:
defaults:
- Read
- Grep
- Bash
rateLimit:
# Cap shell execution to 1 call/sec, no burst. A 10-call burst
# through this tool will take ~9 seconds end-to-end.
Bash:
rps: 1
# MCP tool names are namespaced `mcp__<server>__<tool>`.
mcp__github__list_issues:
rps: 5
burst: 10
Fields.
| Field | Type | Required? | Description |
|---|---|---|---|
rps | number (> 0) | yes | Steady-state rate in calls per second. Fractional values accepted (0.5 = one call every two seconds). |
burst | number (> 0) | no | Max calls absorbed without throttling. Defaults to rps. Set to 2 * rps for the classic token-bucket "1-second absorb" pattern. |
Observability.
- Every stall increments
declaragent_tool_rate_limit_waits_total{agent,tool}and adds its observed wait todeclaragent_tool_rate_limit_wait_ms{agent,tool}on the/metricsendpoint. - Waits that exceed 1 s emit a
rate_limitedaudit record on the hash chain (tenant id, session id, tool name, configured rps/burst, observed wait). Short burst-absorption waits are silent to avoid chain bloat.
Scope.
- Tools not listed under
rateLimitbypass the gate entirely — no overhead for unconfigured tools. - Per-user / per-tenant ceilings are not yet supported; see Enterprise Production Plan #5 multi-tenant quota work.
- Dynamic rate adjustment based on 429 response headers is not supported; set a conservative
rpsup front.
rpc.auth — OIDC / OAuth2 envelope authentication
Added 0.7.x (Enterprise Production Plan §3 Item #4).
Every inter-agent RPC envelope carries an optional auth block. Up to 0.7.x,
up ignored the block and trusted every envelope (the legacy
internal/hmac path). Flipping the switch:
rpc:
auth:
enabled: true # default false — legacy trust path stays on
...tells up to walk rpc-peers.yaml, build an AuthVerifyRegistry
from every peer with an auth: block, and pass it into
createAgentInboxAdapter({ authRegistry }) at boot. From that point on,
envelopes from a registered peer must carry a matching OIDC / OAuth2
bearer token; a bad / missing / expired token routes to the DLQ under
kind=rejected, reason=auth-rejected.
Defaults.
rpc.auth.enableddefaults tofalse. Flipping every fleet on day-one would break every operator who hasn't yet staged a peerauth:block inrpc-peers.yaml. The transition is: stage the peer configs, flipenabled: trueper agent, monitorauth_checkaudit records, then decommission the legacy path on the peer side.- Peers without an
auth:block inrpc-peers.yamlstill follow the legacy path regardless of this toggle — auth is opt-in per peer.
Peer config (in rpc-peers.yaml, not agent.yaml):
version: 1
peers:
- agent: agent://peer-b
transports:
- kind: kafka
brokers: [kafka-1.acme.example:9092]
topics:
requests: agents.peer-b.requests
auth:
provider: oidc # or 'oauth2-client'
issuer: https://dex.acme.example
audience: peer-a
jwksUri: https://dex.acme.example/keys # optional — derived from issuer discovery
scopes: [agents:call] # optional — strict AND check on verify
- agent: agent://peer-c
transports:
- kind: nats
servers: [nats://nats.acme.example:4222]
subjects:
requests: agents.peer-c.requests
auth:
provider: oauth2-client
tokenEndpoint: https://idp.acme.example/oauth2/token
clientId: decl-peer-a
clientSecretRef: secret://platform/decl-peer-a-client-secret
jwksUri: https://idp.acme.example/keys
issuer: https://idp.acme.example
audience: peer-c
scopes: [agents:call]
clientSecretRef resolves through the Phase-6 secret resolver —
secret:// / env: / file: / vault: / aws-sm: / gcp-sm: all
work.
Observability.
- Every envelope emits an
auth_checkaudit record with the decision (accept/reject), provider name (oidc/oauth2-client), resolved subject, and the rejection reason when applicable. - Rejected envelopes also land in
rejected_eventsunderkind=auth-rejectedsodeclaragent dlq list --kind rejectedsurfaces the specific peer + reason.
controlPlane.auth — OIDC / OAuth2 on the HTTP listener
Added 0.7.x (Enterprise Production Plan §3 Item #5, Slice 2 of docs/CONTROL_PLANE_PLAN.md).
The control-plane listener (/metrics, /status, /events, /dlq,
/audit, /logs) binds to 127.0.0.1:9464 by default. Enable this
block and remote callers (a reverse-proxied declaragent fleet ps, a
Prometheus federation) must present a valid Bearer token:
controlPlane:
auth:
enabled: true # default false (back-compat)
provider: oidc # or 'oauth2-client'
issuer: "https://dex.acme.example"
audience: "declaragent-control-plane"
jwksUri: "https://dex.acme.example/keys" # optional — derived from issuer discovery
scopes: ["control:read"] # strict AND — every scope required
# allowLoopback: false # optional — default true; flip for zero-trust
Defaults.
controlPlane.auth.enableddefaults tofalse. Legacy single-host fleets keep working unchanged.allowLoopbackdefaults totrue: requests whoseHostheader is127.0.0.1/localhostbypass the verifier so same-host curls anddeclaragent pswork without a token. Zero-trust deployments setfalseto require a token on every request.- When
enabled: true, theupdaemon implicitly flips the listener'sallowRemote: true— otherwise non-loopback requests would hit the pre-auth 403 before the middleware ever runs. The bind address stays127.0.0.1; a reverse proxy is expected to front the port.
OAuth2 Client-Credentials uses the same shape as
rpc-peers.yaml#auth for that flow:
controlPlane:
auth:
enabled: true
provider: oauth2-client
tokenEndpoint: "https://idp.acme.example/oauth/token"
clientId: "declaragent-control-plane"
clientSecretRef: "env:CP_CLIENT_SECRET" # also accepts file:/secret: refs
audience: "declaragent-control-plane"
issuer: "https://idp.acme.example/"
jwksUri: "https://idp.acme.example/.well-known/jwks.json"
scopes: ["control:read"]
Rejection reasons. Failed requests return 401 with a typed body
{ "error": "...", "reason": "..." }. The reason vocabulary
(missing-token, malformed-token, bad-signature, expired,
wrong-issuer, wrong-audience, insufficient-scope,
idp-unreachable, provider-failed) is stable so remote CLIs can
branch without parsing free-form error text.
See Authenticated control plane for the full request flow + rejection-reason table.
mcp.supervised — MCP auto-recovery opt-in
Added 0.7.x (Enterprise Production Plan §3 Item #8).
MCP stdio servers crash. The runtime's supervisor wraps each server in a ping health check + exponential backoff + circuit breaker + tool- catalog re-registration loop, so a crashed server respawns transparently from the engine's perspective. Routing defaults:
mcp:
supervised: all # default — every MCP server wrapped in a supervisor
# mcp.supervised: none # opt out globally (raw client, no recovery)
# mcp.supervised: [github, jira] # allow-list specific servers
Defaults.
mcp.superviseddefaults to'all'. Supervision is observational when nothing crashes — one ping per server every 10 s — and only activates the respawn + circuit paths on failure. Defaulting on means every operator gets auto-recovery out of the box.'none'bypasses the supervisor completely. Useful when debugging a flaky server where the supervisor's ping probe itself is suspect, or when a server'sinitializeis too slow for the supervisor's handshake timeout.
Observability.
mcp_server_restarts_total{server_id, reason}counter — one sample per respawn attempt, labelled with the trigger reason (initial,transport-closed,ping-failed,init-failed,probe).mcp_server_circuit_state{server_id}gauge (0=closed,1=half-open,2=open). A PagerDuty alert onmcp_server_circuit_state > 1 for 5mcatches a genuinely stuck server.mcp_server_circuit_open_total{server_id}counter — increments once perclosed | half-open → opentransition. Prefer this over the gauge for alertmanager rules: a simpleincrease(mcp_server_circuit_open_total[5m]) > 0expression fires exactly when a server flips to open, no label comparison required.
Behaviour.
- Backoff schedule: 1s, 2s, 4s, 8s, 16s, 32s, 60s cap.
- Circuit opens after 5 consecutive backoff-exhausted respawn sequences. Half-opens after 30 s for a probe respawn.
- Tool-call race protection: a call in flight when the server crashes
resolves with a typed
EMCPCRASHEDerror instead of hanging — the tool adapter translates this into a normalToolEventerror so engine callers see a clean rejection.
Recipe — isolate a flaky MCP server for debugging
When one MCP server is crash-looping and you want to understand why
without taking the rest of the fleet offline, use the list form of
mcp.supervised to exclude the suspect server while the healthy
ones stay wrapped in the supervisor.
# agent.yaml — healthy servers keep auto-recovery; `flaky-server` runs raw
mcp:
servers:
github: { command: 'npx', args: ['@modelcontextprotocol/server-github'] }
jira: { command: 'npx', args: ['@modelcontextprotocol/server-jira'] }
flaky-server:
command: 'node'
args: ['./mcp-servers/flaky-server.js']
# Allow-list: only these servers get supervised. `flaky-server` is
# absent, so it runs as a raw client — you see every crash surface as
# an `EMCPCRASHED` error in the engine log with no respawn churn
# hiding the root cause.
supervised: [github, jira]
Why this helps.
- You can reproduce the crash in place. The supervisor's default 1s → 2s → 4s backoff respawns the server behind your back; raw-mode leaves the transport closed so the first crash is the last crash.
- The other servers keep auto-recovering. A flaky Jira instance today doesn't need to block you from investigating a flaky internal tool you shipped yesterday.
- Metrics still tell a story. The
mcp_server_restarts_totalandmcp_server_circuit_open_totalcounters continue to emit for the supervised servers, so your existing alerts stay useful while you debug the excluded one.
Once the fix lands, flip the entry back into the list (or drop to
supervised: all) and redeploy — no code change required.
audit.export — SIEM forwarding
Added 0.6.x (Enterprise Production Plan §3 Item #10).
Declaragent keeps a hash-chained audit log in per-host SQLite. Compliance
teams need that stream in their SIEM. Add an audit.export block to your
agent.yaml and the up daemon starts an in-process loop that batches
new rows every 10 s and POSTs them to one of:
- Splunk — HEC (HTTP Event Collector) at
<hec-url>/services/collector/event. - Elastic —
POST <base>/<index>/_bulkwithapiKey/basic/bearerauth. - Datadog — Logs v2 intake (
https://http-intake.logs.<site>/api/v2/logs).
Secrets should live in environment variables — set token: env:SPLUNK_HEC_TOKEN
and the exporter resolves the value at boot. Missing / empty env vars fail
loud so a typo can't silently bypass export.
Splunk
audit:
export:
kind: splunk
hecUrl: https://splunk.acme.example:8088
token: env:SPLUNK_HEC_TOKEN
index: declaragent-audit # optional — Splunk-side default when omitted
source: declaragent.audit # optional
sourcetype: _json # optional
batchSize: 500 # optional — rows per push, default 500
intervalMs: 10000 # optional — tick cadence, default 10_000
Elastic
audit:
export:
kind: elastic
baseUrl: https://es.acme.example:9200
index: declaragent-audit
auth:
kind: apiKey # or "basic" | "bearer"
apiKey: env:ELASTIC_API_KEY
For basic: { kind: basic, username: env:ELASTIC_USER, password: env:ELASTIC_PASSWORD }.
For bearer: { kind: bearer, token: env:ELASTIC_TOKEN }.
Datadog
audit:
export:
kind: datadog
apiKey: env:DD_API_KEY
site: datadoghq.com # or us3. / us5. / datadoghq.eu / ap1. / ddog-gov.com
service: declaragent
tags: env:prod,team:platform # optional DD tag string
Operational behaviour
- Cursor. The loop writes the highest acked
seqto theaudit_export_cursortable in the same SQLite file. A restart picks up exactly where the previous run stopped — no gaps, no duplicates under happy-path conditions. Non-happy paths give at-least-once delivery (if the vendor acks but Declaragent crashes before the cursor advance, those rows re-push on boot; theseqis stable so downstream dedup is cheap). - Failure handling. Retryable errors (network / 5xx / 429) re-queue on the
next tick. Non-retryable errors (401 / 403 / 400) pause the loop immediately.
Five consecutive retryable failures also pause. The paused state emits
declaragent_audit_export_paused{exporter,vendor}=1on the/metricsendpoint. Resume happens onuprestart. - Observability.
declaragent_audit_export_acked_total{exporter,vendor}— rows ack'd.declaragent_audit_export_failures_total{exporter,vendor,retryable}— failed pushes.declaragent_audit_export_last_seq{exporter,vendor}— highest acked seq.declaragent_audit_export_paused{exporter,vendor}— 1 when halted.
- Redaction. Tokens and vendor-path suffixes (
/services/collector/event,/api/v2/logs,/<index>/_bulk) are never written to logs or Prometheus labels — all exporter error strings go through a redaction pass first.
Scope
- In. Forward-only stream of every new audit row. Standard
tool_call,channel_event,secret_access,auth_check,rate_limited, erasure tombstones — everything the hash chain holds. - Out. Historical backfill, generic syslog/CEF, custom vendor codecs. Operators who need backfill can dump the SQLite file manually and replay via the vendor's own bulk-load tooling.
Fields documented post-slice 8
The generated reference will additionally document:
skills[]— filesystem paths to per-skill markdown prompts.tools.defaults[]/tools.allow[]/tools.deny[]— engine tool allowlists.channels[]— inbound/outbound platform adapters (Slack, Discord, Telegram, WhatsApp).event_sources[]— Kafka, SQS, AMQP, MQTT, NATS, webhook.plugins[]— extension ids referenced from the plugin manifest.quotas— daily token budget, per-minute rate limits, dailyTokenUSD.tenants— per-tenant override block (when--multi-tenant).secrets—${secret:...}references bound at launch.
Each of these is implemented in Phase 1–6 code and will be frozen in slice 8 via a @since 1.0.0 JSDoc tag.
Related
- CLI →
declaragent init— scaffolds this file for you. - Cookbook →
concierge— walks an example end-to-end. - Troubleshooting → error-codes — every
EINVALyou can hit at load time.