Files
Michael Suchacz 85792d08bc feat: add harness engineering layer for agent workflows (#24791)
This PR adds an opinionated harness-engineering layer for agent-driven
workflows: a small set of agent-readable docs, mechanical structure
checks, structured CI failure summaries, an architecture-lint umbrella,
and per-worktree dev-server isolation. The goal is to make local dev,
tests, and CI mechanically inspectable by agents without changing app
runtime behavior.

## What landed

**Agent docs and navigation**
- `.claude/docs/OBSERVABILITY.md`, `.claude/docs/DEV_ISOLATION.md`,
`.claude/docs/AGENT_FAILURES.md`: task-oriented guides for logs,
tracing, Prometheus, dev-server isolation, and a seeded failure catalog.
- `AGENTS.md`: added an `Agent navigation` block, then trimmed the file
from 375 to 229 lines by migrating duplicated detail into
`WORKFLOWS.md`, `GO.md`, `TESTING.md`, and `DATABASE.md`. The
user-managed custom-instructions block is preserved.
- `.agents/docs`: symlink mirror of `.claude/docs` for agent runtimes
that look under `.agents`.

**Mechanical checks**
- `scripts/check_agents_structure.sh`: validates `@...` references in
tracked `AGENTS.md` files and warns when root grows past 600 lines.
Wired as `make lint/agents` and into `make lint`.
- `scripts/audit-agent-readiness.sh`: report-first audit of harness
readiness. Currently `10 ok, 0 warn, 0 fail`.
- `scripts/check_architecture.sh` / `make lint/architecture`: umbrella
architecture-lint target. Consolidates the existing
`check_enterprise_imports.sh` and `check_codersdk_imports.sh` so they
run exactly once via the umbrella. Slot is open for new high-confidence
rules.

**Structured CI failure summaries**
- `scripts/playwright-failure-summary.sh`: parses
`site/test-results/results.json` and writes Markdown to
`$GITHUB_STEP_SUMMARY` on failure. Wired into the `test-e2e` matrix job.
- `scripts/go-test-failure-summary.sh`: parses `go test -json`
line-delimited output the same way. Wired into `test-go-pg`,
`test-go-pg-17`, and `test-go-race-pg` by injecting `gotestsum
--jsonfile` in the workflow without touching `Makefile`. JSON also
uploaded as a CI artifact on failure.
- `site/e2e/playwright.config.ts`: enables `screenshot:
only-on-failure`, `trace: retain-on-failure`, JSON reporter, and HTML
reporter alongside existing reporters.
- `.github/workflows/ci.yaml`: failure artifact uploads for Playwright
now use `if: failure()` and predictable names
(`playwright-artifacts-<variant>-<sha>`).

**Per-worktree dev-server isolation** (`scripts/develop/main.go`)
- Deterministic FNV-64a hash of the worktree path produces a port offset
in `[0, 1000)` (50 buckets, step 20 to avoid API/proxy overlap across
adjacent buckets).
- Offset is applied only to defaults; both env vars (`CODER_DEV_PORT`,
`CODER_DEV_WEB_PORT`, `CODER_DEV_PROXY_PORT`,
`CODER_DEV_PROMETHEUS_PORT`) and CLI flags retain priority.
- Hardcoded ports `9090` (embedded Prometheus UI) and `12345` (Delve)
are unchanged by design.
- Startup banner shows each port's source: `default`, `offset`, or
`explicit`.
- Unit tests in `scripts/develop/main_test.go` cover determinism,
bounds, no-overlap across the four ports, and explicit-skip behavior.
- State (`.coderv2/`) was already worktree-isolated via `os.Getwd()`, so
no state-dir changes were needed.

## Validation

`make lint/agents`, `make lint/architecture`, `make lint/emdash`, `bash
scripts/audit-agent-readiness.sh` (10 ok, 0 warn, 0 fail), `shellcheck`
on all 5 new scripts, `go test ./scripts/develop/...`, and `js-yaml`
parse of `ci.yaml` all pass. Synthetic fixtures verify both
failure-summary scripts handle empty/missing input (silent exit 0),
ANSI-stripped output, and parent/subtest formatting.

## Known follow-ups (deferred)

- Frontend Storybook/Vitest failure summary: lowest-leverage slice of
the failure-summary work. Skipping until observed pain.
- Architecture lint currently only delegates to existing import checks;
new rules (`InTx` outer-store detection, swagger-annotation lint) plug
in as needed.
- 50 port-offset buckets means two worktree paths can occasionally
collide. The DEV_ISOLATION doc tells users to set the relevant env var
when this happens.

> Mux opened this PR on Mike's behalf.
2026-05-11 17:27:29 +02:00

6.9 KiB

Observability Guide for Agents

This guide maps the observability surfaces that already exist in local Coder development. Do not add new endpoints for agent debugging. Prefer the existing logs, tracing, Prometheus metrics, browser artifacts, and command output described here.

Start the app

Use ./scripts/develop.sh for local development. See Development Workflows and Guidelines for the full workflow. The script builds the dev orchestrator, starts the API server and frontend, waits for the API server to answer /healthz, creates the first user if needed, and prints a banner with the local URLs.

Useful defaults from scripts/develop/main.go are:

  • API server: http://localhost:3000.
  • Frontend dev server: http://localhost:8080.
  • Workspace proxy, when --use-proxy is set: http://localhost:3010.
  • Coder Prometheus metrics: http://localhost:2114/.
  • Embedded Prometheus UI, when --prometheus-server is set and Docker is available on Linux: http://localhost:9090.

Local logs

./scripts/develop.sh writes orchestrator and child process logs to the terminal. The orchestrator uses sloghuman, and each child process is logged under a named logger such as api, site, proxy, ext-provisioner, or prometheus.

HTTP request logging is implemented in coderd/httpmw/loggermw. Request log fields include user_agent, host, path, proto, remote_addr, start, status_code, latency_ms, route params, and selected safe query params. Responses with status codes of 500 or higher include the response body in the request log. Successful GET /api/v2 requests are skipped.

When investigating failures, keep the full terminal output from ./scripts/develop.sh. If you ran a command through Mux or another harness, record the command, exit code, and artifact path for the captured output.

Tracing

HTTP tracing lives in coderd/tracing. The middleware covers /api, /api/**, workspace app routes, and external auth callback routes. When an active trace span exists, responses include X-Trace-ID, X-Span-ID, and a W3C traceparent header.

Tracing export is controlled by existing server flags and environment variables, not by the develop orchestrator itself:

  • --trace or CODER_TRACE_ENABLE enables application tracing.
  • --trace-logs or CODER_TRACE_LOGS adds log events to traces.
  • --trace-honeycomb-api-key or CODER_TRACE_HONEYCOMB_API_KEY enables the Honeycomb exporter.
  • --trace-datadog or CODER_TRACE_DATADOG enables sending Go runtime traces to the local DataDog agent.

To pass server flags through the develop script, put them after --. For example, use ./scripts/develop.sh -- --trace when you already have an OTLP backend configured through the standard OpenTelemetry environment variables.

Prometheus metrics

./scripts/develop.sh enables Coder Prometheus metrics by default on 0.0.0.0:2114, served at http://localhost:2114/. The port is controlled by --prometheus-port or CODER_DEV_PROMETHEUS_PORT. Set it to 0 to disable metrics. The develop script passes these existing server flags when metrics are enabled: --prometheus-enable, --prometheus-address, --prometheus-collect-agent-stats, and --prometheus-collect-db-metrics.

If --prometheus-server or CODER_DEV_PROMETHEUS_SERVER is set, the develop script attempts to start a Docker container named coder-prometheus on Linux. The Prometheus UI listens on http://localhost:9090. If a previous container is reused, confirm the scrape target because it may point at an older metrics port.

Relevant metric implementations include:

  • coderd/httpmw/prometheus.go for HTTP request counters, concurrency gauges, websocket gauges, and latency histograms.
  • coderd/prometheusmetrics/ for active users, workspaces, agents, build info, experiments, insights, and agent stats collectors.
  • coderd/database/dbmetrics/ for database query and transaction metrics.
  • docs/admin/integrations/prometheus.md for the user-facing Prometheus integration guide and metric reference.

Correlating a failed action

Use this sequence when a browser or API action fails:

  1. Record the local clock time, browser action, URL, HTTP method, and response status from the browser network panel or test output.
  2. If the response includes X-Trace-ID or X-Span-ID, copy both values. If not, copy the traceparent header if present.
  3. Search the ./scripts/develop.sh terminal output for the route, method, status code, response body, or timestamp. Match fields such as path, status_code, and latency_ms.
  4. Check http://localhost:2114/ for metrics that match the route or subsystem. Start with coderd_api_requests_processed_total, coderd_api_request_latencies_seconds, and database metrics under the coderd_db_ prefix.
  5. Attach the browser screenshot, trace, video, or command output artifact to the failure report when the harness produced one.

If an API request fails

  • Capture method, URL, status code, response body, and response headers.
  • Check the API log line for matching path, status_code, and latency_ms.
  • If the status is 500 or higher, include the logged response body.
  • Check coderd_api_requests_processed_total and coderd_api_request_latencies_seconds for the matching route.
  • If database work is involved, check coderd_db_query_counts_total, coderd_db_query_latencies_seconds, and transaction metrics.

If the frontend hangs

  • Confirm that the develop banner printed both the API and Web UI URLs.
  • Check the site logger output for Vite errors and dependency failures.
  • Use the browser network panel to separate frontend asset failures from API failures.
  • If API calls are pending or failing, follow the API request checklist above.
  • Capture browser console output and screenshots before retrying.

If a workspace provision fails

  • Capture the workspace build ID, template name, workspace name, user, and action that triggered the build.
  • Search logs for provisioner, workspace, build, and the workspace build ID.
  • Check whether ext-provisioner is running in the develop output.
  • Review metrics for API request failures, database latency, and agent stats if the failure reaches agent startup.
  • Preserve provisioner logs, template files, command output, and any browser artifacts from the failed flow.

Failure report checklist

Include these details in every observability failure report:

  • Absolute timestamp with timezone and the local command that was running.
  • Git branch, commit SHA, and whether generated files were fresh.
  • Browser action, API method, URL, route, status code, and response body.
  • X-Trace-ID, X-Span-ID, or traceparent when present.
  • Relevant log lines with nearby context.
  • Prometheus metrics checked and the observed values or absence of values.
  • Artifact paths for screenshots, traces, videos, logs, and command output.
  • Any cleanup performed before reproducing the failure again.