This PR adds an opinionated harness-engineering layer for agent-driven workflows: a small set of agent-readable docs, mechanical structure checks, structured CI failure summaries, an architecture-lint umbrella, and per-worktree dev-server isolation. The goal is to make local dev, tests, and CI mechanically inspectable by agents without changing app runtime behavior. ## What landed **Agent docs and navigation** - `.claude/docs/OBSERVABILITY.md`, `.claude/docs/DEV_ISOLATION.md`, `.claude/docs/AGENT_FAILURES.md`: task-oriented guides for logs, tracing, Prometheus, dev-server isolation, and a seeded failure catalog. - `AGENTS.md`: added an `Agent navigation` block, then trimmed the file from 375 to 229 lines by migrating duplicated detail into `WORKFLOWS.md`, `GO.md`, `TESTING.md`, and `DATABASE.md`. The user-managed custom-instructions block is preserved. - `.agents/docs`: symlink mirror of `.claude/docs` for agent runtimes that look under `.agents`. **Mechanical checks** - `scripts/check_agents_structure.sh`: validates `@...` references in tracked `AGENTS.md` files and warns when root grows past 600 lines. Wired as `make lint/agents` and into `make lint`. - `scripts/audit-agent-readiness.sh`: report-first audit of harness readiness. Currently `10 ok, 0 warn, 0 fail`. - `scripts/check_architecture.sh` / `make lint/architecture`: umbrella architecture-lint target. Consolidates the existing `check_enterprise_imports.sh` and `check_codersdk_imports.sh` so they run exactly once via the umbrella. Slot is open for new high-confidence rules. **Structured CI failure summaries** - `scripts/playwright-failure-summary.sh`: parses `site/test-results/results.json` and writes Markdown to `$GITHUB_STEP_SUMMARY` on failure. Wired into the `test-e2e` matrix job. - `scripts/go-test-failure-summary.sh`: parses `go test -json` line-delimited output the same way. Wired into `test-go-pg`, `test-go-pg-17`, and `test-go-race-pg` by injecting `gotestsum --jsonfile` in the workflow without touching `Makefile`. JSON also uploaded as a CI artifact on failure. - `site/e2e/playwright.config.ts`: enables `screenshot: only-on-failure`, `trace: retain-on-failure`, JSON reporter, and HTML reporter alongside existing reporters. - `.github/workflows/ci.yaml`: failure artifact uploads for Playwright now use `if: failure()` and predictable names (`playwright-artifacts-<variant>-<sha>`). **Per-worktree dev-server isolation** (`scripts/develop/main.go`) - Deterministic FNV-64a hash of the worktree path produces a port offset in `[0, 1000)` (50 buckets, step 20 to avoid API/proxy overlap across adjacent buckets). - Offset is applied only to defaults; both env vars (`CODER_DEV_PORT`, `CODER_DEV_WEB_PORT`, `CODER_DEV_PROXY_PORT`, `CODER_DEV_PROMETHEUS_PORT`) and CLI flags retain priority. - Hardcoded ports `9090` (embedded Prometheus UI) and `12345` (Delve) are unchanged by design. - Startup banner shows each port's source: `default`, `offset`, or `explicit`. - Unit tests in `scripts/develop/main_test.go` cover determinism, bounds, no-overlap across the four ports, and explicit-skip behavior. - State (`.coderv2/`) was already worktree-isolated via `os.Getwd()`, so no state-dir changes were needed. ## Validation `make lint/agents`, `make lint/architecture`, `make lint/emdash`, `bash scripts/audit-agent-readiness.sh` (10 ok, 0 warn, 0 fail), `shellcheck` on all 5 new scripts, `go test ./scripts/develop/...`, and `js-yaml` parse of `ci.yaml` all pass. Synthetic fixtures verify both failure-summary scripts handle empty/missing input (silent exit 0), ANSI-stripped output, and parent/subtest formatting. ## Known follow-ups (deferred) - Frontend Storybook/Vitest failure summary: lowest-leverage slice of the failure-summary work. Skipping until observed pain. - Architecture lint currently only delegates to existing import checks; new rules (`InTx` outer-store detection, swagger-annotation lint) plug in as needed. - 50 port-offset buckets means two worktree paths can occasionally collide. The DEV_ISOLATION doc tells users to set the relevant env var when this happens. > Mux opened this PR on Mike's behalf.
6.9 KiB
Observability Guide for Agents
This guide maps the observability surfaces that already exist in local Coder development. Do not add new endpoints for agent debugging. Prefer the existing logs, tracing, Prometheus metrics, browser artifacts, and command output described here.
Start the app
Use ./scripts/develop.sh for local development. See
Development Workflows and Guidelines for the full workflow.
The script builds the dev orchestrator, starts the API server and frontend,
waits for the API server to answer /healthz, creates the first user if
needed, and prints a banner with the local URLs.
Useful defaults from scripts/develop/main.go are:
- API server:
http://localhost:3000. - Frontend dev server:
http://localhost:8080. - Workspace proxy, when
--use-proxyis set:http://localhost:3010. - Coder Prometheus metrics:
http://localhost:2114/. - Embedded Prometheus UI, when
--prometheus-serveris set and Docker is available on Linux:http://localhost:9090.
Local logs
./scripts/develop.sh writes orchestrator and child process logs to the
terminal. The orchestrator uses sloghuman, and each child process is logged
under a named logger such as api, site, proxy, ext-provisioner, or
prometheus.
HTTP request logging is implemented in coderd/httpmw/loggermw. Request log
fields include user_agent, host, path, proto, remote_addr, start,
status_code, latency_ms, route params, and selected safe query params.
Responses with status codes of 500 or higher include the response body in the
request log. Successful GET /api/v2 requests are skipped.
When investigating failures, keep the full terminal output from
./scripts/develop.sh. If you ran a command through Mux or another harness,
record the command, exit code, and artifact path for the captured output.
Tracing
HTTP tracing lives in coderd/tracing. The middleware covers /api,
/api/**, workspace app routes, and external auth callback routes. When an
active trace span exists, responses include X-Trace-ID, X-Span-ID, and a
W3C traceparent header.
Tracing export is controlled by existing server flags and environment variables, not by the develop orchestrator itself:
--traceorCODER_TRACE_ENABLEenables application tracing.--trace-logsorCODER_TRACE_LOGSadds log events to traces.--trace-honeycomb-api-keyorCODER_TRACE_HONEYCOMB_API_KEYenables the Honeycomb exporter.--trace-datadogorCODER_TRACE_DATADOGenables sending Go runtime traces to the local DataDog agent.
To pass server flags through the develop script, put them after --. For
example, use ./scripts/develop.sh -- --trace when you already have an OTLP
backend configured through the standard OpenTelemetry environment variables.
Prometheus metrics
./scripts/develop.sh enables Coder Prometheus metrics by default on
0.0.0.0:2114, served at http://localhost:2114/. The port is controlled by
--prometheus-port or CODER_DEV_PROMETHEUS_PORT. Set it to 0 to disable
metrics. The develop script passes these existing server flags when metrics are
enabled: --prometheus-enable, --prometheus-address,
--prometheus-collect-agent-stats, and --prometheus-collect-db-metrics.
If --prometheus-server or CODER_DEV_PROMETHEUS_SERVER is set, the develop
script attempts to start a Docker container named coder-prometheus on Linux.
The Prometheus UI listens on http://localhost:9090. If a previous container
is reused, confirm the scrape target because it may point at an older metrics
port.
Relevant metric implementations include:
coderd/httpmw/prometheus.gofor HTTP request counters, concurrency gauges, websocket gauges, and latency histograms.coderd/prometheusmetrics/for active users, workspaces, agents, build info, experiments, insights, and agent stats collectors.coderd/database/dbmetrics/for database query and transaction metrics.docs/admin/integrations/prometheus.mdfor the user-facing Prometheus integration guide and metric reference.
Correlating a failed action
Use this sequence when a browser or API action fails:
- Record the local clock time, browser action, URL, HTTP method, and response status from the browser network panel or test output.
- If the response includes
X-Trace-IDorX-Span-ID, copy both values. If not, copy thetraceparentheader if present. - Search the
./scripts/develop.shterminal output for the route, method, status code, response body, or timestamp. Match fields such aspath,status_code, andlatency_ms. - Check
http://localhost:2114/for metrics that match the route or subsystem. Start withcoderd_api_requests_processed_total,coderd_api_request_latencies_seconds, and database metrics under thecoderd_db_prefix. - Attach the browser screenshot, trace, video, or command output artifact to the failure report when the harness produced one.
If an API request fails
- Capture method, URL, status code, response body, and response headers.
- Check the API log line for matching
path,status_code, andlatency_ms. - If the status is 500 or higher, include the logged response body.
- Check
coderd_api_requests_processed_totalandcoderd_api_request_latencies_secondsfor the matching route. - If database work is involved, check
coderd_db_query_counts_total,coderd_db_query_latencies_seconds, and transaction metrics.
If the frontend hangs
- Confirm that the develop banner printed both the API and Web UI URLs.
- Check the
sitelogger output for Vite errors and dependency failures. - Use the browser network panel to separate frontend asset failures from API failures.
- If API calls are pending or failing, follow the API request checklist above.
- Capture browser console output and screenshots before retrying.
If a workspace provision fails
- Capture the workspace build ID, template name, workspace name, user, and action that triggered the build.
- Search logs for
provisioner,workspace,build, and the workspace build ID. - Check whether
ext-provisioneris running in the develop output. - Review metrics for API request failures, database latency, and agent stats if the failure reaches agent startup.
- Preserve provisioner logs, template files, command output, and any browser artifacts from the failed flow.
Failure report checklist
Include these details in every observability failure report:
- Absolute timestamp with timezone and the local command that was running.
- Git branch, commit SHA, and whether generated files were fresh.
- Browser action, API method, URL, route, status code, and response body.
X-Trace-ID,X-Span-ID, ortraceparentwhen present.- Relevant log lines with nearby context.
- Prometheus metrics checked and the observed values or absence of values.
- Artifact paths for screenshots, traces, videos, logs, and command output.
- Any cleanup performed before reproducing the failure again.