mirror of
https://github.com/coder/coder.git
synced 2026-06-02 20:48:20 +00:00
85792d08bc
This PR adds an opinionated harness-engineering layer for agent-driven workflows: a small set of agent-readable docs, mechanical structure checks, structured CI failure summaries, an architecture-lint umbrella, and per-worktree dev-server isolation. The goal is to make local dev, tests, and CI mechanically inspectable by agents without changing app runtime behavior. ## What landed **Agent docs and navigation** - `.claude/docs/OBSERVABILITY.md`, `.claude/docs/DEV_ISOLATION.md`, `.claude/docs/AGENT_FAILURES.md`: task-oriented guides for logs, tracing, Prometheus, dev-server isolation, and a seeded failure catalog. - `AGENTS.md`: added an `Agent navigation` block, then trimmed the file from 375 to 229 lines by migrating duplicated detail into `WORKFLOWS.md`, `GO.md`, `TESTING.md`, and `DATABASE.md`. The user-managed custom-instructions block is preserved. - `.agents/docs`: symlink mirror of `.claude/docs` for agent runtimes that look under `.agents`. **Mechanical checks** - `scripts/check_agents_structure.sh`: validates `@...` references in tracked `AGENTS.md` files and warns when root grows past 600 lines. Wired as `make lint/agents` and into `make lint`. - `scripts/audit-agent-readiness.sh`: report-first audit of harness readiness. Currently `10 ok, 0 warn, 0 fail`. - `scripts/check_architecture.sh` / `make lint/architecture`: umbrella architecture-lint target. Consolidates the existing `check_enterprise_imports.sh` and `check_codersdk_imports.sh` so they run exactly once via the umbrella. Slot is open for new high-confidence rules. **Structured CI failure summaries** - `scripts/playwright-failure-summary.sh`: parses `site/test-results/results.json` and writes Markdown to `$GITHUB_STEP_SUMMARY` on failure. Wired into the `test-e2e` matrix job. - `scripts/go-test-failure-summary.sh`: parses `go test -json` line-delimited output the same way. Wired into `test-go-pg`, `test-go-pg-17`, and `test-go-race-pg` by injecting `gotestsum --jsonfile` in the workflow without touching `Makefile`. JSON also uploaded as a CI artifact on failure. - `site/e2e/playwright.config.ts`: enables `screenshot: only-on-failure`, `trace: retain-on-failure`, JSON reporter, and HTML reporter alongside existing reporters. - `.github/workflows/ci.yaml`: failure artifact uploads for Playwright now use `if: failure()` and predictable names (`playwright-artifacts-<variant>-<sha>`). **Per-worktree dev-server isolation** (`scripts/develop/main.go`) - Deterministic FNV-64a hash of the worktree path produces a port offset in `[0, 1000)` (50 buckets, step 20 to avoid API/proxy overlap across adjacent buckets). - Offset is applied only to defaults; both env vars (`CODER_DEV_PORT`, `CODER_DEV_WEB_PORT`, `CODER_DEV_PROXY_PORT`, `CODER_DEV_PROMETHEUS_PORT`) and CLI flags retain priority. - Hardcoded ports `9090` (embedded Prometheus UI) and `12345` (Delve) are unchanged by design. - Startup banner shows each port's source: `default`, `offset`, or `explicit`. - Unit tests in `scripts/develop/main_test.go` cover determinism, bounds, no-overlap across the four ports, and explicit-skip behavior. - State (`.coderv2/`) was already worktree-isolated via `os.Getwd()`, so no state-dir changes were needed. ## Validation `make lint/agents`, `make lint/architecture`, `make lint/emdash`, `bash scripts/audit-agent-readiness.sh` (10 ok, 0 warn, 0 fail), `shellcheck` on all 5 new scripts, `go test ./scripts/develop/...`, and `js-yaml` parse of `ci.yaml` all pass. Synthetic fixtures verify both failure-summary scripts handle empty/missing input (silent exit 0), ANSI-stripped output, and parent/subtest formatting. ## Known follow-ups (deferred) - Frontend Storybook/Vitest failure summary: lowest-leverage slice of the failure-summary work. Skipping until observed pain. - Architecture lint currently only delegates to existing import checks; new rules (`InTx` outer-store detection, swagger-annotation lint) plug in as needed. - 50 port-offset buckets means two worktree paths can occasionally collide. The DEV_ISOLATION doc tells users to set the relevant env var when this happens. > Mux opened this PR on Mike's behalf.
149 lines
6.9 KiB
Markdown
149 lines
6.9 KiB
Markdown
# Observability Guide for Agents
|
|
|
|
This guide maps the observability surfaces that already exist in local
|
|
Coder development. Do not add new endpoints for agent debugging. Prefer the
|
|
existing logs, tracing, Prometheus metrics, browser artifacts, and command
|
|
output described here.
|
|
|
|
## Start the app
|
|
|
|
Use `./scripts/develop.sh` for local development. See
|
|
[Development Workflows and Guidelines](WORKFLOWS.md) for the full workflow.
|
|
The script builds the dev orchestrator, starts the API server and frontend,
|
|
waits for the API server to answer `/healthz`, creates the first user if
|
|
needed, and prints a banner with the local URLs.
|
|
|
|
Useful defaults from `scripts/develop/main.go` are:
|
|
|
|
- API server: `http://localhost:3000`.
|
|
- Frontend dev server: `http://localhost:8080`.
|
|
- Workspace proxy, when `--use-proxy` is set: `http://localhost:3010`.
|
|
- Coder Prometheus metrics: `http://localhost:2114/`.
|
|
- Embedded Prometheus UI, when `--prometheus-server` is set and Docker is
|
|
available on Linux: `http://localhost:9090`.
|
|
|
|
## Local logs
|
|
|
|
`./scripts/develop.sh` writes orchestrator and child process logs to the
|
|
terminal. The orchestrator uses `sloghuman`, and each child process is logged
|
|
under a named logger such as `api`, `site`, `proxy`, `ext-provisioner`, or
|
|
`prometheus`.
|
|
|
|
HTTP request logging is implemented in `coderd/httpmw/loggermw`. Request log
|
|
fields include `user_agent`, `host`, `path`, `proto`, `remote_addr`, `start`,
|
|
`status_code`, `latency_ms`, route params, and selected safe query params.
|
|
Responses with status codes of 500 or higher include the response body in the
|
|
request log. Successful `GET /api/v2` requests are skipped.
|
|
|
|
When investigating failures, keep the full terminal output from
|
|
`./scripts/develop.sh`. If you ran a command through Mux or another harness,
|
|
record the command, exit code, and artifact path for the captured output.
|
|
|
|
## Tracing
|
|
|
|
HTTP tracing lives in `coderd/tracing`. The middleware covers `/api`,
|
|
`/api/**`, workspace app routes, and external auth callback routes. When an
|
|
active trace span exists, responses include `X-Trace-ID`, `X-Span-ID`, and a
|
|
W3C `traceparent` header.
|
|
|
|
Tracing export is controlled by existing server flags and environment
|
|
variables, not by the develop orchestrator itself:
|
|
|
|
- `--trace` or `CODER_TRACE_ENABLE` enables application tracing.
|
|
- `--trace-logs` or `CODER_TRACE_LOGS` adds log events to traces.
|
|
- `--trace-honeycomb-api-key` or `CODER_TRACE_HONEYCOMB_API_KEY` enables the
|
|
Honeycomb exporter.
|
|
- `--trace-datadog` or `CODER_TRACE_DATADOG` enables sending Go runtime
|
|
traces to the local DataDog agent.
|
|
|
|
To pass server flags through the develop script, put them after `--`. For
|
|
example, use `./scripts/develop.sh -- --trace` when you already have an OTLP
|
|
backend configured through the standard OpenTelemetry environment variables.
|
|
|
|
## Prometheus metrics
|
|
|
|
`./scripts/develop.sh` enables Coder Prometheus metrics by default on
|
|
`0.0.0.0:2114`, served at `http://localhost:2114/`. The port is controlled by
|
|
`--prometheus-port` or `CODER_DEV_PROMETHEUS_PORT`. Set it to `0` to disable
|
|
metrics. The develop script passes these existing server flags when metrics are
|
|
enabled: `--prometheus-enable`, `--prometheus-address`,
|
|
`--prometheus-collect-agent-stats`, and `--prometheus-collect-db-metrics`.
|
|
|
|
If `--prometheus-server` or `CODER_DEV_PROMETHEUS_SERVER` is set, the develop
|
|
script attempts to start a Docker container named `coder-prometheus` on Linux.
|
|
The Prometheus UI listens on `http://localhost:9090`. If a previous container
|
|
is reused, confirm the scrape target because it may point at an older metrics
|
|
port.
|
|
|
|
Relevant metric implementations include:
|
|
|
|
- `coderd/httpmw/prometheus.go` for HTTP request counters, concurrency gauges,
|
|
websocket gauges, and latency histograms.
|
|
- `coderd/prometheusmetrics/` for active users, workspaces, agents, build
|
|
info, experiments, insights, and agent stats collectors.
|
|
- `coderd/database/dbmetrics/` for database query and transaction metrics.
|
|
- `docs/admin/integrations/prometheus.md` for the user-facing Prometheus
|
|
integration guide and metric reference.
|
|
|
|
## Correlating a failed action
|
|
|
|
Use this sequence when a browser or API action fails:
|
|
|
|
1. Record the local clock time, browser action, URL, HTTP method, and response
|
|
status from the browser network panel or test output.
|
|
2. If the response includes `X-Trace-ID` or `X-Span-ID`, copy both values. If
|
|
not, copy the `traceparent` header if present.
|
|
3. Search the `./scripts/develop.sh` terminal output for the route, method,
|
|
status code, response body, or timestamp. Match fields such as `path`,
|
|
`status_code`, and `latency_ms`.
|
|
4. Check `http://localhost:2114/` for metrics that match the route or subsystem.
|
|
Start with `coderd_api_requests_processed_total`,
|
|
`coderd_api_request_latencies_seconds`, and database metrics under the
|
|
`coderd_db_` prefix.
|
|
5. Attach the browser screenshot, trace, video, or command output artifact to
|
|
the failure report when the harness produced one.
|
|
|
|
## If an API request fails
|
|
|
|
- Capture method, URL, status code, response body, and response headers.
|
|
- Check the API log line for matching `path`, `status_code`, and `latency_ms`.
|
|
- If the status is 500 or higher, include the logged response body.
|
|
- Check `coderd_api_requests_processed_total` and
|
|
`coderd_api_request_latencies_seconds` for the matching route.
|
|
- If database work is involved, check `coderd_db_query_counts_total`,
|
|
`coderd_db_query_latencies_seconds`, and transaction metrics.
|
|
|
|
## If the frontend hangs
|
|
|
|
- Confirm that the develop banner printed both the API and Web UI URLs.
|
|
- Check the `site` logger output for Vite errors and dependency failures.
|
|
- Use the browser network panel to separate frontend asset failures from API
|
|
failures.
|
|
- If API calls are pending or failing, follow the API request checklist above.
|
|
- Capture browser console output and screenshots before retrying.
|
|
|
|
## If a workspace provision fails
|
|
|
|
- Capture the workspace build ID, template name, workspace name, user, and
|
|
action that triggered the build.
|
|
- Search logs for `provisioner`, `workspace`, `build`, and the workspace build
|
|
ID.
|
|
- Check whether `ext-provisioner` is running in the develop output.
|
|
- Review metrics for API request failures, database latency, and agent stats if
|
|
the failure reaches agent startup.
|
|
- Preserve provisioner logs, template files, command output, and any browser
|
|
artifacts from the failed flow.
|
|
|
|
## Failure report checklist
|
|
|
|
Include these details in every observability failure report:
|
|
|
|
- Absolute timestamp with timezone and the local command that was running.
|
|
- Git branch, commit SHA, and whether generated files were fresh.
|
|
- Browser action, API method, URL, route, status code, and response body.
|
|
- `X-Trace-ID`, `X-Span-ID`, or `traceparent` when present.
|
|
- Relevant log lines with nearby context.
|
|
- Prometheus metrics checked and the observed values or absence of values.
|
|
- Artifact paths for screenshots, traces, videos, logs, and command output.
|
|
- Any cleanup performed before reproducing the failure again.
|