Files
coder/.claude/docs/DEV_ISOLATION.md
T
Michael Suchacz 85792d08bc feat: add harness engineering layer for agent workflows (#24791)
This PR adds an opinionated harness-engineering layer for agent-driven
workflows: a small set of agent-readable docs, mechanical structure
checks, structured CI failure summaries, an architecture-lint umbrella,
and per-worktree dev-server isolation. The goal is to make local dev,
tests, and CI mechanically inspectable by agents without changing app
runtime behavior.

## What landed

**Agent docs and navigation**
- `.claude/docs/OBSERVABILITY.md`, `.claude/docs/DEV_ISOLATION.md`,
`.claude/docs/AGENT_FAILURES.md`: task-oriented guides for logs,
tracing, Prometheus, dev-server isolation, and a seeded failure catalog.
- `AGENTS.md`: added an `Agent navigation` block, then trimmed the file
from 375 to 229 lines by migrating duplicated detail into
`WORKFLOWS.md`, `GO.md`, `TESTING.md`, and `DATABASE.md`. The
user-managed custom-instructions block is preserved.
- `.agents/docs`: symlink mirror of `.claude/docs` for agent runtimes
that look under `.agents`.

**Mechanical checks**
- `scripts/check_agents_structure.sh`: validates `@...` references in
tracked `AGENTS.md` files and warns when root grows past 600 lines.
Wired as `make lint/agents` and into `make lint`.
- `scripts/audit-agent-readiness.sh`: report-first audit of harness
readiness. Currently `10 ok, 0 warn, 0 fail`.
- `scripts/check_architecture.sh` / `make lint/architecture`: umbrella
architecture-lint target. Consolidates the existing
`check_enterprise_imports.sh` and `check_codersdk_imports.sh` so they
run exactly once via the umbrella. Slot is open for new high-confidence
rules.

**Structured CI failure summaries**
- `scripts/playwright-failure-summary.sh`: parses
`site/test-results/results.json` and writes Markdown to
`$GITHUB_STEP_SUMMARY` on failure. Wired into the `test-e2e` matrix job.
- `scripts/go-test-failure-summary.sh`: parses `go test -json`
line-delimited output the same way. Wired into `test-go-pg`,
`test-go-pg-17`, and `test-go-race-pg` by injecting `gotestsum
--jsonfile` in the workflow without touching `Makefile`. JSON also
uploaded as a CI artifact on failure.
- `site/e2e/playwright.config.ts`: enables `screenshot:
only-on-failure`, `trace: retain-on-failure`, JSON reporter, and HTML
reporter alongside existing reporters.
- `.github/workflows/ci.yaml`: failure artifact uploads for Playwright
now use `if: failure()` and predictable names
(`playwright-artifacts-<variant>-<sha>`).

**Per-worktree dev-server isolation** (`scripts/develop/main.go`)
- Deterministic FNV-64a hash of the worktree path produces a port offset
in `[0, 1000)` (50 buckets, step 20 to avoid API/proxy overlap across
adjacent buckets).
- Offset is applied only to defaults; both env vars (`CODER_DEV_PORT`,
`CODER_DEV_WEB_PORT`, `CODER_DEV_PROXY_PORT`,
`CODER_DEV_PROMETHEUS_PORT`) and CLI flags retain priority.
- Hardcoded ports `9090` (embedded Prometheus UI) and `12345` (Delve)
are unchanged by design.
- Startup banner shows each port's source: `default`, `offset`, or
`explicit`.
- Unit tests in `scripts/develop/main_test.go` cover determinism,
bounds, no-overlap across the four ports, and explicit-skip behavior.
- State (`.coderv2/`) was already worktree-isolated via `os.Getwd()`, so
no state-dir changes were needed.

## Validation

`make lint/agents`, `make lint/architecture`, `make lint/emdash`, `bash
scripts/audit-agent-readiness.sh` (10 ok, 0 warn, 0 fail), `shellcheck`
on all 5 new scripts, `go test ./scripts/develop/...`, and `js-yaml`
parse of `ci.yaml` all pass. Synthetic fixtures verify both
failure-summary scripts handle empty/missing input (silent exit 0),
ANSI-stripped output, and parent/subtest formatting.

## Known follow-ups (deferred)

- Frontend Storybook/Vitest failure summary: lowest-leverage slice of
the failure-summary work. Skipping until observed pain.
- Architecture lint currently only delegates to existing import checks;
new rules (`InTx` outer-store detection, swagger-annotation lint) plug
in as needed.
- 50 port-offset buckets means two worktree paths can occasionally
collide. The DEV_ISOLATION doc tells users to set the relevant env var
when this happens.

> Mux opened this PR on Mike's behalf.
2026-05-11 17:27:29 +02:00

6.2 KiB

Development Isolation Guide for Agents

This guide documents the local resources that the existing harness uses. It is for avoiding collisions across worktrees and cleaning up after failed runs. Do not add new readiness or debug endpoints for these workflows.

Default local ports

scripts/develop/main.go defines these base defaults:

Resource Base default Override
API server 3000 --port, CODER_DEV_PORT
Frontend dev server 8080 --web-port, CODER_DEV_WEB_PORT
Workspace proxy 3010 --proxy-port, CODER_DEV_PROXY_PORT
Coder Prometheus metrics 2114 --prometheus-port, CODER_DEV_PROMETHEUS_PORT
Embedded Prometheus UI 9090 Fixed in scripts/develop/main.go
Delve debugger 12345 Fixed when --debug is used

By default, plain ./scripts/develop.sh uses the base defaults exactly: 3000, 8080, 3010, and 2114 for Coder Prometheus metrics. Set --port-offset or CODER_DEV_PORT_OFFSET=true to opt in to a deterministic per-worktree offset for API, frontend, workspace proxy, and Coder Prometheus metrics ports.

When enabled, the develop script hashes the project root with FNV-64a, maps it into one of 50 buckets, multiplies by 20, and adds that value to each unset base default. The same worktree path always gets the same effective ports. A flag or environment variable overrides only that port. Other unset ports still receive the opt-in offset. The workspace proxy is only started when --use-proxy is set. The embedded Prometheus UI is only started when --prometheus-server or CODER_DEV_PROMETHEUS_SERVER is set, Docker is available, and the host is Linux. The Prometheus UI port 9090 and Delve port 12345 remain hardcoded.

Other useful develop flags and environment variables

The develop script also supports these existing flags and environment variables:

Purpose Flag Environment variable
Per-worktree port offset --port-offset CODER_DEV_PORT_OFFSET
Access URL --access-url CODER_DEV_ACCESS_URL
Admin password --password CODER_DEV_ADMIN_PASSWORD
Starter template --starter-template CODER_DEV_STARTER_TEMPLATE
Roll back missing migrations --db-rollback CODER_DEV_DB_ROLLBACK
Reset the development database --db-reset CODER_DEV_DB_RESET
Accept changed migration tracking --db-continue CODER_DEV_DB_CONTINUE

Extra coder server flags can be passed after --. For example, ./scripts/develop.sh -- --trace passes --trace to the API server.

Multi-worktree guidance

Each worktree gets its own .coderv2 directory because scripts/develop.sh sets the global config directory to <project-root>/.coderv2. This isolates built-in Postgres data, local session data, and Prometheus container storage on disk.

The configurable develop ports use canonical defaults unless you opt in with --port-offset or CODER_DEV_PORT_OFFSET=true. Enable the offset when running multiple worktrees in parallel and you want most concurrent runs to avoid manual port selection. When the offset is enabled, the startup banner prints the effective API, web, proxy, and Coder metrics ports with their offset status.

Use overrides when you need fixed ports or when two worktree paths hash to the same offset. For example:

CODER_DEV_PORT=3100 \
CODER_DEV_WEB_PORT=8180 \
CODER_DEV_PROXY_PORT=3110 \
CODER_DEV_PROMETHEUS_PORT=2214 \
./scripts/develop.sh --use-proxy

If you also need the embedded Prometheus UI in more than one worktree, use only one at a time. The UI port is fixed at 9090, and the Docker container name is fixed to coder-prometheus. Delve is fixed at 127.0.0.1:12345 when --debug is used.

Known collision risks

  • Two worktree paths can hash to the same opt-in offset. If preflight reports a busy effective port, set the relevant CODER_DEV_* environment variables or flags for one worktree.
  • The embedded Prometheus UI always uses port 9090.
  • The embedded Prometheus Docker container name is always coder-prometheus.
  • The Delve debugger always listens on 127.0.0.1:12345 when --debug is used.
  • The develop script only checks the proxy port when --use-proxy is set, so a stale process on the effective proxy port can go unnoticed until the proxy is enabled.
  • External databases configured through CODER_PG_CONNECTION_URL are shared if multiple worktrees point at the same database.

Readiness without new probes

Do not invent a new readiness probe. The develop script already waits for the API server to answer GET /healthz for up to 60 seconds, then logs server is ready to accept connections. After setup completes, it prints a banner with Coder is now running in development mode, the effective port list, and the API and Web UI URLs.

For agent-driven runs, treat the banner as the ready signal for browser work. If the banner does not appear, inspect the preceding api, site, database recovery, and port conflict logs.

Cleanup

Use the least destructive cleanup that fixes the problem:

  1. Stop ./scripts/develop.sh with Ctrl+C so child processes receive the orchestrator shutdown signal.
  2. If a child process remains, identify it with lsof -iTCP:<port> -sTCP:LISTEN or ps, then terminate only that stale process.
  3. To reset the built-in development database for the current worktree, rerun with ./scripts/develop.sh --db-reset or remove .coderv2/postgres after stopping the app.
  4. To clear local Coder session and generated state for the current worktree, remove the specific files under .coderv2 that are relevant to the failure.
  5. To clean the embedded Prometheus container, stop the develop script first, then remove the coder-prometheus container if it remains.
  6. To clean test databases, prefer the owning test harness cleanup. If tests were interrupted, inspect the local PostgreSQL instance used by the test suite before dropping any database.

For database migration mismatches, prefer the develop script's recovery flags before deleting state. Use --db-rollback when a migration disappeared from the current branch, --db-continue after you manually reconcile changed migration tracking, and --db-reset only when data loss is acceptable.