This PR adds an opinionated harness-engineering layer for agent-driven workflows: a small set of agent-readable docs, mechanical structure checks, structured CI failure summaries, an architecture-lint umbrella, and per-worktree dev-server isolation. The goal is to make local dev, tests, and CI mechanically inspectable by agents without changing app runtime behavior. ## What landed **Agent docs and navigation** - `.claude/docs/OBSERVABILITY.md`, `.claude/docs/DEV_ISOLATION.md`, `.claude/docs/AGENT_FAILURES.md`: task-oriented guides for logs, tracing, Prometheus, dev-server isolation, and a seeded failure catalog. - `AGENTS.md`: added an `Agent navigation` block, then trimmed the file from 375 to 229 lines by migrating duplicated detail into `WORKFLOWS.md`, `GO.md`, `TESTING.md`, and `DATABASE.md`. The user-managed custom-instructions block is preserved. - `.agents/docs`: symlink mirror of `.claude/docs` for agent runtimes that look under `.agents`. **Mechanical checks** - `scripts/check_agents_structure.sh`: validates `@...` references in tracked `AGENTS.md` files and warns when root grows past 600 lines. Wired as `make lint/agents` and into `make lint`. - `scripts/audit-agent-readiness.sh`: report-first audit of harness readiness. Currently `10 ok, 0 warn, 0 fail`. - `scripts/check_architecture.sh` / `make lint/architecture`: umbrella architecture-lint target. Consolidates the existing `check_enterprise_imports.sh` and `check_codersdk_imports.sh` so they run exactly once via the umbrella. Slot is open for new high-confidence rules. **Structured CI failure summaries** - `scripts/playwright-failure-summary.sh`: parses `site/test-results/results.json` and writes Markdown to `$GITHUB_STEP_SUMMARY` on failure. Wired into the `test-e2e` matrix job. - `scripts/go-test-failure-summary.sh`: parses `go test -json` line-delimited output the same way. Wired into `test-go-pg`, `test-go-pg-17`, and `test-go-race-pg` by injecting `gotestsum --jsonfile` in the workflow without touching `Makefile`. JSON also uploaded as a CI artifact on failure. - `site/e2e/playwright.config.ts`: enables `screenshot: only-on-failure`, `trace: retain-on-failure`, JSON reporter, and HTML reporter alongside existing reporters. - `.github/workflows/ci.yaml`: failure artifact uploads for Playwright now use `if: failure()` and predictable names (`playwright-artifacts-<variant>-<sha>`). **Per-worktree dev-server isolation** (`scripts/develop/main.go`) - Deterministic FNV-64a hash of the worktree path produces a port offset in `[0, 1000)` (50 buckets, step 20 to avoid API/proxy overlap across adjacent buckets). - Offset is applied only to defaults; both env vars (`CODER_DEV_PORT`, `CODER_DEV_WEB_PORT`, `CODER_DEV_PROXY_PORT`, `CODER_DEV_PROMETHEUS_PORT`) and CLI flags retain priority. - Hardcoded ports `9090` (embedded Prometheus UI) and `12345` (Delve) are unchanged by design. - Startup banner shows each port's source: `default`, `offset`, or `explicit`. - Unit tests in `scripts/develop/main_test.go` cover determinism, bounds, no-overlap across the four ports, and explicit-skip behavior. - State (`.coderv2/`) was already worktree-isolated via `os.Getwd()`, so no state-dir changes were needed. ## Validation `make lint/agents`, `make lint/architecture`, `make lint/emdash`, `bash scripts/audit-agent-readiness.sh` (10 ok, 0 warn, 0 fail), `shellcheck` on all 5 new scripts, `go test ./scripts/develop/...`, and `js-yaml` parse of `ci.yaml` all pass. Synthetic fixtures verify both failure-summary scripts handle empty/missing input (silent exit 0), ANSI-stripped output, and parent/subtest formatting. ## Known follow-ups (deferred) - Frontend Storybook/Vitest failure summary: lowest-leverage slice of the failure-summary work. Skipping until observed pain. - Architecture lint currently only delegates to existing import checks; new rules (`InTx` outer-store detection, swagger-annotation lint) plug in as needed. - 50 port-offset buckets means two worktree paths can occasionally collide. The DEV_ISOLATION doc tells users to set the relevant env var when this happens. > Mux opened this PR on Mike's behalf.
6.2 KiB
Development Isolation Guide for Agents
This guide documents the local resources that the existing harness uses. It is for avoiding collisions across worktrees and cleaning up after failed runs. Do not add new readiness or debug endpoints for these workflows.
Default local ports
scripts/develop/main.go defines these base defaults:
| Resource | Base default | Override |
|---|---|---|
| API server | 3000 |
--port, CODER_DEV_PORT |
| Frontend dev server | 8080 |
--web-port, CODER_DEV_WEB_PORT |
| Workspace proxy | 3010 |
--proxy-port, CODER_DEV_PROXY_PORT |
| Coder Prometheus metrics | 2114 |
--prometheus-port, CODER_DEV_PROMETHEUS_PORT |
| Embedded Prometheus UI | 9090 |
Fixed in scripts/develop/main.go |
| Delve debugger | 12345 |
Fixed when --debug is used |
By default, plain ./scripts/develop.sh uses the base defaults exactly:
3000, 8080, 3010, and 2114 for Coder Prometheus metrics. Set
--port-offset or CODER_DEV_PORT_OFFSET=true to opt in to a deterministic
per-worktree offset for API, frontend, workspace proxy, and Coder Prometheus
metrics ports.
When enabled, the develop script hashes the project root with FNV-64a, maps it
into one of 50 buckets, multiplies by 20, and adds that value to each unset base
default. The same worktree path always gets the same effective ports. A flag or
environment variable overrides only that port. Other unset ports still receive
the opt-in offset. The workspace proxy is only started when --use-proxy is
set. The embedded Prometheus UI is only started when --prometheus-server or
CODER_DEV_PROMETHEUS_SERVER is set, Docker is available, and the host is
Linux. The Prometheus UI port 9090 and Delve port 12345 remain hardcoded.
Other useful develop flags and environment variables
The develop script also supports these existing flags and environment variables:
| Purpose | Flag | Environment variable |
|---|---|---|
| Per-worktree port offset | --port-offset |
CODER_DEV_PORT_OFFSET |
| Access URL | --access-url |
CODER_DEV_ACCESS_URL |
| Admin password | --password |
CODER_DEV_ADMIN_PASSWORD |
| Starter template | --starter-template |
CODER_DEV_STARTER_TEMPLATE |
| Roll back missing migrations | --db-rollback |
CODER_DEV_DB_ROLLBACK |
| Reset the development database | --db-reset |
CODER_DEV_DB_RESET |
| Accept changed migration tracking | --db-continue |
CODER_DEV_DB_CONTINUE |
Extra coder server flags can be passed after --. For example,
./scripts/develop.sh -- --trace passes --trace to the API server.
Multi-worktree guidance
Each worktree gets its own .coderv2 directory because scripts/develop.sh
sets the global config directory to <project-root>/.coderv2. This isolates
built-in Postgres data, local session data, and Prometheus container storage on
disk.
The configurable develop ports use canonical defaults unless you opt in with
--port-offset or CODER_DEV_PORT_OFFSET=true. Enable the offset when running
multiple worktrees in parallel and you want most concurrent runs to avoid manual
port selection. When the offset is enabled, the startup banner prints the
effective API, web, proxy, and Coder metrics ports with their offset status.
Use overrides when you need fixed ports or when two worktree paths hash to the same offset. For example:
CODER_DEV_PORT=3100 \
CODER_DEV_WEB_PORT=8180 \
CODER_DEV_PROXY_PORT=3110 \
CODER_DEV_PROMETHEUS_PORT=2214 \
./scripts/develop.sh --use-proxy
If you also need the embedded Prometheus UI in more than one worktree, use only
one at a time. The UI port is fixed at 9090, and the Docker container name is
fixed to coder-prometheus. Delve is fixed at 127.0.0.1:12345 when --debug
is used.
Known collision risks
- Two worktree paths can hash to the same opt-in offset. If preflight reports a
busy effective port, set the relevant
CODER_DEV_*environment variables or flags for one worktree. - The embedded Prometheus UI always uses port
9090. - The embedded Prometheus Docker container name is always
coder-prometheus. - The Delve debugger always listens on
127.0.0.1:12345when--debugis used. - The develop script only checks the proxy port when
--use-proxyis set, so a stale process on the effective proxy port can go unnoticed until the proxy is enabled. - External databases configured through
CODER_PG_CONNECTION_URLare shared if multiple worktrees point at the same database.
Readiness without new probes
Do not invent a new readiness probe. The develop script already waits for the
API server to answer GET /healthz for up to 60 seconds, then logs server is ready to accept connections. After setup completes, it prints a banner with
Coder is now running in development mode, the effective port list, and the API
and Web UI URLs.
For agent-driven runs, treat the banner as the ready signal for browser work.
If the banner does not appear, inspect the preceding api, site, database
recovery, and port conflict logs.
Cleanup
Use the least destructive cleanup that fixes the problem:
- Stop
./scripts/develop.shwithCtrl+Cso child processes receive the orchestrator shutdown signal. - If a child process remains, identify it with
lsof -iTCP:<port> -sTCP:LISTENorps, then terminate only that stale process. - To reset the built-in development database for the current worktree, rerun
with
./scripts/develop.sh --db-resetor remove.coderv2/postgresafter stopping the app. - To clear local Coder session and generated state for the current worktree,
remove the specific files under
.coderv2that are relevant to the failure. - To clean the embedded Prometheus container, stop the develop script first,
then remove the
coder-prometheuscontainer if it remains. - To clean test databases, prefer the owning test harness cleanup. If tests were interrupted, inspect the local PostgreSQL instance used by the test suite before dropping any database.
For database migration mismatches, prefer the develop script's recovery flags
before deleting state. Use --db-rollback when a migration disappeared from the
current branch, --db-continue after you manually reconcile changed migration
tracking, and --db-reset only when data loss is acceptable.