Two changes to make the `flake-go` workflow produce better signal when
something hangs or flakes at low rates.
**Job timeout (20m → 25m).** The Go-level `-timeout 20m` baked into
`make test` (`Makefile:1428`) currently raced the runner's 20m
hard-kill, so a hanging test got SIGTERM'd by Actions instead of
SIGQUIT'd by Go, and we never got the goroutine dump. Bumping the
workflow job to 25m mirrors the layering already used by `test-go-pg` in
`ci.yaml:409` and gives Go's timeout the 5m head start it expects.
**Test count (25 → 35).** Catches lower-frequency flakes that 25
attempts miss too often. For a 5% per-run flake, detection probability
goes from ~72% at n=25 to ~83% at n=35; for 1–2% flakes the lift is
larger. The longest successful flake-go run to date was 11m49s at n=25,
so n=35 should peak around ~16–17m and stay well inside the new 25m
budget.
The Flake Check workflow runs `make test` through the `test-go-pg`
action, which invokes `gotestsum`, but the workflow never installs it.
The mise refactor (#25727) deleted the `setup-go` action that previously
installed `gotestsum` implicitly, and added explicit `mise install ...
go:gotest.tools/gotestsum` steps to every other Go test job. The flake
check's `Install Go mise tools` step only listed `whichtests`, so the
check fails with `gotestsum: command not found` whenever it selects
changed tests to run.
Add `go:gotest.tools/gotestsum` to the flake check's install step,
matching the other `test-go-pg` jobs in `ci.yaml` and
`nightly-gauntlet.yaml`.
Refs #25727🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The `check-docs` job has been failing on every PR touching `docs/**`
since 2026-05-29. `umbrelladocs/action-linkspector` runs linkspector
under puppeteer, which expects an exact Chrome build (e.g.
`148.0.7778.97`) in `/home/runner/.cache/puppeteer`. When that build
isn't present on the hosted runner, linkspector crashes with `Could not
find Chrome` and reviewdog then fails parsing the empty rdjson output
with `proto: syntax error`.
The pinned `v1.4.1` of the action was installing linkspector `0.4.7`,
whose puppeteer requires `148.0.7778.97`; that build is no longer in the
runner cache. Upstream `v1.5.2` upgrades linkspector to `0.5.3` and adds
Chromium fallback logic, but on `ubuntu-22.04` x86_64 none of its new
code paths fire (the AppArmor branch is gated on `lsb_release -rs ==
"24.04"`, the system-Chromium branch on aarch64 or missing 24.04
sysctl), so the bump alone leaves the same Chrome error in place.
This PR:
- Bumps the action to `v1.5.2` (linkspector `0.5.3`).
- Sets `PUPPETEER_EXECUTABLE_PATH=/usr/bin/google-chrome` on the action
step. The hosted `ubuntu-22.04` image ships Google Chrome at that path.
`v1.5.2`'s `script.sh` short-circuits Chromium setup when this env is
set, so puppeteer skips the cache lookup and uses the runner binary
directly.
End-to-end verified by temporarily perturbing `docs/**` on this branch
so the workflow's `pull_request` trigger would fire:
https://github.com/coder/coder/actions/runs/26732938434. `check-docs`
ran linkspector against `docs/**` for ~2m30s and exited 0, with no
`Could not find Chrome` or reviewdog parse errors in the log. That
perturbation has been removed from the branch.
Refs UmbrellaDocs/action-linkspector#62,
UmbrellaDocs/action-linkspector#61
Fixes DOCS-174: the docs-preview workflow only fired on `pull_request:
opened`. Subsequent pushes left the preview comment stale.
## Changes
- Add `synchronize` and `reopened` to trigger types so subsequent pushes
retrigger the workflow.
- Add a workflow-level `concurrency` group keyed by PR number with
`cancel-in-progress: true` so rapid successive pushes don't race the
comment-upsert lookup.
- Replace always-create comment logic with an upsert: find the existing
comment containing `<!-- docs-preview -->` and PATCH it; fall through to
create only when none exists or the PATCH itself fails (comment was
deleted between find and update).
- Filter the upsert lookup to comments authored by `github-actions[bot]`
so a human comment containing the marker is never silently overwritten.
- Decouple the `gh api` lookup from the `head -n 1` pipe so API failures
(network, auth, rate-limit) propagate immediately instead of being
swallowed by `|| true`.
- Delete the stale preview comment when a `synchronize` push drops all
Markdown changes (e.g. a follow-up push that removes the file an earlier
push had previewed but still touches `docs/`). The previous preview
comment would otherwise point at a deleted page.
- Extract the marker and the comment-selector jq into a single
`DOCS_PREVIEW_MARKER` variable and a `list_docs_preview_comments` shell
function so the stale-cleanup and upsert branches share one source of
truth.
## Out of scope
Vercel ISR cache invalidation for feature branch previews requires a
coder.com change (the `algolia-docs-sync` endpoint only accepts `main`
and `release/*` refs). Tracked separately in DOCS-174 out-of-scope
notes.
Pulls that fully revert their `docs/` changes in a follow-up push won't
fire this workflow at all (GitHub's `paths` filter requires a path match
in the diff), so a stale preview comment can survive on that specific
edge. Removing the `paths` filter to handle it would run the workflow on
every PR push, which is disproportionate. Acknowledged in
[CRF-12](https://github.com/coder/coder/pull/25456#discussion_r3313738550).
<details>
<summary>Implementation notes</summary>
**Marker and selector deduplication**: The marker string and jq selector
previously appeared at three sites (comment body, stale-cleanup API
call, upsert API call). They're now consolidated into
`DOCS_PREVIEW_MARKER` plus a `list_docs_preview_comments` shell function
so a future marker change updates one place.
**Comment body construction**: The double-quoted multi-line string form
with escaped backticks (`` \` ``) for the inline-code spans is
shellcheck-clean. An earlier draft used `printf -v comment_body` with a
single-quoted format string containing backticks, which triggered
SC2016; the printf-three-pieces workaround that replaced it has since
been simplified to the direct double-quoted form.
**Upsert logic**: `gh api --paginate` fetches all PR comments, jq
filters to `github-actions[bot]`-authored comments containing the
marker, and the workflow PATCHes the first match. If the PATCH fails
(404 because the comment was deleted between find and update), the
script falls through to `gh pr comment` to create a new one. Self-heals
on the next push if both paths somehow fail.
**Stale-cleanup logic**: Same selector as upsert, but in the early-exit
branch when no Markdown files exist in this push. `DELETE` failures are
logged and execution continues (the next push will re-attempt or post a
fresh comment), so a transient API failure won't fail the CI job.
</details>
> Generated by Coder Agents on behalf of @nickvigilante
The final step of `.github/workflows/cherry-pick.yaml` comments on the
original PR with a link to the cherry-pick PR. If the original PR is
locked, `gh pr comment` fails and the whole job exits with status 1,
even though the backport branch and PR were created successfully.
See
https://github.com/coder/coder/actions/runs/26559681779/job/78239379200
for an example.
Make the comment non-fatal: log a warning and continue.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a `flake-go` workflow that hunts for ordering-dependent and racy Go
tests on pull requests. The workflow runs only on PRs (cancelling
earlier runs on new commits) and skips test execution when no Go test
files changed.
A single `flake_go` job uses
[coder/whichtests](https://github.com/coder/whichtests) with
`--coalesce` to compute the directly-modified `Test*` functions from the
PR diff and emit them as one target row. The same job then runs those
selected tests on a deliberately resource-constrained 4-vCPU runner with
4x parallelism oversubscription, `-count=25`, and `-shuffle=on` to
amplify contention and surface flakes.
Pinned at
[coder/whichtests@ec33bab](https://github.com/coder/whichtests/commit/ec33bab1ec04cd86beb7a61a069db4463dba63f5).
Reuses the `test-go-pg` composite (with its new `run-regex`,
`test-shuffle`, and `gotestsum-json-file` inputs) and the
`go-test-failure-report` composite, both introduced on the base branch
(#25670), so this workflow shares one implementation of the gotestsum +
failure-report path with the existing CI jobs.
`Makefile` adds `TEST_SHUFFLE` support and single-quotes `RUN` so
whichtests' regex survives shell parsing.
Stacked on top of #25670.
Demo @
https://github.com/coder/coder/actions/runs/26494322649/job/78018779381?pr=25667
Closes CODAGT-381
GitHub Actions does not reliably trigger the push-based CI workflow when
a new branch is created at a commit that already has a workflow run from
another branch (e.g. `main`). This meant cutting a release branch
produced no CI run on it, so `should_deploy.sh` never got to approve the
deploy from the release branch.
Adds the `create` event trigger to `ci.yaml` with a condition on the
`changes` job to only proceed for release branch creations. All other
jobs depend on `changes`, so non-release branch creations are a no-op.
> Generated with [Coder Agents](https://coder.com/agents) by @f0ssel
The Go test jobs in `ci.yaml` each had ~30 lines of inline shell that
wrapped `gotestsum` with a PATH shim to capture JSON, then ran
`gotestsummary` and `upload-artifact` to publish a failure report. Three
jobs carried three near-identical copies.
This change replaces the three inline blocks with a single composite
action at `.github/actions/go-test-failure-report/` that runs the same
`gotestsummary` invocation, writes the same markdown to
`GITHUB_STEP_SUMMARY`, and uploads the same NDJSON artifact. The PATH
shim is gone; gotestsum's native `GOTESTSUM_JSONFILE` env variable is
used instead, plumbed through the `test-go-pg` composite.
`test-go-pg` gains three optional inputs:
- `gotestsum-json-file` — explicit JSON file path (or `default` for
`${RUNNER_TEMP}/go-test.json`)
- `run-regex` — passed to `go test -run`
- `test-shuffle` — passed to `go test -shuffle`
All three have safe defaults so existing callers are unaffected.
No observable change in CI behavior: the three existing test-go-pg jobs
continue to emit the same JSON, render the same failure summary, and
upload the same artifact.
Stacked under #25667, which uses the new composite and inputs to power a
new flake-detector workflow.
Move docs linting into the required CI umbrella and reuse the existing
`changes` job so docs lint runs when docs or CI files change, plus on
`main` as a backstop.
This is motivated by the docs lint failures on #25601. That PR touched
`.claude/docs/TESTING.md`; the standalone `Docs CI` workflow picked it
up because `docs-ci.yaml` used broad `**.md` matching, but local `pnpm
lint-docs` and `make lint` did not catch the same file because they only
scanned `docs/**` plus root `*.md`. The first failed Docs CI run
reported markdownlint errors in `.claude/docs/TESTING.md` (`MD040` and
`MD031`), and the next run reported a markdown table formatter failure
in the same file.
That mismatch is why this PR exists: prevent unrelated PRs from being
surprised by stale `.claude/docs/**` lint drift only after they happen
to touch one of those files. The local docs scripts now include
`.claude/docs/**`, and the old standalone `Docs CI` workflow is removed
so we do not maintain separate path-filter logic outside the required CI
workflow.
> Generated by mux, but reviewed by a human
Splits the dogfood image into two artifacts:
- `ghcr.io/coder/oss-dogfood-base:<distro>-<base-sha>`: Ubuntu base with
apt packages, chrome, rustup, brew, gh, and the mise binary. The
base-sha is a cache key over `Dockerfile.base` and `files/`, so commits
that don't touch those inputs reuse the previous build.
- `codercom/oss-dogfood:<final-sha>-<distro>` and rolling tags
(`:22.04`, `:26.04`, `:latest`, `:<branch>`): produced by `mise oci
build` on top of the base, with one content-addressed OCI layer per mise
tool. The rolling tag scheme is unchanged, so the workspace template
doesn't need updating.
Single-tool version bumps now invalidate only that tool's OCI layer, so
workspaces re-pull just what changed instead of the entire 5-6 GB image
on every recreate.
Also:
- Drops the build-time `pnpm dlx playwright@1.47.0 install --with-deps
chromium` step (~400 MB) and the equivalent `playwright-driver.browsers`
install from `flake.nix`. `@playwright/mcp` (used by the claude-code and
codex MCP servers in `dogfood/coder/main.tf`) does NOT auto-install
browsers, so the existing `install-deps` `coder_script` now runs two
installs on workspace start: `pnpm exec playwright install chromium` for
the site's pinned `@playwright/test`, and `npx
--package=@playwright/mcp@latest playwright-core install --no-shell
chromium` so the MCP servers find their matching browser revision.
Browser revisions coexist under
`~/.cache/ms-playwright/chromium-<rev>/`, which lives on the home volume
so both downloads happen once per workspace recreate and persist across
restarts. Net effect: same MCP behavior as before, +~1-2 min on first
workspace start. Nix devshell users running site e2e tests locally now
need `pnpm exec playwright install` once (instead of getting browsers
via nixpkgs).
- Bumps the pinned mise binary to v2026.5.12 (matching main after
#25521) and adds top-level `min_version = "2026.5.12"` to `mise.toml` so
every consumer (devs, CI, the embedded mise inside the dogfood image,
mise oci builds) fails fast on an older mise.
- Adds bison, flex, libicu-dev, libreadline-dev, uuid-dev, and
zlib1g-dev to both Ubuntu base images for source-build use cases (e.g.,
building Postgres from source).
- Replaces skopeo with crane as the registry client `mise oci push`
shells out to: crane is added to `mise.toml`, the workflow drops its
`apt-get install skopeo` and forces `--tool crane`, and the local
wrapper image stops bundling skopeo. One source of truth for tool
versions, no apt drift, smaller wrapper image, and workspace users get a
registry client on PATH for free via mise oci's tool layers.
- Removes `nix.hash`/`mise.hash` and their Makefile rules. The registry
digest already captures every effective change since CI rebuilds when
any baked-in input moves; the per-file `filesha1()` entries in
`pull_triggers` are redundant.
Supersedes #25400 (the `mise.hash` pull trigger landed there in
`2b612abe7b`; this PR removes it as part of the broader simplification).
> [!NOTE]
> `mise oci build` is experimental and requires `MISE_EXPERIMENTAL=1`
(set at job level in the workflow). The local-only
`scripts/dogfood/mise-oci-wrapper.sh` builds a tiny
`coderdev/mise-oci-wrapper:<version>` Debian image with curl-installed
mise on first invocation (cached by version tag thereafter); we don't
reuse `jdxcode/mise:latest` because that tag lags upstream GitHub
releases by days and would defeat the `min_version` enforcement above.
> [!NOTE]
> `compute-base-sha.sh` and `compute-final-sha.sh` are cache keys, not
strict content addresses: the base Dockerfile still pulls dynamic
resources at build time (gh/buildx `releases/latest`, chrome
`stable_current_amd64.deb`, apt mirror state). Two runs with identical
checked-in files can produce slightly different bytes, which is
acceptable here because the cache-hit savings on irrelevant commits
outweigh that drift.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Signed-off-by: Thomas Kosiewski <tk@coder.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a `test_image` job that runs `make gen`, `make fmt`, `make lint`, and `make build` inside the
newly built image via `docker run`. This helps detect breaking changes before merge.
> [!NOTE]
> Generated with [Coder Agents](https://coder.com/agents)
## Problem
`Docs CI` fails on PRs that only touch binary assets under `docs/`.
Example: [#25314](https://github.com/coder/coder/pull/25314), which
swaps a single PNG and produces thousands of `MD010/no-hard-tabs`,
`MD049/emphasis-style`, and `MD018/no-missing-space-atx` errors at
columns like 16,285 of the image.
## Root cause
The single `tj-actions/changed-files` step was doing two jobs at once:
detecting which Markdown files changed (for `lint` and `fmt`), and
gating whether the workflow had anything to do at all. Its `files`
filter matched `docs/**` in addition to `**.md`, so any non-Markdown
file under `docs/` (PNG, GIF, JPG, MP4, SVG) ended up in
`all_changed_files` and was passed straight to `markdownlint-cli2`,
which opened the file and parsed the binary bytes as Markdown.
`markdownlint-cli2`'s own `ignores` setting is a discovery-time filter
and does not gate files passed explicitly on the command line, so the
filtering has to happen in the caller.
## Fix
Adopt a per-tool convention: each downstream tool gets its own
`changed-files` step scoped to the files that tool can process. For now
that is a single `changed-md` step matching `**.md`, consumed by `lint`
and `fmt`. A future tool (e.g. an image linter, video size check, or
link checker) can be added purely additively, by appending another
`changed-*` step and a step that consumes its output, without changing
the existing filters.
The workflow-level `on.push.paths` / `on.pull_request.paths` triggers
stay broad (`docs/**`, `**.md`) so the workflow still runs on
screenshot-only PRs; the per-tool filters decide which individual steps
execute. On a screenshot-only PR the existing `if:
steps.changed-md.outputs.any_changed == 'true'` guard skips `lint` and
`fmt` cleanly.
## Verification
- `actionlint .github/workflows/docs-ci.yaml` passes.
- Reproduced the original failure locally: `pnpm exec markdownlint-cli2
docs/images/install/install_from_deployment.png` produces the same flood
of violations seen in the failing CI run on #25314.
- First revision of this PR (workflow with `**.md`-only filter, single
`changed-files` step) was green on `Docs CI`; the current revision is
structurally equivalent for the existing tools and just renames the step
id and adds the per-tool comment.
<details>
<summary>Decision log</summary>
- Considered adding `ignores` to `.markdownlint-cli2.jsonc` to skip
non-Markdown files. Rejected: `markdownlint-cli2` treats `ignores` as a
discovery-time glob filter and still lints files passed explicitly on
the command line, so it would not have fixed the failure.
- Considered narrowing the existing single `changed-files` step's
`files` filter to `**.md` only. Rejected as the final shape: it solves
the immediate bug but conflates "which Markdown files changed" with
"should the workflow run at all", so adding a second tool with a
different file set later (e.g. an image linter) would require contorting
or duplicating that step.
- Chose the per-tool-filter shape so adding a future tool is additive:
one new `changed-*` step plus one new step that consumes its output,
with no edits to existing steps.
</details>
## Disclosure
Opened on behalf of @nickvigilante by Coder Agents.
Three workflows besides `deploy-docs.yaml`
([DOCS-124](https://linear.app/codercom/issue/DOCS-124),
[#25285](https://github.com/coder/coder/pull/25285)) self-reference in
their `paths:` triggers: `docker-base.yaml`, `docs-ci.yaml`,
`dogfood.yaml`. This was flagged during review of #25285
([DEREM-1](https://github.com/coder/coder/pull/25285#discussion_r3234975475))
as a bug class worth treating uniformly. This PR is the audit.
Each self-reference is either justified inline or removed:
* **`docker-base.yaml`** keeps the self-reference. It's PR-only and
gated by `push: ${{ github.event_name != 'pull_request' }}` on the
`depot/build-push-action`, so PRs build the base image without
publishing.
* **`docs-ci.yaml`** drops the self-reference. The `lint` and `fmt`
steps gate on `tj-actions/changed-files` matching `docs/**` or `**.md`,
so a workflow-only run no-ops. `actionlint` and `make lint/actions`
catch YAML problems before merge regardless.
* **`dogfood.yaml`** keeps the self-reference. PR runs build images
without pushing and run `terraform init` + `validate` only; pushes to
main retag rolling tags on `codercom/oss-dogfood`,
`oss-dogfood-vscode-coder`, and `oss-dogfood-nix`, plus `terraform
apply` against dev.coder.com which produces new `coderd_template`
versions with unchanged content. Idempotent and bounded.
Refs DOCS-121, DOCS-129.
<details>
<summary>Decision table</summary>
| Workflow | Self-ref location | Effect on workflow-only edit | Decision
|
|---|---|---|---|
| `deploy-docs.yaml` | push + workflow_dispatch | Destructive (DOCS-121)
| Removed in [#25285](https://github.com/coder/coder/pull/25285) |
| `docker-base.yaml` | PR-only | Build base image, never push | Keep
with inline comment |
| `docs-ci.yaml` | push + PR | Empty run; lint/fmt skipped by `if:` |
Remove (wasted runner minutes) |
| `dogfood.yaml` | push + PR | PR: build without push, terraform
validate. Main: retag rolling tags, terraform apply, new cosmetic
template versions | Keep with inline comment |
</details>
---
_Coder Agents on behalf of @nickvigilante._
Edits to `.github/workflows/deploy-docs.yaml` previously self-triggered
the workflow on push to `main` and `release/*` because the file was
listed in its own `paths:`. On 2026-05-12, this caused merge of #25049
to fire a production reindex with no `docs/**` changes, which entered
the empty-`paths_json` whole-branch path in the Algolia handler and
wiped the `docs` index (see DOCS-121).
This change removes `.github/workflows/deploy-docs.yaml` from `paths:`
so the workflow only runs against real docs content. Reindexes from a
workflow edit alone now require `workflow_dispatch`, which already
accepts a `ref` input and an `action` choice of `index` or `delete`. The
other safety net (a workflow-level `paths_json=[]` guard in
`algolia-and-isr`) is tracked separately in DOCS-122.
Refs DOCS-121, DOCS-122, DOCS-124.
---
_Coder Agents on behalf of @nickvigilante._
This PR replaces the hand-rolled `curl | tar | go install | cargo
install` chains in the dogfood Ubuntu 22.04 and 26.04 Dockerfiles with a
single `mise install` driven by a new repo-root `mise.toml`.
The previous Dockerfiles installed ~25 CLIs across three multi-stage
builds with versions hardcoded inline. Version bumps were scattered
across the Dockerfiles, the root `mise.toml` (added in #24618 but
otherwise unused at runtime), and CI's setup actions; build-time network
failures came from a dozen distinct endpoints; and `mise` itself sat in
the image with no manifest to install from.
The new flow:
- The repo's `mise.toml` is the single source of truth for image tool
versions. The Dockerfiles `COPY` it to `/etc/mise/config.toml` and run a
single `mise install` as the `coder` user.
- Tools are installed into `/opt/mise/data` rather than the default
`/home/coder/.local/share/mise`, so they live in the image (not on the
persistent home volume) and reach every workspace on recreate.
- Build context moves to the repo root so the Dockerfile can `COPY
mise.toml`; an allowlist `.dockerignore` keeps the transferred context
to ~24 kB.
- Optional `--secret id=github_token` plumbing through the Makefile and
`.github/workflows/dogfood.yaml` lifts aqua's GitHub API quota from
60/hr unauthenticated to 1000/hr with `secrets.GITHUB_TOKEN`.
- `MISE_TRUSTED_CONFIG_PATHS=/home/coder:/etc/mise` is set as an ENV so
users who clone the coder repo into their workspace home aren't prompted
to `mise trust`.
Net diff for the two Ubuntu Dockerfiles: -399 / +244 lines (~200 lines
shorter each). The `FROM rust-utils`, `FROM go`, and `FROM proto`
multi-stage builds are gone; so are the NVM/Node block, the bulk
binary-install block (golangci-lint, helm, kubectx, syft, cosign, bun),
the gh `.deb`/lazygit/doctl tarball installs, the gofmt
`update-alternatives` line, and the `yq`→`yq4` rename
(`scripts/lib.sh:267-275` already auto-detects either name).
Both images were built and smoke-tested with Apple's `container` CLI on
macOS — every migrated tool resolves to the expected pinned version
including outside the cloned coder repo (e.g. `gh` from `/home/coder`,
matching the workspace startup script in `dogfood/coder/main.tf`),
`sqlc` runs (proving `CGO_ENABLED=1` was honoured at install), `yq
--version` reports v4 for `scripts/lib.sh`'s detection, and `gofmt`
resolves via the mise shim.
Follow-ups (out of scope here):
- Commit a multi-platform `mise.lock` so `gh = "latest"` and the other
floating versions resolve deterministically across rebuilds and dev
machines.
- Migrate CI's `setup-go` / `setup-node` actions to consume `mise.toml`
so image and CI versions stop being able to drift.
---------
Signed-off-by: Thomas Kosiewski <tk@coder.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Partially reverts #25136 for non-darwin platforms.
In general we want to avoid pinning trust roots to embedded Certs, since that limits operational flexibility. If Azure changes CAs, operators should, at most, be able to update the OS trust store to keep Coder working correctly. Embedding roots means we need to upgrade the Coder binary.
Since Coder Server on macOS is not really supported for production use, embedding only in that case to ease development and testing is OK.
Folds the Algolia/ISR sync trigger and surgical-reindex path computation
into the existing `deploy-docs.yaml` workflow so a single `docs/**` push
fires every update path the docs site needs.
One preflight job feeds two parallel sibling jobs:
- **`changes`** (preflight): diffs `github.event.before` against
`github.sha` to compute `manifest_changed` and `paths_json` (a JSON
array of `{path, status}` objects derived from `git diff --name-status
-z`, capped at 50 entries). The mapping is `A → added`, `M/T →
modified`, `D → deleted`, `R<n> → renamed` (indexed by the new path).
Falls back to whole-branch (emits `paths_json: "[]"`) on
`workflow_dispatch`, the first push to a new branch, fetch failure,
manifest changes (route restructuring would orphan records), or >50
markdown files.
- **`algolia-and-isr`** (always, parallel with `vercel-rebuild`):
HMAC-signed POST to `coder.com/api/algolia-docs-sync` with the
`paths_json` array as part of the body. Refreshes the Algolia `docs`
slice for the `(corpus, ref)` pair and ISR-revalidates every navigable
route the handler touched. Markdown-only edits surface in seconds with
no full rebuild. The step summary line `Mode: \`surgical\` (N path(s))`
lets operators verify which path ran without scrolling through the curl
output.
- **`vercel-rebuild`** (parallel with `algolia-and-isr`, only when
`docs/manifest.json` changed): fires the existing Vercel deploy hook for
a full build. Manifest changes can register or remove routes that
Next.js's `getStaticPaths` only re-evaluates on a full build, so
ISR-per-existing-path is not enough.
Trigger expanded from "main + manifest.json" to "main and `release/*` +
any `docs/**`" so release-branch docs edits also flow through the same
pipeline. The Vercel rebuild path stays gated on manifest changes
regardless of branch.
The pure shell + curl + openssl + jq + awk pipeline is preserved
verbatim. Zero Algolia or Node dependencies in CI.
## Why one workflow instead of two
The original split (a standalone Algolia workflow + the existing
`deploy-docs.yaml`) would have run twice per manifest push, with two
parallel concurrency groups, two GitHub Actions step summaries, and two
ways to forget to add a secret. Folding into one file makes the trigger
story symmetrical: "docs change → all docs surfaces refresh," with the
rebuild path being a strict superset of the ISR path, and the surgical
path strictly cheaper than whole-branch when computable.
## Pre-merge testing
The companion handler PR (coder/coder.com#741) supports an
`ALGOLIA_DOCS_INDEX` env-var override, scoped to `docs_smoke` on the
Vercel preview deploy, so this workflow can be exercised end-to-end
against a disposable index without touching production records. The
smoke harness at `~/audit/smoke/run.sh` (workspace-only) signs and posts
the same body shape this workflow does, so it covers the same crypto
path. To exercise the workflow itself, push a docs-only commit to a
throwaway branch and watch the step summary; the `algolia-and-isr` job
will print the resolved mode.
## Prerequisites before this can do anything useful
1. `secrets.ALGOLIA_DOCS_SYNC_SECRET` must be added as an Actions secret
on this repo. The same value goes on `coder.com`'s Vercel env. The
workflow logs a clear error and aborts with no network call if the
secret is missing.
2. The handler at coder/coder.com#741 must be merged and deployed.
Without it, the POST will 404.
3. `secrets.DEPLOY_DOCS_VERCEL_WEBHOOK` is already in place from the
existing `deploy-docs.yaml`; this PR does not change its usage.
## Demo, validation, and design
- Front-end-only fixes (modal layout, scroll-shadow, rank-order
preservation): coder/coder.com#749 ships these against production today,
independent of this PR.
- Companion handler PR on `coder.com`: coder/coder.com#741. Includes the
surgical-mode plumbing this workflow's `paths_json` output drives.
- Full design lives in the workspace at
`~/plans/algolia-search-revamp.md`. Key sections:
- §6.0–6.2: why the indexer lives in `coder.com`, not here.
- §6.7: per-version add/remove mechanics.
- §6.8: ISR revalidate rationale and same-time refresh.
- §6.9: surgical per-page reindex (workflow + handler + planning rules).
---
This PR was generated by Coder Agents.
This follows up on
https://github.com/coder/coder/actions/runs/25684936801/job/75406131184?pr=25139
by replacing the large raw Go test JSON artifact with inline structured
summaries and a compact failures-only artifact.
## What changed
- Added `scripts/gotestsummary`, a streaming Go tool that reads
gotestsum JSON and renders failed tests as Markdown.
- Updated the three Go test jobs to publish per-test `<details>`
sections in the job summary.
- Removed upload of the raw `go-test.json` artifact.
- Added upload of `go-test-failures-*.ndjson` with compact failure
records for deeper inspection.
- Deleted the old bash and `jq` summary script.
## Why
- The previous raw artifact was about 35 MB compressed and 445 MB raw in
the linked run.
- Passing-test output made the artifact noisy and slow to inspect.
- The old summary truncated output to 600 characters.
- The new path keeps streaming, bounded output and writes structured
diagnostics for only final failed tests.
## Validation
- `gofmt -w scripts/gotestsummary`
- `gofmt -l scripts/gotestsummary`
- `go test ./scripts/gotestsummary/...`
- `go vet ./scripts/gotestsummary/...`
- `grep -rn 'go-test-failure-summary.sh' . || true`
- `grep -rn 'go-test-failure-summary.sh\|go-test.json\|go-test-json-'
.claude .agents docs AGENTS.md || true`
- `make lint/agents`
- `make lint/emdash`
- `make lint/markdown`
- `make lint/shellcheck`
- `git diff --check origin/main..HEAD`
> This PR was prepared by Mux working on Mike's behalf.
This PR adds an opinionated harness-engineering layer for agent-driven
workflows: a small set of agent-readable docs, mechanical structure
checks, structured CI failure summaries, an architecture-lint umbrella,
and per-worktree dev-server isolation. The goal is to make local dev,
tests, and CI mechanically inspectable by agents without changing app
runtime behavior.
## What landed
**Agent docs and navigation**
- `.claude/docs/OBSERVABILITY.md`, `.claude/docs/DEV_ISOLATION.md`,
`.claude/docs/AGENT_FAILURES.md`: task-oriented guides for logs,
tracing, Prometheus, dev-server isolation, and a seeded failure catalog.
- `AGENTS.md`: added an `Agent navigation` block, then trimmed the file
from 375 to 229 lines by migrating duplicated detail into
`WORKFLOWS.md`, `GO.md`, `TESTING.md`, and `DATABASE.md`. The
user-managed custom-instructions block is preserved.
- `.agents/docs`: symlink mirror of `.claude/docs` for agent runtimes
that look under `.agents`.
**Mechanical checks**
- `scripts/check_agents_structure.sh`: validates `@...` references in
tracked `AGENTS.md` files and warns when root grows past 600 lines.
Wired as `make lint/agents` and into `make lint`.
- `scripts/audit-agent-readiness.sh`: report-first audit of harness
readiness. Currently `10 ok, 0 warn, 0 fail`.
- `scripts/check_architecture.sh` / `make lint/architecture`: umbrella
architecture-lint target. Consolidates the existing
`check_enterprise_imports.sh` and `check_codersdk_imports.sh` so they
run exactly once via the umbrella. Slot is open for new high-confidence
rules.
**Structured CI failure summaries**
- `scripts/playwright-failure-summary.sh`: parses
`site/test-results/results.json` and writes Markdown to
`$GITHUB_STEP_SUMMARY` on failure. Wired into the `test-e2e` matrix job.
- `scripts/go-test-failure-summary.sh`: parses `go test -json`
line-delimited output the same way. Wired into `test-go-pg`,
`test-go-pg-17`, and `test-go-race-pg` by injecting `gotestsum
--jsonfile` in the workflow without touching `Makefile`. JSON also
uploaded as a CI artifact on failure.
- `site/e2e/playwright.config.ts`: enables `screenshot:
only-on-failure`, `trace: retain-on-failure`, JSON reporter, and HTML
reporter alongside existing reporters.
- `.github/workflows/ci.yaml`: failure artifact uploads for Playwright
now use `if: failure()` and predictable names
(`playwright-artifacts-<variant>-<sha>`).
**Per-worktree dev-server isolation** (`scripts/develop/main.go`)
- Deterministic FNV-64a hash of the worktree path produces a port offset
in `[0, 1000)` (50 buckets, step 20 to avoid API/proxy overlap across
adjacent buckets).
- Offset is applied only to defaults; both env vars (`CODER_DEV_PORT`,
`CODER_DEV_WEB_PORT`, `CODER_DEV_PROXY_PORT`,
`CODER_DEV_PROMETHEUS_PORT`) and CLI flags retain priority.
- Hardcoded ports `9090` (embedded Prometheus UI) and `12345` (Delve)
are unchanged by design.
- Startup banner shows each port's source: `default`, `offset`, or
`explicit`.
- Unit tests in `scripts/develop/main_test.go` cover determinism,
bounds, no-overlap across the four ports, and explicit-skip behavior.
- State (`.coderv2/`) was already worktree-isolated via `os.Getwd()`, so
no state-dir changes were needed.
## Validation
`make lint/agents`, `make lint/architecture`, `make lint/emdash`, `bash
scripts/audit-agent-readiness.sh` (10 ok, 0 warn, 0 fail), `shellcheck`
on all 5 new scripts, `go test ./scripts/develop/...`, and `js-yaml`
parse of `ci.yaml` all pass. Synthetic fixtures verify both
failure-summary scripts handle empty/missing input (silent exit 0),
ANSI-stripped output, and parent/subtest formatting.
## Known follow-ups (deferred)
- Frontend Storybook/Vitest failure summary: lowest-leverage slice of
the failure-summary work. Skipping until observed pain.
- Architecture lint currently only delegates to existing import checks;
new rules (`InTx` outer-store detection, swagger-annotation lint) plug
in as needed.
- 50 port-offset buckets means two worktree paths can occasionally
collide. The DEV_ISOLATION doc tells users to set the relevant env var
when this happens.
> Mux opened this PR on Mike's behalf.
Bumps the dogfood template to the refactored Claude Code and Codex
modules and removes the Coder Tasks integration.
Claude and Codex now use slim-window app buttons that launch each tool
in its own tmux session. This replaces the task-specific `develop.sh`
and `preview` apps that were only created for Coder Tasks workspaces.
The PR also wires the OpenAI dogfood secret through the deployment
template so Codex can fall back to template configured BYOK when AI
Gateway is disabled.
Tested with this template version:
[https://dev.coder.com/templates/coder/coder/versions/outstanding_hermann97](<https://dev.coder.com/templates/coder/coder/versions/outstanding_hermann97>)
Dependabot security update PRs should be backported with the workflow
added in #24025, but today they still rely on someone noticing and
adding the backport label manually.
This updates the dependabot workflow to add the existing backport label
automatically when a newly opened Dependabot PR looks like a security
fix, and it adjusts the Slack notification text so those PRs are called
out explicitly.
This PR merges code from `coder/aibridge` repository into `coder/coder`.
It was split into 4 PRs for easier review but stacked PRs will need to
be merged into this PR so all checks pass.
* https://github.com/coder/coder/pull/24190 -> raw code copy (this PR,
before merging PRs on top of it, it was just 1 commit:
https://github.com/coder/coder/commit/70d33f33200c7e77df910957595715f81f9bec24)
* https://github.com/coder/coder/pull/24570 -> update imports in
`coder/coder` to use copied code
* https://github.com/coder/coder/pull/24586 -> linter fixes and CI
integration (also added README.md)
* https://github.com/coder/coder/pull/24571 -> added exclude to
scripts/check_emdash.sh check
Original PR message (before PR squash):
Moves coder/aibridge code into coder/coder repository.
Omitted files:
- `go.mod`, `go.sum`, `.gitignore`, `.github/workflows/ci.yml,`
`Makefile`, `LICENSE`, `README.md` (modified README.md is added later)
- `.github`, `example`, `buildinfo,` `scripts` directories
Simple verification script (will list omitted files)
```
tmp=$(mktemp -d)
echo "$tmp"
git clone --depth=1 https://github.com/coder/aibridge "$tmp/aibridge"
git clone --depth=1 --branch pb/aibridge-code-move https://github.com/coder/coder "$tmp/coder"
diff -rq --exclude=.git "$tmp/aibridge" "$tmp/coder/aibridge"
# rm -rf "$tmp"
```
*Disclaimer: implemented by a Coder Agent using Claude Opus 4.*
---
Move `github.repository` from direct `${{ }}` interpolation in the
`run:`
block to an `env:` var, consistent with how `BRANCH` and `PR_NUMBER` are
already handled. This eliminates a `zizmor` template-injection finding.
Follows up on #24283.
*Disclaimer: implemented by a Coder Agent using Claude Opus 4.6*
---
Adds a lightweight workflow that posts a docs preview link as a PR
comment
whenever a pull request touches files under `docs/`. The preview is
served
by coder.com's branch-preview feature at `/docs/@<branch>`.
The branch name is URL-encoded so names with slashes (e.g.
`user/feature`) produce correct links like
`/docs/@user%2Ffeature` instead of broken paths.
The comment is created on open and updated in-place on subsequent pushes
using the `peter-evans/find-comment` + `create-or-update-comment`
pattern
already used by the `pr-deploy` workflow.
---
Depends on https://github.com/coder/coder.com/pull/708
The `community-label` job in the contrib workflow fails with:
```
Error: Cannot find module '@actions/github'
```
`require("@actions/github")` does not work inside the bundled
`actions/github-script` v8 dist — the package is compiled into
`dist/index.js` and is not resolvable by Node's module loader at
runtime.
`actions/github-script` v9 exposes `getOctokit` as an injected script
parameter, so the `require` call is no longer needed.
Failing run:
https://github.com/coder/coder/actions/runs/24566113706/job/71826621724
> 🤖 Generated by Coder Agents
Add a new template to dev.coder.com for developing the coder/vscode-coder
VS Code extension.
The Docker image is based on node:24-slim (pinned by digest) with git, gh
CLI, dbus, and sudo. Electron system libraries are installed at workspace
startup via playwright install-deps so they stay in sync with the project's
Electron version without Dockerfile changes.
The template includes IDE selection (VS Code Desktop, code-server, Cursor,
etc.), filebrowser, dotfiles, and Claude Code for AI tasks.
OSV-Scanner currently causes the scheduled security workflow to fail
whenever it reports findings, which also triggers the Slack failure
notification.
Treat exit code 1 as a successful scan with uploaded SARIF results, and
only fail the job when OSV-Scanner itself errors.
Restore the container vulnerability scan in the security workflow by
replacing the removed Trivy job with OSV-Scanner.
This keeps the existing image build, SARIF upload, artifact upload, and
Slack failure notification flow, while pinning OSV-Scanner to the latest
release and using the current `--output-file` flag.
## What
The `lint-actions` CI job only ran when `.github/workflows/ci.yaml` or
`.github/actions/**` changed. New workflow files like `backport.yaml`
and `cherry-pick.yaml` were never linted by zizmor, allowing several
findings to land undetected.
## Changes
**`.github/workflows/ci.yaml`** — Broaden the `ci` path filter from
`".github/workflows/ci.yaml"` to `".github/workflows/**"` so
`lint-actions` runs when any workflow file changes.
**`.github/workflows/backport.yaml`**:
- Move permissions from workflow-level to job-level (`detect` →
`contents: read`, `backport` → `contents: write` + `pull-requests:
write`) — fixes `excessive-permissions`
- Replace `${{ matrix.branch }}` in `run:` block with `$BRANCH` env var
— fixes `template-injection`
- Add `persist-credentials: false` to both checkouts — fixes
`artipacked`
**`.github/workflows/cherry-pick.yaml`** — Add `persist-credentials:
false` to checkout — fixes `artipacked`
**`.github/zizmor.yml`** — Ignore `dangerous-triggers` for
`backport.yaml` and `cherry-pick.yaml`. Both use `pull_request_target`
intentionally — they only run post-merge (`merged == true`) and don't
check out or execute untrusted PR code.
## What
Fixes the `TestSVGIconAttributes/texlive.svg` CI failure introduced by
#24312.
Two changes:
1. **Fix `texlive.svg` viewBox**: Changed from `0 0 1024 1024` to `0 0
256 256` (wrapping content in `<g transform="scale(0.25)">` to preserve
rendering). Also cleaned up non-standard attributes (`version`, `style`,
`preserveAspectRatio`) to match other icons.
2. **Add icon/theme paths to CI go filter**: Added `site/static/icon/**`
and `site/src/theme/**` to the `go` path filter in `ci.yaml` so Go tests
(`test-go-pg`, `test-go-pg-17`, `test-go-race-pg`) run when icons or
theme config change. This is why the failure wasn't caught on the PR —
only `site/` files were modified, so Go tests were skipped entirely.
Closes https://github.com/coder/internal/issues/1468
Supersedes #23343.
## Problem
`author_association` on `pull_request_target` events is unreliable:
- Returns `CONTRIBUTOR` instead of `MEMBER` when both apply
([actions/github-script#643](https://github.com/actions/github-script/issues/643)).
- Returns `NONE` for members with private org visibility
([community#18690](https://github.com/orgs/community/discussions/18690)).
This causes org members to incorrectly receive the `community` label.
## Approach
Replace the `author_association` check with an explicit
`orgs.checkMembershipForUser()` API call, which reliably detects both
public and private org members.
Uses a dedicated **GitHub App** via `actions/create-github-app-token`
instead of a PAT. The App only needs **Organization > Members: Read**
permission. Installation tokens are short-lived (1 hour) and
auto-rotated — no long-lived secrets to worry about.
### Setup required
A repo/org admin needs to:
1. Create a GitHub App with only **Organization > Members: Read**
permission.
2. Install it on the `coder` org.
3. Store the App ID as a repository variable: `ORG_MEMBERSHIP_APP_ID`.
4. Store the App's private key as a repository secret:
`ORG_MEMBERSHIP_APP_PRIVATE_KEY`.
> [!NOTE]
> Generated by Coder Agents
---------
Co-authored-by: Jakub Domeracki <jakub@coder.com>
After the cherry-pick workflow creates a backport PR, it now comments on
the original PR to notify the author with a link to the new PR.
If the cherry-pick had conflicts, the comment includes a warning.
## Changes
- Capture the URL output of `gh pr create` into `NEW_PR_URL`
- Add `gh pr comment` on the original PR with the link
- Append a conflict warning to the comment when applicable
> Generated by Coder Agents
Adds a GitHub Actions workflow that runs on PRs targeting `release/*`
branches to flag non-bug-fix cherry-picks.
## What it does
- Triggers on `pull_request_target` (opened, reopened, edited) for
`release/*` branches
- Checks if the PR title starts with `fix:` or `fix(scope):`
(conventional commit format)
- If not a bug fix, comments on the PR informing the author and emits a
warning (via `core.warning`), but does **not** fail the check
- Deduplicates comments on title edits by updating an existing comment
(identified by a hidden HTML marker) instead of creating a new one
> [!NOTE]
> Generated by Coder Agents