Commit Graph

1906 Commits

Author SHA1 Message Date
Spike Curtis b49344519b test: batch 06 of refactoring CLI tests not to use PTY (#25990)
Part of [coder/internal#1400](https://github.com/coder/internal/issues/1400)

Batch of refactored CLI tests to avoid creating PTYs.
2026-06-02 15:44:36 -04:00
Spike Curtis cbaea49c02 test: batch 04 of refactoring CLI tests not to use PTY (#25937)
Part of [coder/internal#1400](https://github.com/coder/internal/issues/1400)

Batch of refactored CLI tests to avoid creating PTYs.
2026-06-02 10:50:27 -04:00
Zach 170c33a475 feat: encrypt gitsshkeys.private_key at rest via dbcrypt (#25872)
Adds an optional dbcrypt wrapper around gitsshkeys.private_key. The
column is encrypted on insert and update through enterprise/dbcrypt when
external token encryption is configured, and decrypted on read.

A new private_key_key_id column references
dbcrypt_keys(active_key_digest) so revocation safety is enforced by the
existing foreign key. Rows with a NULL key_id stay plaintext and remain
readable. Existing plaintext rows can be backfilled by running `coder
server dbcrypt rotate`.

Generated with assistance from Coder Agents.
2026-06-02 08:36:01 -06:00
Spike Curtis 93b067f5f2 test: batch 03 of refactoring CLI tests not to use PTY (#25935)
Part of [coder/internal#1400](https://github.com/coder/internal/issues/1400)

Batch of refactored CLI tests to avoid creating PTYs.
2026-06-02 08:05:26 -04:00
Spike Curtis bfa6ce32a6 test: batch 02 of refactoring CLI tests not to use PTY (#25931)
Part of [coder/internal#1400](https://github.com/coder/internal/issues/1400)

Batch of refactored CLI tests to avoid creating PTYs.
2026-06-02 07:53:24 -04:00
Thomas Kosiewski f6a4ed309f ci: fix Windows runner PATH casing for mise, not in cli (#25972)
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 10:46:40 +00:00
Mathias Fredriksson ed4311b2cb ci: add Git usr/bin to PATH on Windows (#25939)
## Summary

Fixes all 9 Windows CI test failures caused by the mise CI refactor
(`fe257666d7`, PR #25727).

### Root cause

`jdx/mise-action` exports `Path` (Windows convention) via `GITHUB_ENV`.
Bash on Windows maintains its own `PATH`. When Go's `os.Environ()`
returns both, `cmd.exe` subprocesses non-deterministically pick the
MSYS-translated `PATH` (forward slashes), causing Windows executables
(`printf`, `powershell.exe`, `cmd.exe`) to be unresolvable.

These failures only appeared on `main` (where `-count=1` forces real
test execution) and were masked on PRs by Go test cache.

### Fixes applied

**CI (`setup-mise` action)**:
- Write both `Path` and `PATH` to `GITHUB_ENV` with Git usr/bin
prepended

**Code (`cli/root.go`)**:
- Add `appendAndDedupEnv` helper that deduplicates case-insensitive env
vars on Windows, preferring native Windows paths (backslashes) over MSYS
paths

**Code (`cli/configssh_windows.go`)**:
- Use absolute paths for `powershell.exe` and `cmd.exe` in the SSH
config `Match exec` escape function, avoiding PATH resolution entirely

**Tests**:
- Switch `--header-command` tests from `printf` to `echo` (cmd.exe
builtin) for reliable cross-platform execution
- Add env dedup in `Test_sshConfigMatchExecEscape` for subprocess PATH
consistency

Fixes coder/internal#1556, coder/internal#1558, coder/internal#1559

> 🤖 Generated by Coder agent, will be reviewed by @mafredri. 🏂🏻
2026-06-02 11:51:16 +10:00
Danny Kopping c8555e2163 fix: deprecate ai provider seeding env config (#25854)
Environment variables used to configure AI Gateway providers are now deprecated, and we need to reflect this as such.
2026-06-01 15:15:47 +02:00
Mathias Fredriksson 6ecf804896 test(cli): eliminate race in PausedDuringWaitForReady test (#25858)
The PausedDuringWaitForReady and WaitsForWorkingAppState tests flaked
because the quartz resetTrap was released immediately after catching
ticker.Reset (line 174), allowing client.TaskByID (line 175) to race
with the subsequent DB mutation (pauseTask / PatchAppStatus).

Fix: keep the resetTrap open across both poll iterations. On the first
poll, release the trap so the goroutine sees the initial state and
continues. On the second poll, hold the goroutine frozen at
ticker.Reset while mutating state. Then release; client.TaskByID
deterministically sees the mutated state. No race because the
goroutine cannot execute client.TaskByID while trapped.

Closes CODAGT-482
2026-06-01 13:58:57 +03:00
Spike Curtis 3a727a9087 test: batch 01 of refactoring CLI tests not to use PTY (#25871)
Part of https://github.com/coder/internal/issues/1400

Batch of refactored CLI tests to avoid creating PTYs.
2026-05-29 20:12:52 +00:00
Spike Curtis 8a47b7fa14 test: batch 00 of refactoring CLI tests not to use PTY (#25868)
Part of https://github.com/coder/internal/issues/1400

Batch of refactored CLI tests to avoid creating PTYs.
2026-05-29 15:33:45 -04:00
Mathias Fredriksson 2af037ce02 fix(cli): use quartz mock clock in PausedDuringWaitForReady test (#25811)
PausedDuringWaitForReady used the real clock, so the 5s poll in
waitForTaskIdle could race with an in-flight stop build. The SQL
view (tasks_with_status) returns "unknown" for stop builds with
job_status != "succeeded" because the build_status CASE has no
branch for (stop, pending) or (stop, running). On macOS CI, where
the provisioner is slower, the poll fires during this transient
window and hits the TaskStatusUnknown case instead of
TaskStatusPaused, failing with "task entered unknown state" rather
than the expected "was paused".

Convert to the same quartz mock clock pattern that PR #25648
applied to WaitsForWorkingAppState: inject a mock clock via
NewWithClock, trap ticker creation and reset, then advance time
deterministically so the poll fires after the stop build completes.

Closes CODAGT-482
2026-05-29 11:25:06 +03:00
Danny Kopping 5b10268827 feat: serve 503 sentinel for disabled providers (#25794)
_Disclosure: created with Coder Agents._

When providers are disabled, we should serve a sentinel error so the
requesting client (Claude Code, Coder Agents, etc) is informed. Coder
Agents can also conditionalize its display to show a helpful error
message.

---------

Signed-off-by: Danny Kopping <danny@coder.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 10:24:16 +02:00
Spike Curtis ee4126e913 test: refactor CLI create tests not to use PTY (#25807)
<!--

If you have used AI to produce some or all of this PR, please ensure you have read our [AI Contribution guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING) before submitting.

-->Part of https://github.com/coder/internal/issues/1400  
  
Refactors CLI tests of the `create` command as the first batch of tests refactored to take a PTY out of the loop.

One interesting difference I noticed between PTY and a direct pipe to standard in is that on the PTY we write `\r` to enter some input, but the kernel actually sends `\n` (or maybe `\r\n`) to the process, at least on Unix. (On windows we sent `\r\n` into the PTY).  This is reflected in the implementation of the `Writer` , otherwise mostly inspired by the PTYTest equivalents.
2026-05-28 17:50:37 -04:00
Danny Kopping 12520ee964 feat: add ai provider status and reload freshness metrics (#25770)
Add metrics for `aibridged` and `aibridgeproxyd`'s provider statuses. AI providers can be modified, and possibly misconfigured, at runtime. These metrics help operators understand the state of these provider definitions in case unexpected behaviour is observed.
2026-05-28 14:57:33 +02:00
Danny Kopping 2770bdc9d1 feat: route extra ai_provider_types through OpenAI and Anthropic providers (#25722)
_Disclosure:_ _produced_ _with_ _Claude_ _Opus_ _4\.7_

AI Gateway only supports Anthropic (+Bedrock), OpenAI, and Copilot providers at present. All other types (Vercel, Gemini, etc) will be mapped to OpenAI since they support OpenAI-compatible endpoints.
2026-05-27 16:16:05 +02:00
Max Schwenk ae492495ee fix(cli): show ready sync start dependencies (#25546)
## Problem

Follow-on to:

- https://github.com/coder/coder/pull/25089

`coder exp sync start` still printed a generic success message when the
unit was ready on the first status check. That hid whether the unit had
no dependencies or had dependencies that were already satisfied before
`sync start` ran.

Before:

```text
Success
```

## Solution
Print explicit startup output for both ready-at-first-check cases.

After, dependencies already satisfied:

```text
Unit "test-unit" started immediately, dependencies already satisfied: [dep-unit, dep-unit-2]
```

After, no dependencies:

```text
Unit "test-unit" started with no dependencies
```

The existing waiting path is unchanged and still reports the
dependencies while waiting and after waiting finishes.

Co-authored-by: Sas Swart <sas.swart.cdk@gmail.com>
2026-05-27 12:33:39 +02:00
Danny Kopping 79e007cf30 feat: hot-reload aibridged and aibridgeproxyd providers on DB changes (#25673)
Previously the in-process aibridge daemon and the enterprise aibridgeproxy daemon both snapshotted their provider routing once at boot. Any `ai_providers` or `ai_provider_keys` mutation required a restart for either to pick it up.

Add an `ai_providers_changed` pubsub channel that the CRUD handlers publish on after Create / Update / Delete. Both daemons subscribe:

- **aibridged** rebuilds its `[]aibridge.Provider` snapshot via `BuildProviders` and swaps it into the pool atomically. Inflight requests keep serving against the bridge they already acquired; new acquires build against the new snapshot. Per-provider construction errors stay scoped to the offending row.
- **aibridgeproxyd** rebuilds its routing snapshot from `GetAIProviders` and swaps the host→provider map atomically. The MITM listener picks up new providers without restart.

DB read for aibridgeproxyd uses the existing `AsAIProviderMetadataReader` subject for routing-only access.
2026-05-27 11:58:43 +02:00
Ethan e91bec8574 fix(cli): close aibridge daemon before WebSocket shutdown wait (#25719)
> [!WARNING]
> The investigation and solution in this PR were done with
[Mux](https://mux.coder.com/). I've reviewed the investigation
methodology, evidence and solution, and it all appears sound.

## Summary

PR #25570 (`refactor: move aibridged out of enterprise to AGPL`, merged
2026-05-22) added an in-memory aibridge DRPC server in
`coderd/aibridged.go` that does `api.WebsocketWaitGroup.Add(1)` and only
releases `Done()` when its client session is closed. PR #25575 then
flipped `CODER_AI_GATEWAY_ENABLED` to default to `true`, so every
`cli.Server()` invocation now spins up that goroutine.

In `cli/server.go`, the only call to `aibridgeDaemon.Close()` was a
`defer` scheduled at function return. During graceful shutdown the code
first calls `coderAPICloser.Close()`, which waits on
`api.WebsocketWaitGroup`. That wait sits for the full 10s timeout in
`coderd/coderd.go` (`websocket shutdown timed out after 10 seconds`),
then returns, then the function unwinds, and only then does the deferred
`aibridgeDaemon.Close()` fire and let the goroutine call `Done()`.

The 10s tax was previously latent (aibridged was enterprise-only and
opt-in). After the two May 22 PRs it hit every `cli.Server()` test. On
Linux/macOS CI it just makes the suite slower; on the Depot Windows
runner, the ramdisk reservation leaves only ~17 GiB of headroom and the
~10s shutdown tails of multiple concurrent package binaries overlap into
an OOM, presenting as `test-go-pg (windows-2022)` jobs that die silently
at the ~600s watchdog with an empty `steps` array.

See Slack:
https://codercom.slack.com/archives/C05AE94121Z/p1779807717764189

## Fix

Close `aibridgeDaemon` explicitly during graceful shutdown, **before**
`coderAPICloser.Close()` waits on the WebSocket wait group. This matches
the existing ordered-shutdown pattern used for `tunnel` and
`notificationsManager`. The deferred `aibridgeDaemon.Close()` is
retained as a safety net for early-return paths, and is safe to
double-call because `aibridged.Server.Close()` is already idempotent via
`shutdownOnce` in `coderd/aibridged/aibridged.go`.

## Regression test

`TestServer_AIGatewayShutdownOrdering` boots a real `coder server` with
`--ai-gateway-enabled=true`, cancels its context, and asserts graceful
shutdown finishes in under 8s. With the fix the test runs in ~0.1s;
without the fix it fails deterministically at ~10.0s. The flag is passed
explicitly so the test continues to guard the ordering even if the
deployment default is ever flipped back.

## Evidence this fixes the OOM

On Linux the patched `cli` test package drops from 114 s back to its
pre-regression 30 s wall time at the same single-process peak RSS (~7.6
GiB), and the `websocket shutdown timed out after 10 seconds` log line
disappears from every server-test run. Since the Windows OOM is the sum
of multiple concurrent 10 s shutdown tails overlapping past the runner's
~17 GiB headroom, removing those tails returns the concurrent-RSS budget
to its pre-regression level. The Windows OOM was intermittent (a handful
of hits across many runs since May 22), so a single green `test-go-pg
(windows-2022)` job on this PR is not by itself proof; confirmation will
come from watching Windows runs on `main` over the next several days and
seeing the ~600 s silent-kill fingerprint stop recurring.

Relates to ENG-2771
2026-05-27 17:33:14 +10:00
Michael Suchacz 8b1705eb65 feat: route chatd provider traffic through aibridge (#25629)
## Summary

Routes chatd model calls backed by concrete AI Provider rows through the
in-process aibridge transport by default, with deployment options to use
direct provider routing when AI Gateway is disabled or chat AI Gateway
routing is disabled.

- Splits model routing into common, direct provider, and AI Gateway
paths behind a single deployment-mode entry point.
- Builds chatd models through explicit request, route, and options data.
Active API key attribution is passed explicitly instead of being hidden
inside generic model construction.
- For AI Gateway BYOK routes, resolves the user's provider key in chatd,
forwards it through provider-specific auth headers, and sets
`X-Coder-AI-Governance-Token` to the `delegated` marker so aibridge
preserves those headers while still stripping Coder-specific metadata.
- Keeps central provider credentials and deployment fallback credentials
out of forwarded provider auth headers, so AI Gateway central policy
remains authoritative.
- Redacts delegated provider auth from default string formatting to
avoid accidental plaintext logging of user BYOK credentials.
- Covers selected chat models, advisor overrides, title and quickgen
paths, subagent overrides, computer use model selection, and an
integration-style chat turn through the aibridge transport path.
- Persists initiating API key IDs on chat and queued user messages,
including subagent child messages, and fails closed for AI
Gateway-routed model builds without an active key.
- Removes unused `api_key_id` indexes while keeping the persistence
columns and foreign keys.
- Keeps the deployment option available through config and env parsing,
but hides it from CLI help and generated docs.
- Stabilizes the subagent poll fallback test so background CreateChat
processing cannot win the state transition under slower CI environments.

## Tests

- `go test ./coderd/x/chatd -run
'TestAIGatewayProviderAuthForUser|TestAIGatewayProviderAuthRedactsFormatting|TestResolveModelRouteForConfigAIGatewayProviderAuth|TestAIGatewayModelForwardsProviderAuth|TestProcessChat_AIGatewayRoutingUsesDelegatedAPIKey|TestAwaitSubagentCompletion'
-count=1`
- `go test ./coderd/aibridged -run
'TestServeHTTP_DelegatedAPIKey|TestServeHTTP_StripCoderToken' -count=1`
- `git diff --check HEAD~1..HEAD`
- `make lint`

> Mux working on behalf of Mike.
2026-05-26 19:31:52 +00:00
Danny Kopping a56c88a0cc fix: run AI provider seed and build after newAPI so dbcrypt applies (#25699)
## Problem

Two related symptoms of the same architectural issue: the `dbcrypt`
wrapper is installed inside `enterprise/coderd.New`, so any access to
`options.Database` that happens before `newAPI` runs bypasses
encryption.

**Symptom 1 (reads):** Provider keys added via the admin UI are
encrypted at rest. `BuildProviders` was running *before* `newAPI`,
against the unwrapped store, so the ciphertext was read as-is and shoved
into the keypool as the upstream credential. Anthropic/OpenAI reject it,
and the interception log shows:

```
coderd.aibridged.pool: interception failed  ... error="all configured keys failed authentication"
  credential_kind=centralized  credential_hint=PaPb...4A==  credential_length=184
```

**Symptom 2 (writes):** `SeedAIProvidersFromEnv` was also running before
`newAPI`, against the unwrapped store, so env-derived keys
(`CODER_AIBRIDGE_OPENAI_KEY`, indexed `CODER_AIBRIDGE_PROVIDER_<N>_KEY`,
etc.) landed in `ai_provider_keys` as plaintext with `ApiKeyKeyID =
null` even when `CODER_EXTERNAL_TOKEN_ENCRYPTION_KEYS` was set.

## Fix

Move both `SeedAIProvidersFromEnv` and `BuildProviders` to after
`newAPI`, where `options.Database` is the dbcrypt-wrapped store. Writes
encrypt correctly; reads decrypt correctly.

The enterprise closure (`enterprise/cli/server.go`) runs *inside*
`newAPI` and calls `BuildProviders` for the aibridgeproxyd at that
point. Once the agpl seed moves to after `newAPI`, the proxy on first
boot would see no env-seeded providers. Add a matching seed call inside
the enterprise closure before its `BuildProviders` to cover that case.
Seeding is idempotent, so the agpl-side seed running again post-`newAPI`
is a no-op when the rows already exist.

## Known shortcomings

The clean version of this fix would just inherit `ctx` like every other
startup step and place these calls naturally. It can't, for two reasons
that are both about the surrounding handler architecture rather than
this change:

1. **`dbcrypt` wrapping is positioned inside `newAPI`, not around
`options.Database` at creation.** That's why both seed and build have to
wait until after `newAPI` in the first place. The principled fix is to
install the wrapper at the point the store is created (behind a hook the
enterprise build supplies), so every consumer sees a single
authoritative view and the ordering stops mattering. This would also
collapse the duplicated seed call back to a single site.

2. **The handler's shutdown sequence is not deferred.**
`coderAPICloser.Close()` and the other teardown steps run only if
control reaches the `select` at the bottom of the handler. An early
`return` from anywhere in Phase 1 (e.g. seed/build returning
`context.Canceled` when the user hits ctrl-c during startup) skips that
block and orphans all the goroutines `newAPI` spawned — tailnet workers,
gitsync, telemetry batcher, etc. `goleak` then catches them at package
teardown and `TestServer_TelemetryDisabled_FinalReport` fails. Moving
the shutdown into deferred closers (with a `sync.Once`-guarded close to
avoid double-close from the explicit Phase 2 call) is the principled
fix.

For this PR I took the smallest change that fixes the reported bugs: a
detached context (`context.WithoutCancel(ctx)` + a 30s timeout) at the
seed and build call sites in both the agpl and enterprise paths. It lets
the calls complete even if the user cancels during startup, after which
the handler reaches its shutdown select naturally and tears down through
Phase 2. Both shortcomings above are worth addressing separately.

## Test plan

- `make test RUN=TestServer_TelemetryDisabled_FinalReport` with `-race`;
passes locally with `-count=3`.
- Manually verified on a deployment with
`CODER_EXTERNAL_TOKEN_ENCRYPTION_KEYS` set and env-configured providers:
`ai_provider_keys.api_key_key_id` is populated, `api_key` is base64
ciphertext, and upstream auth succeeds.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 21:27:02 +02:00
Danny Kopping 282ab7de34 refactor: load AI providers from the database at startup (#25672)
Replace the env-based `BuildProviders` with a DB-backed loader. The database is now the single source of truth for runtime provider configuration; env config arrives via `SeedAIProvidersFromEnv` (run at boot) and `BuildProviders` reads it back as `aibridge.Provider` instances. `cli/server.go` and `enterprise/cli/server.go` both call the same path, so aibridged and aibridgeproxyd see the same provider set.

Per-provider `DumpDir` is replaced by a top-level `CODER_AI_GATEWAY_DUMP_DIR` base; each provider's effective dump path is `<base>/<provider name>`.
2026-05-26 15:57:01 +02:00
Ethan 4f1043a50a feat(scaletest): add chat scaletest command (#25553)
Adds `coder exp scaletest chat`, a harness for creating Coder Agents
chat load.
Start the mock LLM separately, prepare the scaletest workspaces you want
to target, then run the chat scaletest against the existing
`scaletest-*` fleet selected by the shared workspace targeting flags:

```sh
coder exp scaletest llm-mock --address 127.0.0.1:18080

coder exp scaletest chat --llm-mock-url http://127.0.0.1:18080/v1 --chats-per-workspace 10 --turns 1
coder exp scaletest chat --llm-mock-url http://127.0.0.1:18080/v1 --template docker --target-workspaces 0:10 --chats-per-workspace 1 --turns 10 --turn-start-delay 30s
```

This is the same pattern used by the `workspace-traffic` load generator.

Keeping the fake LLM as a separate process is intentional so it can be
scaled independently from the Coder deployment, which will likely be
necessary as we scale up and up.

This PR is the starting point: it provides the command, mock
provider/model bootstrap, existing workspace selection, chat streaming,
follow-up turns, metrics, and cleanup. Follow-up PRs will add multi-step
turns via tool calls. I'm still a bit iffy on the mechanism I have for
that. It'll likely involve having the runner send some magic strings
that the mock will recognise.


Relates to CODAGT-307
Relates to GRU-48
Relates to https://github.com/coder/scaletest/issues/124

Generated by Mux, but reviewed by a human
2026-05-26 14:19:36 +10:00
Mathias Fredriksson 7958ad6d04 fix(cli): use quartz clock in waitForTaskIdle for immediate first poll (#25648)
waitForTaskIdle used time.NewTicker(5s) which delays the first poll
by 5 seconds. Debugger tracing proved the failure mechanism: on slow
CI (Windows), the first poll at 5s sees "working" (idle patch has not
landed due to goroutine scheduling), needs poll #2 at 10s, but the
25s context expires before it fires.

Two changes:

1. Use r.clock.NewTicker (quartz) with time.Nanosecond initial
   interval and Reset(5s) for immediate first poll. Tests inject a
   mock clock via clitest.NewWithClock for deterministic control.
2. Rewrite WaitsForWorkingAppState test with quartz traps
   (NewTicker + TickerReset) for deterministic synchronization
   instead of racing goroutines. Fix PausedDuringWaitForReady
   sync point.

Closes DEVEX-381
2026-05-25 19:14:29 +03:00
Danielle Maywood 5deab9f721 test: wait for devcontainer readiness (#25567) 2026-05-22 13:55:21 +01:00
Danny Kopping ef6ee2af68 chore: tolerate empty providers at startup and log env seeds (#25605)
Since AI Gateway is now enabled by default, and if the AI Gateway Proxy is enabled too it's possible the server can start without any configured providers. This would previously block startup, which is unacceptable.

In an upstack PR we will handle reloading the providers at runtime, so the server needs to be able to start up even if it can't handle any proxy requests to AI Gateway.

This change was necessitated because if there are providers configured in the environment they need to be seeded _before_ the proxy starts.
2026-05-22 12:45:14 +02:00
Danny Kopping ddec110b0e refactor: move aibridged out of enterprise to AGPL (#25570)
In order to allow Coder Agents to use AI Gateway in OSS, we need to rehome the `aibridged`\-related code into the AGPL path.

The HTTP API is only registered under enterprise so will still require the AI Governance Add-on to be present in order to use it, whereas Coder Agents uses an in-memory pipe to the same handlers.
2026-05-22 09:11:37 +02:00
Danny Kopping c50b0e84b9 feat!: default CODER_AI_GATEWAY_ENABLED to true (#25575)
`CODER_AI_GATEWAY_ENABLED` / `CODER_AIBRIDGE_ENABLED` is now being defaulted to `true` now that it will be used by Coder Agents.

If you previously had this value disabled explicitly, that value will persist.
2026-05-22 08:57:36 +02:00
Danny Kopping 9341efec9f feat!: seed ai_providers from env on server startup (#24895)
_Disclaimer: implemented by a Coder Agent using Claude Opus 4.7_

Part of the implementation of [RFC: Common AI Provider Configs](https://www.notion.so/coderhq/RFC-Common-AI-Provider-Configs-34bd579be59280ed958feffb82024797) (AIGOV-201).

## Note

This change can cause a previously working installation to fail to start should a conflict exist between the providers configured in the environment & those now migrated to the database.

I'll raise a PR upstack to document this process and workarounds should a startup fail.

## What this PR does

Reconciles environment-derived AI provider configuration with the `ai_providers` table at server startup. The seed runs **before** the aibridged daemon is initialized, so the runtime always reads providers from the database; the legacy `CODER_AIBRIDGE_*` environment variables become a one-shot migration source.

### Behavior

- Concurrent server starts are serialized through a Postgres advisory lock (`LockIDAIProvidersEnvSeed`).
- Missing rows are inserted with an audit entry attributed to the system actor.
- Existing rows whose canonical hash matches the env-derived hash are left alone (the common no-op restart path).
- Existing rows whose canonical hash does **not** match cause server startup to fail with a descriptive error so the operator can explicitly resolve the conflict in either env or DB.
- Soft-deleted rows are NOT resurrected from env; an explicit operator deletion is sticky across restarts.
- Indexed providers whose name conflicts with a legacy env var fail startup with a clear remediation message.
- Unknown provider types (e.g. `copilot`, until the DB enum is widened) are skipped with a log entry rather than failing startup.

### Canonical hashing

The `canonicalAIProvider` shape captures exactly the fields that determine runtime behavior — `type`, `base_url`, and the Bedrock subset of settings (access key, access key secret, region, model, small fast model) — and is hashed with SHA-256. The hash is **computed on demand from the row + env**, never persisted, so the database does not need a new column for it. API keys live in the separate `ai_provider_keys` table and are intentionally excluded from the hash so operators can rotate keys via the API without forcing a server restart.

<details>
<summary>Decision log</summary>

- The hash is intentionally not persisted in the database. The RFC discussed this trade-off; computing on demand keeps the schema minimal and lets the canonical shape evolve without a migration.
- The lock uses an `iota` slot in `coderd/database/lock.go` rather than `GenLockID` so it's stable, easy to audit, and matches the convention used for every other startup lock.
- A bearer-token Anthropic provider whose env vars also set Bedrock metadata but no AWS credentials does NOT store the Bedrock fields. Without credentials the discriminated settings would misrepresent the row as Bedrock auth.
- We deliberately do NOT publish to the `ai_providers_changed` pubsub channel from the seed because the seed completes before any subscriber is started; the follow-up PR introduces that channel.

</details>
2026-05-22 08:37:27 +02:00
Zach ddc0e99c69 chore: remove coder_secret Terraform integration (#25512)
Removes the coder_secret Terraform integration: the data.coder_secret
consumption path through provisionerdserver → provisioner.proto →
provisioner/terraform, the dynamic-parameter secret-requirement
validation, and the workspace-update / resolve-autostart surfaces that
depended on it. This is being done due to a product/feature direction
change (see PLAT-243). User-secret CRUD (DB, REST, CLI, UI, telemetry, audit)
and the agent-manifest secret-injection path are untouched.

The provisionerd API is bumped from v1.17 to v1.18 rather than rolled
back: v1.17 shipped in v2.33.x, so user_secrets field numbers are
reserved and the changelog documents both versions.

Generated with assistance from Coder Agents.
2026-05-21 09:19:29 -06:00
Thomas Kosiewski 26a0805dcd fix(cli): isolate root HTTP transports (#25430)
The CLI root client shared `http.DefaultTransport` for normal API
requests and for the version-check build-info request. In parallel
tests, other clients can close idle connections on that process-global
transport, which can fail the Boundary license check before the AGPL 404
handling runs.

`TestBoundaryLicenseVerification/AGPLDeployment` configures a proxy that
returns `404` from `/api/v2/entitlements`, which `verifyLicense()` maps
to the expected AGPL deployment error. However, `clitest.SetupConfig()`
only writes the URL and session token to disk. It does not pass the
test's isolated `proxyClient.HTTPClient` into the CLI invocation, so
`coder boundary` builds a fresh client through `RootCmd.InitClient()`.
Before this change, that fresh client used `http.DefaultTransport`; if
another parallel test closed idle connections on the shared transport
while the entitlement request was in flight, Go returned `http:
CloseIdleConnections called` instead of the proxy's `404`. The command
then failed with `failed to get entitlements`, and the test never
reached the expected AGPL error path.

Clone the default transport for each CLI root HTTP client and for the
unwrapped build-info client, preserving the configured TLS settings when
present. Each CLI invocation now gets its own transport instance, so
cleanup from unrelated parallel tests cannot interrupt its entitlement
or build-info requests.

Closes https://github.com/coder/internal/issues/1538

<details>
<summary>Coder Agents notes</summary>

Generated by Coder Agents for Linear ENG-2705.

Local validation:

- `go test ./cli -run
'TestNewHTTPTransport|Test_ensureTLSConfig|Test_wrapTransportWithVersionCheck'
-count=1`
- `go test ./enterprise/cli -run
TestBoundaryLicenseVerification/AGPLDeployment -count=20 -parallel=16`
- `go test ./cli ./enterprise/cli`
- `make lint`
- `go test ./enterprise/cli -run '^TestBoundaryLicenseVerification$'
-count=50 -parallel=16`
- pre-commit hook during `git commit`

</details>
2026-05-21 16:51:34 +02:00
Paweł Banaszewski 46e93e6325 chore: add ai_gateway options that alias aibridge options (#25061)
Adds options matching new AI Gateway naming.
New options are added as alias for old options. Old options are still
working.
Old options have deprecated message.
No conflict detection was added.

Updated documentation so it mentions only new options. Added note about
old options still working.

> Various AI tools where used to create this PR
2026-05-21 11:14:11 +02:00
Spike Curtis 05e47b9c0f fix: filter out cross-talk on TestPortForward (#25503)
<!--

If you have used AI to produce some or all of this PR, please ensure you have read our [AI Contribution guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING) before submitting.

-->

Fixes https://github.com/coder/internal/issues/1539  
  
Protects from port cross-talk by adding a short random prefix to our socket communication and instructing the service on the workspace agent side of the test to ignore any connections that don't use the prefix.
2026-05-20 13:08:57 -04:00
Cian Johnston 85289464b6 fix(cli): remove unnecessary PTY from TestServerCreateAdminUser/Validates (#25444)
Fixes https://linear.app/codercom/issue/PLAT-224

The Validates subtest only checks that `Run()` returns a validation
error and never reads PTY output. We don't need it in this test, so
removing.

> 🤖 Generated by Coder Agents
2026-05-19 10:35:50 +01:00
Danielle Maywood 170a6e1fe9 feat: add chat sharing foundation (#25041) 2026-05-18 22:32:05 +01:00
Callum Styan 191dd230ae feat: add agentfake scaletest subcommand (#25072)
This PR builds on top of https://github.com/coder/coder/pull/25070 to
add a way of running the larger "fake agent" manager via the existing
CLI, pulling in the URL/credentials already set.

With this, we can run a pod per scaletest region to act as all the
workspaces in that region.


This is in a new subcommand `scaletest agentfake` currently.

---------

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2026-05-15 14:36:54 -07:00
Danny Kopping 841b777ccd feat: add ai_providers table, queries, dbauthz, audit, RBAC (#24892) 2026-05-14 16:10:46 +02:00
Max Schwenk f3e90b334d fix(cli): show sync wait dependencies (#25089)
## Problem
`coder exp sync want` and `coder exp sync start` both printed generic
success messages, which hid the dependency units involved in startup
coordination.

Before, declaring dependencies with `sync want` printed:

```text
Success
```

Before, `sync start` printed while waiting, then finished with another
generic success message:

```text
Waiting for dependencies of unit 'test-unit' to be satisfied...
Success
```

## Solution
Print the dependency units in both cases, using wording that matches
where the command is in the lifecycle.

After, `sync want` prints the dependencies it declared for the unit:

```text
Unit "test-unit" declared dependencies: [dep-unit]
```

After, `sync start` enumerates the dependencies while it is waiting,
then prints the same dependencies after the unit starts executing:

```text
Unit "test-unit" is waiting for dependencies to be satisfied: [dep-unit, dep-unit-2]
Unit "test-unit" finished waiting for dependencies: [dep-unit, dep-unit-2]
```

The sync golden tests now cover the updated output, including multiple
dependencies for `sync start`.
2026-05-14 14:45:20 +02:00
Steven Masley 0f505aa4da chore: unhide flag to force unix filepaths in config-ssh (#25142)
Docs now include this flag. This flag is now also viewable in linux/mac
despite it effectively being a `no-op`.

Closes https://github.com/coder/coder/issues/24205
2026-05-13 14:59:33 -05:00
Michael Suchacz 38f586107d refactor: remove agents TUI (#25190) 2026-05-13 21:30:11 +02:00
Yevhenii Shcherbina b5e1ea33d8 feat: add AI budget policy and period deployment config (#25122)
Closes
https://linear.app/codercom/issue/AIGOV-283/add-deployment-config-for-ai-budget-policy-and-period

Adds `CODER_AI_BUDGET_POLICY` and `CODER_AI_BUDGET_PERIOD` deployment
options for AI Governance cost controls.
2026-05-12 10:48:36 -04:00
Steven Masley 19573e8aee feat!: patchTemplateMeta to use optional fields (#24984)
Closes https://github.com/coder/coder/issues/13112

**Breaking Change**: Removed status code `StatusNotModified` when no
diffs occur in a patch. Now the patch is always applied and a template
is always returned.
2026-05-11 12:43:52 -05:00
Zach 81e2be69e9 test: use typed atomics in test files (#25071)
Use typed atomics (atomic.Int64, atomic.Int32, etc.) in test files to prevent
mixing atomic and non-atomic access on the same value, guarantee 64-bit
alignment on 32-bit platforms, and provide a cleaner API.
2026-05-11 08:41:17 -06:00
Jeremy Ruppel a1dbd758bc feat: add template builder deployment config and telemetry types (#25082) 2026-05-11 09:48:55 -04:00
Thomas Kosiewski 4a6756a3e8 fix: isolate test HTTP clients (#25038) 2026-05-11 11:03:38 +02:00
Marcin Tojek febabfb8b2 feat: add request/response dump support to aibridgeproxyd (#24837)
Closes https://github.com/coder/coder/issues/24335
2026-05-11 10:59:26 +02:00
Nick Vigilante 369a191972 feat: add Quickstart template with language and IDE selection (#24904)
Add a new Quickstart starter template that lets users pick programming
languages, editors, and an optional Git repo to clone. The template uses
Docker under the hood but presents a developer-focused experience: pick
your tools, start coding.

## What's included

- **Languages parameter** (multi-select): Python, Node.js, Go, Rust,
Java, C/C++
- **IDEs parameter** (multi-select): VS Code (Browser), VS Code Desktop,
Cursor, JetBrains, Zed, Windsurf
- **Git repo parameter**: Optional URL to clone on workspace start
- **JetBrains filtering**: Maps selected languages to relevant IDE codes
(Python → PyCharm, Go → GoLand, etc.)
- **Docker precondition check**: Uses `data "external"` +
`terraform_data` precondition to surface a friendly error when Docker is
unavailable, before the Docker provider fails with a cryptic message
- **4 presets**: Web Development, Backend (Go), Data Science, Full Stack
- **Single install script**: All languages install in one `coder_script`
to avoid apt-get lock conflicts (agent scripts run in parallel via
`errgroup`)

<details><summary>Design decisions</summary>

- **Docker as invisible backend**: Docker is required on the Coder
server but never mentioned in the user-facing parameter UI. The
experience is entirely "pick languages, pick editors, start coding."
- **`coder_script` over startup_script**: Language installs use a
templated script file (`install-languages.sh.tftpl`) driven by the
languages parameter. A single script avoids dpkg lock contention since
`coder_script` resources execute concurrently.
- **`data "external"` for Docker check**: The external provider probes
Docker availability independently of the Docker provider. If Docker is
down, the `terraform_data` precondition fails with a human-readable
message before any `docker_*` resource is evaluated. This depends on the
Docker provider connecting lazily (at resource eval time, not at
provider init), which current behavior confirms.
- **JetBrains filtering by language**: Rather than showing all 9
JetBrains IDEs, the template computes relevant IDE codes from the
language selection (e.g. Python → PY, Go → GO) and passes them as
`default` to the JetBrains module.
- **Arch-aware Go install**: The install script detects `uname -m` to
download the correct Go binary for amd64 or arm64.

</details>

<details><summary>Screenshots and recordings from the UI</summary>
<p>
<img width="1851" height="1471" alt="Screenshot 2026-05-05 at 2 14
20 PM"
src="https://github.com/user-attachments/assets/d4c9cdc5-d311-43a5-9e2e-f90b0019eda7"
/>
<img width="1851" height="1471" alt="Screenshot 2026-05-05 at 2 15
06 PM"
src="https://github.com/user-attachments/assets/cf3023fe-b6db-4503-a6c4-eaa0ec0659f8"
/>


https://github.com/user-attachments/assets/7507fd7d-ddb5-457a-9f7d-cbf89b36eb20


</p>
</details> 

> [!NOTE]
> This PR was authored by Coder Agents.
2026-05-06 13:55:38 +00:00
Kayla はな f6233e622b fix(cli): use app slug instead of raw command in terminal URLs (#24827) 2026-05-05 19:43:08 -06:00
Ethan 4751416b29 fix!: persist structured chat errors (#24919)
**Breaking change for changelog:**

> `codersdk.Chat.last_error` now returns a structured `ChatError` object
(`{message, kind, provider, retryable, status_code, detail}`) instead of
a plain string. The chats API is experimental
(`/api/experimental/chats`), so this ships without a deprecation cycle;
consumers reading `chat.last_error` as a string must update to read
`chat.last_error.message`. SDK/generated TypeScript terminal error
payloads now use the single `ChatError` type; the live stream error
payload type is renamed from `ChatStreamError` to `ChatError`.

Persisted chat errors now carry the same provider-specific detail (kind,
provider, retryable, HTTP status, optional detail) as the live stream,
so refreshing a failed chat rehydrates with the full structured error
instead of a one-line headline.

Existing rows are migrated in place: legacy text errors are wrapped into
`{message, kind: "generic"}` so already-errored chats still render, and
rows with `last_error IS NULL` stay NULL. Internally, persisted fallback
decoding now reuses the existing `chaterror.KindGeneric` constant, with
no JSON value change.

Closes CODAGT-239
2026-05-05 12:56:06 +10:00
Thomas Kosiewski c3794d54ac fix: avoid PTY for ssh command mode (#24862) 2026-05-01 15:02:05 +02:00