Commit Graph

1891 Commits

Author SHA1 Message Date
Danny Kopping 2770bdc9d1 feat: route extra ai_provider_types through OpenAI and Anthropic providers (#25722)
_Disclosure:_ _produced_ _with_ _Claude_ _Opus_ _4\.7_

AI Gateway only supports Anthropic (+Bedrock), OpenAI, and Copilot providers at present. All other types (Vercel, Gemini, etc) will be mapped to OpenAI since they support OpenAI-compatible endpoints.
2026-05-27 16:16:05 +02:00
Max Schwenk ae492495ee fix(cli): show ready sync start dependencies (#25546)
## Problem

Follow-on to:

- https://github.com/coder/coder/pull/25089

`coder exp sync start` still printed a generic success message when the
unit was ready on the first status check. That hid whether the unit had
no dependencies or had dependencies that were already satisfied before
`sync start` ran.

Before:

```text
Success
```

## Solution
Print explicit startup output for both ready-at-first-check cases.

After, dependencies already satisfied:

```text
Unit "test-unit" started immediately, dependencies already satisfied: [dep-unit, dep-unit-2]
```

After, no dependencies:

```text
Unit "test-unit" started with no dependencies
```

The existing waiting path is unchanged and still reports the
dependencies while waiting and after waiting finishes.

Co-authored-by: Sas Swart <sas.swart.cdk@gmail.com>
2026-05-27 12:33:39 +02:00
Danny Kopping 79e007cf30 feat: hot-reload aibridged and aibridgeproxyd providers on DB changes (#25673)
Previously the in-process aibridge daemon and the enterprise aibridgeproxy daemon both snapshotted their provider routing once at boot. Any `ai_providers` or `ai_provider_keys` mutation required a restart for either to pick it up.

Add an `ai_providers_changed` pubsub channel that the CRUD handlers publish on after Create / Update / Delete. Both daemons subscribe:

- **aibridged** rebuilds its `[]aibridge.Provider` snapshot via `BuildProviders` and swaps it into the pool atomically. Inflight requests keep serving against the bridge they already acquired; new acquires build against the new snapshot. Per-provider construction errors stay scoped to the offending row.
- **aibridgeproxyd** rebuilds its routing snapshot from `GetAIProviders` and swaps the host→provider map atomically. The MITM listener picks up new providers without restart.

DB read for aibridgeproxyd uses the existing `AsAIProviderMetadataReader` subject for routing-only access.
2026-05-27 11:58:43 +02:00
Ethan e91bec8574 fix(cli): close aibridge daemon before WebSocket shutdown wait (#25719)
> [!WARNING]
> The investigation and solution in this PR were done with
[Mux](https://mux.coder.com/). I've reviewed the investigation
methodology, evidence and solution, and it all appears sound.

## Summary

PR #25570 (`refactor: move aibridged out of enterprise to AGPL`, merged
2026-05-22) added an in-memory aibridge DRPC server in
`coderd/aibridged.go` that does `api.WebsocketWaitGroup.Add(1)` and only
releases `Done()` when its client session is closed. PR #25575 then
flipped `CODER_AI_GATEWAY_ENABLED` to default to `true`, so every
`cli.Server()` invocation now spins up that goroutine.

In `cli/server.go`, the only call to `aibridgeDaemon.Close()` was a
`defer` scheduled at function return. During graceful shutdown the code
first calls `coderAPICloser.Close()`, which waits on
`api.WebsocketWaitGroup`. That wait sits for the full 10s timeout in
`coderd/coderd.go` (`websocket shutdown timed out after 10 seconds`),
then returns, then the function unwinds, and only then does the deferred
`aibridgeDaemon.Close()` fire and let the goroutine call `Done()`.

The 10s tax was previously latent (aibridged was enterprise-only and
opt-in). After the two May 22 PRs it hit every `cli.Server()` test. On
Linux/macOS CI it just makes the suite slower; on the Depot Windows
runner, the ramdisk reservation leaves only ~17 GiB of headroom and the
~10s shutdown tails of multiple concurrent package binaries overlap into
an OOM, presenting as `test-go-pg (windows-2022)` jobs that die silently
at the ~600s watchdog with an empty `steps` array.

See Slack:
https://codercom.slack.com/archives/C05AE94121Z/p1779807717764189

## Fix

Close `aibridgeDaemon` explicitly during graceful shutdown, **before**
`coderAPICloser.Close()` waits on the WebSocket wait group. This matches
the existing ordered-shutdown pattern used for `tunnel` and
`notificationsManager`. The deferred `aibridgeDaemon.Close()` is
retained as a safety net for early-return paths, and is safe to
double-call because `aibridged.Server.Close()` is already idempotent via
`shutdownOnce` in `coderd/aibridged/aibridged.go`.

## Regression test

`TestServer_AIGatewayShutdownOrdering` boots a real `coder server` with
`--ai-gateway-enabled=true`, cancels its context, and asserts graceful
shutdown finishes in under 8s. With the fix the test runs in ~0.1s;
without the fix it fails deterministically at ~10.0s. The flag is passed
explicitly so the test continues to guard the ordering even if the
deployment default is ever flipped back.

## Evidence this fixes the OOM

On Linux the patched `cli` test package drops from 114 s back to its
pre-regression 30 s wall time at the same single-process peak RSS (~7.6
GiB), and the `websocket shutdown timed out after 10 seconds` log line
disappears from every server-test run. Since the Windows OOM is the sum
of multiple concurrent 10 s shutdown tails overlapping past the runner's
~17 GiB headroom, removing those tails returns the concurrent-RSS budget
to its pre-regression level. The Windows OOM was intermittent (a handful
of hits across many runs since May 22), so a single green `test-go-pg
(windows-2022)` job on this PR is not by itself proof; confirmation will
come from watching Windows runs on `main` over the next several days and
seeing the ~600 s silent-kill fingerprint stop recurring.

Relates to ENG-2771
2026-05-27 17:33:14 +10:00
Michael Suchacz 8b1705eb65 feat: route chatd provider traffic through aibridge (#25629)
## Summary

Routes chatd model calls backed by concrete AI Provider rows through the
in-process aibridge transport by default, with deployment options to use
direct provider routing when AI Gateway is disabled or chat AI Gateway
routing is disabled.

- Splits model routing into common, direct provider, and AI Gateway
paths behind a single deployment-mode entry point.
- Builds chatd models through explicit request, route, and options data.
Active API key attribution is passed explicitly instead of being hidden
inside generic model construction.
- For AI Gateway BYOK routes, resolves the user's provider key in chatd,
forwards it through provider-specific auth headers, and sets
`X-Coder-AI-Governance-Token` to the `delegated` marker so aibridge
preserves those headers while still stripping Coder-specific metadata.
- Keeps central provider credentials and deployment fallback credentials
out of forwarded provider auth headers, so AI Gateway central policy
remains authoritative.
- Redacts delegated provider auth from default string formatting to
avoid accidental plaintext logging of user BYOK credentials.
- Covers selected chat models, advisor overrides, title and quickgen
paths, subagent overrides, computer use model selection, and an
integration-style chat turn through the aibridge transport path.
- Persists initiating API key IDs on chat and queued user messages,
including subagent child messages, and fails closed for AI
Gateway-routed model builds without an active key.
- Removes unused `api_key_id` indexes while keeping the persistence
columns and foreign keys.
- Keeps the deployment option available through config and env parsing,
but hides it from CLI help and generated docs.
- Stabilizes the subagent poll fallback test so background CreateChat
processing cannot win the state transition under slower CI environments.

## Tests

- `go test ./coderd/x/chatd -run
'TestAIGatewayProviderAuthForUser|TestAIGatewayProviderAuthRedactsFormatting|TestResolveModelRouteForConfigAIGatewayProviderAuth|TestAIGatewayModelForwardsProviderAuth|TestProcessChat_AIGatewayRoutingUsesDelegatedAPIKey|TestAwaitSubagentCompletion'
-count=1`
- `go test ./coderd/aibridged -run
'TestServeHTTP_DelegatedAPIKey|TestServeHTTP_StripCoderToken' -count=1`
- `git diff --check HEAD~1..HEAD`
- `make lint`

> Mux working on behalf of Mike.
2026-05-26 19:31:52 +00:00
Danny Kopping a56c88a0cc fix: run AI provider seed and build after newAPI so dbcrypt applies (#25699)
## Problem

Two related symptoms of the same architectural issue: the `dbcrypt`
wrapper is installed inside `enterprise/coderd.New`, so any access to
`options.Database` that happens before `newAPI` runs bypasses
encryption.

**Symptom 1 (reads):** Provider keys added via the admin UI are
encrypted at rest. `BuildProviders` was running *before* `newAPI`,
against the unwrapped store, so the ciphertext was read as-is and shoved
into the keypool as the upstream credential. Anthropic/OpenAI reject it,
and the interception log shows:

```
coderd.aibridged.pool: interception failed  ... error="all configured keys failed authentication"
  credential_kind=centralized  credential_hint=PaPb...4A==  credential_length=184
```

**Symptom 2 (writes):** `SeedAIProvidersFromEnv` was also running before
`newAPI`, against the unwrapped store, so env-derived keys
(`CODER_AIBRIDGE_OPENAI_KEY`, indexed `CODER_AIBRIDGE_PROVIDER_<N>_KEY`,
etc.) landed in `ai_provider_keys` as plaintext with `ApiKeyKeyID =
null` even when `CODER_EXTERNAL_TOKEN_ENCRYPTION_KEYS` was set.

## Fix

Move both `SeedAIProvidersFromEnv` and `BuildProviders` to after
`newAPI`, where `options.Database` is the dbcrypt-wrapped store. Writes
encrypt correctly; reads decrypt correctly.

The enterprise closure (`enterprise/cli/server.go`) runs *inside*
`newAPI` and calls `BuildProviders` for the aibridgeproxyd at that
point. Once the agpl seed moves to after `newAPI`, the proxy on first
boot would see no env-seeded providers. Add a matching seed call inside
the enterprise closure before its `BuildProviders` to cover that case.
Seeding is idempotent, so the agpl-side seed running again post-`newAPI`
is a no-op when the rows already exist.

## Known shortcomings

The clean version of this fix would just inherit `ctx` like every other
startup step and place these calls naturally. It can't, for two reasons
that are both about the surrounding handler architecture rather than
this change:

1. **`dbcrypt` wrapping is positioned inside `newAPI`, not around
`options.Database` at creation.** That's why both seed and build have to
wait until after `newAPI` in the first place. The principled fix is to
install the wrapper at the point the store is created (behind a hook the
enterprise build supplies), so every consumer sees a single
authoritative view and the ordering stops mattering. This would also
collapse the duplicated seed call back to a single site.

2. **The handler's shutdown sequence is not deferred.**
`coderAPICloser.Close()` and the other teardown steps run only if
control reaches the `select` at the bottom of the handler. An early
`return` from anywhere in Phase 1 (e.g. seed/build returning
`context.Canceled` when the user hits ctrl-c during startup) skips that
block and orphans all the goroutines `newAPI` spawned — tailnet workers,
gitsync, telemetry batcher, etc. `goleak` then catches them at package
teardown and `TestServer_TelemetryDisabled_FinalReport` fails. Moving
the shutdown into deferred closers (with a `sync.Once`-guarded close to
avoid double-close from the explicit Phase 2 call) is the principled
fix.

For this PR I took the smallest change that fixes the reported bugs: a
detached context (`context.WithoutCancel(ctx)` + a 30s timeout) at the
seed and build call sites in both the agpl and enterprise paths. It lets
the calls complete even if the user cancels during startup, after which
the handler reaches its shutdown select naturally and tears down through
Phase 2. Both shortcomings above are worth addressing separately.

## Test plan

- `make test RUN=TestServer_TelemetryDisabled_FinalReport` with `-race`;
passes locally with `-count=3`.
- Manually verified on a deployment with
`CODER_EXTERNAL_TOKEN_ENCRYPTION_KEYS` set and env-configured providers:
`ai_provider_keys.api_key_key_id` is populated, `api_key` is base64
ciphertext, and upstream auth succeeds.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 21:27:02 +02:00
Danny Kopping 282ab7de34 refactor: load AI providers from the database at startup (#25672)
Replace the env-based `BuildProviders` with a DB-backed loader. The database is now the single source of truth for runtime provider configuration; env config arrives via `SeedAIProvidersFromEnv` (run at boot) and `BuildProviders` reads it back as `aibridge.Provider` instances. `cli/server.go` and `enterprise/cli/server.go` both call the same path, so aibridged and aibridgeproxyd see the same provider set.

Per-provider `DumpDir` is replaced by a top-level `CODER_AI_GATEWAY_DUMP_DIR` base; each provider's effective dump path is `<base>/<provider name>`.
2026-05-26 15:57:01 +02:00
Ethan 4f1043a50a feat(scaletest): add chat scaletest command (#25553)
Adds `coder exp scaletest chat`, a harness for creating Coder Agents
chat load.
Start the mock LLM separately, prepare the scaletest workspaces you want
to target, then run the chat scaletest against the existing
`scaletest-*` fleet selected by the shared workspace targeting flags:

```sh
coder exp scaletest llm-mock --address 127.0.0.1:18080

coder exp scaletest chat --llm-mock-url http://127.0.0.1:18080/v1 --chats-per-workspace 10 --turns 1
coder exp scaletest chat --llm-mock-url http://127.0.0.1:18080/v1 --template docker --target-workspaces 0:10 --chats-per-workspace 1 --turns 10 --turn-start-delay 30s
```

This is the same pattern used by the `workspace-traffic` load generator.

Keeping the fake LLM as a separate process is intentional so it can be
scaled independently from the Coder deployment, which will likely be
necessary as we scale up and up.

This PR is the starting point: it provides the command, mock
provider/model bootstrap, existing workspace selection, chat streaming,
follow-up turns, metrics, and cleanup. Follow-up PRs will add multi-step
turns via tool calls. I'm still a bit iffy on the mechanism I have for
that. It'll likely involve having the runner send some magic strings
that the mock will recognise.


Relates to CODAGT-307
Relates to GRU-48
Relates to https://github.com/coder/scaletest/issues/124

Generated by Mux, but reviewed by a human
2026-05-26 14:19:36 +10:00
Mathias Fredriksson 7958ad6d04 fix(cli): use quartz clock in waitForTaskIdle for immediate first poll (#25648)
waitForTaskIdle used time.NewTicker(5s) which delays the first poll
by 5 seconds. Debugger tracing proved the failure mechanism: on slow
CI (Windows), the first poll at 5s sees "working" (idle patch has not
landed due to goroutine scheduling), needs poll #2 at 10s, but the
25s context expires before it fires.

Two changes:

1. Use r.clock.NewTicker (quartz) with time.Nanosecond initial
   interval and Reset(5s) for immediate first poll. Tests inject a
   mock clock via clitest.NewWithClock for deterministic control.
2. Rewrite WaitsForWorkingAppState test with quartz traps
   (NewTicker + TickerReset) for deterministic synchronization
   instead of racing goroutines. Fix PausedDuringWaitForReady
   sync point.

Closes DEVEX-381
2026-05-25 19:14:29 +03:00
Danielle Maywood 5deab9f721 test: wait for devcontainer readiness (#25567) 2026-05-22 13:55:21 +01:00
Danny Kopping ef6ee2af68 chore: tolerate empty providers at startup and log env seeds (#25605)
Since AI Gateway is now enabled by default, and if the AI Gateway Proxy is enabled too it's possible the server can start without any configured providers. This would previously block startup, which is unacceptable.

In an upstack PR we will handle reloading the providers at runtime, so the server needs to be able to start up even if it can't handle any proxy requests to AI Gateway.

This change was necessitated because if there are providers configured in the environment they need to be seeded _before_ the proxy starts.
2026-05-22 12:45:14 +02:00
Danny Kopping ddec110b0e refactor: move aibridged out of enterprise to AGPL (#25570)
In order to allow Coder Agents to use AI Gateway in OSS, we need to rehome the `aibridged`\-related code into the AGPL path.

The HTTP API is only registered under enterprise so will still require the AI Governance Add-on to be present in order to use it, whereas Coder Agents uses an in-memory pipe to the same handlers.
2026-05-22 09:11:37 +02:00
Danny Kopping c50b0e84b9 feat!: default CODER_AI_GATEWAY_ENABLED to true (#25575)
`CODER_AI_GATEWAY_ENABLED` / `CODER_AIBRIDGE_ENABLED` is now being defaulted to `true` now that it will be used by Coder Agents.

If you previously had this value disabled explicitly, that value will persist.
2026-05-22 08:57:36 +02:00
Danny Kopping 9341efec9f feat!: seed ai_providers from env on server startup (#24895)
_Disclaimer: implemented by a Coder Agent using Claude Opus 4.7_

Part of the implementation of [RFC: Common AI Provider Configs](https://www.notion.so/coderhq/RFC-Common-AI-Provider-Configs-34bd579be59280ed958feffb82024797) (AIGOV-201).

## Note

This change can cause a previously working installation to fail to start should a conflict exist between the providers configured in the environment & those now migrated to the database.

I'll raise a PR upstack to document this process and workarounds should a startup fail.

## What this PR does

Reconciles environment-derived AI provider configuration with the `ai_providers` table at server startup. The seed runs **before** the aibridged daemon is initialized, so the runtime always reads providers from the database; the legacy `CODER_AIBRIDGE_*` environment variables become a one-shot migration source.

### Behavior

- Concurrent server starts are serialized through a Postgres advisory lock (`LockIDAIProvidersEnvSeed`).
- Missing rows are inserted with an audit entry attributed to the system actor.
- Existing rows whose canonical hash matches the env-derived hash are left alone (the common no-op restart path).
- Existing rows whose canonical hash does **not** match cause server startup to fail with a descriptive error so the operator can explicitly resolve the conflict in either env or DB.
- Soft-deleted rows are NOT resurrected from env; an explicit operator deletion is sticky across restarts.
- Indexed providers whose name conflicts with a legacy env var fail startup with a clear remediation message.
- Unknown provider types (e.g. `copilot`, until the DB enum is widened) are skipped with a log entry rather than failing startup.

### Canonical hashing

The `canonicalAIProvider` shape captures exactly the fields that determine runtime behavior — `type`, `base_url`, and the Bedrock subset of settings (access key, access key secret, region, model, small fast model) — and is hashed with SHA-256. The hash is **computed on demand from the row + env**, never persisted, so the database does not need a new column for it. API keys live in the separate `ai_provider_keys` table and are intentionally excluded from the hash so operators can rotate keys via the API without forcing a server restart.

<details>
<summary>Decision log</summary>

- The hash is intentionally not persisted in the database. The RFC discussed this trade-off; computing on demand keeps the schema minimal and lets the canonical shape evolve without a migration.
- The lock uses an `iota` slot in `coderd/database/lock.go` rather than `GenLockID` so it's stable, easy to audit, and matches the convention used for every other startup lock.
- A bearer-token Anthropic provider whose env vars also set Bedrock metadata but no AWS credentials does NOT store the Bedrock fields. Without credentials the discriminated settings would misrepresent the row as Bedrock auth.
- We deliberately do NOT publish to the `ai_providers_changed` pubsub channel from the seed because the seed completes before any subscriber is started; the follow-up PR introduces that channel.

</details>
2026-05-22 08:37:27 +02:00
Zach ddc0e99c69 chore: remove coder_secret Terraform integration (#25512)
Removes the coder_secret Terraform integration: the data.coder_secret
consumption path through provisionerdserver → provisioner.proto →
provisioner/terraform, the dynamic-parameter secret-requirement
validation, and the workspace-update / resolve-autostart surfaces that
depended on it. This is being done due to a product/feature direction
change (see PLAT-243). User-secret CRUD (DB, REST, CLI, UI, telemetry, audit)
and the agent-manifest secret-injection path are untouched.

The provisionerd API is bumped from v1.17 to v1.18 rather than rolled
back: v1.17 shipped in v2.33.x, so user_secrets field numbers are
reserved and the changelog documents both versions.

Generated with assistance from Coder Agents.
2026-05-21 09:19:29 -06:00
Thomas Kosiewski 26a0805dcd fix(cli): isolate root HTTP transports (#25430)
The CLI root client shared `http.DefaultTransport` for normal API
requests and for the version-check build-info request. In parallel
tests, other clients can close idle connections on that process-global
transport, which can fail the Boundary license check before the AGPL 404
handling runs.

`TestBoundaryLicenseVerification/AGPLDeployment` configures a proxy that
returns `404` from `/api/v2/entitlements`, which `verifyLicense()` maps
to the expected AGPL deployment error. However, `clitest.SetupConfig()`
only writes the URL and session token to disk. It does not pass the
test's isolated `proxyClient.HTTPClient` into the CLI invocation, so
`coder boundary` builds a fresh client through `RootCmd.InitClient()`.
Before this change, that fresh client used `http.DefaultTransport`; if
another parallel test closed idle connections on the shared transport
while the entitlement request was in flight, Go returned `http:
CloseIdleConnections called` instead of the proxy's `404`. The command
then failed with `failed to get entitlements`, and the test never
reached the expected AGPL error path.

Clone the default transport for each CLI root HTTP client and for the
unwrapped build-info client, preserving the configured TLS settings when
present. Each CLI invocation now gets its own transport instance, so
cleanup from unrelated parallel tests cannot interrupt its entitlement
or build-info requests.

Closes https://github.com/coder/internal/issues/1538

<details>
<summary>Coder Agents notes</summary>

Generated by Coder Agents for Linear ENG-2705.

Local validation:

- `go test ./cli -run
'TestNewHTTPTransport|Test_ensureTLSConfig|Test_wrapTransportWithVersionCheck'
-count=1`
- `go test ./enterprise/cli -run
TestBoundaryLicenseVerification/AGPLDeployment -count=20 -parallel=16`
- `go test ./cli ./enterprise/cli`
- `make lint`
- `go test ./enterprise/cli -run '^TestBoundaryLicenseVerification$'
-count=50 -parallel=16`
- pre-commit hook during `git commit`

</details>
2026-05-21 16:51:34 +02:00
Paweł Banaszewski 46e93e6325 chore: add ai_gateway options that alias aibridge options (#25061)
Adds options matching new AI Gateway naming.
New options are added as alias for old options. Old options are still
working.
Old options have deprecated message.
No conflict detection was added.

Updated documentation so it mentions only new options. Added note about
old options still working.

> Various AI tools where used to create this PR
2026-05-21 11:14:11 +02:00
Spike Curtis 05e47b9c0f fix: filter out cross-talk on TestPortForward (#25503)
<!--

If you have used AI to produce some or all of this PR, please ensure you have read our [AI Contribution guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING) before submitting.

-->

Fixes https://github.com/coder/internal/issues/1539  
  
Protects from port cross-talk by adding a short random prefix to our socket communication and instructing the service on the workspace agent side of the test to ignore any connections that don't use the prefix.
2026-05-20 13:08:57 -04:00
Cian Johnston 85289464b6 fix(cli): remove unnecessary PTY from TestServerCreateAdminUser/Validates (#25444)
Fixes https://linear.app/codercom/issue/PLAT-224

The Validates subtest only checks that `Run()` returns a validation
error and never reads PTY output. We don't need it in this test, so
removing.

> 🤖 Generated by Coder Agents
2026-05-19 10:35:50 +01:00
Danielle Maywood 170a6e1fe9 feat: add chat sharing foundation (#25041) 2026-05-18 22:32:05 +01:00
Callum Styan 191dd230ae feat: add agentfake scaletest subcommand (#25072)
This PR builds on top of https://github.com/coder/coder/pull/25070 to
add a way of running the larger "fake agent" manager via the existing
CLI, pulling in the URL/credentials already set.

With this, we can run a pod per scaletest region to act as all the
workspaces in that region.


This is in a new subcommand `scaletest agentfake` currently.

---------

Signed-off-by: Callum Styan <callumstyan@gmail.com>
2026-05-15 14:36:54 -07:00
Danny Kopping 841b777ccd feat: add ai_providers table, queries, dbauthz, audit, RBAC (#24892) 2026-05-14 16:10:46 +02:00
Max Schwenk f3e90b334d fix(cli): show sync wait dependencies (#25089)
## Problem
`coder exp sync want` and `coder exp sync start` both printed generic
success messages, which hid the dependency units involved in startup
coordination.

Before, declaring dependencies with `sync want` printed:

```text
Success
```

Before, `sync start` printed while waiting, then finished with another
generic success message:

```text
Waiting for dependencies of unit 'test-unit' to be satisfied...
Success
```

## Solution
Print the dependency units in both cases, using wording that matches
where the command is in the lifecycle.

After, `sync want` prints the dependencies it declared for the unit:

```text
Unit "test-unit" declared dependencies: [dep-unit]
```

After, `sync start` enumerates the dependencies while it is waiting,
then prints the same dependencies after the unit starts executing:

```text
Unit "test-unit" is waiting for dependencies to be satisfied: [dep-unit, dep-unit-2]
Unit "test-unit" finished waiting for dependencies: [dep-unit, dep-unit-2]
```

The sync golden tests now cover the updated output, including multiple
dependencies for `sync start`.
2026-05-14 14:45:20 +02:00
Steven Masley 0f505aa4da chore: unhide flag to force unix filepaths in config-ssh (#25142)
Docs now include this flag. This flag is now also viewable in linux/mac
despite it effectively being a `no-op`.

Closes https://github.com/coder/coder/issues/24205
2026-05-13 14:59:33 -05:00
Michael Suchacz 38f586107d refactor: remove agents TUI (#25190) 2026-05-13 21:30:11 +02:00
Yevhenii Shcherbina b5e1ea33d8 feat: add AI budget policy and period deployment config (#25122)
Closes
https://linear.app/codercom/issue/AIGOV-283/add-deployment-config-for-ai-budget-policy-and-period

Adds `CODER_AI_BUDGET_POLICY` and `CODER_AI_BUDGET_PERIOD` deployment
options for AI Governance cost controls.
2026-05-12 10:48:36 -04:00
Steven Masley 19573e8aee feat!: patchTemplateMeta to use optional fields (#24984)
Closes https://github.com/coder/coder/issues/13112

**Breaking Change**: Removed status code `StatusNotModified` when no
diffs occur in a patch. Now the patch is always applied and a template
is always returned.
2026-05-11 12:43:52 -05:00
Zach 81e2be69e9 test: use typed atomics in test files (#25071)
Use typed atomics (atomic.Int64, atomic.Int32, etc.) in test files to prevent
mixing atomic and non-atomic access on the same value, guarantee 64-bit
alignment on 32-bit platforms, and provide a cleaner API.
2026-05-11 08:41:17 -06:00
Jeremy Ruppel a1dbd758bc feat: add template builder deployment config and telemetry types (#25082) 2026-05-11 09:48:55 -04:00
Thomas Kosiewski 4a6756a3e8 fix: isolate test HTTP clients (#25038) 2026-05-11 11:03:38 +02:00
Marcin Tojek febabfb8b2 feat: add request/response dump support to aibridgeproxyd (#24837)
Closes https://github.com/coder/coder/issues/24335
2026-05-11 10:59:26 +02:00
Nick Vigilante 369a191972 feat: add Quickstart template with language and IDE selection (#24904)
Add a new Quickstart starter template that lets users pick programming
languages, editors, and an optional Git repo to clone. The template uses
Docker under the hood but presents a developer-focused experience: pick
your tools, start coding.

## What's included

- **Languages parameter** (multi-select): Python, Node.js, Go, Rust,
Java, C/C++
- **IDEs parameter** (multi-select): VS Code (Browser), VS Code Desktop,
Cursor, JetBrains, Zed, Windsurf
- **Git repo parameter**: Optional URL to clone on workspace start
- **JetBrains filtering**: Maps selected languages to relevant IDE codes
(Python → PyCharm, Go → GoLand, etc.)
- **Docker precondition check**: Uses `data "external"` +
`terraform_data` precondition to surface a friendly error when Docker is
unavailable, before the Docker provider fails with a cryptic message
- **4 presets**: Web Development, Backend (Go), Data Science, Full Stack
- **Single install script**: All languages install in one `coder_script`
to avoid apt-get lock conflicts (agent scripts run in parallel via
`errgroup`)

<details><summary>Design decisions</summary>

- **Docker as invisible backend**: Docker is required on the Coder
server but never mentioned in the user-facing parameter UI. The
experience is entirely "pick languages, pick editors, start coding."
- **`coder_script` over startup_script**: Language installs use a
templated script file (`install-languages.sh.tftpl`) driven by the
languages parameter. A single script avoids dpkg lock contention since
`coder_script` resources execute concurrently.
- **`data "external"` for Docker check**: The external provider probes
Docker availability independently of the Docker provider. If Docker is
down, the `terraform_data` precondition fails with a human-readable
message before any `docker_*` resource is evaluated. This depends on the
Docker provider connecting lazily (at resource eval time, not at
provider init), which current behavior confirms.
- **JetBrains filtering by language**: Rather than showing all 9
JetBrains IDEs, the template computes relevant IDE codes from the
language selection (e.g. Python → PY, Go → GO) and passes them as
`default` to the JetBrains module.
- **Arch-aware Go install**: The install script detects `uname -m` to
download the correct Go binary for amd64 or arm64.

</details>

<details><summary>Screenshots and recordings from the UI</summary>
<p>
<img width="1851" height="1471" alt="Screenshot 2026-05-05 at 2 14
20 PM"
src="https://github.com/user-attachments/assets/d4c9cdc5-d311-43a5-9e2e-f90b0019eda7"
/>
<img width="1851" height="1471" alt="Screenshot 2026-05-05 at 2 15
06 PM"
src="https://github.com/user-attachments/assets/cf3023fe-b6db-4503-a6c4-eaa0ec0659f8"
/>


https://github.com/user-attachments/assets/7507fd7d-ddb5-457a-9f7d-cbf89b36eb20


</p>
</details> 

> [!NOTE]
> This PR was authored by Coder Agents.
2026-05-06 13:55:38 +00:00
Kayla はな f6233e622b fix(cli): use app slug instead of raw command in terminal URLs (#24827) 2026-05-05 19:43:08 -06:00
Ethan 4751416b29 fix!: persist structured chat errors (#24919)
**Breaking change for changelog:**

> `codersdk.Chat.last_error` now returns a structured `ChatError` object
(`{message, kind, provider, retryable, status_code, detail}`) instead of
a plain string. The chats API is experimental
(`/api/experimental/chats`), so this ships without a deprecation cycle;
consumers reading `chat.last_error` as a string must update to read
`chat.last_error.message`. SDK/generated TypeScript terminal error
payloads now use the single `ChatError` type; the live stream error
payload type is renamed from `ChatStreamError` to `ChatError`.

Persisted chat errors now carry the same provider-specific detail (kind,
provider, retryable, HTTP status, optional detail) as the live stream,
so refreshing a failed chat rehydrates with the full structured error
instead of a one-line headline.

Existing rows are migrated in place: legacy text errors are wrapped into
`{message, kind: "generic"}` so already-errored chats still render, and
rows with `last_error IS NULL` stay NULL. Internally, persisted fallback
decoding now reuses the existing `chaterror.KindGeneric` constant, with
no JSON value change.

Closes CODAGT-239
2026-05-05 12:56:06 +10:00
Thomas Kosiewski c3794d54ac fix: avoid PTY for ssh command mode (#24862) 2026-05-01 15:02:05 +02:00
Dean Sheather e57525002c chore: remove agents experiment flag and mark feature as beta (#24432)
Remove the `ExperimentAgents` feature flag so the Agents feature is
always available without requiring `--experiments=agents`. The feature
is now in beta.

Existing deployments that still pass `--experiments=agents` will get a
harmless "ignoring unknown experiment" warning on startup.

### Changes

**Backend:**
- Remove `RequireExperimentWithDevBypass` middleware from chat and MCP
server routes
- Always include `AgentsAccessRole` in assignable site roles (later
refactored to org-scoped on main; rebase keeps that)
- Always set `AgentsTabVisible = true`, then drop the entire dead
`AgentsTabVisible` metadata pipeline (Go htmlState field,
populateHTMLState goroutine, HTML meta tag, useEmbeddedMetadata
registration, mock); no production consumer reads it. `AgentsNavItem`
already gates on `permissions.createChat`.
- Make `blob:` CSP `img-src` addition unconditional
- Remove `ExperimentAgents` constant, `DisplayName` case, and
`ExperimentsKnown` entry

**CLI:**
- Graduate the agents TUI from `coder exp agents` to `coder agents`
(moved from `AGPLExperimental()` to `CoreSubcommands()`)
- Drop the `agent` alias so it does not collide with the hidden
workspace-agent command
- Rename implementation files `cli/exp_agents_*.go` -> `cli/agents_*.go`
and internal identifiers (`expChatsTUIModel` -> `chatsTUIModel`,
`newExpChatsTUIModel` -> `newChatsTUIModel`, `setupExpAgentsBackend` ->
`setupAgentsBackend`, `startExpAgentsSession` -> `startAgentsSession`,
`expAgentsPtr` -> `agentsPtr`, `expAgentsSession` -> `agentsSession`,
`TestExpAgents*` -> `TestAgents*`). `expClient` (the
`*codersdk.ExperimentalClient` local) is kept; `coderd/exp_chats*.go`
and other still-experimental `cli/exp_*.go` commands are intentionally
untouched.

**Frontend:**
- Remove experiment check from `AgentsNavItem` - render when
`canCreateChat` is true
- Remove `agentsEnabled` experiment check from `WorkspacesPage`, then
gate `chatsByWorkspace` on `permissions.createChat` so users without
chat access don't trigger the per-page DB query (Copilot review
feedback)
- Add `FeatureStageBadge` (beta) next to the Coder logo in the Agents
sidebar (desktop + mobile)

**Docs:**
- Remove experiment flag setup instructions from `early-access.md` and
`getting-started.md` (and rename `early-access.md`'s "Enable Coder
Agents" heading to "Set up Coder Agents", since there is no enablement
step left)
- Update `chats-api.md` and `getting-started.md`'s Chats API note to say
"beta" instead of "experimental"
- `docs/manifest.json`: drop "experimental" from the Chats API sidebar
description
- `make gen` regenerated `docs/reference/cli/agents.md` and the CLI
index
- `scripts/check_emdash.sh`: exclude `cli/testdata/*.golden` and
`enterprise/cli/testdata/*.golden` from the new repo-wide emdash lint,
since serpent emits emdash borders in every generated `--help` golden
file

**Tests:**
- Remove `ExperimentAgents` setup from all test files (14 occurrences
across 7 files)
- Update stale "with the agents experiment" comments in
`coderd/x/chatd/integration_test.go` and `coderd/mcp_test.go`


<img width="1185" height="900" alt="image"
src="https://github.com/user-attachments/assets/b420bc8f-41d6-42c6-abd8-ad572533d651"
/>


> 🤖 Generated by Coder Agents
2026-05-01 01:49:00 +10:00
Susana Ferreira dbb50ebaaf feat: remove 429 from aibridge circuit breaker failure conditions (#24701)
## Description

Removes 429 (Too Many Requests) from the circuit breaker failure conditions. Rate limiting is now handled by automatic key failover instead of tripping the circuit breaker.

## Changes

`DefaultIsFailure` no longer treats 429 as a circuit breaker failure. The circuit breaker now only trips on server overload responses (503, 529).

Tests and integration tests updated to use 503 instead of 429 for tripping circuits. Description strings in deployment config updated to reflect the change.

Closes https://github.com/coder/internal/issues/1445

> [!NOTE]
> Initially generated by Coder Agents, modified and reviewed by @ssncferreira
2026-04-30 09:31:32 +01:00
Susana Ferreira 101a4082dd feat: support multiple keys per AI Bridge provider (#24683)
## Description

Adds support for configuring multiple API keys per AI Bridge provider. This PR introduces the configuration parsing and validation only; wiring the key pools into the aibridge providers will happen in upstream PRs.

## Changes

Providers now accept a comma-separated list of keys via the `KEYS` env var (or a single key via the existing `KEY` var). The two are mutually exclusive. Bedrock follows the same pattern with `BEDROCK_ACCESS_KEYS` / `BEDROCK_ACCESS_KEY_SECRETS`, with an additional validation that the two slices have matching lengths.

Key validation at startup checks for empty values, duplicates, and a maximum of 5 keys per provider. 

Related to: https://github.com/coder/internal/issues/1445

> [!NOTE]
> Initially generated by Coder Agents, modified and reviewed by @ssncferreira
2026-04-30 09:19:32 +01:00
Callum Styan 950660e392 fix(cli): extend exp scaletest cleanup to properly clean up prebuilds (#23628)
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-29 10:28:27 -07:00
Paweł Banaszewski a24dc19d49 chore: clean up env var usage in aibridge (#24783)
> AI tools where used when creating this PR

This PR removes environment variable parsing from `/aibridge` directory.

Added env variables/flags for dump dir as coder options.
Only added to new indexed provider options
(`CODER_AIBRIDGE_PROVIDER_<N>_*`) not to deprecated legacy env variables
(`CODER_AIBRIDGE_ANTHROPIC_*` and `CODER_AIBRIDGE_OPENAI_KEY_*`).

Reverted adding `MaxRetries` option as it will be removed soon due to
key failover work:
https://github.com/coder/coder/pull/24783#discussion_r3155544808
2026-04-29 18:28:37 +02:00
George K 3f0e015fe5 fix: allow coderd to start with an empty DERP map when built-in DERP is disabled (#24544)
Allow coderd to start with an empty base DERP map when built-in DERP
is disabled and no static DERP map is configured, so DERP can come from
workspace proxies after startup.

Also add a DERP healthcheck warning when no DERP servers are currently
available at runtime.

Related to: https://linear.app/codercom/issue/PLAT-43/bug-coderd-unable-to-be-started-if-built-in-derp-server-disabled-and
Related to: https://github.com/coder/coder/issues/22324
2026-04-28 09:17:08 -07:00
Mathias Fredriksson 3c450899ea fix: pass agent context config explicitly instead of reading env (#24759)
The CODER_AGENT_EXP_* env vars are agent-internal options. When set
in the workspace environment they leak to MCP subprocesses and user
shells.

ReadEnvConfig() captures the values and ClearEnvVars() strips them
before the reinit loop, so config survives agent restarts. NewAPI
and ReadEnvConfig both use applyDefaults() to fill zero fields.
The chatd test passes config via agenttest.WithContextConfigFromEnv().
2026-04-28 17:58:28 +03:00
Cian Johnston 70d6efa311 feat: chat auto-archive owner digest notifications (#24643)
Depends on #24642

Adds per-owner digest notifications onto the chat auto-archive
subsystem.

Each tick's archived rows are grouped by owner, the top 25 titles per
owner are rendered into a new `Chats Auto-Archived` notification
template, and any remainder surfaces as `and N more`. Each digest is
per-tick, so users with large amounts of purgeable data may get multiple
notifications in sequence (one per user per tick).

The template body branches on `retention_days`: when retention is
disabled (`retention_days=0`), users are told archived chats are kept
indefinitely rather than falsely claiming imminent deletion.

### Changes
- migration `000XXX_chat_auto_archive_notification_template` adds new
notification template
- `dbpurge`: threads `notifications.Enqueuer` through `New`; and
enqueues notification message.
- `cli/server.go`: passes `options.NotificationsEnqueuer` into
`dbpurge.New`.
- `coderd/notifications/events.go`: new `TemplateChatAutoArchiveDigest`
UUID.
- `coderd/inboxnotifications.go`: inbox registration.
- Docs: adds a `Notifications` section to `chat-auto-archive.md`.

> 🤖
2026-04-28 08:56:36 +01:00
Sushant P 4820f13eb4 docs: add deprecation warning for login-type none (#24594)
The `--login-type none` option for `coder users create` is deprecated.
This adds deprecation warnings to all docs that reference it and updates
the CI/CD tutorial to recommend the replacement flows.

Refs DEVEX-224

<details>
<summary>Changes</summary>

- `cli/usercreate.go`: Append deprecation notice to `--login-type` flag
description.
- `docs/tutorials/testing-templates.md`: Replace `--login-type none`
example with separate Premium (`--service-account`) and OSS
(`--login-type password`) examples.
- `docs/reference/cli/users_create.md`: Regenerated from CLI source.
- `cli/testdata/coder_users_create_--help.golden`: Updated golden
snapshot.

</details>

> [!NOTE]
> Generated by Coder Agents.
2026-04-27 22:51:01 +00:00
Zach 79735f2d45 feat: plumb user secrets through provisioner chain to terraform (#24542)
This change passes user secrets from coderd to the Terraform process at
workspace build time so the `data.coder_secret` data source in
terraform-provider-coder can resolve values at plan time.

Secrets traverse two proto hops: `provisionerdserver` fetches them
via`ListUserSecretsWithValues`, attaches them to
`AcquiredJob.WorkspaceBuild.user_secrets` on `provisionerd.proto`;
`runner.go` forwards into `PlanRequest.user_secrets` on
`provisioner.proto`; the Terraform provisioner encodes each as
`CODER_SECRET_ENV_<name>` or `CODER_SECRET_FILE_<hex(path)>` before
invoking `terraform plan`. Only plan requests carry secrets; apply runs
with `nil` because values are baked into plan state.

Fetch is gated on a workspace transitioning to start. stop and delete
transitions never carry secrets, so revoking or deleting a stored secret
cannot make a workspace unstoppable. DB errors on the fetch fail the job
outright rather than silently continuing with an empty secret set.

Note that user secrets will be stored in the workspace_builds table in
provisioner_state with other Terraform state (including other sensitive data).
2026-04-27 08:26:07 -06:00
Cian Johnston d5a5be116d fix: fall back to name lookup for UUID-shaped workspace names (#24340)
`namedWorkspace` in `cli/root.go` parsed workspace identifiers with
`uuid.Parse` first and returned immediately on success, even when no
workspace had that UUID as its actual ID. This caused 404 errors for any
workspace whose name was a valid 32-char hex string (dashless UUID).

- Add `codersdk.ResolveWorkspace`: tries UUID lookup first, falls back
to name lookup on 404. `NameValid` guard skips the fallback for standard
dashed UUIDs (36 chars > 32-char name limit).
- Export `codersdk.SplitWorkspaceIdentifier`, replacing the duplicate
`splitNamedWorkspace` in `cli/root.go` (uses `strings.Cut`).
- Delete `namedWorkspace` from `cli/root.go`; all 28 call sites now use
`client.ResolveWorkspace` directly.
- Delete `namedWorkspace` and `splitNameAndOwner` from
`codersdk/toolsdk/bash.go`; inline `client.ResolveWorkspace`.
- Simplify `GetWorkspace` tool handler to a single `ResolveWorkspace`
call.
- Unit tests via httptest mock cover UUID, name, owner/name, UUID-like
fallback, not-found, server error, transport error, and invalid
identifier paths.
- Integration tests in `cli/show_test.go` and `codersdk/toolsdk` for
workspaces with UUID-like names.

> Generated with Coder Agents
2026-04-27 12:58:26 +01:00
Jeremy Ruppel 02b123518c fix: honor parameter defaults in --use-parameter-defaults and SSH auto-start (#24591)
## Problem

The CLI does not honor `default` values on template parameters in two
ways:

1. **`--use-parameter-defaults` rejects empty-string defaults.** The
check `parameterValue != ""` means `default = ""` in Terraform falls
through to an interactive prompt. In CI this causes an EOF error.

2. **`--use-parameter-defaults` only exists on `coder create`.** The
`start`, `update`, and `restart` commands never wire it through. SSH
auto-start passes empty `workspaceParameterFlags{}`, so users SSH-ing
into a stopped workspace with new template parameters get stuck in an
interactive prompt they cannot complete.

## Fix

### 1. Fix empty-string default detection and expose flag on all
commands

Replace `parameterValue != ""` with a check based on `!tvp.Required`. A
parameter with `Required==false` always has a valid default in
Terraform, even if that default is `""`. Also respect CLI defaults
provided via `--parameter-default`.

Move `--use-parameter-defaults` from a standalone option on `create`
into the shared `workspaceParameterFlags` struct. This exposes the flag
(and `CODER_WORKSPACE_USE_PARAMETER_DEFAULTS`) on `start`, `update`, and
`restart` via `allOptions()`. Wire it through
`buildWorkspaceStartRequest` so the resolver receives it.

### 2. SSH auto-start always uses defaults

Set `useParameterDefaults: true` on both `startWorkspace` calls in the
SSH auto-start path (initial start and the forbidden/upgrade fallback).
SSH is non-interactive and should never prompt.

Fixes https://linear.app/codercom/issue/DEVEX-180
Fixes https://github.com/coder/coder/issues/22272

<details><summary>Implementation notes</summary>

### Scoping decisions

- **`--yes` does not imply `--use-parameter-defaults`**: Making `--yes`
auto-accept defaults exposes a validation gap in the dynamic parameter
path (client-side validation happens during prompting, and skipping
prompts bypasses it). This is deferred to a follow-up that also
addresses `codersdk.ValidateWorkspaceBuildParameter` integration in the
resolver. Tracked in PLAT-114.
- **Explicit overrides always win**: `--parameter`,
`--rich-parameter-file`, and `--preset` are resolved in stages 1-5 of
the resolver, before `resolveWithInput` runs. No change needed for
precedence.
- **`!tvp.Required` vs `parameterValue != ""`**: The `Required` field is
set by the Terraform provider based on whether a `default` is present.
This is the canonical signal for "has a default," not the string value
itself.

</details>

> Generated with [Coder Agents](https://coder.com/agents)
2026-04-24 17:09:17 -04:00
Cian Johnston a876287d36 feat: auto-archive inactive chats with audit trail (#24642)
Adds a background job in `dbpurge` that periodically archives chats
inactive beyond a configurable threshold. Each archived root chat gets a
background audit entry tagged `chat_auto_archive`. Disabled by default.

* New `AutoArchiveInactiveChats` SQL query with LATERAL last-activity
subquery and partial index on archive candidates
* `site_configs`-backed `auto_archive_days` setting with admin-only PUT,
any-authenticated-user GET
* Cascade archive via `root_chat_id`; pinned chats and active threads
exempt
* Root-only audit dispatch on detached context, matching manual archive
(`patchChat`) behavior
* 11 subtests covering disabled no-op, boundary, deleted messages, child
activity, pinned exemption, multi-owner, idempotency, and batch
pagination

PR #24643 adds per-owner digest notifications.
PR #24704 adds the requisite UI controls.

> 🤖
2026-04-24 14:18:28 +01:00
Cian Johnston be011b210b fix(cli): fix flaky TestExpAgentsE2E/ExistingChatHistory (#24661)
- Remove racy sequential `expect("esc")` after `expect("direct open
seed")`
- Both strings appear in the same initial PTY render; their byte-stream
order depends on async title generation timing
- The seed text alone proves we are in the chat view; pressing esc +
expecting `enter: open` confirms list navigation

> 🤖
2026-04-23 11:25:24 +01:00
Cian Johnston f5ccf68e53 feat: add lima incus example (#24640)
Depends on https://github.com/coder/coder/pull/24616

Adds a sample Lima configuration for Coder+Incus.
2026-04-23 08:24:03 +01:00