* Adds `streamJanitorLoop` to clean up stale streams every 30s
* zeroes dropped slots to aid in gc-eligibliity
* Adds regression tests in coderd/x/chatd and enterprise/coderd/x/chatd
> 🤖
Add a `chat_client_type` enum (`ui` | `api`) and `client_type` column to
the `chats` table. The column defaults to `api` for new rows so API
callers don't need to set it explicitly. Existing rows are backfilled to
`ui`.
The field flows through `CreateChatRequest`, `chatd.CreateOptions`,
`InsertChat`, and is returned in the `Chat` response via `db2sdk`.
<details>
<summary>Implementation notes (Coder Agents generated)</summary>
### Changes
**Database migration (000469)**
- New enum `chat_client_type` with values `ui`, `api`.
- New `client_type` column, `NOT NULL DEFAULT 'api'`.
- Backfill: `UPDATE chats SET client_type = 'ui'`.
**SQL query** — `InsertChat` now includes `client_type`.
**SDK** — `ChatClientType` type added; `ClientType` field added to both
`CreateChatRequest` (optional, defaults server-side to `api`) and `Chat`
response.
**Handler** — `postChats` maps the request field (defaulting to `api`)
and passes it through `chatd.CreateOptions`.
**Sub-agent** — Child chats inherit their parent's `client_type`.
**db2sdk** — Maps the database value to the SDK type.
### Decision log
- Default is `api` (not `ui`) so existing API integrations get the
correct value without code changes.
- Backfill sets existing rows to `ui` per requirement.
- Child chats inherit `client_type` from parent rather than defaulting.
</details>
Relates to https://github.com/coder/internal/issues/1455
From that issue:
> Going to skip this test until the underlying race in chatd is fixed.
https://github.com/coder/coder/pull/24279 was a band-aid fix that I no
longer think is valuable pursuing short term. Hugo is working on a RFC
for a redesign of the system to prevent the class of race condition into
the future.
## Summary
Adds `--ai-gateway-allow-byok` deployment option to control whether
users can use Bring Your Own Key (BYOK) mode with AI Gateway.
When disabled (`--ai-gateway-allow-byok=false`), BYOK requests are
rejected with a 403 and a message directing the admin to enable the
flag. Centralized key authentication works regardless of this setting.
Defaults to `true` (BYOK allowed).
---------
Co-authored-by: Danny Kopping <danny@coder.com>
Addresses review findings from #23827 that were added post-merge:
- Persisted attachments now store `organizationId`; mismatched orgs
pruned on restore
- Workspace selection reconciliation: stale IDs from previous orgs
dropped via derived `effectiveWorkspaceId`
- Org picker uses `permittedOrganizations()` for RBAC-aware filtering
- Org picker hidden when user belongs to only one org
- Ref-sync `useEffect` replaced with `useEffectEvent`
- `CreateWorkspace()` and `ListTemplates()` take `organizationID` and
`db` as required function parameters instead of optional struct fields —
compiler enforces them, removes scattered nil guards
- Cross-org template check in `CreateWorkspace` is now unconditional
- `ListTemplates` org-scoping filter now has test coverage
- `setupChatInfra` comment fixed; test helpers use params structs
instead of positional UUIDs
- Enterprise test documents that org admin only sees own chats (handler
hardcodes `OwnerID` — future work needs sidebar UI before lifting that
restriction)
> 🤖
Three SQL queries (`GetUserGroupSpendLimit`,
`ResolveUserChatSpendLimit`, `GetUserChatSpendInPeriod`) aggregated chat
spend limits and usage globally across all organizations. A restrictive
group limit in org A would bleed into org B.
## Changes
- Add `organization_id` parameter to all three SQL queries in
`coderd/database/queries/chats.sql`
- When nil UUID is passed, queries fall back to global behavior
(backward compat for HTTP dashboard endpoints)
- When real org ID is passed, limits and spend are scoped to that
organization
- Thread `organizationID` through `ResolveUsageLimitStatus` →
`checkUsageLimit` → all chatd call sites
- Update dbauthz wrappers for new param structs
- HTTP endpoints (`chatCostSummary`, `getMyChatUsageLimitStatus`) pass
`uuid.Nil` with TODO for future org-scoped UI
- Add `TestResolveUsageLimitStatus_OrgScoped` with 5 test cases covering
org isolation, nil-UUID fallback, spend scoping, and user override
priority
Closescoder/internal#1466
> 🤖
## Summary
Adds a `sanitizeCredentialHint` safety check in the db-to-SDK conversion
layer to ensure credential hints are always masked before being exposed
in the API. Also adds `credential_kind` and `credential_hint` assertions
to the session threads API test.
Fixes https://github.com/coder/internal/issues/1436
* Adds organization_id to chats with backfill (workspace org → user org membership → default org)
* No support yet for ACLs (follow-up issue)
- Cross-org workspace binding rejected (both in `CreateChatRequest` and in `create_workspace` tool
- Adds `OrganizationAutocomplete` to `AgentCreateForm`
- Docs updated with `organization_id` in chats-api.md
> 🤖 Written by a Coder Agent. Reviewed by many humans and many agents.
---------
Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
Fixescoder/internal#1455
Three changes to eliminate the timing-sensitive flake in
`TestSubscribeRelayEstablishedMidStream`:
1. **Reduce `PendingChatAcquireInterval` from `time.Hour` to
`time.Second`.**
The primary trigger is still `signalWake()` from `SendMessage`, but a
short fallback poll ensures the worker picks up the pending chat
even under heavy CI goroutine scheduling contention.
2. **Increase context timeout from `WaitLong` (25s) to `WaitSuperLong`
(60s).**
The worker pipeline (model resolution, message loading, LLM call)
involves multiple DB round-trips that can be slow when PostgreSQL
is shared with many parallel test packages.
3. **Add a status-polling loop while waiting for the streaming
request.**
If the worker errors out during chat processing, the test now
fails immediately with the error status and message instead of
silently timing out.
> Generated by Coder Agents
Audit and connection log pages were timing out due to expensive COUNT(*)
queries over large tables. This commit adds opt-in count capping: requests can
return a `count_cap` field signaling that the count was truncated at a threshold,
avoiding full table scans that caused page timeouts.
Text-cast UUID comparisons in regosql-generated authorization queries
also contributed to the slowdown by preventing index usage for connection
and audit log queries. These now emit native UUID operators.
Frontend changes handle the capped state in usePaginatedQuery and
PaginationWidget, optionally displaying a capped count in the pagination
UI (e.g. "Showing 2,076 to 2,100 of 2,000+ logs")
Related to:
https://linear.app/codercom/issue/PLAT-31/connectionaudit-log-performance-issue
Two new columns added to aibridge_token_usages:
- cache_read_input_tokens (BIGINT, default 0)
- cache_write_input_tokens (BIGINT, default 0)
Migration backfills existing rows by extracting values from the metadata
JSONB column (cache_read_input, input_cached, prompt_cached for reads
(max value selected since only 1 should be set), cache_creation_input
for writes).
All references to data from metadata were updated to reference new
columns. No other changes then changing where data is extracted from.
Requires aibridge library version bump to include:
https://github.com/coder/aibridge/pull/229
Fixes: https://github.com/coder/aibridge/issues/150
When the `agents` experiment is enabled, new users are automatically
granted the `agents-access` role at creation time so they can use Coder
Agents without manual admin intervention.
- Auto-assigns in `CreateUser()` — covers admin API, OAuth, and OIDC
creation paths
- Skips auto-assign for OIDC users when enterprise site role sync is
enabled (sync overwrites roles on every login; those admins should use
`--oidc-user-role-default` instead)
- CLI `create-admin-user` bypasses `CreateUser()` but creates `owner`
users who already have all permissions
> 🤖 Written by a Coder Agent. Will be reviewed by a human.
Previously, `CreateChat` inserted the `chats` row with the DB default
status (`waiting`), then updated it to `pending` in the same transaction
via `setChatPendingWithStore`. This wasted two extra queries per chat
creation (`GetChatByID` + `UpdateChatStatus`) and rewrote the same row
immediately after inserting it.
Now `CreateChat` passes the status directly to `InsertChat`, so the row
is written once in its final create-time state. The
`setChatPendingWithStore` helper is removed entirely. `InsertChat` now
requires an explicit `status` parameter at all callsites instead of
relying on a DB column default.
## Motivation
On an experimental branch we're trialing firing all chatd notifications
from plpgsql triggers. The old two-step insert made that awkward: in an
`AFTER INSERT` trigger, `NEW` only contained the insert-time row
(`waiting`), not the final committed state (`pending`). To emit the
correct event payload the trigger had to be deferred and re-read the row
from `chats` at commit time.
With this change, `NEW` already contains the correct row to publish — no
deferred trigger, no extra `SELECT`, simpler and cheaper trigger logic.
That said, this seems like a worthwhile change regardless of the trigger
experiment: writing the final row state once removes unnecessary DB work
on every chat creation and makes the create path easier to reason about.
Removes 6 fragile `require.Equal(t, codersdk.ChatStatusPending,
chat.Status)` assertions from chat relay and creation tests.
**Root cause**: In HA tests with two replicas sharing the same DB, the
worker can acquire a just-created chat (flipping `pending → running` via
`AcquireChats`) before the HTTP response reaches the test. All affected
tests already synchronize via `require.Eventually` waiting for `running`
status, making the initial assertion both redundant and racy.
- Remove 5 assertions in `enterprise/coderd/exp_chats_test.go` (all
`TestChatStreamRelay` subtests)
- Remove 1 assertion in `coderd/exp_chats_test.go` (`TestPostChats`)
- An existing comment in `TestPostChats/Success` already documents this
exact race
Fixes flake:
https://github.com/coder/coder/actions/runs/23807597632/job/69385425724
> 🤖 Written by a Coder Agent. Will be reviewed by a human.
_Disclaimer: produced using Claude Opus 4.6, reviewed by me, and
validated against Dogfood dataset._
The `ListAIBridgeSessions` query materialized and aggregated all
matching interceptions before paginating, then ran expensive
token/prompt lookups across the full dataset. For a page of 25 sessions
against ~200k interceptions (our dogfood dataset), this meant:
- Three CTEs scanning all rows (filtered_interceptions, session_tokens,
session_root)
- ARRAY_AGG(fi.id) collecting every interception ID per session
- Lateral prompt lookup via ANY(array_of_all_ids) running for every
session, not just the page
- ~90MB of disk sorts and JIT compilation kicking in
The improvement is to restructure to paginate first and enrich after: a
single CTE groups interceptions into sessions with only cheap aggregates
(MIN, MAX, COUNT), applies cursor pagination and LIMIT, then lateral
joins fetch metadata, tokens, and prompts for just the ~25-row page.
Measured against 220k interceptions / 160k sessions:
| Metric | Before | After |
|--------------------|--------|-------|
| Execution time | 1800ms | 185ms |
| Shared buffer hits | 737k | 2.6k |
| Disk sort spill | 86MB | 16MB |
| Lateral loops | 160k | 25 |
https://grafana.dev.coder.com/goto/fbODPGtvR?orgId=1 the results are
identical, just _much_ faster.
---
Also includes some additional tests which I added prior to refactoring
the query to ensure no regressions on edge-cases.
---------
Signed-off-by: Danny Kopping <danny@coder.com>
- Add `chat-access` built-in role granting chat CRUD at User scope
- Exclude `ResourceChat` from member, org member, and org service
account `allPermsExcept` calls
- Allow system, owner, and user-admin to assign the new role
- Migration auto-assigns role to users who have ever created a chat
- Update RBAC test matrix: `memberMe` denied, `chatAccessUser` allowed
**Breaking change**: Members without `chat-access` lose chat creation
ability. Migration covers existing chat creators. Members who have never
created a chat do not get this role automatically applied.
> 🤖 This PR was created by a Coder Agent and reviewed by me.
These chatd relay tests were seeding chats through
`subscriber.CreateChat(...)`, which wakes the subscriber and can race
local acquisition against the intended remote-worker setup.
Seed waiting and remote-running chats directly in the database instead,
and point the default OpenAI provider at a local safety-net server so
accidental processing fails locally instead of reaching the live API.
Closes https://github.com/coder/internal/issues/1430
Closes#22136
This pull-request implements a `<ClientFilter />` to our `Request Logs`
page for AI Bridge. This will allow the user to select a client which
they wish to filter against. Technically the backend is able to actually
filter against multiple clients at once however the frontend doesn't
currently have a nice way of supporting this (future improvement).
<img width="1447" height="831" alt="image"
src="https://github.com/user-attachments/assets/0be234e2-25f2-4a89-b971-d74817395da1"
/>
---------
Co-authored-by: Jeremy Ruppel <jeremy.ruppel@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
Add site-wide banners for AI Governance seat usage thresholds:
1. **90% capacity warning (admin-only):** When actual AI Governance
seats are ≥90% and <100% of the license limit, admins see:
> "You have used 90% of your AI governance add-on seats."
2. **Over-limit banner (admin-only):** When actual seats exceed the
license limit, admins see a prominent warning:
> "Your organization is using {actual} / {limit} AI Governance user
seats ({X}% over the limit). Contact sales@coder.com"
- Uses floor whole percentage (Go int division / `Math.floor`)
- Includes a clickable `mailto:sales@coder.com` link
> **PR Stack**
> 1. **#23351** ← `#23282` *(you are here)*
> 2. #23282 ← `#23275`
> 3. #23275 ← `#23349`
> 4. #23349 ← `main`
---
## Summary
`chatretry.Retry()` used pure exponential backoff (1 s, 2 s, 4 s, …) and
never consulted provider `Retry-After` headers. Fantasy's
`ProviderError` carries `ResponseHeaders` including `Retry-After`, but
`chaterror.Classify()` only parsed error text and silently dropped the
structured transport metadata.
This makes `Retry-After` a first-class signal in the classification →
retry pipeline.
<img width="853" height="346" alt="image"
src="https://github.com/user-attachments/assets/65f012b6-8173-43d2-957e-ab9faddea525"
/>
## Changes
### `coderd/chatd/chaterror/classify.go`
- Added `RetryAfter time.Duration` field to `ClassifiedError` — a
normalized minimum retry delay derived from provider response metadata.
- `Classify()` now calls `extractProviderErrorDetails()` before falling
back to text heuristics. Structured `ProviderError.StatusCode` takes
priority over regex extraction.
- `normalizeClassification()` preserves and clamps `RetryAfter`.
### `coderd/chatd/chaterror/provider_error.go` (new)
Provider-specific extraction, isolated from the text-based
classification logic:
- `extractProviderErrorDetails()` unwraps `*fantasy.ProviderError` from
the error chain via `errors.As`.
- `retryAfterFromHeaders()` parses headers in priority order:
1. `retry-after-ms` (OpenAI-specific, millisecond precision)
2. `retry-after` (standard HTTP — integer seconds or HTTP-date)
- Case-insensitive header key lookup.
### `coderd/chatd/chatretry/chatretry.go`
- `effectiveDelay(attempt, classified)` computes `max(Delay(attempt),
classified.RetryAfter)` — the provider hint acts as a floor without
weakening the local exponential backoff.
- `Retry()` now uses `effectiveDelay` and passes the effective delay to
both `onRetry(...)` and the sleep timer, so downstream payloads, logs,
and the frontend countdown stay aligned automatically.
### Tests
- `classify_test.go`: Structured provider status + `Retry-After`
extraction, `retry-after-ms` priority, HTTP-date parsing, invalid header
fallback, `WithProvider` preservation.
- `chatretry_test.go`: Retry-after-as-floor semantics — longer hint
wins, shorter hint keeps base delay.
## Design notes
- **No SDK/API/frontend changes needed.** `codersdk.ChatStreamRetry`
already carries `DelayMs` and `RetryingAt`, and the frontend already
consumes them. The fix is purely in the server-side delay computation.
- **Existing retryability rules unchanged.** This fixes *when* we sleep,
not *whether* an error is retryable.
- **Provider hint is a floor:** `max(baseDelay, RetryAfter)` ensures we
never retry earlier than the provider asks, and never weaken our own
backoff curve.
*Disclaimer: implemented by a Coder Agent using Claude Opus 4.6,
reviewed by me.*
Replace the transitional soft warning message:
> AI Bridge is now Generally Available in v2.30. In a future Coder
version, your deployment will require the AI Governance Add-On to
continue using this feature. Please reach out to your account team or
sales@coder.com to learn more.
with the definitive requirement message:
> The AI Governance Add-On is required to use AI Bridge. Please reach
out to your account team or sales@coder.com to learn more.
Updated in:
- `enterprise/coderd/license/license.go`
- `enterprise/coderd/license/license_test.go` (2 occurrences)
> **PR Stack**
> 1. #23351 ← `#23282`
> 2. #23282 ← `#23275`
> 3. **#23275** ← `#23349` *(you are here)*
> 4. #23349 ← `main`
---
## Summary
Extracts a structured error classification subsystem for agent chat
(`chatd`) so that retry and error payloads carry machine-readable
metadata — error kind, provider name, HTTP status code, and retryability
— instead of raw error strings.
This is the **backend half** of the error-handling work. The frontend
counterpart is in #23282.
## Changes
### New package: `coderd/chatd/chaterror/`
Canonical error classification — extracts error kind, provider, status
code, and user-facing message from raw provider errors. One source of
truth that drives both retry policy and stream payloads.
- **`kind.go`**: Error kind enum (`rate_limit`, `timeout`, `auth`,
`config`, `overloaded`, `unknown`).
- **`signals.go`**: Signal extraction — parses provider name, HTTP
status code, and retryability from error strings and wrapped types.
- **`classify.go`**: Classification logic — maps extracted signals to an
error kind.
- **`message.go`**: User-facing message templates keyed by kind +
signals.
- **`payload.go`**: Projectors that build `ChatStreamError` and
`ChatStreamRetry` payloads from a classified error.
### Modified
- **`codersdk/chats.go`**: Added `Kind`, `Provider`, `Retryable`,
`StatusCode` fields to `ChatStreamError` and `ChatStreamRetry`.
- **`coderd/chatd/chatretry/`**: Thinned to retry-policy only;
classification logic moved to `chaterror`.
- **`coderd/chatd/chatloop/`**: Added per-attempt first-chunk timeout
(60 s) via `guardedStream` wrapper — produces retryable
`startup_timeout` errors instead of hanging forever.
- **`coderd/chatd/chatd.go`**: Publishes normalized retry/error payloads
via `chaterror` projectors.
_Disclaimer:_ _produced_ _by_ _Claude_ _Opus_ _4\.6,_ _reviewed_ _by_ _me._
**This is a breaking change.** Users who are not have `owner` or sitewide `auditor` roles will no longer be able to view interceptions.
Regular users should not need to view this information; in fact, it could be used by a malicious insider to see what information we track and don't track to exfiltrate data or perform actions unobserved.
---
Changed authorization for AI Bridge interception-related operations from system-level permissions to resource-specific permissions. The following functions now authorize against `rbac.ResourceAibridgeInterception` instead of `rbac.ResourceSystem`:
- `ListAIBridgeTokenUsagesByInterceptionIDs`
- `ListAIBridgeToolUsagesByInterceptionIDs`
- `ListAIBridgeUserPromptsByInterceptionIDs`
Updated RBAC roles to grant AI Bridge interception permissions:
- **User/Member roles**: Can create and update AI Bridge interceptions but cannot read them back
- **Service accounts**: Same create/update permissions without read access
- **Owners/Auditors**: Retain full read access to all interceptions
Removed system-level authorization bypass in `populatedAndConvertAIBridgeInterceptions` function, allowing proper resource-level authorization checks.
Updated tests to reflect the new permission model where members cannot view AI Bridge interceptions, even their own, while owners and auditors maintain full visibility.
<!--
If you have used AI to produce some or all of this PR, please ensure you have read our [AI Contribution guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING) before submitting.
-->
_Disclaimer:_ _initially_ _produced_ _by_ _Claude_ _Opus_ _4\.6,_ _heavily_ _modified_ _and_ _reviewed_ _by_ _me._
Closes https://github.com/coder/internal/issues/1360
Adds a new `/api/v2/aibridge/sessions` API which returns "sessions".
Sessions, as defined in the [RFC](https://www.notion.so/coderhq/AI-Bridge-Sessions-Threads-2ccd579be59280f28021d3baf7472fbe?source=copy_link), are a set of interceptions logically grouped by a session key issued by the client.
The API design for this endpoint was done in [this doc](https://github.com/coder/internal/issues/1360).
If the client has not provided a session ID, we will revert to the thread root ID, and if that's not present we use the interception's own ID (i.e. a session of a single interception - which is effectively what we show currently in our `/api/v2/aibridge/interceptions` API).
The SQL query looks gnarly but it's relatively simple, and seems to perform well (~200ms) even when I import dogfood's `aibridge_*` tables into my workspace. If we need to improve performance on this later we can investigate materialized views, perhaps, but for now I don't think it's warranted.
---
_The PR looks large but it's got a lot of generated code; the actual changes aren't huge._
- Moves `coderd/chatd/`, `coderd/gitsync/`, `enterprise/coderd/chatd/`
under `x/` parent directories to signal instability
- Adds `Experimental:` glue code comments in `coderd/coderd.go`
> 🤖 This PR was created with the help of Coder Agents, and was
reviewed by my human. 🧑💻
- Changes all 41 chat method receivers in `codersdk/chats.go` from
`*Client` to `*ExperimentalClient` to ensure that callers are aware that
these reference potentially unstable `/api/experimental` endpoints.
> 🤖 This PR was created with the help of Coder Agents, and has been
reviewed by my human. 🧑💻
Partially addresses #21813 (still need to make changes to the "add user"
button to be complete)
Since there are a lot of user tests already, I moved them into
`coderdtest` to be shared.
> **PR Stack**
> 1. #23351 ← `#23282`
> 2. #23282 ← `#23275`
> 3. #23275 ← `#23349`
> 4. **#23349** ← `main` *(you are here)*
---
Retry events were published only to the local in-process stream via
`publishEvent()`. When pubsub is active, `Subscribe()`'s merge loop only
forwarded durable events (messages, status, errors) from pubsub
notifications,
so retry events were silently dropped for cross-replica subscribers.
This adds a `publishRetry()` helper that publishes both locally and via
pubsub,
and extends the `Subscribe()` notification handler to forward retry
events.
**Changes:**
- `coderd/pubsub/chatstreamnotify.go`: add `Retry` field to notify
message
- `coderd/chatd/chatd.go`: add `publishRetry()`, update `OnRetry`
callback,
extend `Subscribe()` to forward `notify.Retry`
- `coderd/chatd/chatd_internal_test.go`: focused pubsub delivery test
- `enterprise/coderd/chatd/chatd_test.go`: cross-replica end-to-end test
## Summary
Adds the database schema, API endpoints, SDK types, and encryption
wrappers for admin-managed MCP (Model Context Protocol) server
configurations that chatd can consume. This is the backend foundation
for allowing external MCP tools (Sentry, Linear, GitHub, etc.) to be
used during AI chat sessions.
## Database
Two new tables:
- **`mcp_server_configs`**: Admin-managed server definitions with URL,
transport (Streamable HTTP / SSE), auth config (none / OAuth2 / API key
/ custom headers), tool allow/deny lists, and an availability policy
(`force_on` / `default_on` / `default_off`). Includes CHECK constraints
on transport, auth_type, and availability values.
- **`mcp_server_user_tokens`**: Per-user OAuth2 tokens for servers
requiring individual authentication. Cascades on user/config deletion.
New column on `chats` table:
- **`mcp_server_ids UUID[]`**: Per-chat MCP server selection, following
the same pattern as `model_config_id` — passed at chat creation,
changeable per-message with nil-means-no-change semantics.
## API Endpoints
All routes are under `/api/experimental/mcp/servers/` and gated behind
the `agents` experiment.
**Admin endpoints** (`ResourceDeploymentConfig` auth):
- `POST /` — Create MCP server config
- `PATCH /{id}` — Update MCP server config (full-replace)
- `DELETE /{id}` — Delete MCP server config
**Authenticated endpoints** (all users, enabled servers only for
non-admins):
- `GET /` — List configs (admins see all, members see enabled-only with
admin fields redacted)
- `GET /{id}` — Get config by ID (with `auth_connected` populated
per-user)
**OAuth2 per-user auth flow:**
- `GET /{id}/oauth2/connect` — Initiate OAuth2 flow (state cookie CSRF
protection)
- `GET /{id}/oauth2/callback` — Handle OAuth2 callback, store tokens
- `DELETE /{id}/oauth2/disconnect` — Remove stored OAuth2 tokens
## Security
- **Secrets never returned**: `OAuth2ClientSecret`, `APIKeyValue`, and
`CustomHeaders` are never in API responses — only boolean indicators
(`has_oauth2_secret`, `has_api_key`, `has_custom_headers`).
- **Field redaction for non-admins**: `convertMCPServerConfigRedacted`
strips `OAuth2ClientID`, auth URLs, scopes, and `APIKeyHeader` from
non-admin responses.
- **dbcrypt encryption at rest**: All 5 secret fields use `dbcrypt_keys`
encryption with full encrypt-on-write / decrypt-on-read wrappers (11
dbcrypt method overrides + 2 helpers), following the same pattern as
`chat_providers.api_key`.
- **OAuth2 CSRF protection**: State parameter stored in `HttpOnly`
cookie with `HTTPCookies.Apply()` for correct `Secure`/`SameSite` behind
TLS-terminating proxies.
- **dbauthz authorization**: All 18 querier methods have authorization
wrappers. Read operations use `ActionRead`, write operations use
`ActionUpdate` on `ResourceDeploymentConfig`.
## Governance Model
| Control | Implementation |
|---------|---------------|
| **Global kill switch** | `enabled` defaults to `false` |
| **Availability policy** | `force_on` (always injected), `default_on`
(pre-selected), `default_off` (opt-in) |
| **Per-chat selection** | `mcp_server_ids` on `CreateChatRequest` /
`CreateChatMessageRequest` |
| **Auth gate** | OAuth2 servers require per-user auth before tools are
injected |
| **Tool-level allow/deny** | Arrays on `mcp_server_configs` for
granular tool filtering |
| **Secrets encrypted at rest** | Uses `dbcrypt_keys` (same pattern as
`chat_providers.api_key`) |
## Tests
8 test functions covering:
- Full CRUD lifecycle (create, list, update, delete)
- Non-admin visibility filtering (enabled-only, field redaction)
- `auth_connected` population for OAuth2 vs non-OAuth2 servers
- Availability policy validation (valid values + invalid rejection)
- Unique slug enforcement (409 Conflict)
- OAuth2 disconnect idempotency
- Chat creation with `mcp_server_ids` persistence
## Known Limitations (Deferred)
These are documented and intentional for an experimental feature:
- **Audit logging** not yet wired — will add when feature stabilizes
- **Cross-field validation** (e.g., OAuth2 fields required when
`auth_type=oauth2`) — admin-only endpoint, will add when stabilizing
- **`force_on` auto-injection** — query exists but not yet wired into
chatd tool injection (follow-up)
- **Additional test coverage** — 403 auth tests, GET-by-ID tests,
callback CSRF tests planned for follow-up
## What's NOT in this PR
- Frontend UI (admin panel + chat picker)
- Actual MCP client connections (`chatd/chatmcp/` manager)
- Tool injection into `chatloop/`
Introduce a three-way workspace sharing setting (none, everyone,
service_accounts) replacing the boolean workspace_sharing_disabled.
In service_accounts mode, only service account-owned workspaces can be
shared while regular members' share permissions are removed. Adds a
new organization-service-account system role with per-org permissions
reconciled alongside the existing organization-member system role.
Related to:
https://linear.app/codercom/issue/PLAT-28/feat-service-accounts-sharing-mode-and-rbac-role
---------
Co-authored-by: Steven Masley <Emyrk@users.noreply.github.com>
Co-authored-by: Kayla はな <mckayla@hey.com>
When a user uses an AI feature, we record them in the `ai_seat_state` as consuming a seat.
Added in debouching to prevent excessive writes to the db for this feature. There is no need for frequent updates.
File-reference parts in user messages were flattened to `TextContent` at
write time because fantasy has no file-reference content type. The
frontend never saw them as structured parts.
This moves all write paths (user, assistant, tool) from fantasy envelope
format to `codersdk.ChatMessagePart`. The streaming layer (`chatloop`)
is untouched, the conversion happens at the serialization boundary in
`persistStep`.
Old rows are still readable. `ParseContent` uses a structural heuristic
(`isFantasyEnvelopeFormat`) to distinguish legacy envelopes from SDK
parts. We chose this over try/fallback because fantasy envelopes
partially unmarshal into `ChatMessagePart` (the `type` field matches)
while silently losing content. A guard test enforces that no SDK part
can produce the envelope shape.
This is forward-only: new rows are unreadable by old code. Chat is
behind a feature flag so rollback risk is contained.
Also adds a typed `ChatMessageRole` to replace raw strings and
`fantasy.MessageRole*` casts at the persistence boundary. The type
covers `ChatMessage.Role`, `ChatStreamMessagePart.Role`, the
`PublishMessagePart` callback chain, and all DB write sites.
`fantasy.MessageRole*` remains only where we build `fantasy.Message`
structs for LLM dispatch.
Separately, `ProviderMetadata` was leaking to SSE clients via
`publishMessagePart`. `StripInternal` now runs on both the SSE and REST
paths, covering this.
Other cleanup:
- Old `db2sdk.contentBlockToPart` silently dropped metadata on
text/reasoning/tool-call content. New code preserves it.
- `providerMetadataToOptions` now logs warnings instead of silently
returning nil.
- `db2sdk` shrinks from ~250 lines of parallel conversion to ~15 lines
delegating to `chatprompt.ParseContent()`, removing the `fantasy` import
entirely.
Refs #22821
WaitBuffer is a thread-safe io.Writer that supports blocking until
accumulated output matches a substring or custom predicate. It
replaces ad-hoc safeBuffer/syncWriter types and time.Sleep-based
poll loops in tests with signal-driven waits.
- WaitFor/WaitForNth/WaitForCond for blocking on output
- Replace custom buffer types in cli/sync_test.go and
provisionersdk/agent_test.go
- Convert time.Sleep poll loops to require.Eventually/require.Never
in cli/ssh_test.go, coderd/activitybump_test.go,
coderd/workspaceagentsrpc_test.go, workspaceproxy_test.go, and
scaletest tests
Split from #22693 per review feedback.
Fixes multiple bugs in coderd/chatd and sub-packages including race
conditions, transaction safety, stream buffer bounds, retry limits, and
enterprise relay improvements.
See commit message for full list.