Emit user secret audit log entries for create/update/delete operations.
Reads stay un-audited, matching every other resource.
Audit log entries record changes in user secret name, environment
variable name, file path, and value. The secret value column is marked
`ActionSecret` so the diff records the change without showing the
ciphertext or plaintext.
Closes a TOCTOU window on delete to ensure no phantom audit logs for a
delete of a non-existent secret. Secret update accepts a small TOCTOU
window matching the other audited resources (templates, workspaces,
chats). The two-query pattern is wrapped in a transaction so audit state
can't leak from a failed mutation.
(cherry picked from commit 1c30d52b2b)
<!--
If you have used AI to produce some or all of this PR, please ensure you
have read our [AI Contribution
guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING)
before submitting.
-->
Co-authored-by: Zach <3724288+zedkipp@users.noreply.github.com>
The agents-access role previously granted chat permissions at user
scope, but chats are org-scoped objects. Rego skips user-level perms
when org_owner is set, making the grants invisible. Handler-level
band-aids used synthetic non-org-scoped objects as a workaround.
- Migrates agents-access from users.rbac_roles (site-level) to
organization_members.roles (org-scoped) via DB migration
- Redefines agents-access as a predefined org-scoped builtin role
alongside organization-admin, organization-auditor, etc., with
Member permissions granting chat create/read/update
- Excludes ResourceChat from OrgMemberPermissions so org membership
alone no longer grants chat access
- Fixes handler Authorize checks to use org-scoped objects with
semantically correct actions (ActionUpdate for message/tool operations)
- Grants org admins the ability to assign agents-access
Closes#24250
Fixes CODAGT-174
Note: this does not update the "Usage" endpoints. Tracked by CODAGT-161.
> 🤖
*Disclaimer: implemented by a Coder Agent using Claude Opus 4.6*
Marks the `GET /api/v2/aibridge/interceptions` endpoint as deprecated in
favor of `/aibridge/sessions`, which provides richer session-level
aggregation including threads and agentic actions.
Changes:
- Add `@Deprecated` Swagger annotation to the endpoint handler
- Add deprecation notice to the
`codersdk.Client.AIBridgeListInterceptions` method
- Regenerated OpenAPI spec with `"deprecated": true` flag
The endpoint remains fully functional.
Fixes https://github.com/coder/internal/issues/1339
Fixes https://github.com/coder/internal/issues/1474.
PR #24549 introduced a `quartz.NewMock` clock +
`Trap().NewTimer("drain")` to
remove the wall-clock race. However, the trap consumed only **one**
`NewTimer("drain")` call via `MustWait/MustRelease`.
The merge loop has two code paths that create drain timers with the same
tag:
- Relay result handler (`drainAndClose` path in `relayReadyCh` case):
when an async dial completes after `drainAndClose` was set.
- Status notification handler (`relayParts != nil` branch in
`statusNotifications` case): when `status!=running` arrives while an
active relay exists.
Depending on goroutine scheduling, one or both paths fire. When two
calls hit
the trap, the second blocks the merge loop in `matchCallLocked` (quartz
waits
for all traps to release). Since the test already moved past `MustWait`,
nobody
reads the second call from the trap channel, deadlocking the test.
- Replace the single `MustWait/MustRelease/Advance` with a goroutine
that
loops over `trapDrain.Wait`, releasing and advancing for every drain
timer.
- No production code changes.
> 🤖
Previously, the sessions list sorted by `MIN(started_at)` across
interceptions, so sessions with old start times but recent activity
would sink to the bottom of the list regardless of how recently they
were used.
`ListAIBridgeSessions` now sorts by `COALESCE(MAX(prompt.created_at),
MIN(started_at)) DESC`, exposed as the non-nullable `last_active_at`
field. Sessions with prompts surface by last activity; sessions with no
prompts fall back to their start time.
The original implementation used two separate columns (`last_active_at`
as a nullable prompt timestamp and `sort_at` as the non-nullable cursor
key). This revision collapses them into a single `last_active_at` that
is always set — simplifying the SQL, the Go conversion, the API type,
and the frontend.
🤖 Generated with [Claude Code](https://claude.ai/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
This PR merges code from `coder/aibridge` repository into `coder/coder`.
It was split into 4 PRs for easier review but stacked PRs will need to
be merged into this PR so all checks pass.
* https://github.com/coder/coder/pull/24190 -> raw code copy (this PR,
before merging PRs on top of it, it was just 1 commit:
https://github.com/coder/coder/commit/70d33f33200c7e77df910957595715f81f9bec24)
* https://github.com/coder/coder/pull/24570 -> update imports in
`coder/coder` to use copied code
* https://github.com/coder/coder/pull/24586 -> linter fixes and CI
integration (also added README.md)
* https://github.com/coder/coder/pull/24571 -> added exclude to
scripts/check_emdash.sh check
Original PR message (before PR squash):
Moves coder/aibridge code into coder/coder repository.
Omitted files:
- `go.mod`, `go.sum`, `.gitignore`, `.github/workflows/ci.yml,`
`Makefile`, `LICENSE`, `README.md` (modified README.md is added later)
- `.github`, `example`, `buildinfo,` `scripts` directories
Simple verification script (will list omitted files)
```
tmp=$(mktemp -d)
echo "$tmp"
git clone --depth=1 https://github.com/coder/aibridge "$tmp/aibridge"
git clone --depth=1 --branch pb/aibridge-code-move https://github.com/coder/coder "$tmp/coder"
diff -rq --exclude=.git "$tmp/aibridge" "$tmp/coder/aibridge"
# rm -rf "$tmp"
```
Fixes the flake reported in
https://github.com/coder/internal/issues/1474.
- Use a `quartz.NewMock` clock for the subscriber with a drain timer
trap, so the 200ms relay drain fires only when explicitly advanced —
fully deterministic, no wall-clock race
- Give each `testutil.Eventually` call its own context so one slow
assertion cannot starve subsequent ones of their shared 25s deadline
> 🤖
- Replaces the hard-coded 500ms reconnect timer for dialing chat relays with exponential backoff via `coder/retry`.
- `dialRelay` drops the `codersdk.ExperimentalClient.StreamChat` wrapper
and calls `websocket.Dial` directly so we can capture
`*http.Response.StatusCode` without parsing error strings.
- Adds `RelayDialError` that exposes the HTTP status from `websocket.Dial`
- Modifies retry logic: 401/403 tear the stream down immediately, 5xx/network/timeouts retry
then tear down on cap. Outer stream closes cleanly so the browser SDK
reconnects with a fresh cookie.
- Retry state resets on successful dial and on target-worker change, not
on every `closeRelay()`.
> 🤖 Generated by Coder Agents.
* Adds `streamJanitorLoop` to clean up stale streams every 30s
* zeroes dropped slots to aid in gc-eligibliity
* Adds regression tests in coderd/x/chatd and enterprise/coderd/x/chatd
> 🤖
Add a `chat_client_type` enum (`ui` | `api`) and `client_type` column to
the `chats` table. The column defaults to `api` for new rows so API
callers don't need to set it explicitly. Existing rows are backfilled to
`ui`.
The field flows through `CreateChatRequest`, `chatd.CreateOptions`,
`InsertChat`, and is returned in the `Chat` response via `db2sdk`.
<details>
<summary>Implementation notes (Coder Agents generated)</summary>
### Changes
**Database migration (000469)**
- New enum `chat_client_type` with values `ui`, `api`.
- New `client_type` column, `NOT NULL DEFAULT 'api'`.
- Backfill: `UPDATE chats SET client_type = 'ui'`.
**SQL query** — `InsertChat` now includes `client_type`.
**SDK** — `ChatClientType` type added; `ClientType` field added to both
`CreateChatRequest` (optional, defaults server-side to `api`) and `Chat`
response.
**Handler** — `postChats` maps the request field (defaulting to `api`)
and passes it through `chatd.CreateOptions`.
**Sub-agent** — Child chats inherit their parent's `client_type`.
**db2sdk** — Maps the database value to the SDK type.
### Decision log
- Default is `api` (not `ui`) so existing API integrations get the
correct value without code changes.
- Backfill sets existing rows to `ui` per requirement.
- Child chats inherit `client_type` from parent rather than defaulting.
</details>
Relates to https://github.com/coder/internal/issues/1455
From that issue:
> Going to skip this test until the underlying race in chatd is fixed.
https://github.com/coder/coder/pull/24279 was a band-aid fix that I no
longer think is valuable pursuing short term. Hugo is working on a RFC
for a redesign of the system to prevent the class of race condition into
the future.
## Summary
Adds `--ai-gateway-allow-byok` deployment option to control whether
users can use Bring Your Own Key (BYOK) mode with AI Gateway.
When disabled (`--ai-gateway-allow-byok=false`), BYOK requests are
rejected with a 403 and a message directing the admin to enable the
flag. Centralized key authentication works regardless of this setting.
Defaults to `true` (BYOK allowed).
---------
Co-authored-by: Danny Kopping <danny@coder.com>
Addresses review findings from #23827 that were added post-merge:
- Persisted attachments now store `organizationId`; mismatched orgs
pruned on restore
- Workspace selection reconciliation: stale IDs from previous orgs
dropped via derived `effectiveWorkspaceId`
- Org picker uses `permittedOrganizations()` for RBAC-aware filtering
- Org picker hidden when user belongs to only one org
- Ref-sync `useEffect` replaced with `useEffectEvent`
- `CreateWorkspace()` and `ListTemplates()` take `organizationID` and
`db` as required function parameters instead of optional struct fields —
compiler enforces them, removes scattered nil guards
- Cross-org template check in `CreateWorkspace` is now unconditional
- `ListTemplates` org-scoping filter now has test coverage
- `setupChatInfra` comment fixed; test helpers use params structs
instead of positional UUIDs
- Enterprise test documents that org admin only sees own chats (handler
hardcodes `OwnerID` — future work needs sidebar UI before lifting that
restriction)
> 🤖
Three SQL queries (`GetUserGroupSpendLimit`,
`ResolveUserChatSpendLimit`, `GetUserChatSpendInPeriod`) aggregated chat
spend limits and usage globally across all organizations. A restrictive
group limit in org A would bleed into org B.
## Changes
- Add `organization_id` parameter to all three SQL queries in
`coderd/database/queries/chats.sql`
- When nil UUID is passed, queries fall back to global behavior
(backward compat for HTTP dashboard endpoints)
- When real org ID is passed, limits and spend are scoped to that
organization
- Thread `organizationID` through `ResolveUsageLimitStatus` →
`checkUsageLimit` → all chatd call sites
- Update dbauthz wrappers for new param structs
- HTTP endpoints (`chatCostSummary`, `getMyChatUsageLimitStatus`) pass
`uuid.Nil` with TODO for future org-scoped UI
- Add `TestResolveUsageLimitStatus_OrgScoped` with 5 test cases covering
org isolation, nil-UUID fallback, spend scoping, and user override
priority
Closescoder/internal#1466
> 🤖
## Summary
Adds a `sanitizeCredentialHint` safety check in the db-to-SDK conversion
layer to ensure credential hints are always masked before being exposed
in the API. Also adds `credential_kind` and `credential_hint` assertions
to the session threads API test.
Fixes https://github.com/coder/internal/issues/1436
* Adds organization_id to chats with backfill (workspace org → user org membership → default org)
* No support yet for ACLs (follow-up issue)
- Cross-org workspace binding rejected (both in `CreateChatRequest` and in `create_workspace` tool
- Adds `OrganizationAutocomplete` to `AgentCreateForm`
- Docs updated with `organization_id` in chats-api.md
> 🤖 Written by a Coder Agent. Reviewed by many humans and many agents.
---------
Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
Fixescoder/internal#1455
Three changes to eliminate the timing-sensitive flake in
`TestSubscribeRelayEstablishedMidStream`:
1. **Reduce `PendingChatAcquireInterval` from `time.Hour` to
`time.Second`.**
The primary trigger is still `signalWake()` from `SendMessage`, but a
short fallback poll ensures the worker picks up the pending chat
even under heavy CI goroutine scheduling contention.
2. **Increase context timeout from `WaitLong` (25s) to `WaitSuperLong`
(60s).**
The worker pipeline (model resolution, message loading, LLM call)
involves multiple DB round-trips that can be slow when PostgreSQL
is shared with many parallel test packages.
3. **Add a status-polling loop while waiting for the streaming
request.**
If the worker errors out during chat processing, the test now
fails immediately with the error status and message instead of
silently timing out.
> Generated by Coder Agents
Audit and connection log pages were timing out due to expensive COUNT(*)
queries over large tables. This commit adds opt-in count capping: requests can
return a `count_cap` field signaling that the count was truncated at a threshold,
avoiding full table scans that caused page timeouts.
Text-cast UUID comparisons in regosql-generated authorization queries
also contributed to the slowdown by preventing index usage for connection
and audit log queries. These now emit native UUID operators.
Frontend changes handle the capped state in usePaginatedQuery and
PaginationWidget, optionally displaying a capped count in the pagination
UI (e.g. "Showing 2,076 to 2,100 of 2,000+ logs")
Related to:
https://linear.app/codercom/issue/PLAT-31/connectionaudit-log-performance-issue
Two new columns added to aibridge_token_usages:
- cache_read_input_tokens (BIGINT, default 0)
- cache_write_input_tokens (BIGINT, default 0)
Migration backfills existing rows by extracting values from the metadata
JSONB column (cache_read_input, input_cached, prompt_cached for reads
(max value selected since only 1 should be set), cache_creation_input
for writes).
All references to data from metadata were updated to reference new
columns. No other changes then changing where data is extracted from.
Requires aibridge library version bump to include:
https://github.com/coder/aibridge/pull/229
Fixes: https://github.com/coder/aibridge/issues/150
When the `agents` experiment is enabled, new users are automatically
granted the `agents-access` role at creation time so they can use Coder
Agents without manual admin intervention.
- Auto-assigns in `CreateUser()` — covers admin API, OAuth, and OIDC
creation paths
- Skips auto-assign for OIDC users when enterprise site role sync is
enabled (sync overwrites roles on every login; those admins should use
`--oidc-user-role-default` instead)
- CLI `create-admin-user` bypasses `CreateUser()` but creates `owner`
users who already have all permissions
> 🤖 Written by a Coder Agent. Will be reviewed by a human.
Previously, `CreateChat` inserted the `chats` row with the DB default
status (`waiting`), then updated it to `pending` in the same transaction
via `setChatPendingWithStore`. This wasted two extra queries per chat
creation (`GetChatByID` + `UpdateChatStatus`) and rewrote the same row
immediately after inserting it.
Now `CreateChat` passes the status directly to `InsertChat`, so the row
is written once in its final create-time state. The
`setChatPendingWithStore` helper is removed entirely. `InsertChat` now
requires an explicit `status` parameter at all callsites instead of
relying on a DB column default.
## Motivation
On an experimental branch we're trialing firing all chatd notifications
from plpgsql triggers. The old two-step insert made that awkward: in an
`AFTER INSERT` trigger, `NEW` only contained the insert-time row
(`waiting`), not the final committed state (`pending`). To emit the
correct event payload the trigger had to be deferred and re-read the row
from `chats` at commit time.
With this change, `NEW` already contains the correct row to publish — no
deferred trigger, no extra `SELECT`, simpler and cheaper trigger logic.
That said, this seems like a worthwhile change regardless of the trigger
experiment: writing the final row state once removes unnecessary DB work
on every chat creation and makes the create path easier to reason about.
Removes 6 fragile `require.Equal(t, codersdk.ChatStatusPending,
chat.Status)` assertions from chat relay and creation tests.
**Root cause**: In HA tests with two replicas sharing the same DB, the
worker can acquire a just-created chat (flipping `pending → running` via
`AcquireChats`) before the HTTP response reaches the test. All affected
tests already synchronize via `require.Eventually` waiting for `running`
status, making the initial assertion both redundant and racy.
- Remove 5 assertions in `enterprise/coderd/exp_chats_test.go` (all
`TestChatStreamRelay` subtests)
- Remove 1 assertion in `coderd/exp_chats_test.go` (`TestPostChats`)
- An existing comment in `TestPostChats/Success` already documents this
exact race
Fixes flake:
https://github.com/coder/coder/actions/runs/23807597632/job/69385425724
> 🤖 Written by a Coder Agent. Will be reviewed by a human.
_Disclaimer: produced using Claude Opus 4.6, reviewed by me, and
validated against Dogfood dataset._
The `ListAIBridgeSessions` query materialized and aggregated all
matching interceptions before paginating, then ran expensive
token/prompt lookups across the full dataset. For a page of 25 sessions
against ~200k interceptions (our dogfood dataset), this meant:
- Three CTEs scanning all rows (filtered_interceptions, session_tokens,
session_root)
- ARRAY_AGG(fi.id) collecting every interception ID per session
- Lateral prompt lookup via ANY(array_of_all_ids) running for every
session, not just the page
- ~90MB of disk sorts and JIT compilation kicking in
The improvement is to restructure to paginate first and enrich after: a
single CTE groups interceptions into sessions with only cheap aggregates
(MIN, MAX, COUNT), applies cursor pagination and LIMIT, then lateral
joins fetch metadata, tokens, and prompts for just the ~25-row page.
Measured against 220k interceptions / 160k sessions:
| Metric | Before | After |
|--------------------|--------|-------|
| Execution time | 1800ms | 185ms |
| Shared buffer hits | 737k | 2.6k |
| Disk sort spill | 86MB | 16MB |
| Lateral loops | 160k | 25 |
https://grafana.dev.coder.com/goto/fbODPGtvR?orgId=1 the results are
identical, just _much_ faster.
---
Also includes some additional tests which I added prior to refactoring
the query to ensure no regressions on edge-cases.
---------
Signed-off-by: Danny Kopping <danny@coder.com>
- Add `chat-access` built-in role granting chat CRUD at User scope
- Exclude `ResourceChat` from member, org member, and org service
account `allPermsExcept` calls
- Allow system, owner, and user-admin to assign the new role
- Migration auto-assigns role to users who have ever created a chat
- Update RBAC test matrix: `memberMe` denied, `chatAccessUser` allowed
**Breaking change**: Members without `chat-access` lose chat creation
ability. Migration covers existing chat creators. Members who have never
created a chat do not get this role automatically applied.
> 🤖 This PR was created by a Coder Agent and reviewed by me.
These chatd relay tests were seeding chats through
`subscriber.CreateChat(...)`, which wakes the subscriber and can race
local acquisition against the intended remote-worker setup.
Seed waiting and remote-running chats directly in the database instead,
and point the default OpenAI provider at a local safety-net server so
accidental processing fails locally instead of reaching the live API.
Closes https://github.com/coder/internal/issues/1430
Closes#22136
This pull-request implements a `<ClientFilter />` to our `Request Logs`
page for AI Bridge. This will allow the user to select a client which
they wish to filter against. Technically the backend is able to actually
filter against multiple clients at once however the frontend doesn't
currently have a nice way of supporting this (future improvement).
<img width="1447" height="831" alt="image"
src="https://github.com/user-attachments/assets/0be234e2-25f2-4a89-b971-d74817395da1"
/>
---------
Co-authored-by: Jeremy Ruppel <jeremy.ruppel@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
Add site-wide banners for AI Governance seat usage thresholds:
1. **90% capacity warning (admin-only):** When actual AI Governance
seats are ≥90% and <100% of the license limit, admins see:
> "You have used 90% of your AI governance add-on seats."
2. **Over-limit banner (admin-only):** When actual seats exceed the
license limit, admins see a prominent warning:
> "Your organization is using {actual} / {limit} AI Governance user
seats ({X}% over the limit). Contact sales@coder.com"
- Uses floor whole percentage (Go int division / `Math.floor`)
- Includes a clickable `mailto:sales@coder.com` link
> **PR Stack**
> 1. **#23351** ← `#23282` *(you are here)*
> 2. #23282 ← `#23275`
> 3. #23275 ← `#23349`
> 4. #23349 ← `main`
---
## Summary
`chatretry.Retry()` used pure exponential backoff (1 s, 2 s, 4 s, …) and
never consulted provider `Retry-After` headers. Fantasy's
`ProviderError` carries `ResponseHeaders` including `Retry-After`, but
`chaterror.Classify()` only parsed error text and silently dropped the
structured transport metadata.
This makes `Retry-After` a first-class signal in the classification →
retry pipeline.
<img width="853" height="346" alt="image"
src="https://github.com/user-attachments/assets/65f012b6-8173-43d2-957e-ab9faddea525"
/>
## Changes
### `coderd/chatd/chaterror/classify.go`
- Added `RetryAfter time.Duration` field to `ClassifiedError` — a
normalized minimum retry delay derived from provider response metadata.
- `Classify()` now calls `extractProviderErrorDetails()` before falling
back to text heuristics. Structured `ProviderError.StatusCode` takes
priority over regex extraction.
- `normalizeClassification()` preserves and clamps `RetryAfter`.
### `coderd/chatd/chaterror/provider_error.go` (new)
Provider-specific extraction, isolated from the text-based
classification logic:
- `extractProviderErrorDetails()` unwraps `*fantasy.ProviderError` from
the error chain via `errors.As`.
- `retryAfterFromHeaders()` parses headers in priority order:
1. `retry-after-ms` (OpenAI-specific, millisecond precision)
2. `retry-after` (standard HTTP — integer seconds or HTTP-date)
- Case-insensitive header key lookup.
### `coderd/chatd/chatretry/chatretry.go`
- `effectiveDelay(attempt, classified)` computes `max(Delay(attempt),
classified.RetryAfter)` — the provider hint acts as a floor without
weakening the local exponential backoff.
- `Retry()` now uses `effectiveDelay` and passes the effective delay to
both `onRetry(...)` and the sleep timer, so downstream payloads, logs,
and the frontend countdown stay aligned automatically.
### Tests
- `classify_test.go`: Structured provider status + `Retry-After`
extraction, `retry-after-ms` priority, HTTP-date parsing, invalid header
fallback, `WithProvider` preservation.
- `chatretry_test.go`: Retry-after-as-floor semantics — longer hint
wins, shorter hint keeps base delay.
## Design notes
- **No SDK/API/frontend changes needed.** `codersdk.ChatStreamRetry`
already carries `DelayMs` and `RetryingAt`, and the frontend already
consumes them. The fix is purely in the server-side delay computation.
- **Existing retryability rules unchanged.** This fixes *when* we sleep,
not *whether* an error is retryable.
- **Provider hint is a floor:** `max(baseDelay, RetryAfter)` ensures we
never retry earlier than the provider asks, and never weaken our own
backoff curve.
*Disclaimer: implemented by a Coder Agent using Claude Opus 4.6,
reviewed by me.*
Replace the transitional soft warning message:
> AI Bridge is now Generally Available in v2.30. In a future Coder
version, your deployment will require the AI Governance Add-On to
continue using this feature. Please reach out to your account team or
sales@coder.com to learn more.
with the definitive requirement message:
> The AI Governance Add-On is required to use AI Bridge. Please reach
out to your account team or sales@coder.com to learn more.
Updated in:
- `enterprise/coderd/license/license.go`
- `enterprise/coderd/license/license_test.go` (2 occurrences)
> **PR Stack**
> 1. #23351 ← `#23282`
> 2. #23282 ← `#23275`
> 3. **#23275** ← `#23349` *(you are here)*
> 4. #23349 ← `main`
---
## Summary
Extracts a structured error classification subsystem for agent chat
(`chatd`) so that retry and error payloads carry machine-readable
metadata — error kind, provider name, HTTP status code, and retryability
— instead of raw error strings.
This is the **backend half** of the error-handling work. The frontend
counterpart is in #23282.
## Changes
### New package: `coderd/chatd/chaterror/`
Canonical error classification — extracts error kind, provider, status
code, and user-facing message from raw provider errors. One source of
truth that drives both retry policy and stream payloads.
- **`kind.go`**: Error kind enum (`rate_limit`, `timeout`, `auth`,
`config`, `overloaded`, `unknown`).
- **`signals.go`**: Signal extraction — parses provider name, HTTP
status code, and retryability from error strings and wrapped types.
- **`classify.go`**: Classification logic — maps extracted signals to an
error kind.
- **`message.go`**: User-facing message templates keyed by kind +
signals.
- **`payload.go`**: Projectors that build `ChatStreamError` and
`ChatStreamRetry` payloads from a classified error.
### Modified
- **`codersdk/chats.go`**: Added `Kind`, `Provider`, `Retryable`,
`StatusCode` fields to `ChatStreamError` and `ChatStreamRetry`.
- **`coderd/chatd/chatretry/`**: Thinned to retry-policy only;
classification logic moved to `chaterror`.
- **`coderd/chatd/chatloop/`**: Added per-attempt first-chunk timeout
(60 s) via `guardedStream` wrapper — produces retryable
`startup_timeout` errors instead of hanging forever.
- **`coderd/chatd/chatd.go`**: Publishes normalized retry/error payloads
via `chaterror` projectors.
_Disclaimer:_ _produced_ _by_ _Claude_ _Opus_ _4\.6,_ _reviewed_ _by_ _me._
**This is a breaking change.** Users who are not have `owner` or sitewide `auditor` roles will no longer be able to view interceptions.
Regular users should not need to view this information; in fact, it could be used by a malicious insider to see what information we track and don't track to exfiltrate data or perform actions unobserved.
---
Changed authorization for AI Bridge interception-related operations from system-level permissions to resource-specific permissions. The following functions now authorize against `rbac.ResourceAibridgeInterception` instead of `rbac.ResourceSystem`:
- `ListAIBridgeTokenUsagesByInterceptionIDs`
- `ListAIBridgeToolUsagesByInterceptionIDs`
- `ListAIBridgeUserPromptsByInterceptionIDs`
Updated RBAC roles to grant AI Bridge interception permissions:
- **User/Member roles**: Can create and update AI Bridge interceptions but cannot read them back
- **Service accounts**: Same create/update permissions without read access
- **Owners/Auditors**: Retain full read access to all interceptions
Removed system-level authorization bypass in `populatedAndConvertAIBridgeInterceptions` function, allowing proper resource-level authorization checks.
Updated tests to reflect the new permission model where members cannot view AI Bridge interceptions, even their own, while owners and auditors maintain full visibility.
<!--
If you have used AI to produce some or all of this PR, please ensure you have read our [AI Contribution guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING) before submitting.
-->
_Disclaimer:_ _initially_ _produced_ _by_ _Claude_ _Opus_ _4\.6,_ _heavily_ _modified_ _and_ _reviewed_ _by_ _me._
Closes https://github.com/coder/internal/issues/1360
Adds a new `/api/v2/aibridge/sessions` API which returns "sessions".
Sessions, as defined in the [RFC](https://www.notion.so/coderhq/AI-Bridge-Sessions-Threads-2ccd579be59280f28021d3baf7472fbe?source=copy_link), are a set of interceptions logically grouped by a session key issued by the client.
The API design for this endpoint was done in [this doc](https://github.com/coder/internal/issues/1360).
If the client has not provided a session ID, we will revert to the thread root ID, and if that's not present we use the interception's own ID (i.e. a session of a single interception - which is effectively what we show currently in our `/api/v2/aibridge/interceptions` API).
The SQL query looks gnarly but it's relatively simple, and seems to perform well (~200ms) even when I import dogfood's `aibridge_*` tables into my workspace. If we need to improve performance on this later we can investigate materialized views, perhaps, but for now I don't think it's warranted.
---
_The PR looks large but it's got a lot of generated code; the actual changes aren't huge._
- Moves `coderd/chatd/`, `coderd/gitsync/`, `enterprise/coderd/chatd/`
under `x/` parent directories to signal instability
- Adds `Experimental:` glue code comments in `coderd/coderd.go`
> 🤖 This PR was created with the help of Coder Agents, and was
reviewed by my human. 🧑💻
- Changes all 41 chat method receivers in `codersdk/chats.go` from
`*Client` to `*ExperimentalClient` to ensure that callers are aware that
these reference potentially unstable `/api/experimental` endpoints.
> 🤖 This PR was created with the help of Coder Agents, and has been
reviewed by my human. 🧑💻
Partially addresses #21813 (still need to make changes to the "add user"
button to be complete)
Since there are a lot of user tests already, I moved them into
`coderdtest` to be shared.
> **PR Stack**
> 1. #23351 ← `#23282`
> 2. #23282 ← `#23275`
> 3. #23275 ← `#23349`
> 4. **#23349** ← `main` *(you are here)*
---
Retry events were published only to the local in-process stream via
`publishEvent()`. When pubsub is active, `Subscribe()`'s merge loop only
forwarded durable events (messages, status, errors) from pubsub
notifications,
so retry events were silently dropped for cross-replica subscribers.
This adds a `publishRetry()` helper that publishes both locally and via
pubsub,
and extends the `Subscribe()` notification handler to forward retry
events.
**Changes:**
- `coderd/pubsub/chatstreamnotify.go`: add `Retry` field to notify
message
- `coderd/chatd/chatd.go`: add `publishRetry()`, update `OnRetry`
callback,
extend `Subscribe()` to forward `notify.Retry`
- `coderd/chatd/chatd_internal_test.go`: focused pubsub delivery test
- `enterprise/coderd/chatd/chatd_test.go`: cross-replica end-to-end test