coder

mirror of https://github.com/coder/coder.git synced 2026-06-02 20:48:20 +00:00

Author	SHA1	Message	Date
Kyle Carberry	19e44f4136	fix: target specific chat in MarkStale instead of broadcasting to all workspace chats (#23883 ) ## Problem Subagent chats were receiving git context (branch, remote origin, PR status) from their parent or sibling chats' git operations. When a git operation triggers external auth, the workspace agent sends `chat_id` identifying which chat initiated it — but this was broken at two levels: 1. Agent side: `CODER_CHAT_ID` was never injected into process environments. `chatd` sets `Coder-Chat-Id` HTTP headers and the agent extracts them for process isolation, but never propagated `CODER_CHAT_ID` to `cmd.Env`. So `gitaskpass` always sent an empty `chat_id`. 2. Server side: `workspaceAgentsExternalAuth` ignored the `chat_id` query param. `MarkStale` broadcast git context to all chats on the workspace via `filterChatsByWorkspaceID`. ## Fix - Inject `CODER_CHAT_ID` into `cmd.Env` in `agentproc` when the chat ID is known, so `gitaskpass` can read and forward it. - Read `chat_id` from query params in `workspaceAgentsExternalAuth` and thread it through `chatGitRef`. - Refactor `MarkStale` to accept a `MarkStaleParams` struct. When `ChatID` is provided, target only that specific chat. When empty (legacy agents, non-chat git operations), fall back to the existing workspace-wide broadcast. - Extract `markStaleSingle` helper to deduplicate the upsert+publish logic. <details><summary>Investigation notes</summary> ### Data flow before fix ``` chatd → sets Coder-Chat-Id header on agent conn agent → extracts chatID, stores on process struct agent → does NOT set CODER_CHAT_ID in cmd.Env ← gap 1 gitaskpass → reads CODER_CHAT_ID (always empty), sends chat_id="" server handler → ignores chat_id query param ← gap 2 MarkStale → broadcasts to ALL workspace chats ``` ### Data flow after fix ``` chatd → sets Coder-Chat-Id header on agent conn agent → extracts chatID, stores on process struct agent → sets CODER_CHAT_ID in cmd.Env gitaskpass → reads CODER_CHAT_ID, sends chat_id=<uuid> server handler → reads chat_id, passes to MarkStale MarkStale → targets only that specific chat ``` </details>	2026-04-01 13:04:59 +00:00
Kyle Carberry	7861fcf1f6	perf(coderd): stop inline-resolving diff status on every GetChat call (#23901 ) ## Problem Every `GET /api/experimental/chats/{chatID}` call was blocking for 200-800ms because the `getChat` handler called `resolveChatDiffStatus`, which unconditionally hit the git provider API (e.g. GitHub's `GET /repos/{owner}/{repo}/pulls?head=...`) via `ResolveBranchPullRequest` — even when the cached diff status was fresh. This made every chat page load at `/agents/{id}` noticeably slow. ## Root cause The call chain was: 1. `getChat` → `resolveChatDiffStatus` 2. `resolveChatDiffStatus` → `resolveChatDiffReference` → `gp.ResolveBranchPullRequest(...)` (external HTTP call) 3. Only after the external call: `chatDiffStatusIsStale(status, now)` check The staleness check happened after the expensive work, so every request paid the cost regardless of cache freshness. ## Fix `getChat` now returns the cached `chat_diff_statuses` row directly from the database. The background `gitsync` worker already keeps these rows fresh (every `DiffStatusTTL = 120s`), so inline resolution was redundant. The `resolveChatDiffContents` endpoint (which fetches actual diff content) still uses the full resolution path since it needs to make provider API calls by design. ## Changes - `getChat` reads cached diff status from DB instead of calling `resolveChatDiffStatus` - Remove `resolveChatDiffStatus` (dead code — no production callers) - Remove `chatDiffStatusIsStale` and `chatDiffStatusTTL` (dead code) - Remove `RefreshesStaleStatusWithExternalAuth` test (tested the removed inline refresh path) <details><summary>Decision log</summary> - Why not just add a staleness gate? The background worker already handles refreshes on the same schedule. Adding an early-return-if-fresh would work but leaves dead code for the stale path that's never exercised in production (the worker gets there first). Removing the inline path entirely is simpler and eliminates the external API dependency from the read path. - Why keep `resolveChatDiffContents` unchanged? That endpoint's job is to fetch the actual diff content from the provider, so external API calls are inherent to its purpose. </details>	2026-04-01 12:08:13 +00:00
Cian Johnston	d6df78c9b9	chore: remove racy ChatStatusPending assertions after CreateChat (#23882 ) Removes 6 fragile `require.Equal(t, codersdk.ChatStatusPending, chat.Status)` assertions from chat relay and creation tests. Root cause: In HA tests with two replicas sharing the same DB, the worker can acquire a just-created chat (flipping `pending → running` via `AcquireChats`) before the HTTP response reaches the test. All affected tests already synchronize via `require.Eventually` waiting for `running` status, making the initial assertion both redundant and racy. - Remove 5 assertions in `enterprise/coderd/exp_chats_test.go` (all `TestChatStreamRelay` subtests) - Remove 1 assertion in `coderd/exp_chats_test.go` (`TestPostChats`) - An existing comment in `TestPostChats/Success` already documents this exact race Fixes flake: https://github.com/coder/coder/actions/runs/23807597632/job/69385425724 > 🤖 Written by a Coder Agent. Will be reviewed by a human.	2026-04-01 10:00:50 +01:00
Ethan	5cba59af79	fix(coderd): unarchive child chats with parents (#23761 ) Unarchiving a root chat now restores descendant chats in the database and emits lifecycle events for every affected chat so passive sessions converge without a full refetch. This keeps archive and unarchive symmetric at both the data and watch-stream layers by returning the affected chat family from the database, using those post-update rows for chatd pubsub fanout, and covering descendant lifecycle delivery with a watch-level regression test. Closes #23666	2026-04-01 15:30:25 +11:00
Cian Johnston	a164d508cf	fix(coderd/x/chatd): gate control subscriber to ignore stale pubsub notifications (#23865 ) Fixes flaky `TestOpenAIReasoningWithWebSearchRoundTripStoreFalse` and `TestOpenAIReasoningWithWebSearchRoundTrip`. ## Changes - Gate the `processChat` control subscriber's cancel callback behind a `chan struct{}` that is closed after publishing `"running"` status - Add `TestGatedControlCancel` with 4 subtests exercising the gate logic <details> <summary>Root cause analysis</summary> `SendMessage` publishes a `"pending"` notification on `chat:stream:<chatID>` via PostgreSQL `NOTIFY`. `processChat` subscribes to the same channel for control signals. Due to async NOTIFY delivery, the `"pending"` notification can arrive at the control subscriber after it registers its queue — even though it was published before. `shouldCancelChatFromControlNotification("pending")` returns `true`, immediately self-interrupting the processor before it does any work. The fix gates the cancel callback behind a closed channel. The channel is closed after `processChat` publishes `"running"` status, so stale notifications from before initialization are harmlessly ignored. `close()` provides a happens-before guarantee in the Go memory model. </details> > 🤖 Written by a Coder Agent. Reviewed by a human.	2026-03-31 22:55:20 +01:00
Michael Suchacz	e2bbd12137	test(coderd/x/chatd): remove flaky OpenAI round-trip tests (#23877 )	2026-03-31 17:04:56 -04:00
Yevhenii Shcherbina	84b94a8376	feat: add chatgpt support for aibridge proxy (#23826 ) Add ChatGPT support for AIBridgeProxy	2026-03-31 12:54:38 -04:00
Cian Johnston	2a990ce758	feat: show friendly alert for missing agents-access role (#23831 ) Replaces the generic red `ErrorAlert` ("Forbidden.") with a proactive permission check and friendly info alert when a user lacks the `agents-access` role. - Add `createChat` permission check to `permissions.json` using `owner_id: "me"` - Handle `"me"` owner substitution in `renderPermissions` (SSR path) - Pass `canCreateChat` from `useAuthenticated().permissions` into `AgentCreateForm` - Show `ChatAccessDeniedAlert` and disable input immediately (no need to trigger a 403 first) - Also catch 403 errors as a fallback in case permissions aren't yet loaded - Add `ForbiddenNoAgentsRole` Storybook story with `play` assertions - Add `TestRenderPermissionsResolvesMe` Go test to pin the `"me"` sentinel substitution <details><summary>Implementation plan & decision log</summary> - Uses the existing `permissions.json` + `checkAuthorization` system rather than a separate API call - `owner_id: "me"` is resolved to the actor's ID by both the auth-check API endpoint and the SSR `renderPermissions` function - Go test uses a real `rbac.StrictCachingAuthorizer` (not a mock) so it verifies both the sentinel substitution and the RBAC role evaluation end-to-end - Alert follows the exact same `Alert` pattern as the 409 usage-limit block - Uses `severity="info"` and links to the getting-started docs Step 3 - Textarea is disabled proactively so the user never sees the scary generic error </details> > 🤖 Created by a Coder Agent and will be reviewed by a human.	2026-03-31 17:26:58 +01:00
Yevhenii Shcherbina	9440adf435	feat: add chatgpt support for aibridge (#23822 ) Registers a new aibridge provider for ChatGPT by reusing the existing OpenAI provider with a different `Name` and `BaseURL` (https://chatgpt.com/backend-api/codex). The ChatGPT backend API is OpenAI-compatible, so no new provider type is needed. ChatGPT authenticates exclusively via per-user OAuth JWTs (BYOK mode) — no centralized API key is configured. The OpenAI provider already handles this: when no key is set, it falls through to the bearer token from the request's Authorization header. Depends on #23811	2026-03-31 12:08:45 -04:00
Susana Ferreira	b0036af57b	feat: register multiple Copilot providers for business and enterprise upstreams (#23811 ) ## Description Adds support for multiple Copilot provider instances to route requests to different Copilot upstreams (individual, business, enterprise). Each instance has its own name and base URL, enabling per-upstream metrics, logs, circuit breakers, API dump, and routing. ## Changes * Add Copilot business and enterprise provider names and host constants * Register three Copilot provider instances in aibridged (default, business, enterprise) * Update `defaultAIBridgeProvider` in `aibridgeproxy` to route new Copilot hosts to their corresponding providers ## Related * Depends on: https://github.com/coder/aibridge/pull/240 * Closes: https://github.com/coder/aibridge/issues/152 Note: documentation changes will be added in a follow-up PR. _Disclaimer: initially produced by Claude Opus 4.6, heavily modified and reviewed by @ssncferreira ._	2026-03-31 16:00:37 +01:00
Ethan	bbf3fbc830	fix(coderd/x/chatd): archive chat hard-interrupts active stream (#23758 ) Archiving a chat now transitions pending or running chats to waiting before setting the archived flag. This publishes a status notification on `ChatStreamNotifyChannel` so `subscribeChatControl` cancels the active `processChat` context via `ErrInterrupted` — the same codepath used by the stop button. The `processChat` cleanup also skips queued-message auto-promotion when the chat is archived, so archiving behaves like a hard stop rather than interrupt-and-continue. Relates to https://github.com/coder/coder/issues/23666	2026-04-01 00:23:52 +11:00
Danny Kopping	9fa103929a	perf: make `ListAIBridgeSessions` 10x faster (#23774 ) _Disclaimer: produced using Claude Opus 4.6, reviewed by me, and validated against Dogfood dataset._ The `ListAIBridgeSessions` query materialized and aggregated all matching interceptions before paginating, then ran expensive token/prompt lookups across the full dataset. For a page of 25 sessions against ~200k interceptions (our dogfood dataset), this meant: - Three CTEs scanning all rows (filtered_interceptions, session_tokens, session_root) - ARRAY_AGG(fi.id) collecting every interception ID per session - Lateral prompt lookup via ANY(array_of_all_ids) running for every session, not just the page - ~90MB of disk sorts and JIT compilation kicking in The improvement is to restructure to paginate first and enrich after: a single CTE groups interceptions into sessions with only cheap aggregates (MIN, MAX, COUNT), applies cursor pagination and LIMIT, then lateral joins fetch metadata, tokens, and prompts for just the ~25-row page. Measured against 220k interceptions / 160k sessions: \| Metric \| Before \| After \| \|--------------------\|--------\|-------\| \| Execution time \| 1800ms \| 185ms \| \| Shared buffer hits \| 737k \| 2.6k \| \| Disk sort spill \| 86MB \| 16MB \| \| Lateral loops \| 160k \| 25 \| https://grafana.dev.coder.com/goto/fbODPGtvR?orgId=1 the results are identical, just _much_ faster. --- Also includes some additional tests which I added prior to refactoring the query to ensure no regressions on edge-cases. --------- Signed-off-by: Danny Kopping <danny@coder.com>	2026-03-31 14:42:23 +02:00
Michael Suchacz	af678606fc	fix(coderd/x/chatd): stabilize flaky request-count assertion in round-trip test (#23843 ) The flaky test assumed the second streamed OpenAI request had already been captured when the chat status event arrived. In practice, the capture server can record that second request slightly later, which intermittently left `streamRequestCount` at `1`. This change waits for the second captured request before asserting on the follow-up payload and relaxes the count check to a sanity check. The test still verifies the `store=false` round-trip behavior without depending on that timing race. Fixes coder/internal#1433	2026-03-31 13:09:11 +02:00
Cian Johnston	3ce82bb885	feat: add chat-access site-wide role to gate chat creation (#23724 ) - Add `chat-access` built-in role granting chat CRUD at User scope - Exclude `ResourceChat` from member, org member, and org service account `allPermsExcept` calls - Allow system, owner, and user-admin to assign the new role - Migration auto-assigns role to users who have ever created a chat - Update RBAC test matrix: `memberMe` denied, `chatAccessUser` allowed Breaking change: Members without `chat-access` lose chat creation ability. Migration covers existing chat creators. Members who have never created a chat do not get this role automatically applied. > 🤖 This PR was created by a Coder Agent and reviewed by me.	2026-03-31 10:07:21 +01:00
Kyle Carberry	b3d5b8d13c	fix: stabilize flaky chatd subscribe/promote queued tests (#23816 ) ## Summary Fixes three flaky chatd tests that intermittently fail due to timing races with the background run loop. Closes coder/internal#1428 ## Root Cause `CreateChat` and `PromoteQueued` call `signalWake()` which writes to `wakeCh`, triggering `processOnce` immediately. Even though `newTestServer` sets `PendingChatAcquireInterval: testutil.WaitLong` to prevent ticker-based polling, the wake channel bypasses this. This causes `processOnce` to acquire and process the chat concurrently with the test's manual DB updates and assertions. ### Failing tests \| Test \| Failure \| Cause \| \|------\|---------\|-------\| \| `TestPromoteQueuedAllowsAlreadyQueuedMessageWhenUsageLimitReached` \| `expected: "pending", actual: "running"` \| Wake from `CreateChat` races with manual `UpdateChatStatus`; wake from `PromoteQueued` acquires the chat before the status assertion \| \| `TestSendMessageInterruptBehaviorQueuesAndInterruptsWhenBusy` \| `should have 1 item(s), but has 2` \| Wake from `CreateChat` triggers `processChat` which auto-promotes a queued message, adding an extra row to `chat_messages` \| \| `TestSubscribeNoPubsubNoDuplicateMessageParts` \| `Condition satisfied` (duplicate events) \| Pre-existing `WaitGroup.Add/Wait` race in the `Eventually` + `WaitUntilIdleForTest` pattern \| ## Fix Introduces a `waitForChatProcessed` helper that: 1. Polls until the chat reaches a terminal state (not pending AND not running) 2. Then calls `WaitUntilIdleForTest` to wait for the inflight `WaitGroup` Waiting for a terminal state (not just "not pending") avoids a `sync.WaitGroup` `Add/Wait` race: `AcquireChats` updates the DB status to `running` before `processOnce` calls `inflight.Add(1)`. Checking only `status != pending` could return while `Add(1)` hasn't happened yet, causing `Wait()` to return prematurely. ### Per-test changes - `TestSendMessageInterruptBehaviorQueuesAndInterruptsWhenBusy`: Call `waitForChatProcessed` after `CreateChat` before manually setting running status - `TestPromoteQueuedAllowsAlreadyQueuedMessageWhenUsageLimitReached`: Call `waitForChatProcessed` after `CreateChat`; remove the inherently racy `status == pending` assertion after `PromoteQueued` (the wake immediately acquires the chat). Key assertions on promoted message, queue state, and message count remain. - `TestSubscribeNoPubsubNoDuplicateMessageParts`: Replace inline `Eventually` with the safer `waitForChatProcessed` helper ## Verification All three tests pass 150 consecutive executions with `-race -count=10` across 15 runs (0 failures).	2026-03-30 18:23:47 +00:00
Kyle Carberry	a5cc579453	feat: add last_injected_context column to chats table (#23798 ) Adds a nullable JSONB column `last_injected_context` to the `chats` table that stores the most recently persisted injected context parts (AGENTS.md context-file and skill message parts). The column is updated only when `persistInstructionFiles()` runs — on first workspace attach or when the agent changes — so there are no redundant writes on subsequent turns. Internal fields (`ContextFileContent`, `ContextFileOS`, `ContextFileDirectory`, `SkillDir`) are stripped at write time so the column only holds small metadata. No stripping needed on the read path. <details> <summary>Implementation notes</summary> - New migration `000456` adds nullable `last_injected_context JSONB` column. - New SQL query `UpdateChatLastInjectedContext` writes the column without touching `updated_at`. - `persistInstructionFiles()` strips internal fields from parts via `StripInternal()` before persisting. - Sentinel path (no AGENTS.md) persists skill-only parts when skills exist. - `codersdk.Chat` exposes `LastInjectedContext []ChatMessagePart` (omitempty). - `db2sdk.Chat()` passes through the already-clean data. </details>	2026-03-30 14:11:30 -04:00
Susana Ferreira	0fb3e5cba5	feat: extract, log, and strip aibridgeproxy request ID header in aibridged (#23731 ) ## Problem `aibridgeproxyd` sends `X-AI-Bridge-Request-Id` on every MITM request to `aibridged` for cross-service log correlation, but aibridged never reads it. The header is silently forwarded to upstream LLM providers. ## Changes * Renamed the header to `X-Coder-AI-Governance-Request-Id` to match the existing `X-Coder-AI-Governance-` convention. `aibridged` now extracts the header, logs it and strips it before forwarding upstream. * Added `TestServeHTTP_StripInternalHeaders` to verify no `X-Coder-*` headers leak to upstream	2026-03-30 15:21:30 +01:00
Michael Suchacz	73f6cd8169	feat: suffix-based chat agent selection (#23741 ) Adds suffix-based agent selection for chatd. Template authors can direct chat traffic to a specific root workspace agent by naming it with the `-coderd-chat` suffix (for example, `coder_agent "dev-coderd-chat"`). When no suffix match exists, chatd falls back to the first root agent by `DisplayOrder`, then `Name`. Multiple suffix matches return an error. The selection logic lives in `coderd/x/chatd/internal/agentselect` and is shared by chatd core plus the workspace chat tools so all chat entry points pick the same agent deterministically. No database migrations, API contract changes, or provider changes. The experimental sandbox template was split out to #23777.	2026-03-30 11:43:59 +00:00
Ethan	54738e9e14	test(coderd/x/chatd): avoid zero-ttl config cache flake (#23762 ) This fixes a flaky `TestConfigCache_UserPrompt_ExpiredEntryRefetches` by making the seeded user prompt entry unambiguously expired before the cache lookup runs. The test previously inserted a `tlru` entry with a zero TTL, which depends on `Set` and `Get` landing in different clock ticks. Switching that seed entry to a negative TTL keeps the bounded `tlru` cache behavior while removing the same-tick race. Close https://github.com/coder/internal/issues/1432	2026-03-30 17:51:51 +11:00
Kyle Carberry	4d2b0a2f82	feat: persist skills as message parts like AGENTS.md (#23748 ) ## Summary Skills are now discovered once on the first turn (or when the workspace agent changes) and persisted as `skill` message parts alongside `context-file` parts. On subsequent turns, the skill index is reconstructed from persisted parts instead of re-dialing the workspace agent. This makes skills consistent with the AGENTS.md pattern and is groundwork for a future `/context` endpoint that surfaces loaded workspace context to the frontend. ## Changes - Add `skill` `ChatMessagePartType` with `SkillName` and `SkillDescription` fields - Extend `persistInstructionFiles` to also discover and persist skills as parts - Add `skillsFromParts()` to reconstruct skill index from persisted parts on subsequent turns - Update `runChat()` to use `skillsFromParts` instead of re-dialing workspace for skills - Frontend: handle new `skill` part type (skip rendering, hide metadata-only messages) ## Before / After \| \| AGENTS.md \| Skills \| \|---\|---\|---\| \| Before \| Persist as `context-file` parts, reconstruct from parts \| In-memory `skillsCache` only, re-dial workspace on cache miss \| \| After \| Persist as `context-file` parts, reconstruct from parts \| Persist as `skill` parts, reconstruct from parts \| The in-memory `skillsCache` remains for `read_skill`/`read_skill_file` tool calls that need full skill bodies on demand. <details><summary>Design context</summary> This is the first step toward a unified workspace context representation. Currently: - Context files are persisted as message parts (works) - Skills were only in-memory (inconsistent) - Workspace MCP servers are cached in-memory (future work) Persisting skills as parts means a future `/context` endpoint can query both context files and skills from the same message parts in the DB, without depending on ephemeral server-side caches. </details>	2026-03-29 21:48:17 -04:00
Kyle Carberry	be99b3cb74	fix: prioritize context cancellation in WebSocket sendEvent (#23756 ) ## Problem Commit `386b449` (PR #23745) changed the `OneWayWebSocketEventSender` event channel from unbuffered to buffered(64) to reduce chat streaming latency. This introduced a nondeterministic race in `sendEvent`: ```go sendEvent := func(event codersdk.ServerSentEvent) error { select { case eventC <- event: // buffered channel — almost always ready case <-ctx.Done(): // also ready after cancellation } return nil } ``` After context cancellation, Go's `select` randomly picks between two ready cases, so `send()` sometimes returns `nil` instead of `ctx.Err()`. With the old unbuffered channel the send case was rarely ready (no reader), masking the bug. ## Fix Add a priority `select` that checks `ctx.Done()` before attempting the channel send: ```go select { case <-ctx.Done(): return ctx.Err() default: } select { case eventC <- event: case <-ctx.Done(): return ctx.Err() } ``` This is the standard Go pattern for prioritizing one channel over another. When the context is already cancelled, the first select returns immediately. The second select still handles the case where cancellation happens concurrently with the send. ## Verification - Ran the flaky test 20× in a loop (`-count=20`): all passed - Ran the full `TestOneWayWebSocketEventSender` suite 5× (`-count=5`): all passed - Ran the complete `coderd/httpapi` test package: all passed Fixes coder/internal#1429	2026-03-29 20:11:30 -04:00
Michael Suchacz	bfeb91d9cd	fix: scope title regeneration per chat (#23729 ) Previously, generating a new agent title used a page-global pending state, so one in-flight regeneration disabled the action for every chat in the Agents UI. This change tracks regenerations by chat ID, updates the Agents page contracts to use `regeneratingTitleChatIds`, and adds sidebar story coverage that proves only the active chat is disabled.	2026-03-29 00:01:53 +01:00
Kyle Carberry	386b449273	perf(coderd): reduce chat streaming latency with event-driven acquisition (#23745 ) Previously, when a user sent a message, there was a 0–1000ms (avg ~500ms) polling delay before processing began. `SendMessage`/`CreateChat`/`EditMessage` set `status='pending'` in the DB and returned, but nothing woke the processing loop — it was a blind 1-second ticker. ## Changes Event-driven acquisition (main change): Adds a `wakeCh` channel to the chatd `Server`. `CreateChat`, `SendMessage`, `EditMessage`, and `PromoteQueued` call `signalWake()` after committing their transactions, which wakes the run loop to call `processOnce` immediately. The 1-second ticker remains as a fallback safety net for edge cases (stale recovery, missed signals). Buffer WebSocket write channel: Changes the `OneWayWebSocketEventSender` event channel from unbuffered to buffered (64), decoupling the event producer from WebSocket write speed. The existing 10s write timeout guards against stuck connections. <details><summary>Implementation plan & analysis</summary> The full latency analysis identified these sources of delay in the streaming pipeline: 1. Chat acquisition polling — 0–1000ms (avg 500ms) dead time per message. Fixed by wake channel. 2. Unbuffered WebSocket write channel — each token blocked on the previous WS write completing. Fixed by buffering. 3. PersistStep DB transaction per step — `FOR UPDATE` lock + batch insert. Not addressed in this PR (medium risk, would overlap DB write with next provider TTFB). 4. Multi-hop channel pipeline — 4 channel hops per token. Not addressed (medium complexity). </details> <details><summary>Test stabilization notes</summary> `signalWake()` causes the chatd daemon to process chats immediately after creation/send/edit, which exposed timing assumptions in several tests that expected chats to remain in `pending` status long enough to assert on. These tests were updated with `require.Eventually` + `WaitUntilIdleForTest` patterns to wait for processing to settle before asserting. The race detector (`test-go-race-pg`) shows failures in `TestCreateWorkspaceTool_EndToEnd` and `TestAwaitSubagentCompletion` — these appear to be pre-existing races in the end-to-end chat flow that are now exercised more aggressively because processing starts immediately instead of after a 1s delay. Main branch CI (race detector) passes without these changes. </details>	2026-03-28 15:26:42 -04:00
Michael Suchacz	91217a97b9	fix(coderd/x/chatd): guard title generation meta replies (#23708 ) Short prompts were producing title-generation meta responses such as "I am a title generator" and prompt-echo titles. This rewrites the automatic and manual title prompts to be shorter, less self-referential, and more focused on returning only the title text. The change also removes the broader post-generation guard layer, updates manual regeneration to send real conversation text instead of a meta instruction, and keeps regression coverage focused on the slimmer prompt contract.	2026-03-28 15:58:53 +01:00
Jake Howell	71a492a374	feat: implement `<ClientFilter />` to AI Bridge request logs (#22694 ) Closes #22136 This pull-request implements a `<ClientFilter />` to our `Request Logs` page for AI Bridge. This will allow the user to select a client which they wish to filter against. Technically the backend is able to actually filter against multiple clients at once however the frontend doesn't currently have a nice way of supporting this (future improvement). <img width="1447" height="831" alt="image" src="https://github.com/user-attachments/assets/0be234e2-25f2-4a89-b971-d74817395da1" /> --------- Co-authored-by: Jeremy Ruppel <jeremy.ruppel@gmail.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 17:18:28 -04:00
Kyle Carberry	839165818b	feat(coderd/x/chatd): add skills discovery and tools for chatd (#23715 ) Adds skill discovery and tools to chatd so the agent can discover and load `.agents/skills/` from workspaces, following the same pattern as AGENTS.md instruction loading and MCP tool discovery. ## What changed ### `chattool/skill.go` — discovery, loading, and tools - DiscoverSkills — walks `.agents/skills/` via `conn.LS()` + `conn.ReadFile()`, parses SKILL.md frontmatter (name + description), validates kebab-case names match directory names, silently skips broken/missing entries. - FormatSkillIndex — renders a compact `<available-skills>` XML block for system prompt injection (~60 tokens for 3 skills). Progressive disclosure: only names + descriptions in context, full body loaded on demand. - LoadSkillBody / LoadSkillFile — on-demand loading with path traversal protection and size caps (64KB for SKILL.md, 512KB for supporting files). - read_skill / read_skill_file tools — `fantasy.AgentTool` implementations following the same pattern as ReadFile and WorkspaceMCPTool. Receive pre-discovered `[]SkillMeta` via closure to avoid re-scanning on every call. ### `chatd.go` — integration into runChat - Skills discovered in the `g2` errgroup parallel with instructions and MCP tools. - `skillsCache` (sync.Map) per chat+agent, same invalidation pattern as MCP tools cache. - Skill index injected via `InsertSystem` after workspace instructions. - Re-injected in `ReloadMessages` callback so it survives compaction. - `read_skill` + `read_skill_file` tools registered when skills are present (for both root and subagent chats). - Cache cleaned up in `cleanupStreamIfIdle` alongside MCP tools cache. ## Format compatibility Uses the same `.agents/skills/<name>/SKILL.md` format as [coder/mux](https://github.com/coder/mux) and [openai/codex](https://github.com/openai/codex).	2026-03-27 15:22:13 -04:00
Kyle Carberry	bcdc35ee3e	feat: add chat read/unread indicator to sidebar (#23129 ) ## Summary Adds read/unread tracking for chats so users can see which agent conversations have new assistant messages they haven't viewed. ## Backend Changes - Adds `last_read_message_id` column to the `chats` table (migration 000439). - Computes `has_unread` as a virtual column in `GetChatsByOwnerID` using an `EXISTS` subquery checking for assistant messages beyond the read cursor. - Exposes `has_unread` on the `codersdk.Chat` struct and auto-generated TypeScript types. - Updates `last_read_message_id` on stream connect/disconnect in `streamChat`, avoiding per-message API calls during active streaming. - Uses `context.WithoutCancel` for the deferred disconnect write so the DB update succeeds even after the client disconnects. ## Frontend Changes - Bold title (`font-semibold`) for unread chats in the sidebar. - Small blue dot indicator next to the relative timestamp. - Suppresses unread indicator for the currently active chat via `isActive` from NavLink. ## Design Decisions - Only `assistant` messages count as unread — the user's own messages don't trigger the indicator. - No foreign key on `last_read_message_id` since messages can be deleted (via rollback/truncation) and the column is just a high-water mark. - Zero API calls during streaming: exactly 2 DB writes per stream session (connect + disconnect). - Unread state refreshes on chat list load and window focus. The `watchChats` WebSocket optimistically marks non-active chats as unread on `status_change` events, but does not carry a server-computed `has_unread` field. Navigating to a chat optimistically clears its unread indicator in the cache.	2026-03-27 12:15:04 -04:00
Cian Johnston	a5c72ba396	fix(coderd/agentapi): trim whitespace from workspace agent metadata values (#23709 ) - Trim leading/trailing whitespace from metadata `value` and `error` fields before storage - Trimming happens before length validation so whitespace-padded values are handled correctly - Add `TrimWhitespace` test covering spaces, tabs, newlines, and preserved inner whitespace - No backfill needed (unlogged table, stores only latest value) > 🤖 Created by a Coder Agent, reviewed by me.	2026-03-27 15:08:47 +00:00
Cian Johnston	3f55b35f68	refactor: replace AsSystemRestricted with narrower actors (#23712 ) Replace overly-broad `AsSystemRestricted` with purpose-built actors: - OAuth2 provider paths → `AsSystemOAuth2` (13 call sites across `tokens.go`, `registration.go`, `apikey.go`) - Provisioner daemon health read → `AsSystemReadProvisionerDaemons` (1 site in `healthcheck/provisioner.go`) - Provisionerd file cache paths → `AsProvisionerd` (2 sites in `provisionerdserver.go`, matching existing usage nearby) <details> <summary>Implementation notes</summary> Each replacement actor is a strict subset of `AsSystemRestricted`. Every DB method at each call site is already covered by the narrower actor's permissions: - `subjectSystemOAuth2`: OAuth2App/Secret/CodeToken (all), ApiKey (Read, Delete), User (Read), Organization (Read) - `subjectSystemReadProvisionerDaemons`: ProvisionerDaemon (Read) - `subjectProvisionerd`: File (Create, Read) plus provisionerd-scoped resources No new permissions added. `nolint:gocritic` comments updated to reflect the new actors. </details> > 🤖 Created by a Coder Agent, reviewed by me.	2026-03-27 15:08:30 +00:00
Kyle Carberry	d973a709df	feat: add model_intent option to MCP server configs (#23717 ) Add a per-MCP-server `model_intent` toggle that wraps tool schemas with a `model_intent` field, requiring the LLM to provide a human-readable description of each tool call's purpose. The intent string is shown as a status label in the UI instead of opaque tool names, and is transparently stripped before the call reaches the remote MCP server. Built-in tools have rich specialized renderers (terminal blocks, file diffs, etc.) and don't need this. MCP tools hit `GenericToolRenderer` which only shows raw tool names and JSON — that's where model_intent adds value. The model learns what to provide via the JSON Schema `description` on the `model_intent` property itself — no system prompt changes needed. <details> <summary>Implementation details</summary> ### Architecture Inspired by the `withModelIntent()` pattern from `coder/blink`, adapted for Go + React. The wrapping is entirely in the `mcpclient` layer — tool implementations never see `model_intent`. Schema wrapping (`mcpToolWrapper.Info()`): When enabled, wraps the original tool parameters under a `properties` key and adds a `model_intent` string field with a rich description that teaches the model inline. Input unwrapping (`mcpToolWrapper.Run()`): Strips `model_intent` and unwraps `properties` before forwarding to the remote MCP server. Handles three input shapes models may produce: 1. `{ model_intent, properties: {...} }` — correct format 2. `{ model_intent, key: val, ... }` — flat, no wrapper 3. Malformed — falls through gracefully Frontend extraction: `streamState.ts` extracts `model_intent` from incrementally parsed streaming JSON. `messageParsing.ts` extracts it from persisted tool call args. UI rendering: `GenericToolRenderer` shows the capitalized intent string as the primary label when available, falling back to the raw tool name. ### Changes - Database: `model_intent` boolean column on `mcp_server_configs` - SDK: `ModelIntent` field on config/create/update types - API: pass-through in create/update handlers + converter - mcpclient: schema wrapping in `Info()`, input unwrapping in `Run()` - Frontend: extraction from streaming + persisted args - UI: intent label in `GenericToolRenderer`, toggle in admin panel - Tests: 6 new tests (schema wrapping, unwrapping, passthrough, fallback) ### Decision log - Option lives on MCPServerConfig, not model config: Built-in tools already have rich renderers; only MCP tools benefit from model_intent. - No system prompt changes: The JSON Schema `description` on the `model_intent` property teaches the model inline. - Pointer bool on update request: Follows existing pattern (`*bool`) so PATCH requests don't reset the value when omitted. </details>	2026-03-27 14:23:25 +00:00
Kyle Carberry	50c0c89503	fix(coderd): refresh expired MCP OAuth2 tokens everywhere (#23713 ) Fixes expired MCP OAuth2 tokens causing 401 errors and stale `auth_connected` status in the UI. When users authenticate MCP servers (e.g. GitHub) via OAuth2, the access token and refresh token are stored in the database. However, when the access token expired, nothing refreshed it anywhere: - chatd: sent the expired token as-is, getting a 401 and skipping the MCP server - list/get endpoints: reported `auth_connected: true` just because a token record existed, regardless of expiry ## Changes ### Shared utility: `mcpclient.RefreshOAuth2Token` Pure function that uses `golang.org/x/oauth2` `TokenSource` to check if a token is expired (or within 10s of expiry) and refresh it. No DB dependency — callers handle persistence. ### chatd (`coderd/x/chatd/chatd.go`) Before calling `mcpclient.ConnectAll`, refreshes expired tokens. Persists new credentials to the database. Falls back to the old token if refresh fails. ### List/get MCP server endpoints (`coderd/mcp.go`) Both `listMCPServerConfigs` and `getMCPServerConfig` now attempt refresh when checking `auth_connected`. If the token is expired: - Has refresh token: attempt refresh, persist result, report `auth_connected` based on success - No refresh token: report `auth_connected: false` if expired This means the UI accurately reflects whether the user's token is actually usable, rather than just whether a record exists. <details> <summary>Design notes</summary> - `RefreshOAuth2Token` lives in `mcpclient` to avoid circular imports (`coderd` → `chatd` → `mcpclient` is fine; `chatd` → `coderd` would be circular). - DB persistence is handled by each caller with their own authz context (`AsSystemRestricted` in both cases). - The `buildAuthHeaders` warning in mcpclient about expired tokens is kept as defense-in-depth logging. </details>	2026-03-27 10:06:32 -04:00
Ethan	c4ef94aacf	fix(coderd/x/chatd): prevent chat hang when workspace agent is unavailable (#23707 ) ## Problem Chats with a persisted `agent_id` binding hang indefinitely when the workspace is stopped. The stale agent row still exists in the DB, so `ensureWorkspaceAgent` succeeds, but the dial blocks forever in `AwaitReachable`. The MCP discovery goroutine used an unbounded context, so `g2.Wait()` never returned and the LLM never started. ## Fix Three targeted changes restore the pre-binding behavior where stopped workspaces degrade gracefully instead of blocking: 1. `dialWithLazyValidation`: "no agents in latest build" is now a terminal fast-fail — the hanging dial is canceled and `errChatHasNoWorkspaceAgent` returned immediately, instead of falling through to `waitForOriginalDial`. 2. Pre-LLM workspace setup: MCP discovery and instruction persistence gate on `workspaceAgentIDForConn` before attempting any dial. MCP discovery is bounded by a 5s timeout and checks the in-memory tool cache first (using the cheap cached agent from `ensureWorkspaceAgent`), so the common subsequent-turn path has zero DB queries. 3. `persistInstructionFiles`: tracks whether the workspace connection succeeded and skips sentinel persistence on failure, so the next turn retries if the workspace is restarted. ## Scenarios Running workspace, subsequent turn (hot path): MCP cache hit via in-memory cached agent. Zero DB queries, zero dials. Unchanged from #23274. Stopped workspace, persisted binding (the bug): MCP cache hit (stale descriptors, fine — they fail at invocation). Pre-LLM setup completes instantly. Tool invocation enters `dialWithLazyValidation`, dial fails or hangs, validation discovers no agents, returns `errChatHasNoWorkspaceAgent`. Model sees the error and can call `start_workspace`. New chat, running workspace: `ensureWorkspaceAgent` resolves via latest-build, persists binding. MCP discovery dials and caches tools. New chat, stopped workspace: `ensureWorkspaceAgent` finds no agents, returns `errChatHasNoWorkspaceAgent`. Pre-LLM setup skips. LLM starts with built-in tools only. Rebuilt workspace (agent switched): MCP cache hit with stale agent (harmless for one turn). Tool invocation dials stale agent, fails fast, `dialWithLazyValidation` switches to new agent, persists updated binding. Workspace restarted after stop: No sentinel was persisted during the stopped turn, so instruction persistence retries. Agent binding switches to the new agent via `workspaceAgentIDForConn`. Transient DB error during validation: Not `errChatHasNoWorkspaceAgent`, so `dialWithLazyValidation` falls through to `waitForOriginalDial` (cannot prove stale). No false positive. Tool invocation on stopped workspace: `getWorkspaceConn` calls `ensureWorkspaceAgent` (returns stale row), then `dialWithLazyValidation` validation discovers no agents, returns `errChatHasNoWorkspaceAgent`, cached state cleared, error returned to model.	2026-03-27 18:47:39 +11:00
Ethan	d678c6fb16	fix(coderd/x/chatd): forward local status events to fix delayed-startup banner (#23650 ) ## Problem The agent chat delayed-startup banner ("Response startup is taking longer than expected") could appear even though the model was already streaming. The root cause is in `Subscribe()`: `message_part` events were delivered via the fast local in-process stream, while `status` events were delivered via PostgreSQL pubsub. Both feed into the same `select` statement, and Go's `select` picks whichever channel is ready first — there is no ordering guarantee between channels. So a `message_part` could outrun the `status=running` that logically precedes it. The frontend saw content arrive while it still thought the chat was pending, triggering the banner. ## Fix Also forward `status` events from the local channel, alongside `message_part`. Both event types already travel through the same FIFO subscriber channel: `publishStatus()` is called before the first `message_part`, so channel ordering guarantees the frontend sees `status=running` before any content. Pubsub still delivers a duplicate `status` event later; the frontend deduplicates it (`setChatStatus` is idempotent — it early-returns when the status hasn't changed).	2026-03-27 17:55:19 +11:00
Michael Suchacz	2312e5c428	feat: add manual chat title regeneration (#23633 ) ## Summary Adds a "Generate new title" action that lets users manually regenerate a chat's title using richer conversation context than the automatic first-message title path. ## Changes ### Backend - New endpoint: `POST /api/experimental/chats/{chatID}/title/regenerate` returns the updated Chat with a regenerated title - Manual title algorithm: Extracts useful user/assistant text turns → selects first user turn + last 3 turns → builds context with gap markers → renders prompt with anti-recency guidance → calls lightweight model → normalizes output - Helpers: `extractManualTitleTurns`, `selectManualTitleTurnIndexes`, `buildManualTitleContext`, `renderManualTitlePrompt`, `generateManualTitle` — all private, with the public `Server.RegenerateChatTitle` method - SDK: `ExperimentalClient.RegenerateChatTitle(ctx, chatID) (Chat, error)` - Persists title via existing `UpdateChatByID` and broadcasts `ChatEventKindTitleChange` ### Frontend - API client method + React Query mutation with cache invalidation - "Generate new title" menu item (with wand icon) in both TopBar and Sidebar dropdown menus - Loading/disabled state while regeneration is in-flight - Error toast on failure - Stories updated for both menus ### Tests - `quickgen_test.go`: Table-driven tests for all 4 helper functions (turn extraction, index selection, context building, prompt rendering) - `exp_chats_test.go`: Handler tests (ChatNotFound, NotFoundForDifferentUser, NoDaemon) ## Design notes - The existing auto-title path (`maybeGenerateChatTitle`, `titleInput`) is completely unchanged - Manual regeneration uses richer context (first user turn + last 3 turns + gap markers) vs the auto path's single first message - Endpoint is experimental and marked with `@x-apidocgen {"skip": true}`	2026-03-27 01:47:19 +01:00
Matt Vollmer	113aaa79a0	feat: add pinned chats with drag-to-reorder (#23615 ) https://github.com/user-attachments/assets/bd5d12a1-61b3-4b7d-83b6-317bdfb60b3c ## Summary Adds pinned chats to the agents page sidebar with server-side persistence and drag-to-reorder. Users can pin/unpin chats via the context menu, and pinned chats appear in a dedicated "Pinned" section above the time-grouped list. ## Database Migration `000453_chat_pin_order`: adds `pin_order integer DEFAULT 0 NOT NULL` column on `chats` (0 = unpinned, 1+ = pinned in display order). Three SQL queries handle pin operations server-side using CTEs with `ROW_NUMBER()`: - `PinChatByID`: normalizes existing orders and appends to end - `UnpinChatByID`: sets target to 0 and compacts remaining pins - `UpdateChatPinOrder`: shifts neighbors, clamps to `[1, pinned_count]` All queries exclude archived chats. `ArchiveChatByID` clears `pin_order` on archive. The handler rejects pinning archived chats with 400. ## Backend Pin/unpin/reorder go through the existing `PATCH /api/experimental/chats/{chat}` via the `pin_order` field on `UpdateChatRequest`. The handler routes based on current pin state: `pin_order == 0` unpins, `> 0` on an already-pinned chat reorders, `> 0` on an unpinned chat appends to end. ## Frontend - `pinChat` / `unpinChat` / `reorderPinnedChat` optimistic mutations using shared `isChatListQuery` predicate - Sidebar renders Pinned section above time groups, excludes pinned chats from time groups - Pin/Unpin context menu items (hidden for child/delegated chats) - `@dnd-kit/core` + `@dnd-kit/sortable` for drag-to-reorder with `MouseSensor`, `TouchSensor`, and `KeyboardSensor` - Local pin-order override prevents flash on drop; click blocker prevents NavLink navigation after drag --- PR generated with Coder Agents	2026-03-26 16:52:02 -04:00
Kyle Carberry	0f86c4237e	feat: add workspace MCP tool discovery and proxying for chat (#23680 ) Coder's chat (chatd) can now discover and use MCP servers configured in a workspace's `.mcp.json` file. This brings project-specific tooling (GitHub, databases, docs servers, etc.) into the chat without any manual configuration. ## How it works The workspace agent reads `.mcp.json` from the workspace directory (same format Claude Code uses), connects to the declared MCP servers — spawning child processes for stdio servers and connecting over the network for HTTP/SSE — and caches their tool lists. Two new agent HTTP endpoints expose this: - `GET /api/v0/mcp/tools` returns the cached tool list (supports `?refresh=true`) - `POST /api/v0/mcp/call-tool` proxies calls to the correct server On each chat turn, chatd calls `ListMCPTools` through the existing `AgentConn` tailnet connection, wraps each tool as a `fantasy.AgentTool`, and adds them to the LLM's tool set alongside built-in and admin-configured MCP tools. Tool names are prefixed with the server name (`github__create_issue`) to avoid collisions. Failed server connections are logged and skipped — they never block the agent or break the chat. Child stdio processes are terminated on agent shutdown.	2026-03-26 19:57:02 +00:00
Cian Johnston	bfee7e6245	fix: populate all chat fields in pubsub events (#23664 ) Problem: `publishChatPubsubEvent` was constructing a partial `codersdk.Chat` that omitted `LastModelConfigID` and other fields. Go's zero-value UUID caused the sidebar to show "Default model" for chats received via SSE. Solution: - Extracted `convertChat`/`convertChats` from `exp_chats.go` into `db2sdk.Chat`/`db2sdk.Chats`, alongside existing `ChatMessage`, `ChatQueuedMessage`, and `ChatDiffStatus` converters. `publishChatPubsubEvent` now calls `db2sdk.Chat(chat, nil)` instead of maintaining its own copy of the conversion logic - Added backend integration test `TestWatchChats/CreatedEventIncludesAllChatFields` - Added frontend regression tests for nil-UUID and valid model config ID cases > 🤖 Created by Coder Agents, reviewed by this human.	2026-03-26 16:49:26 +00:00
Danny Kopping	801e57d430	feat: session detail API (#23203 )	2026-03-26 18:09:53 +02:00
Michael Suchacz	e937f89081	feat: add enabled toggle to chat model admin panel (#23665 ) Adds an `enabled` toggle to the chat model admin create/edit form so admins can disable a model without soft-deleting it. Disabled models stay visible in admin settings but stop appearing in user-facing model selectors. The backend already supported this (`chat_model_configs.enabled` column, filtered queries, and SDK fields). This change wires it into the admin UI and adds coverage on both sides. Backend: three new subtests in `coderd/exp_chats_test.go` verifying the visibility contract (admin sees disabled models, non-admin doesn't, update-to-disabled preserves the record). Frontend: `enabled` field added to form logic and seeded from the existing model (defaults to `true` for new models). A Switch+Tooltip control renders in the form header, matching the MCP Server panel pattern. Two interaction stories cover the create-disabled and toggle-existing flows.	2026-03-26 17:07:20 +01:00
Ethan	4d74603045	fix(coderd/x/chatd): respect provider Retry-After headers in chat retry loop (#23351 ) > PR Stack > 1. #23351 ← `#23282` (you are here) > 2. #23282 ← `#23275` > 3. #23275 ← `#23349` > 4. #23349 ← `main` --- ## Summary `chatretry.Retry()` used pure exponential backoff (1 s, 2 s, 4 s, …) and never consulted provider `Retry-After` headers. Fantasy's `ProviderError` carries `ResponseHeaders` including `Retry-After`, but `chaterror.Classify()` only parsed error text and silently dropped the structured transport metadata. This makes `Retry-After` a first-class signal in the classification → retry pipeline. <img width="853" height="346" alt="image" src="https://github.com/user-attachments/assets/65f012b6-8173-43d2-957e-ab9faddea525" /> ## Changes ### `coderd/chatd/chaterror/classify.go` - Added `RetryAfter time.Duration` field to `ClassifiedError` — a normalized minimum retry delay derived from provider response metadata. - `Classify()` now calls `extractProviderErrorDetails()` before falling back to text heuristics. Structured `ProviderError.StatusCode` takes priority over regex extraction. - `normalizeClassification()` preserves and clamps `RetryAfter`. ### `coderd/chatd/chaterror/provider_error.go` (new) Provider-specific extraction, isolated from the text-based classification logic: - `extractProviderErrorDetails()` unwraps `fantasy.ProviderError` from the error chain via `errors.As`. - `retryAfterFromHeaders()` parses headers in priority order: 1. `retry-after-ms` (OpenAI-specific, millisecond precision) 2. `retry-after` (standard HTTP — integer seconds or HTTP-date) - Case-insensitive header key lookup. ### `coderd/chatd/chatretry/chatretry.go` - `effectiveDelay(attempt, classified)` computes `max(Delay(attempt), classified.RetryAfter)` — the provider hint acts as a floor without weakening the local exponential backoff. - `Retry()` now uses `effectiveDelay` and passes the effective delay to both `onRetry(...)` and the sleep timer, so downstream payloads, logs, and the frontend countdown stay aligned automatically. ### Tests - `classify_test.go`: Structured provider status + `Retry-After` extraction, `retry-after-ms` priority, HTTP-date parsing, invalid header fallback, `WithProvider` preservation. - `chatretry_test.go`: Retry-after-as-floor semantics — longer hint wins, shorter hint keeps base delay. ## Design notes - No SDK/API/frontend changes needed.* `codersdk.ChatStreamRetry` already carries `DelayMs` and `RetryingAt`, and the frontend already consumes them. The fix is purely in the server-side delay computation. - Existing retryability rules unchanged. This fixes when we sleep, not whether an error is retryable. - Provider hint is a floor: `max(baseDelay, RetryAfter)` ensures we never retry earlier than the provider asks, and never weaken our own backoff curve.	2026-03-27 01:20:46 +11:00
Cian Johnston	847a88c6ca	chore: clean up stale and dangerous //nolint comments (#23643 ) ## Changes - Commit 1: Remove 17 unnecessary `//nolint` directives: - `//nolint:varnamelen` — linter not active - `//nolint:unused` on exported `SlimUnsupported` - `//nolint:govet` in `coderd/httpmw/csrf` — no longer fires - `//nolint:revive` on functions refactored since the nolint was added - `//nolint:paralleltest` citing Go 1.22 loop variable capture (obsolete) - Bare `//nolint` narrowed to specific `//nolint:gocritic` with justification - Commit 2: Fix root causes behind 5 dangerous nolint suppressions: - Add `MinVersion: tls.VersionTLS12` to TLS client config (removes `gosec` G402) - Delete trivial unexported wrappers `apiKey()`/`normalizeProvider()` in chatprovider (removes `revive` confusing-naming) - Add doc comments to `StartWithAssert` and `Router` (removes `revive` exported) - Rename unused parameters to `_` in integration test helpers > 🤖 This PR was created using Coder Agents and reviewed by me.	2026-03-26 14:13:53 +00:00
Michael Suchacz	4f063cdc47	feat: separate default and additional Coder Agents system prompts (#23616 ) Admins can now control whether the built-in Coder Agents default system prompt is prepended to their custom instructions, rather than having the custom prompt silently replace the default. Changes: - New `include_default_system_prompt` boolean toggle (defaults to `true` for existing deployments) stored as a site config key — no migration needed. - GET `/api/experimental/chats/config/system-prompt` returns the toggle state, the custom prompt, and a preview of the built-in default. - PUT persists both the toggle and custom prompt atomically in a single transaction. - `resolvedChatSystemPrompt()` composes `[default?, custom?]` joined by `\n\n`, falling back to the built-in default on DB errors. - Settings UI adds a Switch toggle with conditional helper text and a "Preview" button that shows the built-in default prompt via the existing `TextPreviewDialog`. - Comprehensive test coverage: 15 subtests covering toggle behavior, prompt composition matrix, auth boundaries, and integration with chat creation.	2026-03-26 13:32:41 +01:00
Cian Johnston	d175e799da	feat: show agent badge on workspace list (#23453 ) - Adds `GET /api/experimental/chats/by-workspace` endpoint that returns workspace_id → latest chat_id mapping - Modifies FE to fetch this alongside the workspace list, gated on `agents` experiment and render an "Agent" badge similar to the existing "Task" badge in `WorkspacesTable` - Badge links to the "latest chat" linked to the given workspace. Notes: - Intentionally uses `fetchWithPostFilter` for RBAC to decouple from workspaces API — will migrate to `workspaces_expanded` view later. - If users have multiple chats linked to the same workspace, the badge will link to the most recently updated one. > 🤖 This PR was created with the help of Coder Agents, and has been reviewed by my human. 🧑‍💻	2026-03-26 11:30:12 +00:00
Jaayden Halko	3fb7c6264f	feat: display the AI add-on column in the UI on the Users and Organization Members tables (#23291 ) ## Summary Adds an entitlement-gated AI add-on column to both the Users table and the Organization Members table. When `ai_governance_user_limit` is entitled, each row shows whether the user is consuming an AI seat. ## Background The AI governance add-on tracks which users are consuming AI seats. Admins need visibility into per-user seat consumption directly from the user management tables. This change surfaces that information through both the site-wide Users table and the per-organization Members table, gated behind the `ai_governance_user_limit` entitlement so the column only appears when the feature is licensed. ## Implementation ### Backend - New SQL query `GetUserAISeatStates` (`coderd/database/queries/aiseatstate.sql`) — returns user IDs consuming an AI seat, derived from: - Users with entries in `aibridge_interceptions` (AI Bridge usage) - Users who own workspaces with `has_ai_task = true` builds (AI Tasks usage) - SDK types — added `has_ai_seat: boolean` to `codersdk.User` and `codersdk.OrganizationMemberWithUserData` - Handler wiring — both the Users list endpoint (`coderd/users.go`) and all Members endpoints (`coderd/members.go`) query AI seat state per page of user IDs and populate the response field - dbauthz — per-user `ActionRead` checks on `ResourceUserObject` ### Frontend - Shared `AISeatCell` component (`site/src/modules/users/AISeatCell.tsx`) — green `CircleCheck` for consuming, gray `X` for non-consuming - `TableColumnHelpTooltip` — extended with `ai_addon` variant with tooltip: "Users with access to AI features like AI Bridge, Boundary, or Tasks who are actively consuming a seat." - Column visibility gated behind `useFeatureVisibility().ai_governance_user_limit` ## Validation - Backend: dbauthz full method suite (`TestMethodTestSuite`) passes including new `GetUserAISeatStates` test - Backend: `TestGetUsers`, `TestUsersFilter`, CLI golden file tests pass - Frontend: 7/7 tests pass across `UsersPage.test.tsx` and `OrganizationMembersPage.test.tsx` (column visibility gating both directions) - `go build ./coderd/...` compiles clean - `pnpm --dir site run lint:types` passes - `make gen` clean ## Risks - Pagination performance: The AI seat query is scoped to the current page's user IDs (not a full table scan), keeping it efficient for paginated views. - Semantic scope: The workspace-side AI seat derivation uses "any build with `has_ai_task = true`" rather than "latest build only". If the product intent is latest-build-only, this can be tightened in a follow-up. --- _Generated with `mux` • Model: `anthropic:claude-opus-4-6` • Thinking: `xhigh` • Cost: `$27.25`_ <!-- mux-attribution: model=anthropic:claude-opus-4-6 thinking=xhigh costs=27.25 -->	2026-03-26 10:36:40 +00:00
Ethan	15f2fa55c6	perf(coderd/x/chatd): add process-wide config cache for hot DB queries (#23272 ) ## Summary Adds a process-wide cache for three hot database queries in `chatd` that were hitting Postgres on every chat turn despite returning rarely-changing configuration data: \| Query \| Before (50k turns) \| After \| Reduction \| \|---\|---\|---\|---\| \| `GetEnabledChatProviders` \| ~98.6k calls \| ~500-1000 \| ~99% \| \| `GetChatModelConfigByID` \| ~49.2k calls \| ~500-1000 \| ~98% \| \| `GetUserChatCustomPrompt` \| ~46.7k calls \| ~1000-2000 \| ~97% \| These were identified via `coder exp scaletest chat` (5000 concurrent chats × 10 turns) as the dominant source of Postgres load during chat processing. ## Design Follows the established webpush subscription cache pattern (`coderd/webpush/webpush.go`): - `sync.RWMutex` + `tailscale.com/util/singleflight` (generic) + generation-based stale prevention + TTL - 10s TTL for provider/model config, 5s TTL for user prompts - Negative caching for `sql.ErrNoRows` on user prompts (the common case — most users don't set custom prompts) - Deep-clones `ChatModelConfig.Options` (`json.RawMessage` = `[]byte`) on both store and read paths ### Invalidation Single pubsub channel (`chat:config_change`) with kind discriminator for cross-replica cache invalidation. Seven publish points in `coderd/chats.go` cover all admin mutation endpoints (create/update/delete for providers and model configs, put for user prompts). _This PR was generated with mux and was reviewed by a human_	2026-03-26 18:04:53 +11:00
Ethan	21c2acbad5	fix: refine chat retry status UX (#23651 ) Follow-up to #23282. The retry and terminal error callouts had a few UX oddities: - Auto-retrying states reused backend error text that said "Please try again" even while the UI was already retrying on behalf of the user. - Terminal error states also said "Please try again" with no action the user could take. - `startup_timeout` had no specific title or retry copy — it fell through to the generic "Retrying request" heading. - The kind pill showed raw enum values like `startup_timeout` and `rate_limit`. - Terminal error metadata showed a "Retryable" / "Not retryable" label that does not help users. - A separate "Provider anthropic" metadata row duplicated information already present in the message body. - The `usage-limit` error kind used a hyphen while every backend kind uses underscores. Changes: Backend (`chaterror/message.go`) - Split message generation into `terminalMessage()` and `retryMessage()`, replacing the old `userFacingMessage()`. - Terminal messages include HTTP status codes and actionable guidance (e.g. "Check the API key, permissions, and billing settings."). - Retry messages are clean factual statements without status codes or remediation, suitable for the retry countdown UI (e.g. "Anthropic is temporarily overloaded."). - Removed "Please try again" / "Please try again later" from all paths. - `StreamRetryPayload` calls `retryMessage()` instead of forwarding `classified.Message`. Frontend - Removed the parallel frontend message-generation system: `getRetryMessage()`, `getProviderDisplayName()`, `getRetryProviderSubject()`, and the `PROVIDER_DISPLAY_NAMES` map are all deleted from `chatStatusHelpers.ts`. - `liveStatusModel.ts` passes `retryState.error` through directly — the backend owns the copy. - Added specific title and retry copy for `startup_timeout`, and extended the title mapping to cover `auth` and `config`. - Kind pills now show humanized labels ("Startup timeout", "Rate limit", etc.) instead of raw enum strings. - Removed the redundant "Provider anthropic" metadata row. - Removed the terminal "Retryable" / "Not retryable" badge. - Normalized `"usage-limit"` → `"usage_limit"` and added it to `ChatProviderFailureKind` so all error kinds follow the same underscore convention and live in one enum. Refs #23282.	2026-03-26 17:37:27 +11:00
Ethan	61e31ec5cc	perf(coderd/x/chatd): persist workspace agent binding across chat turns (#23274 ) ## Summary This change removes the steady-state "resolve the latest workspace agent" query from chat execution. Instead of asking the database for the latest build's agent on every turn, a chat now persists the workspace/build/agent binding it actually uses and reuses that binding across subsequent turns. The common path becomes "load the bound agent by ID and dial it", with fallback paths to repair the binding when it is missing, stale, or intentionally changed. ## What changes - add `workspace_id`, `build_id`, and `agent_id` binding fields to `chats` - expose those fields through the chat API / SDK so the execution context is explicit - load the persisted binding first in chatd, instead of always resolving the latest build's agent - persist a refreshed binding when chatd has to re-resolve the workspace agent - keep child / subagent chats on the same bound workspace context by inheriting the parent binding - leave `build_id` / `agent_id` unset for flows like `create_workspace`, then bind them lazily on the next agent-backed turn ## Runtime behavior The binding is treated as an optimistic cache of the agent a chat should use: - if the bound agent still exists and dials successfully, we use it without a latest-build lookup - if the bound agent is missing or no longer reachable, chatd re-resolves against the latest build and persists the new binding - if a workspace mutation changes the chat's target workspace, the binding is updated as part of that mutation To avoid reintroducing a hot-path query, dialing uses lazy validation: - start dialing the cached agent immediately - only validate against the latest build if the dial is still pending after a short delay - if validation finds a different agent, cancel the stale dial, switch to the current agent, and persist the repaired binding ## Result The hot path stops issuing `GetWorkspaceAgentsInLatestBuildByWorkspaceID` for every user message, which is the source of the DB pressure this PR is addressing. At the same time, chats still converge to the correct workspace agent when the binding becomes stale due to rebuilds or explicit workspace changes.	2026-03-26 17:22:38 +11:00
Cian Johnston	7a9d57cd87	fix(coderd): actually wire the chat template allowlist into tools (#23626 ) Problem: previously, the deployment-wide chat template allowlist was never actually wired in from `chatd.go` - Extracts `parseChatTemplateAllowlist` into shared `coderd/util/xjson.ParseUUIDList` - Adds `Server.chatTemplateAllowlist()` method that reads the allowlist from DB - Passes `AllowedTemplateIDs` callback to `ListTemplates`, `ReadTemplate`, and `CreateWorkspace` tool constructors > 🤖 Created by Coder Agents and reviewed by a human.	2026-03-25 22:15:27 +00:00
Steven Masley	9d5b7f4579	test: assert on user id, not entire user (#23632 ) User struct has "LastSeen" field which can change during the test Replaces https://github.com/coder/coder/pull/23622	2026-03-25 19:09:25 +00:00
Steven Masley	f65b915fe3	chore: add permissions to `coder:workspace.` scopes for functionality (#23515 ) `coder:workspaces.` composite scopes did not provide enough permissions to do what they say they can do. Closes https://github.com/coder/coder/issues/22537	2026-03-25 13:46:58 -05:00

1 2 3 4 5 ...

3543 Commits