coder

mirror of https://github.com/coder/coder.git synced 2026-06-05 22:18:20 +00:00

Author	SHA1	Message	Date
Kyle Carberry	b9c729457b	fix(chatd): queue interrupt messages to preserve conversation order (#22736 ) ## Problem When `message_agent` is called with `interrupt=true`, two independent code paths race to persist messages: 1. `SendMessage` inserts the user message into `chat_messages` at time T1 2. `persistInterruptedStep` saves the partial assistant response at time T2 (T2 > T1) Since `chat_messages` are ordered by `(created_at, id)`, the assistant message ends up after the user message that triggered the interrupt. On reload, this produces a broken conversation where the interrupted response appears below the new user message — and Anthropic rejects the trailing assistant message as unsupported prefill. The root cause is that two independent writers can't guarantee ordering. Any solution involving timestamp manipulation or signal-then-wait coordination leaves race windows. ## Fix Route interrupt behavior through the existing queued message mechanism: 1. `SendMessage` with `BusyBehaviorInterrupt` now inserts into `chat_queued_messages` (not `chat_messages`) when the chat is busy 2. After queuing, `setChatWaiting` signals the running loop to stop 3. The deferred cleanup in `processChat` persists the partial assistant response first, then auto-promotes the queued user message This eliminates the race entirely: the assistant partial response and user message are written by the same serialized cleanup flow, so ordering is guaranteed by the DB's auto-incrementing `id` sequence. No timestamp hacks, no reordering at send time. Supersedes #22728 — fixes the root cause instead of reordering at prompt construction time.	2026-03-06 18:15:40 -05:00
Kyle Carberry	9bd712013f	fix(chat): fix streaming bugs in edit notifications, persist race, and frontend reconnect (#22737 )	2026-03-06 15:11:05 -08:00
Kyle Carberry	f404463317	fix: resolve bugs in chat HTTP handlers (#22722 )	2026-03-06 16:06:18 -06:00
Kyle Carberry	eecb7d0b66	fix: resolve bugs in chatd streaming system (#22720 ) Split from #22693 per review feedback. Fixes multiple bugs in coderd/chatd and sub-packages including race conditions, transaction safety, stream buffer bounds, retry limits, and enterprise relay improvements. See commit message for full list.	2026-03-06 21:02:25 +00:00
Mathias Fredriksson	a104d608a3	feat: add file/image attachment support to chat input (#22604 ) This change adds support for image attachments to chat via add button and clipboard paste. Files are stored in a new `chat_files` table and referenced by ID in message content. File data is resolved from storage at LLM dispatch time, keeping the message content column small. Upload validates MIME types via content type or content sniffing against an allowlist (png, jpeg, gif, webp). The retrieval endpoint serves files with immutable caching headers. On the frontend, uploads start eagerly on attach with a background fetch to pre-warm the browser HTTP cache so the timeline renders instantly after send.	2026-03-06 21:05:26 +02:00
Kyle Carberry	30a736c49e	fix: resolve bugs in pubsub and codersdk chat packages (#22717 )	2026-03-06 17:37:55 +00:00
Steven Masley	537260aa22	fix: early oidc refresh with fake idp tests (#22712 ) Wrote unit tests that implement a fake idp to verify the oauth package actually refreshes the token	2026-03-06 16:51:27 +00:00
Kacper Sawicki	c0ef3540a5	feat(namesgenerator): expand auto-generated name digit suffix to 00-99 (#22665 )	2026-03-06 15:09:58 +01:00
Danny Kopping	13e3df67d6	feat: track client sessions (#22470 ) This change adds support for tracking client session IDs in AI Bridge interceptions to enable better session-based auditing. Depends on https://github.com/coder/aibridge/pull/198 Fixes https://github.com/coder/internal/issues/1337 The session ID field is optional and not universally supported by all clients.	2026-03-06 14:43:53 +02:00
Danielle Maywood	f9891416c0	fix: emit Responses API lifecycle events in mock OpenAI server (#22702 )	2026-03-06 12:35:44 +00:00
Steven Masley	c805c8c02c	chore: setting time forward for expiration math (#22687 ) It was set backwards, which allowed invalid refresh tokens. Making things worse.	2026-03-06 12:29:54 +00:00
Danielle Maywood	ffb47cea19	feat(chatd): add tag-based dedup to push notifications (#22669 )	2026-03-06 10:48:58 +00:00
Danielle Maywood	d91d9712f7	fix: use Eventually for web push dispatch assertion in chatd test (#22700 )	2026-03-06 09:52:28 +00:00
Hugo Dutka	48ab492f49	feat: agents git watch backend (#22565 ) Adds real-time git status watching for workspace agents, so the frontend can subscribe over WebSocket and show git file changes in near real-time. 1. Subscription is scoped to a chat via `GET /api/experimental/chats/{chat}/git/watch`. 2. The workspace agent automatically determines which paths to watch based on tool calls made by the chat (and its ancestor chats). 3. Workspace agent polls subscribed repo working trees on a 30s interval, on tools calls, and on explicit `refresh` from the client. 4. Scans are rate-limited to at most once per second. 5. Edited paths are tracked in-memory inside the workspace agent. There is no database persistence — state is lost on agent restart. This will be addresses in a future PR. 6. Messages sent over WebSocket include a full-repo snapshot (unified diff, branch, origin). A new message is emitted only when the snapshot changes. This PR was implemented with AI with me closely controlling what it's doing. The code follows a plan file that was updated continuously during implementation. Here's the file if you'd like to see it: [project.md](https://gist.github.com/hugodutka/8722cf80c92f8a56555f7bc595b770e2). It reflects the current state of the PR.	2026-03-06 10:47:55 +01:00
Cian Johnston	81468323e0	fix(coderd): use dbtime.Now() instead of time.Now() in test assertions against DB timestamps (#22685 ) `time.Now()` has nanosecond precision while Postgres timestamps are microsecond precision. When tests compare `time.Now()` against DB-sourced timestamps using `Before`/`After`/`WithinRange`/etc., there is a non-zero flake risk from the precision mismatch. This replaces `time.Now()` with `dbtime.Now()` (which rounds to microsecond precision) in all test assertions that compare against database timestamps. Follows from #22684. ## Changes (11 files) \| File \| Changes \| \|---\|---\| \| `coderd/apikey_test.go` \| 11 comparisons with `ExpiresAt` \| \| `coderd/users_test.go` \| 2 comparisons with `ExpiresAt` \| \| `coderd/oauth2_test.go` \| 1 comparison with `token.Expiry` \| \| `coderd/workspaces_test.go` \| 2 comparisons with `DormantAt` \| \| `coderd/workspaceagents_test.go` \| 3 comparisons with `ConnectedAt`/`DisconnectedAt` \| \| `coderd/workspaceapps/db_test.go` \| 1 comparison with `token.Expiry` \| \| `coderd/provisionerdserver/provisionerdserver_test.go` \| 1 comparison with `key.ExpiresAt` \| \| `enterprise/coderd/workspaces_test.go` \| 1 comparison with `DormantAt` \| \| `enterprise/coderd/license/license_test.go` \| 3 `NotBefore` values \| \| `enterprise/coderd/licenses_test.go` \| 2 `NotBefore` values \| \| `enterprise/coderd/users_test.go` \| 3 `Next()` comparisons \| ## Not changed (intentionally) - `scaletest/placebo/run_test.go` — compares wall-clock elapsed time, not DB timestamps - `cli/server_test.go`, `coderd/jwtutils/jwt_test.go`, `enterprise/aibridgeproxyd/aibridgeproxyd_test.go` — TLS cert fields, not DB-stored - `coderd/azureidentity/azureidentity_test.go` — Azure cert expiry, not DB 🤖 Generated by Claude Opus 4.6 but reviewed manually.	2026-03-06 09:14:11 +00:00
Jon Ayers	6c44de951d	feat: add Prometheus collector for DERP server expvar metrics (#22583 ) This PR does three things: - Exports derp expvars to the pprof endpoint - Exports the expvar metrics as prometheus metrics in both coderd and wsproxy - Updates our tailscale to a fix I also had to make to avoid a data race condition I generated this with mux but I also manually tested that the metrics were getting properly emitted	2026-03-06 01:57:58 -06:00
Kayla はな	56bdea73b8	feat: add workspace acls to task rbac objects (#22311 ) To allow tasks to be shareable, we need to share both the `task` resource and the `workspace` resource, and their sharing state needs to be kept in sync. We've already implemented all of the necessary ACL functionality for workspaces, so we can just sort of proxy those ACLs back to the task as well.	2026-03-05 13:40:53 -07:00
Mathias Fredriksson	719c24829a	build(Makefile): use atomic writes for remaining gen targets (#22670 ) Follow-up to #22612. Running `git status --short` in a loop during `make -B -j gen` still showed intermediate states for several files. This PR fixes the remaining ones. The main issues: - `generate.sh` ran `gofmt` and `goimports` in-place after moving files into the source tree. Now it formats in a workdir first and only `mv`s the final result. - `protoc` targets wrote directly to the source tree. Wrapped with `scripts/atomic_protoc.sh` which redirects output to a tmpdir. - Several generators used hardcoded `/tmp/` paths. On systems where `/tmp` is tmpfs, `mv` degrades to copy+delete. Switched to a project-local `_gen/` directory (gitignored, same filesystem). - `apidoc/.gen` and `cli/index.md` used `cp` for final output. Replaced with `mv`. - `manifest.json` was written twice (unformatted, then formatted). Now `.gen` writes to a staging file and the manifest target does one formatted atomic write. - `biome_format.sh` silently skipped files in gitignored dirs. Added `--vcs-enabled=false`. Two helpers reduce the Makefile boilerplate: `scripts/atomic_protoc.sh` (wraps protoc) and an `atomic_write` Make define (stdout-to-temp-to-target pattern). `.PRECIOUS` now also covers `.pb.go` and mock files. Verification: `make -B -j gen` x3 with `git status` polling, no changes. Refs #22612	2026-03-05 22:32:18 +02:00
Danielle Maywood	f91475cd51	test: remove unnecessary dbauthz.AsSystemRestricted calls in tests (#22663 )	2026-03-05 20:29:49 +00:00
Danielle Maywood	0ec27e3d48	feat(chatd): navigate to specific chat on push notification click (#22668 )	2026-03-05 16:40:17 +00:00
Kyle Carberry	6520159045	feat(chatd): add start_workspace tool to agent flow (#22646 ) ## Summary When a chat's workspace is stopped, the LLM previously had no way to start it — `create_workspace` would either create a duplicate workspace or fail. This adds a dedicated `start_workspace` tool to the agent flow. ## Changes ### New: `start_workspace` tool (`coderd/chatd/chattool/startworkspace.go`) - Detects if the chat's workspace is stopped and starts it via a new build with `transition=start` - Reuses the existing `waitForBuild` and `waitForAgent` helpers (shared logic) - Shares the workspace mutex with `create_workspace` to prevent races - Idempotent: returns immediately if the workspace is already running or building - Returns a `no_agent` / `not_ready` status if the agent isn't available yet (non-fatal) ### Updated: `create_workspace` stopped-workspace hint - `checkExistingWorkspace` now returns a `stopped` status with message `"use start_workspace to start it"` when it detects the chat's workspace is stopped, instead of falling through to create a new workspace ### Wiring - `chatd.Config` / `chatd.Server`: new `StartWorkspace` / `startWorkspaceFn` field - `coderd/chats.go`: new `chatStartWorkspace` method that calls `postWorkspaceBuildsInternal` with proper RBAC context - `coderd/coderd.go`: passes `chatStartWorkspace` into chatd config - Tool registered alongside `create_workspace` for root chats only (not subagents) ### Tests (`startworkspace_test.go`) - `NoWorkspace`: error when chat has no workspace - `AlreadyRunning`: idempotent return for workspace with successful start build - `StoppedWorkspace`: verifies StartFn is called, build is waited on, and success response returned	2026-03-05 15:34:24 +00:00
Mathias Fredriksson	a6a8fd94d7	build(Makefile): enable parallel `make -j gen` with correct dependency graph (#22612 ) `make gen` could not run with `-j` because inter-target dependency edges were missing. Multiple recipes compile `coderd/rbac` (which includes generated files like `object_gen.go`), and without explicit ordering, parallel runs produced syntax errors from mid-write reads. Three main changes: Dependency graph fixes declare the compile-time chain through `coderd/rbac` so that `object_gen.go` is written before anything that imports it is compiled. The DB generation targets use a GNU Make 4.3+ grouped target (`&:`) so Make knows `generate.sh` co-produces `querier.go`, `unique_constraint.go`, `dbmetrics`, and `dbauthz` in a single invocation. `SKIP_DUMP_SQL=1` avoids re-entrant `make` inside `generate.sh` when the Makefile already guarantees `dump.sql` is fresh. `scripts/atomicwrite` package replaces `os.WriteFile` in all gen scripts with a temp-file-in-same-dir + rename pattern, preventing interrupted runs from leaving partial files. `.PRECIOUS` and shell atomic writes protect git-tracked generated files from Make's default delete-on-error behavior. Since these files are committed, deletion is worse than staleness -- `git restore` is the recovery path. CI now runs `make -j --output-sync -B gen` (~32s, down from ~85s serial). \| Scenario \| Before \| After \| \|-----------------------------------\|--------------------\|----------\| \| `make gen` (serial) \| 95s \| 95s \| \| `make -j gen` (parallel) \| race error \| 22s \| \| CI `make -j --output-sync -B gen` \| forced serial ~85s \| ~32s \|	2026-03-05 11:58:10 +00:00
Cian Johnston	d0a51e1752	fix: use testutil.Eventually in chatd interrupt test (#22653 ) Follow-up to #22630. Addresses [review feedback](https://github.com/coder/coder/pull/22630#pullrequestreview-2953419963) that was missed due to auto-merge. ## Changes Replaces three `require.Eventually` calls with `testutil.Eventually` in `TestInterruptChatDoesNotSendWebPushNotification`, linking the condition to the existing test context (`ctx`) created on line 1194. This ensures the test respects context cancellation instead of using a standalone timeout/tick pattern.	2026-03-05 09:42:34 +00:00
Cian Johnston	4d0d187806	fix(chatd): wait for startup scripts before returning from create_workspace (#22498 ) The `create_workspace` tool waited for the workspace build to succeed and the agent to become connectable, but did not wait for the agent's startup scripts (e.g. git clone) to finish. This caused agents to attempt file operations on repositories that hadn't been cloned yet. Add a waitForStartupScripts step that polls the agent's lifecycle_state via GetWorkspaceAgentLifecycleStateByID until it transitions out of created/starting into a terminal state (ready, start_error, or start_timeout). The tool now only returns success once the workspace is fully initialized. If the scripts fail or time out, the tool still returns (non-fatal) with an appropriate agent_status so the model knows something went wrong. Created using thingies (Opus 4.6 Max)	2026-03-05 09:42:12 +00:00
Susana Ferreira	21c91cebaa	feat: add TLS listener support to aibridgeproxyd (#22411 ) ## Description Adds optional TLS support for the AI Bridge Proxy listener. When TLS cert and key files are provided, the proxy serves over HTTPS instead of plain HTTP. ## Changes * New configuration options to enable TLS on the proxy listener * Wraps the TCP listener in `tls.NewListener` when configured * Tests for validation errors, invalid files, and full integration (tunneled + MITM) through a TLS listener Note: Documentation for TLS listener setup and client configuration will be handled in a follow-up PR. Related to: https://github.com/coder/internal/issues/1335	2026-03-05 09:19:34 +00:00
Kyle Carberry	7bcd9f6de8	fix: skip web push notification when chat is interrupted (#22630 ) When a user interrupts a chat, the status transitions to `waiting` which previously triggered an "Agent has finished running." web push notification. This is incorrect — the user interrupted it themselves, so no notification is needed. ## Changes ### `coderd/chatd/chatd.go` - Added `wasInterrupted` flag alongside the existing `status` variable - Set the flag when `ErrInterrupted` is detected in the error handler - Added `!wasInterrupted` to the web push dispatch condition ### `coderd/chatd/chatd_test.go` - Added `TestInterruptChatDoesNotSendWebPushNotification` that creates a chat with a mock webpush dispatcher, processes it, interrupts it, and verifies no push notification was dispatched - Added `mockWebpushDispatcher` implementing the `webpush.Dispatcher` interface	2026-03-05 09:08:17 +00:00
Kyle Carberry	b28958cef9	Revert "fix(chatd): sanitize \u0000 from JSON before JSONB insertion" (#22645 ) Reverts coder/coder#22637	2026-03-05 03:35:52 +00:00
Kyle Carberry	5630390d94	fix(chatd): enable compaction between steps and re-enter after summarization (#22640 ) ## Problem Three bugs with chat summarization (compaction) share a single root cause: `ReloadMessages` was never wired up in the production `chatloop.Run()` call. ### Bug 1: Compaction never fires between steps The inline compaction guard in `chatloop.go` requires both `Compaction` and `ReloadMessages` to be non-nil: ```go if opts.Compaction != nil && opts.ReloadMessages != nil { ``` Since `ReloadMessages` was only set in tests, inline compaction was dead code in production. Long multi-step turns could blow through the context window. ### Bug 2: Compaction only occurs at end of turn The post-run safety net doesn't check `ReloadMessages`, so it was the only compaction path that fired: ```go if !alreadyCompacted && opts.Compaction != nil { // no ReloadMessages check ``` This meant compaction only happened once, after the entire agent turn finished. ### Bug 3: Agent stops after summarization After post-run compaction, `Run()` unconditionally returned `nil`. `processChat` then set the chat status to `waiting` (done). The agent never had a chance to continue with its fresh summarized context. ## Fix 1. Wire up `ReloadMessages` in `chatd.go`: reloads persisted messages from the database and re-applies system prompts (subagent instruction, workspace AGENTS.md). 2. Wrap the step loop in an outer compaction loop: when compaction fires on the model's final step (`compactedOnFinalStep`), reload messages and `continue` the outer loop so the agent re-enters with summarized context. 3. Track `compactedOnFinalStep` to distinguish inline compaction on the last step (needs re-entry) from inline compaction mid-loop followed by more tool-call steps (agent already consumed the compacted context, no re-entry needed). 4. Add `maxCompactionRetries = 3` to prevent infinite compaction loops. ## Testing - All 7 existing compaction tests pass unchanged. - Added `PostRunCompactionReEntersStepLoop` test: verifies that when a text-only response triggers compaction, the outer loop re-enters and the agent makes a second stream call with fresh context.	2026-03-04 22:28:23 -05:00
Kyle Carberry	27f0f2962c	fix(chatd): sanitize \u0000 from JSON before JSONB insertion (#22637 ) ## Problem Users hit this error when agent tool results contain Unicode null characters: ``` persist step: insert tool result: pq: unsupported Unicode escape sequence ``` PostgreSQL's `jsonb` type rejects `\u0000` (Unicode null, U+0000) with that error, even though it's valid JSON per RFC 8259. Tool results from agents can contain this sequence — e.g. binary data, C-style strings, or certain API responses. ## Root cause `MarshalToolResult` and `MarshalContent` in `chatprompt.go` serialize content blocks to JSON and pass them directly to `InsertChatMessage` which casts to `::jsonb`. Go's `json.Marshal` / `json.Valid` accept `\u0000`, but Postgres does not. ## Fix Added `sanitizeJSONForPG()` which strips `\u0000` escape sequences from serialized JSON before insertion. Uses `bytes.Contains` as a fast-path check to avoid allocation when no null bytes are present (the common case). Applied to both `MarshalContent` (assistant messages) and `MarshalToolResult` (tool result messages).	2026-03-04 21:14:41 -05:00
Kyle Carberry	d50fc374c5	fix(coderd): fix flaky TestGetUserStatusCounts timezone boundary (#22639 ) ## Problem `TestGetUserStatusCounts/OK_when_offset_is_provided_without_timezone` fails intermittently in CI: ``` Error: Should be zero, but was 1 Test: TestGetUserStatusCounts/OK_when_offset_is_provided_without_timezone ``` ## Root Cause The `happyResponseCheck` asserts `count=0` for all 61 dates. The test creates a first user, which inserts a `user_status_changes` row with `new_status=active` and `changed_at=now()`. The query computes its date range using the requested timezone/offset: ```go nextHourInLoc = dbtime.Now().Truncate(time.Hour).Add(time.Hour).In(loc) sixtyDaysAgo = dbtime.StartOfDay(nextHourInLoc).AddDate(0, 0, -60) ``` When the UTC time of day is earlier than the timezone offset (e.g. UTC 01:30 with offset `-2` means local time is 23:30 previous day), `StartOfDay(nextHourInLoc)` rounds forward to start-of-today in the target timezone, which is after the current UTC time. The last `date_of_interest` in the SQL query ends up ahead of `now()` in UTC, so the user's `changed_at` satisfies `changed_at <= date` — producing `count=1` on the last date. This happens ~8% of the time for offset `-2` (when UTC hour is 0 or 1) and ~15% for `America/St_Johns` (UTC-3:30). ## Fix Allow the last date entry to have count 0 or 1 (only 1 user exists) while keeping all earlier dates strictly zero. This correctly accounts for the timezone boundary without weakening the test's structural validation.	2026-03-04 18:01:56 -08:00
Kyle Carberry	30d534b36b	fix(chatd): fix relay race conditions, extract enterprise relay logic, move pubsub to OSS (#22589 ) ## Summary Fixes a bug where interrupting a streaming chat and sending a new message left the relay connected to the wrong replica. Expanded into a broader refactor that cleanly separates concerns: - OSS owns pubsub subscription, message catch-up, queue updates, status forwarding, and local parts merging. - Enterprise (`enterprise/coderd/chatd`) only manages relay dialing, reconnection, and stale-dial discarding for cross-replica streaming. ## Architecture ### OSS `coderd/chatd/chatd.go` `Subscribe()` builds the initial snapshot then runs a single merge goroutine that handles: - Pubsub subscription for durable events (status, messages, queue, errors) - Message catch-up via `AfterMessageID` - Local `message_part` forwarding - Relay events from enterprise (when `SubscribeFn` is set) - Sends `StatusNotification` to enterprise so it can manage relay lifecycle Key types: - `SubscribeFn` — enterprise hook, returns relay-only events channel - `SubscribeFnParams` — `ChatID`, `Chat`, `WorkerID`, `StatusNotifications`, `RequestHeader`, `DB`, `Logger` - `StatusNotification` — `Status` + `WorkerID`, sent to enterprise on pubsub status changes ### Enterprise `enterprise/coderd/chatd/chatd.go` `NewMultiReplicaSubscribeFn(cfg MultiReplicaSubscribeConfig)` returns a `SubscribeFn` that: - Opens an initial synchronous relay if the chat is running on a remote worker - Reads `StatusNotifications` from OSS to open/close relay connections - Handles async dial, reconnect timers, stale-dial discarding - Returns only relay `message_part` events ## Bug fixes ### Original bug: stale relay dial after interrupt `openRelayAsync` goroutines used `mergedCtx` (subscription-level), not a per-dial context. `closeRelay()` could not cancel in-flight dials. When the user interrupts and a new replica picks up the chat, the old dial goroutine could complete after the new one and deliver a stale `relayResult`. Fix: per-dial `dialCtx`/`dialCancel`, `expectedWorkerID` tracking, `workerID` on `relayResult`. `closeRelay()` cancels the dial context and drains `relayReadyCh`. Merge loop rejects mismatched worker IDs. ### Additional fixes - `statusNotifications` send-on-closed-channel race — goroutine now owns `close()` via defer - Enterprise spin-loop on `StatusNotifications` close — two-value receive with nil-out - `hasPubsub` set from `p.pubsub != nil` instead of subscription success — now tracks actual subscription result - `lastMessageID` not initialized from `afterMessageID` — caused duplicate messages on catch-up - `wrappedParts` goroutine leaked remote connection on `dialCtx` cancel - `closeRelay()` did not drain `relayReadyCh` - `setChatWaiting` race with `SendMessage(Interrupt)` — wrapped in `InTx` - `processChat` post-TX side effects fired when chat was taken by another worker — added `errChatTakenByOtherWorker` sentinel - Cancel closure data race on `reconnectTimer` - Bare blocking send on pubsub error path - `localParts` hot-spin after channel close - No-pubsub branch dropped relay events and initial snapshot - Failed relay dial caused permanent stall (no reconnect retry) - DB error during reconnect timer caused permanent stall - `time.NewTimer` replaced with `quartz.Clock` for testable timing ## Tests 9 enterprise tests covering: - Relay reconnect on drop (mock clock) - Async dial does not block merge loop - Relay snapshot delivery - Stale dial discarded after interrupt - Cancel during in-flight dial - Running-to-running worker switch - Failed dial retries (mock clock) - Local worker closes relay - Multiple consecutive reconnects (mock clock) All pass with `-race`.	2026-03-04 18:42:28 -05:00
Kayla はな	e35717bc19	fix: show a notice when workspace sharing is disabled globally in organization settings (#22580 )	2026-03-04 11:14:52 -07:00
Mathias Fredriksson	c7dd429bbf	fix(coderd/database/dbfake): prevent cross-test job stealing in WorkspaceBuildBuilder (#22598 ) Previously, WorkspaceBuildBuilder.doInTX() inserted provisioner jobs with empty tags and used a loop in AcquireProvisionerJob that could match other tests' pending jobs when parallel tests share a database. Add a unique tag (jobID -> "true") to each provisioner job at insert time, then use that tag in AcquireProvisionerJob to target only the correct job. This follows the same pattern used in dbgen.ProvisionerJob. Closes coder/internal#1367	2026-03-04 17:47:34 +00:00
Kyle Carberry	ec89abd6e5	feat(chatd): use lightweight model candidates for title generation (#22605 ) ## Problem Title generation uses the same model the user selected for chat. This breaks when: 1. Thinking/extended thinking models — `ToolChoice: None` conflicts with extended thinking on Anthropic. The bare call has no thinking config, so provider-level defaults can conflict. 2. Expensive models — User picks `o3` or `claude-opus-4`, and a trivial 8-word title generation burns through tokens/cost unnecessarily. 3. Provider quirks — Different providers have different constraints around thinking mode + tool choice combinations. ## Solution Modeled after how `coder/mux` handles this with `NAME_GEN_PREFERRED_MODELS` + ordered candidate fallback: ### Phase 1: Candidate model list with fallback - New `TitleModelFunc` type returns an ordered list of candidate models - Tries `claude-haiku-4-5` → `gpt-4o-mini` → user's model - Gracefully skips unavailable candidates (missing API key, provider not configured) - Falls back to the user's chat model as last resort ### Phase 2: Provider-safe call options - Removed `ToolChoice: None` which conflicts with extended thinking on some providers - Added `MaxOutputTokens: 256` to cap token usage - Improved title prompt with verb-noun format guidance (`Fix sidebar layout`, `Add user authentication`) and explicit no-markdown/no-code-fences instructions ### Files changed - `coderd/chatd/title.go` — Candidate loop, improved prompt, safe call options - `coderd/chatd/chatd.go` — Build `TitleModelFunc` closure with lightweight candidates	2026-03-04 16:03:03 +00:00
Kyle Carberry	f4a7fa5b95	fix(chatd): block subagents from spawning workspaces (#22603 ) ## Summary Subagent (child) chats were previously given access to workspace provisioning tools (`list_templates`, `read_template`, `create_workspace`), which could lead to uncontrolled resource consumption. This PR moves those tools behind the same `!chat.ParentChatID.Valid` gate that already protects the subagent tools (`spawn_agent`, `wait_agent`, etc.). ## Changes - `coderd/chatd/chatd.go`: Moved `list_templates`, `read_template`, and `create_workspace` tool registration into the root-chat-only block alongside subagent tools. - `coderd/chatd/chatd_test.go`: Added `TestSubagentChatExcludesWorkspaceProvisioningTools` — an E2E test that spawns a subagent via a root chat and verifies the subagent's LLM call does not include workspace provisioning or subagent tools. - `coderd/chatd/chattest/openai.go`: Added `Tools` field to `OpenAIRequest` and supporting `OpenAITool`/`OpenAIToolFunction` types so tests can inspect which tools are sent to the model.	2026-03-04 15:49:14 +00:00
Danielle Maywood	90f686d684	feat(agents): add unarchive agent support (#22579 )	2026-03-04 14:08:12 +00:00
Kyle Carberry	012a0497ce	fix(agents): remove optimistic message rendering and fix auto-promote delivery (#22588 ) ## Problem Two bugs in the agents chat flow: 1. Optimistic rendering glitch: When sending a message while the agent is busy, a fake message with a negative ID appears in the timeline, then gets rolled back to the queued state. This causes a jarring flash. 2. Auto-promoted messages not appearing: When the server auto-promotes a queued message after finishing a task, the promoted user message doesn't show up in the timeline until the LLM finishes its response. ## Root Causes Bug 1: The optimistic rendering system injected placeholder messages with `id: -Date.now()` into the store. When the server responded with `queued: true`, the optimistic message was rolled back — but the user had already seen it flash in the timeline. Bug 2: In `processChat`'s deferred cleanup, the auto-promoted message was published via `publishEvent()`, which only delivers to local in-process stream subscribers. The SSE subscriber goroutine only forwards `message_part` events from the local channel — it ignores `message` events. Durable events reach the SSE client via pubsub → DB read, but `publishEvent` doesn't trigger a pubsub notification. The explicit `PromoteQueued` endpoint correctly used `publishMessage()` (which does both), but the auto-promote path did not. ## Changes ### Frontend (`site/`) - AgentDetail.tsx: Remove optimistic message injection from send and edit flows. Instead, use the `CreateChatMessageResponse.message` from the POST response to insert the real server message into the store immediately. - ChatContext.ts: Remove the negative-ID cleanup logic from `upsertDurableMessage` that stripped optimistic placeholders when real messages arrived. - chatStore.test.ts: Remove 2 tests for negative-ID optimistic message behavior. ### Backend (`coderd/chatd/`) - chatd.go: In `processChat` cleanup, replace `publishEvent()` with `publishMessage()` for auto-promoted messages. This ensures the pubsub notification (`AfterMessageID`) is sent, so SSE subscribers read the new message from the DB immediately.	2026-03-04 07:49:39 -05:00
Danielle Maywood	f28f56d02c	test(coderd/rbac): parallelize TestRolePermissions subtests (#22259 )	2026-03-04 12:47:39 +00:00
Sas Swart	cfcb81fb0f	fix: user status change chart accommodates DST (#22191 ) closes https://github.com/coder/internal/issues/464 # Summary This PR resolves a flaky test that was sensitive to DST transitions in various time zones. The root of the flake was: * a bug; the query and its tests assume 24 hours per day * the tests used local system time, which resulted in failures for dates proximal to DST transitions # Changes Query: The original query assumed 24 hour intervals between each day, which is not a valid assumption. It now increments `1 day` at a time. Database tests: Database level tests for the query all assumed 24 hour days. They now increment in DST-aware days instead. Instead of using time.Now() as a base for testing, the test uses a series of dates over the course of an entire year, to ensure that DST transition dates are present in every test run. # API Endpoint The endpoint that delivers the user status chart now accepts an IANA timezone name as a parameter and passes it, keeping the existing offset as a fallback, to the database query. API level tests were added to ensure the correct response form and error behaviour. Correctness of content is tested at the database level.	2026-03-04 12:54:39 +02:00
Kyle Carberry	5b1cf4a6a3	fix(chatd): start stream buffering before publishing running status (#22571 ) ## Problem There is a race condition in the chat stream reconnect path. When a client connects (or reconnects) to `/stream`, sometimes they only see a `status: running` event but never receive any `message_part` events — the stream appears stuck. ## Root Cause In `processChat`, the sequence is: 1. `publishStatus(running)` — broadcasts `status: running` to all subscribers and via pubsub. 2. `runChat()` is called. 3. Inside `runChat`, there's significant setup work (model resolution, DB queries, title generation, prompt building, instruction resolution). 4. Only after all that setup does `runChat` set `buffering = true` on the stream state. If a client connects to `/stream` between steps 1 and 4: - `Subscribe()` reads `chat.Status == running` from the DB, so it includes `status: running` in the snapshot. - But `buffering` is still `false`, so `subscribeToStream` returns an empty local snapshot (no message_parts). - `publishToStream` drops all `message_part` events when `buffering` is false. - Result: client sees `running` but never gets any streaming content. ## Fix Move the `buffering = true` setup (and its deferred cleanup) from `runChat` into `processChat`, right before `publishStatus(running)`. This guarantees the buffer is active before any subscriber can observe `status: running`, so: - The snapshot always includes any in-flight `message_part` events. - `publishToStream` never drops parts because buffering is already on.	2026-03-03 21:27:59 +00:00
Danielle Maywood	d2d956edb1	fix: add archived query parameter to chat list endpoint (#22562 ) Despite the SDK type having an `Archived` field for chats, this data was never fetched from the database — the `GetChatsByOwnerID` query hardcoded `AND archived = false`, and the `convertChat` function never mapped the field. This PR adds an optional `archived` query parameter to `GET /api/experimental/chats`: \| Value \| Behavior \| \|-------\|----------\| \| (not provided) \| Returns all chats (active and archived) \| \| `archived=false` \| Returns only non-archived chats \| \| `archived=true` \| Returns only archived chats \| This follows the same pattern used by template versions (`sqlc.narg('archived')` nullable boolean). Also fixes `convertChat` to populate the `Archived` field in API responses, which was never being set despite existing on the SDK type.	2026-03-03 20:39:19 +00:00
Danny Kopping	1b08bc76a6	feat: store tool call IDs to determine interception lineage (#22246 ) Adds database columns and server-side logic to track interception lineage via tool call IDs. When an interception ends, the server resolves the correlating tool call ID to find the parent interception and links them via `parent_id`. New `provider_tool_call_id` column on `aibridge_tool_usages` and `parent_id` column on `aibridge_interceptions`, with indexes for lookup. `findParentInterceptionID` queries by tool call ID and filters out the current interception to find the parent. Adapted from the [coder/coder `dk/prompt_provenance_poc`](https://github.com/coder/coder/compare/main...dk/prompt_provenance_poc) branch. Depends on [coder/aibridge#188](https://github.com/coder/aibridge/pull/188). Closes https://github.com/coder/internal/issues/1334	2026-03-03 21:04:41 +02:00
Steven Masley	f49dea683c	chore: prematurely refresh oidc token near expiry during workspace build (#22502 ) Closes https://github.com/coder/coder/issues/22429	2026-03-03 18:13:00 +00:00
Steven Masley	bca638a498	feat: validate prebuild presets using dynamic parameter validation (#21858 ) Prebuilds need to be valid. Before this change, you can push a template version that's preset will fail when making a prebuild. This PR ensures all presets that are used for prebuilds are valid	2026-03-03 16:50:18 +00:00
Kyle Carberry	059ed7ab5c	fix(chatd): return chat to pending when server shuts down during successful completion (#22559 ) ## Problem Flaky test: `TestCloseDuringShutdownContextCanceledShouldRetryOnNewReplica` (coder/internal#1371) The test intermittently fails because the chat ends up in `waiting` status instead of `pending` after server shutdown. ## Root Cause There is a race condition in `processChat` where `runChat` completes successfully just as the server context is being canceled during `Close()`. The sequence: 1. Server calls `Close()`, canceling the server context. 2. The LLM HTTP response has already been fully written by the mock server (the stream closes normally before context cancellation propagates to the HTTP client). 3. `runChat` returns `nil` (success) instead of `context.Canceled`. 4. The existing `isShutdownCancellation` check only runs when `runChat` returns an error, so the shutdown is not detected. 5. `processChat`'s deferred cleanup marks the chat as `waiting` instead of `pending`. 6. The test's assertion that the chat is `pending` never becomes true. This race is timing-dependent — it only triggers when the mock server's HTTP response completes in the narrow window between context cancellation being initiated and it propagating through the HTTP transport layer. ## Fix Add a server context check after `runChat` returns successfully. If the server is shutting down (`ctx.Err() != nil`), override the status to `pending` so another replica can pick up the chat. This is the same pattern already used for the error path (`isShutdownCancellation`), extended to cover the success path.	2026-03-03 11:34:08 -05:00
Kyle Carberry	2d7009e50d	test: reduce unnecessary sleep durations in tests (#22552 ) ## Summary Removes `time.Sleep` calls in two test files by replacing them with deterministic or event-driven alternatives. ### Changes `coderd/provisionerjobs_test.go` (34.5s → 0.25s) Replaced `time.Sleep(1500ms)` with a direct SQL `UPDATE` to bump `created_at` by 2 seconds. The sleep existed purely to ensure different timestamps for sort-order testing. The fix is deterministic and cannot flake. Uses `NewDBWithSQLDB` (the test already required real Postgres via `WithDumpOnFailure`). `coderd/database/pubsub/pubsub_test.go` (2.05s → 1.3s) Replaced `time.Sleep(1s)` with a `testutil.Eventually` retry loop that publishes and checks for subscriber receipt. This is the idiomatic pattern in the codebase. The old sleep waited for pq.Listener to re-issue LISTEN after reconnect; the new code polls until it actually works.	2026-03-03 10:19:00 -05:00
Kyle Carberry	10a33ebc75	test: reduce Await* polling interval from 250ms to 25ms (#22536 ) ## Summary Change the four main `coderdtest` Await helper functions to poll at `IntervalFast` (25ms) instead of `IntervalMedium` (250ms): - `AwaitTemplateVersionJobCompleted` - `AwaitWorkspaceBuildJobCompleted` - `WorkspaceAgentWaiter.WaitFor` - `WorkspaceAgentWaiter.Wait` These are called ~855 times across the test suite. Each call previously wasted ~125ms on average waiting for the next poll tick. `AwaitTemplateVersionJobRunning` already used `IntervalFast` — this makes all Await helpers consistent. ## Measured Impact Local benchmarks (postgres, `-short -count=1 -p 8 -parallel 8 -tags=testsmallbatch`): \| Package \| Before \| After \| Delta \| \|---\|---\|---\|---\| \| enterprise/coderd \| 90.8s \| 76.0s \| -16.3% \| \| coderd \| 65.6s \| 56.5s \| -13.8% \| \| cli \| 57.9s \| 37.8s \| -34.7% \| \| enterprise (root) \| 41.1s \| 39.9s \| -2.9% \| \| Sum of all packages \| 623s \| 543s \| -12.8% \| Zero test failures across all 199 packages.	2026-03-03 13:48:58 +00:00
Ehab Younes	9d2aed88c4	fix: register task pause/resume routes under /api/v2 (#22544 ) The pause/resume endpoints were only registered under /api/experimental but the frontend and Go SDK were calling /api/v2, resulting in 404s. Register the routes in the v2 group, update the SDK client paths, and fix swagger annotations (Accept → Produce) since these POST endpoints have no request body.	2026-03-03 16:34:33 +03:00
Jake Howell	8aebd73466	feat: implement new default monospace font `Geist Mono` (#22081 ) This pull-request follows up #22060 Felt wrong to only make use of Geist when there is a Monospace variant here too. Felt best we default to this as the default font as its inline with the rest of the application. This also updates the lower line for Workspace Statistics 🙂	2026-03-03 12:00:50 +00:00
Cian Johnston	517cb0ce73	refactor(webpush): use RequireExperimentWithDevBypass middleware (#22525 ) Replace manual experiment checks in web-push handlers with the `RequireExperimentWithDevBypass` middleware on the route group, matching the pattern used by OAuth2, Agents, and MCP experiments. ## Changes - `coderd/coderd.go`: Add `RequireExperimentWithDevBypass` middleware to `/webpush` route group - `coderd/webpush.go`: Remove inline `api.Experiments.Enabled(codersdk.ExperimentWebPush)` checks from all three handlers - `cli/server.go`: Gate webpush dispatcher initialization with `buildinfo.IsDev()` fallback so dev builds always init the real dispatcher - `coderd/webpush_test.go`: Remove experiment enablement from tests (dev bypass handles it) Net effect: -26 lines removed, +5 added. Created using whatchamacallits (Opus 4.6 Max)	2026-03-03 09:49:04 +00:00

... 2 3 4 5 6 ...

3442 Commits