coder

mirror of https://github.com/coder/coder.git synced 2026-06-03 13:08:25 +00:00

Author	SHA1	Message	Date
Kyle Carberry	0ad2f9ecd7	feat(chatd): persist last_error on chats table (#22436 ) Adds a nullable `last_error` column to the `chats` table so error reasons survive page reloads. Backend: - Migration adds `last_error TEXT` (nullable) to chats - `UpdateChatStatus` writes the error reason when status transitions to `error`, clears it (NULL) on recovery - `convertChat` maps `sql.NullString` to `string` in the SDK Frontend:* - Sidebar falls back to `chat.last_error` when no stream error reason is cached - Chat detail page does the same for `persistedErrorReason` - Fixtures updated for new required field	2026-02-28 12:27:26 -05:00
Kyle Carberry	2bdacae5f5	feat(chatd): add LLM stream retry with exponential backoff (#22418 ) ## Summary Adds automatic retry with exponential backoff for transient LLM errors during chat streaming and title generation. Inspired by [coder/mux](https://github.com/coder/mux)'s retry mechanism. ## Key Behaviors - Infinite retries with exponential backoff: 1s → 2s → 4s → ... → 60s cap - Deterministic delays (no jitter) - Error classification: retryable (429, 5xx, overloaded, rate limit, network errors) vs non-retryable (auth, quota, context exceeded, model not found, canceled) - Retry status published to SSE stream so frontend can show "Retrying in Xs..." UI - Title generation retries silently (best-effort, nil onRetry callback) ## New Package: `coderd/chatd/chatretry/` \| File \| Purpose \| \|------\|---------\| \| `classify.go` \| `IsRetryable(err)` and `StatusCodeRetryable(code)` \| \| `backoff.go` \| `Delay(attempt)` — exponential doubling with 60s cap \| \| `retry.go` \| `Retry(ctx, fn, onRetry)` — infinite loop with context-aware timer \| ## Test Helpers: `coderd/chatd/chattest/errors.go` Anthropic and OpenAI error response builders for use in chattest providers: - `AnthropicErrorResponse()`, `AnthropicOverloadedResponse()`, `AnthropicRateLimitResponse()` - `OpenAIErrorResponse()`, `OpenAIRateLimitResponse()`, `OpenAIServerErrorResponse()` ## SDK Changes: `codersdk/chats.go` - New `ChatStreamEventType: "retry"` - New `ChatStreamRetry` struct with `Attempt`, `DelayMs`, `Error`, `RetryingAt` fields - TypeScript types auto-generated ## Changed Files - `coderd/chatd/chatloop/chatloop.go` — wraps `agent.Stream()` in `chatretry.Retry()` - `coderd/chatd/chatd.go` — publishes retry events to SSE stream with logging - `coderd/chatd/title.go` — wraps `model.Generate()` in silent retry - `coderd/chatd/chattest/anthropic.go` / `openai.go` — error injection support ## Tests 42 tests covering classification (33), backoff (9), and retry scenarios (8).	2026-02-27 18:34:33 -05:00
Kyle Carberry	4b5ec8a9a4	feat: add diff_status_change event to /chats/watch pubsub stream (#22419 ) ## Summary Adds a new `diff_status_change` event kind to the `/chats/watch` pubsub stream so the sidebar can update diff status (PR created, files changed, branch info) without a full page reload. ### Problem When a chat's diff status changes (e.g. PR created via GitHub, git branch pushed), the sidebar didn't update because: 1. The backend `publishChatPubsubEvent` didn't include diff status data 2. The frontend watch handler only merged `status`, `title`, and `updated_at` from events ### Solution A notify-only approach: a new `ChatEventKindDiffStatusChange` event kind tells the frontend "diff status changed for chat X" — the frontend then invalidates the relevant React Query cache entries to re-fetch. ### Backend changes - `coderd/pubsub/chatevent.go`: New `ChatEventKindDiffStatusChange = "diff_status_change"` constant - `coderd/chatd/chatd.go`: New `PublishDiffStatusChange(ctx, chatID)` method on `Server` - `coderd/chats.go`: New `publishChatDiffStatusEvent` helper. Published from: - `refreshWorkspaceChatDiffStatuses` — after each chat's diff status is refreshed via GitHub API - `storeChatGitRef` — after persisting git branch/origin info from workspace agent ### Frontend changes - `AgentsPage.tsx`: Handle `diff_status_change` event by invalidating `chatDiffStatusKey` and `chatDiffContentsKey` queries - `ChatContext.ts`: Remove redundant diff status invalidation that fired on every chat status change (the new event kind handles this properly)	2026-02-27 18:06:54 -05:00
Kyle Carberry	12083441e0	feat(chats): archive chats instead of hard-deleting them (#22406 ) ## Summary The UI has always labeled the action as "Archive agent" but the backend was performing a hard `DELETE`, permanently destroying chats and all their messages. This change replaces the hard delete with a soft archive, consistent with the pattern used by template versions. ## Changes ### Database - Migration 000423: Add `archived boolean DEFAULT false NOT NULL` column to `chats` table - Replace `DeleteChatByID` query with `ArchiveChatByID` (`UPDATE SET archived = true`) - Add `UnarchiveChatByID` query (`UPDATE SET archived = false`) - Filter archived chats from `GetChatsByOwnerID` (`WHERE archived = false`) ### API - Remove `DELETE /api/experimental/chats/{chat}` - Add `POST /api/experimental/chats/{chat}/archive` — archives a chat and all its descendants - Add `POST /api/experimental/chats/{chat}/unarchive` — unarchives a single chat (API only, no UI yet) ### Backend - `archiveChatTree()` recursively archives child chats (replaces `deleteChatTree()` which hard-deleted) - Chat daemon's `ArchiveChat()` archives the full chat tree in a transaction - Authorization uses `ActionUpdate` instead of `ActionDelete` ### SDK - Replace `DeleteChat()` with `ArchiveChat()` and `UnarchiveChat()` - Add `Archived` field to `Chat` struct ### Frontend - `archiveChat` API call uses `POST .../archive` instead of `DELETE` - No UI changes — the "Archive agent" button now actually archives instead of deleting ## Design Decision This follows the template version archive pattern (Pattern B in the codebase): - `archived boolean` column (not `deleted boolean`) - Dedicated `POST .../archive` and `POST .../unarchive` routes (not repurposing `DELETE`) - Reversible — users can unarchive via the API (UI for this will come later)	2026-02-27 16:46:19 -05:00
Kyle Carberry	52dad56462	fix(coderd): refresh OAuth token before GitHub API calls in chat diff (#22415 ) ## Problem `resolveChatGitHubAccessToken` reads the `OAuthAccessToken` directly from the database without refreshing it. When the token expires, GitHub returns "bad credentials" and the chat diff features break. ## Fix Call `config.RefreshToken()` before returning the token — the same code path used by `provisionerdserver` when handing tokens to provisioners. - Builds a map of provider ID → `*externalauth.Config` during the existing config iteration - After fetching the `ExternalAuthLink` from the DB, calls `cfg.RefreshToken()` if a matching config exists - On refresh failure, falls through to the existing token (GitHub tokens without expiry still work) with a debug log	2026-02-27 16:37:17 -05:00
Kyle Carberry	360df1d84f	fix(chatd): publish streaming message_part events during compaction (#22410 ) ## Problem Context compaction in chatd persisted durable messages for the `chat_summarized` tool call and result via `publishMessage`, but never published `message_part` streaming events via `publishMessagePart`. This meant connected clients had no streaming representation of the compaction. The client's `streamState` (built entirely from `message_part` events in `streamState.ts`) never saw the compaction tool call, so: - No "Summarizing..." running state was shown to the user during summary generation (which can take up to 90s). - The durable `message` events arrived after or interleaved with the `status: waiting` event, causing the tool to appear as "Summarized" with the chat appearing to just stop. ## Fix ### 1. `CompactionOptions.OnStart` callback (chatloop) Added an `OnStart` callback to `CompactionOptions`, called in `maybeCompact` right before `generateCompactionSummary` (the slow LLM call). This gives `chatd` a hook to publish the tool-call `message_part` immediately when compaction begins. ### 2. Tool-result streaming part (chatd) `persistChatContextSummary` now publishes a tool-result `message_part` before the durable `message` events, so clients transition from "Summarizing..." to "Summarized" before the status change arrives. ### Event ordering is now: 1. `message_part` (tool call via `OnStart`) — client shows "Summarizing..." 2. LLM generates summary (up to 90s) 3. `message_part` (tool result) — client shows "Summarized" in stream state 4. `message` (assistant) — durable message persisted, stream state resets 5. `message` (tool) — durable tool result persisted 6. `status: waiting` — chat transitions to idle ## Tests - `OnStartFiresBeforePersist`: Verifies callback ordering is `on_start` → `generate` → `persist`. - `OnStartNotCalledBelowThreshold`: Verifies `OnStart` is not called when context usage is below the compaction threshold.	2026-02-27 16:33:39 -05:00
Kyle Carberry	bb97ba727f	fix(coderd): allow non-admin users to list chat model configs (#22407 ) ## Problem Non-admin users of the Agents (chat) feature send `model_config_id: "00000000-0000-0000-0000-000000000000"` (nil UUID) when creating chats, because the `GET /api/experimental/chats/model-configs` endpoint requires `policy.ActionRead` on `rbac.ResourceDeploymentConfig`, which is only granted to admins. The flow: 1. `AgentsPage.tsx` calls `useQuery(chatModelConfigs())` → hits `listChatModelConfigs` 2. Non-admin users get a 403 Forbidden response 3. `chatModelConfigsQuery.data` is `undefined`, so the `modelConfigIDByModelID` map is empty 4. `handleCreateChat` falls back to `nilUUID` for `model_config_id` 5. The backend rejects the nil UUID: `"Invalid model config ID."` ## Fix Changed `listChatModelConfigs` to allow all authenticated users to read model configs: - Admin users continue to see all configs (including disabled ones) for management via `GetChatModelConfigs` - Non-admin users now see only enabled configs via `GetEnabledChatModelConfigs` with a system context, which is sufficient for using the chat feature This follows the same pattern as `listChatModels`, which already uses `dbauthz.AsSystemRestricted(ctx)` to allow all authenticated users to see available models. Write endpoints (create/update/delete) retain their existing `ResourceDeploymentConfig` authorization. ## Testing - Updated `TestListChatModelConfigs/ForbiddenForOrganizationMember` → `SuccessForOrganizationMember` to verify non-admin users can list enabled model configs - All existing chat tests continue to pass	2026-02-27 15:31:04 -05:00
Kyle Carberry	f509c841cf	fix(chatd): recover stale chats after coderd redeployment (#22405 ) ## Problem When coderd instances are redeployed (e.g. rolling deployment on dogfood), in-flight chats get stuck in `running` status permanently. The UI shows them as "thinking" with a spinning indicator, but no worker is actually processing them. They never error or resume. ## Root Cause Two bugs combine to cause this: ### Bug 1: Shutdown cleanup uses a canceled context The `processChat` defer block updates the chat status in the DB when processing completes. But it uses `ctx`, which `Close()` cancels before the defer runs. The DB transaction silently fails with `context.Canceled`, leaving the chat in `status=running` with a dead `worker_id`. ```go // Close() calls p.cancel() which cancels ctx // Then the defer tries to use the now-canceled ctx: defer func() { err := p.db.InTx(func(tx database.Store) error { tx.GetChatByIDForUpdate(ctx, chat.ID) // FAILS tx.UpdateChatStatus(ctx, ...) // FAILS }, nil) }() ``` ### Bug 2: Stale recovery runs only once at startup `recoverStaleChats()` was called only once in `start()`, not periodically. During a rolling deployment, the new instance starts while the old one is still alive (fresh heartbeat). By the time the old instance crashes, no one checks again. ## Fix 1. Use `context.WithoutCancel(ctx)` in the processChat defer — the cleanup transaction now completes even during graceful shutdown. 2. Run `recoverStaleChats` periodically — a second ticker in the `start()` loop checks for stale chats at `inFlightChatStaleAfter / 5` intervals (default: every 1 minute). This catches orphaned chats even when the instance that owns them crashes without clean shutdown. ## Tests - `TestRecoverStaleChatsPeriodically` — Verifies chats orphaned after startup are recovered by the periodic loop (not just the startup check). - `TestNewReplicaRecoversStaleChatFromDeadReplica` — Verifies a new replica recovers stale chats on startup. - `TestWaitingChatsAreNotRecoveredAsStale` — Negative test: `waiting` chats are not incorrectly modified by recovery.	2026-02-27 15:25:40 -05:00
Kyle Carberry	b65c0766d2	feat: add line-based read_file tool with safety limits (#22400 ) ## Summary Adds a new line-based file reading endpoint to the workspace agent, replacing the unbounded byte-based approach for the `read_file` chat tool and `coder_workspace_read_file` MCP tool. Problem: The current `read_file` tool returns the entire file contents with no limits, which can blow up LLM context windows and cause OOM issues with large files. Solution: Inspired by [`coder/mux`](https://github.com/coder/mux) and [`openai/codex`](https://github.com/openai/codex), implement a line-based reader with safety limits. ## Changes ### Agent (`agent/agentfiles/`) - New `/read-file-lines` endpoint with `HandleReadFileLines` handler - Line-based `offset` (1-based line number, default: 1) and `limit` (line count, default: 2000) - Safety constants: \| Constant \| Value \| Purpose \| \|---\|---\|---\| \| `MaxFileSize` \| 1 MB \| Reject files larger than this at stat \| \| `MaxLineBytes` \| 1,024 \| Per-line truncation with `... [truncated]` marker \| \| `MaxResponseLines` \| 2,000 \| Max lines per response \| \| `MaxResponseBytes` \| 32 KB \| Max total response size \| \| `DefaultLineLimit` \| 2,000 \| Default when no limit specified \| - Line numbering format: `1\tcontent` (tab-separated) - Structured JSON response: `{ success, file_size, total_lines, lines_read, content, error }` - Hard errors when limits exceeded — tells the LLM to use `offset`/`limit` - Existing byte-based `/read-file` endpoint preserved (used by `instruction.go`) ### SDK (`codersdk/workspacesdk/`) - `ReadFileLinesResponse` type added - `ReadFileLines` method added to `AgentConn` interface - Mock regenerated ### Chat tool (`coderd/chatd/chattool/`) - `read_file` tool now uses `conn.ReadFileLines()` instead of `conn.ReadFile()` - Updated tool description to document line-based parameters - Response includes `file_size`, `total_lines`, `lines_read` metadata ### MCP tool (`codersdk/toolsdk/`) - `coder_workspace_read_file` updated to use line-based reading - Schema descriptions updated for line-based offset/limit - Removed `maxFileLimit` constant (agent handles limits now) ### Tests - 13 new test cases for `TestReadFileLines`: - Path validation (empty, relative, non-existent, directory, no permissions) - Empty file handling - Basic read, offset, limit, offset+limit combinations - Offset beyond file length - Long line truncation (>1024 bytes) - Large file rejection (>1MB) - All existing tests pass unchanged ## Design decisions \| Decision \| Rationale \| \|---\|---\| \| Line-based, not byte-based \| Both coder/mux and openai/codex use line-based — matches how LLMs reason about code \| \| Default limit of 2000 \| Matches codex; prevents accidental full-file dumps while being generous \| \| 32 KB response cap \| Compromise between mux (16 KB) and codex (no cap) \| \| 1024 byte/line truncation with marker \| More generous than codex (500), marker helps LLM know data is missing \| \| Hard errors on overflow \| Matches mux; forces LLM to paginate rather than getting partial data \| \| Preserve byte-based endpoint \| `instruction.go` needs raw byte access for AGENTS.md \|	2026-02-27 15:12:56 -05:00
Kyle Carberry	ff687aa780	fix: re-read chat before publishing status event to preserve AI title (#22402 ) ## Problem Chat titles revert to the fallback truncated title after briefly showing the AI-generated title. Even reloading the page doesn't help — the correct title flashes then gets overwritten. ## Root Cause Single bug, two symptoms. In `processChat` (`coderd/chatd/chatd.go`), the `chat` variable is passed by value. The flow: 1. `processChat(ctx, chat)` receives `chat` with the initial fallback title (truncated first message). 2. Inside `runChat`, `maybeGenerateChatTitle` generates an AI title, writes it to the DB via `UpdateChatByID`, and publishes a `title_change` event. The DB has the correct title. The client briefly displays it. 3. `runChat` returns. The deferred cleanup in `processChat` publishes `publishChatPubsubEvent(chat, StatusChange)` — but `chat` here is the original value copy that still has the old fallback title. 4. The frontend receives the `status_change` SSE event and unconditionally applies `title` from every event kind (see `AgentsPage.tsx` line ~305: `title: updatedChat.title`). This overwrites the correct AI title with the stale fallback. Why reload doesn't help: If the chat is still processing when the page reloads, `listChats` loads the correct title from the DB, but then the deferred `status_change` event arrives moments later and clobbers it. The title was always in the DB — it was the pubsub event that kept overwriting it. ## Fix Re-read the chat from the database in the deferred cleanup before publishing the final `status_change` event, so it carries the current (AI-generated) title.	2026-02-27 15:06:36 -05:00
Kyle Carberry	344d11fa22	feat: include OS and working directory in workspace agent prompt injection (#22399 ) When injecting system instructions into the chat prompt, include: 1. Operating system and working directory from the `workspace_agents` table 2. Home-level instructions from `~/.coder/AGENTS.md` (existing behavior) 3. Project-level instructions from `<pwd>/AGENTS.md` (new) The XML tag is renamed from `<coder-home-instructions>` to `<system-instructions>` since it now carries more than just the home instruction file. ### Example output (both files present) ```xml <system-instructions> Operating System: linux Working Directory: /home/coder/coder Source: /home/coder/.coder/AGENTS.md ... home instructions ... Source: /home/coder/coder/AGENTS.md ... project instructions ... </system-instructions> ``` ### Example output (no AGENTS.md files) ```xml <system-instructions> Operating System: linux Working Directory: /home/coder/coder </system-instructions> ``` ### Changes - `coderd/chatd/instruction.go`: - Renamed types: `homeInstructionContext` → `agentContext`, added `instructionFile` struct - Extracted `readInstructionFileAtPath` shared helper - Added `readWorkingDirectoryInstructionFile` to read `<pwd>/AGENTS.md` - Replaced `formatHomeInstruction` with `formatInstructions` that renders both files under `<system-instructions>` - `coderd/chatd/chatd.go`: - Renamed `resolveHomeInstruction` → `resolveInstructions`; now reads both home and pwd instruction files - `resolveAgentContext` returns `agentContext` (renamed from `homeInstructionContext`) - pwd file read is skipped gracefully if directory is empty or file doesn't exist - `coderd/chatd/instruction_test.go`: - Added `TestReadWorkingDirectoryInstructionFile` (success, not-found, empty-directory) - Replaced `TestFormatHomeInstruction` with `TestFormatInstructions` covering all combinations - Added ordering test (`AgentContextBeforeFiles`) to verify OS/pwd appear before file sources	2026-02-27 14:21:23 -05:00
Kyle Carberry	59cec5be65	feat: add pagination and popularity sorting to chattool list_templates (#22398 ) ## Summary The `chattool` `list_templates` tool previously returned all templates in a single response with no popularity signal. On deployments with many templates (e.g. 71 on dogfood), this wastes tokens and makes it hard for the AI to pick the right template for broad user questions. ## Changes Single file: `coderd/chatd/chattool/listtemplates.go` - `page` parameter — optional, 1-indexed, 10 results per page - Popularity sort — queries `GetWorkspaceUniqueOwnerCountByTemplateIDs` to get active developer counts, then sorts descending (most popular first). The DB query returns templates alphabetically, so this explicit sort is needed. - `active_developers` — included on each template item so the agent can see the signal - Pagination metadata — `page`, `total_pages`, `total_count` in the response so the agent knows there are more results - Updated tool description — tells the agent that results are ordered by popularity and paginated ## Frontend No frontend changes needed. The renderer already reads `rec.templates` and `rec.count` from the response — the new fields (`page`, `total_pages`, `total_count`) are additive and safely ignored.	2026-02-27 14:06:22 -05:00
Cian Johnston	0cfa03718e	fix(stringutil): operate on runes instead of bytes in Truncate (#22388 ) Fixes https://github.com/coder/coder/issues/22375 Updates `stringutil.Truncate` to properly handle multi-byte UTF-8 characters. Adds tests for multi-byte truncation with word boundary. Created by Mux using Opus 4.6	2026-02-27 17:46:37 +00:00
Kyle Carberry	0252205374	agents: do not use bridge config vars for models (#22392 ) <!-- If you have used AI to produce some or all of this PR, please ensure you have read our [AI Contribution guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING) before submitting. -->	2026-02-27 12:24:38 -05:00
Kyle Carberry	edee917d88	feat: add experimental agents support (#22290 ) feat: add AI chat system with agent tools and chat UI Introduce the chatd subsystem and Agents UI for AI-powered chat within Coder workspaces. - Add chatd package with chat loop, message compaction, prompt management, and LLM provider integration (OpenAI, Anthropic) - Add agent tools: create workspace, list/read templates, read/write/ edit files, execute commands - Add chat API endpoints with streaming, message editing, and durable reconnection - Add database schema and migrations for chats, chat messages, chat providers, and chat model configs - Add RBAC policies and dbauthz enforcement for chat resources - Add Agents UI pages with conversation timeline, queued messages list, diff viewer, and model configuration panel - Add comprehensive test coverage including coderd integration tests, chatd unit tests, and Storybook stories - Gate feature behind experiments flag --------- Co-authored-by: Cian Johnston <cian@coder.com> Co-authored-by: Danielle Maywood <danielle@themaywoods.com> Co-authored-by: Jeremy Ruppel <jeremy@coder.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-27 16:50:56 +00:00
Susana Ferreira	ca234f346d	fix: mark presets as validation_failed to prevent endless prebuild retries (#22085 ) ## Description - Updates `wsbuilder` to return a `BuildError` with `http.StatusBadRequest` to signify a "validation error" on missing or invalid parameters - Adds a short-circuit in `prebuilds.StoreReconciler` to mark presets for which creating a build returns a "validation error" as "validation failed" and skip further attempts to reconcile. - Adds a test to verify the above - Introduces a new Prometheus metric `coderd_prebuilt_workspaces_preset_validation_failed` to track the above Closes: https://github.com/coder/coder/issues/21237 --------- Co-authored-by: Cian Johnston <cian@coder.com>	2026-02-27 14:26:48 +00:00
Dean Sheather	bef7eb9dcc	fix: avoid derp-related panic during wsproxy registration (#22322 )	2026-02-27 00:07:14 +11:00
blinkagent[bot]	d140920248	fix(coderd): bump taskname default model from Claude 3.5 Haiku to Claude Haiku 4.5 (#22304 ) Claude 3.5 Haiku (`claude-3-5-haiku-20241022`) was retired by Anthropic on February 19th, 2026. Requests to this model now return errors. Switch to Claude Haiku 4.5 (`claude-haiku-4-5`), which is the [recommended replacement](https://docs.anthropic.com/en/docs/resources/model-deprecations). --- One-line change in `coderd/taskname/taskname.go` L25: ```diff - defaultModel = anthropic.ModelClaude3_5HaikuLatest + defaultModel = anthropic.ModelClaudeHaiku4_5 ``` Co-authored-by: blink-so[bot] <211532188+blink-so[bot]@users.noreply.github.com>	2026-02-25 16:38:04 +00:00
Zach	2bac4eb739	fix: use time.Equal() for external auth token expiry comparison (#22295 ) The listen loop in workspaceAgentsExternalAuthListen compared OAuthExpiry using == which compares `time.Time` internal struct fields including the `time.Location` pointer. `time.LoadLocation` does not cache the returned `Location` pointer, so each lib/pq connection gets a distinct pointer for the same timezone. When `pq.ParseTimestamp()` applies the connection's location to a parsed timestamp, the resulting time.Time embeds that connection-specific pointer. If the `sql.DB` pool hands out different connections for the two GetExternalAuthLink reads, the identical timestamp produces `time.Time` values where == returns false despite representing the same instant. This is intermittent because the pool _usually_ reuses the same connection for sequential queries. This change uses `.Equal()` to compare instants regardless of location. Also makes the test's validation call counter atomic to fix a possible data race between the HTTP server and test goroutines.	2026-02-25 08:45:00 -07:00
Jake Howell	d2787df442	feat: add AI Bridge request logs model filter (#22230 ) This pull-request implements a simple filtering logic so that we're able to pick which model the user actually used when logs were sent to AI Bridge. - Add `GET /aibridge/models` API endpoint that returns distinct model names from AI Bridge interceptions, with pagination and search support - New `ListAIBridgeModels` SQL query using case-sensitive prefix matching (`LIKE model \|\| '%'`) to allow B-tree index usage - Hand-written `ListAuthorizedAIBridgeModels` in `modelqueries.go` for RBAC authorization filter injection - `AIBridgeModels` search query parser in searchquery/search.go (defaults bare terms to the `model` field) - dbauthz wrappers, dbmetrics, and dbmock implementations for the new query <img width="292" height="185" alt="image" src="https://github.com/user-attachments/assets/134771df-2d26-4c54-acc4-27f58128b351" />	2026-02-26 02:40:45 +11:00
Mathias Fredriksson	d2f33932c0	test(coderd): remove provisioner daemon from SendToNonActiveStates test (#22298 ) This change a test flake triggered disabling the provisioner daemon that was modifying jobs created by dbgen. Fixes coder/internal#1367	2026-02-25 13:14:32 +02:00
Garrett Delfosse	4057363f78	fix(coderd): add organization_name label to insights Prometheus metrics (#22296 ) ## Description When multiple organizations have templates with the same name, the Prometheus `/metrics` endpoint returns HTTP 500 because Prometheus rejects duplicate label combinations. The three `coderd_insights_` metrics (`coderd_insights_templates_active_users`, `coderd_insights_applications_usage_seconds`, `coderd_insights_parameters`) used only `template_name` as a distinguishing label, so two templates named e.g. `"openstack-v1"` in different orgs would produce duplicate metric series. This adds `organization_name` as a label to all three insight metric descriptors to disambiguate templates across organizations. ## Changes `coderd/prometheusmetrics/insights/metricscollector.go`: - Added `organization_name` label to all three metric descriptors - Added `organizationNames` field (template ID → org name) to the `insightsData` struct - In `doTick`: after fetching templates, collect unique org IDs, fetch organizations via `GetOrganizations`, and build a template-ID-to-org-name mapping - In `Collect()`: pass the organization name as an additional label value in every `MustNewConstMetric` call `coderd/prometheusmetrics/insights/testdata/insights-metrics.json`*: Updated golden file to include `organization_name=coder` in all metric label keys. Fixes #21748	2026-02-25 08:58:50 +00:00
Steven Masley	93e823931b	fix: allow sharing ports >9999 (#22273 ) Closes https://github.com/coder/coder/issues/22267	2026-02-24 23:46:43 -06:00
Cian Johnston	6336fee3a7	feat: add telemetry for task lifecycle events (#21922 ) Relates to https://github.com/coder/internal/issues/1259 Adds new database queries and telemetry collection functions to gather task lifecycle events (pause/resume cycles, idle time) for analytics. Task events track pause/resume activity, idle duration before pausing, paused duration, and time from resume to first app status, filtered to recent activity based on the telemetry snapshot interval. 🤖 Created with Mux (Opus 4.6).	2026-02-24 17:04:42 +00:00
Danielle Maywood	974ca3eda6	fix: use "idle timeout" as task auto-pause reason (#22287 )	2026-02-24 16:45:56 +00:00
Kacper Sawicki	1e274063d4	feat(coderd): filter expired API tokens server-side (#22263 ) ## Summary Moves expired token filtering from client-side to server-side by adding an `include_expired` parameter to the `GetAPIKeysByLoginType` and `GetAPIKeysByUserID` database queries. This is more efficient for large deployments with many expired/short-lived tokens. ## Changes - Add `include_expired` parameter to SQL queries using `OR` short-circuit - Add `include_expired` query parameter to `GET /users/{user}/keys/tokens` - Add `IncludeExpired` field to `codersdk.TokensFilter` - Remove client-side filtering from CLI `tokens list` command - Add `TestTokensFilterExpired` test Fixes coder/internal#1357	2026-02-24 15:27:03 +00:00
Spike Curtis	393b3874ac	feat: add UpdateAppStatus to the workspace agent API (#22219 ) <!-- If you have used AI to produce some or all of this PR, please ensure you have read our [AI Contribution guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING) before submitting. --> part of https://github.com/coder/coder/issues/21335 This moves updating app status (used by Tasks) into the workspace agent API over dRPC. This will allow us to update the status without having to re-authenticate each time, like we would with an HTTP PATCH request. Further PRs in this stack will pipe these requests thru from the CLI MCP server to the agentsock and finally to this dRPC call to coderd.	2026-02-24 13:26:55 +04:00
Jon Ayers	0a7a3da178	fix: exclude provisioner_state from workspace_build_with_user view (#22159 ) The provisioner state for a workspace build was being loaded for every long-lived agent rpc connection. Since this state can be anywhere from kilobytes to megabytes this can gradually cause the `coderd` memory footprint to grow over time. It's also a lot of unnecessary allocations for every query that fetches a workspace build since only a few callers ever actually reference the provisioner state. This PR removes it from the returned workspace build and adds a query to fetch the provisioner state explicitly.	2026-02-23 22:46:17 -06:00
Sushant P	37a8e61ea2	chore: move Shared Workspaces from experiments to beta (#22206 ) * Removed the shared-workspaces experiment and cleaned up related middleware * Added beta tagging to the UI for shared workspaces	2026-02-23 08:30:32 -08:00
Thomas Kosiewski	b776a14b46	fix(coderd): harden OAuth2 provider security (#22194 ) ## Summary Harden the OAuth2 provider with multiple security fixes addressing `coder/security#121` (CSRF session takeover) and converge on OAuth 2.1 compliance. ### Security Fixes \| Fix \| Description \| Commits \| \|-----\|-------------\|---------\| \| CSRF on `/oauth2/authorize` \| Enforce CSRF protection on the authorize endpoint POST (consent form submission) \| `ba7d646`, `b94a64e` \| \| Clickjacking: `frame-ancestors` CSP \| Prevent consent page from being iframed (`Content-Security-Policy: frame-ancestors 'none'` + `X-Frame-Options: DENY`) \| `597aeb2` \| \| Exact redirect URI matching \| Changed from prefix matching to full string exact matching per OAuth 2.1 §4.1.2.1 \| `73d64b1`, `93897f1` \| \| Store & verify `redirect_uri` \| Store redirect_uri with auth code in DB, verify at token exchange matches exactly (RFC 6749 §4.1.3) \| `50569b9`, `d7ca315` \| \| Mandatory PKCE \| Require `code_challenge` at authorization (for `response_type=code`) + unconditional `code_verifier` verification at token exchange \| `d7ca315`, `1cda1a9` \| \| Reject implicit grant \| `response_type=token` now returns `unsupported_response_type` error page (OAuth 2.1 removes implicit flow) \| `d7ca315`, `91b8863` \| ### Changes by File `coderd/httpmw/csrf.go` — Extended the CSRF `ExemptFunc` to enforce CSRF on `/oauth2/authorize` in addition to `/api` routes. The consent form POST is now CSRF-protected to prevent cross-site authorization code theft. `site/site.go` — Added `Content-Security-Policy: frame-ancestors 'none'` and `X-Frame-Options: DENY` headers to `RenderOAuthAllowPage` (consent page only — does not affect the SPA/global CSP used by AI tasks). `coderd/httpapi/queryparams.go` — Changed `RedirectURL` from prefix matching (`strings.HasPrefix(v.Path, base.Path)`) to full URI exact matching (`v.String() != base.String()`), comparing scheme, host, path, and query. `coderd/oauth2provider/authorize.go` — Added PKCE enforcement: `code_challenge` is required when `response_type=code` (via a conditional check, not `RequiredNotEmpty`, so `response_type=token` can reach the explicit rejection path). `ShowAuthorizePage` (GET) validates `response_type` before rendering and returns a 400 error page for unsupported types. `ProcessAuthorize` (POST) stores the `redirect_uri` with the auth code when explicitly provided. `coderd/oauth2provider/tokens.go` — PKCE verification is now unconditional (not gated on `code_challenge` being present in DB). If the stored code has a `redirect_uri`, the token endpoint verifies it matches exactly — mismatch returns `errBadCode` → `invalid_grant`. Missing `code_verifier` returns `invalid_grant`. `codersdk/oauth2.go` — `OAuth2ProviderResponseTypeToken` constant and `Valid()` acceptance are kept so the authorize handler can parse `response_type=token` and return the proper `unsupported_response_type` error rather than failing at parameter validation. *`coderd/database/migrations/000421_` — Added `redirect_uri text` column to `oauth2_provider_app_codes`. ### Design Decisions `state` parameter remains optional — The plan initially required `state` via `RequiredNotEmpty`, but this was reverted in `376a753` to avoid breaking existing clients. The `state` is still hashed and stored when provided (via `state_hash` column), securing clients that opt in. `response_type=token` kept in `Valid()` — Removing it from `Valid()` would cause the parameter parser to reject the request before the authorize handler can return the proper `unsupported_response_type` error. The constant is kept for correct error handling flow. CSP scoped to consent page only — `frame-ancestors 'none'` is set only on the OAuth consent page renderer, not globally. The SPA/global CSP was previously changed to allow framing for AI tasks ([#18102](https://github.com/coder/coder/pull/18102)); this change does not regress that. ### Out of Scope (follow-up PRs) - Bearer tokens in query strings (needs internal caller audit) - Scope enforcement on OAuth2 tokens - Rate limiting on dynamic client registration --- <details> <summary>📋 Implementation Plan</summary> # Plan: Harden OAuth2 Provider — Security Fixes + OAuth 2.1 Compliance ## Context & Why Security issue `coder/security#121` reports a critical session takeover via CSRF on the OAuth2 provider. This plan covers all remaining security fixes from that issue plus convergence on OAuth 2.1 requirements. The goal is a single PR that closes all actionable gaps. ## Current State (already committed on branch `csrf-sjx1`) \| Fix \| Status \| Commits \| \|-----\|--------\|---------\| \| Fix 1: CSRF on `/oauth2/authorize` \| ✅ Done \| `ba7d646`, `b94a64e` \| \| CSRF token in consent form HTML \| ✅ Done \| `b94a64e` \| \| `state_hash` column + storage \| ✅ Done (hash stored, but state still optional) \| `9167d83`, `b94a64e` \| \| Tests for CSRF + state hash \| ✅ Done \| `e4119b5` \| ## Remaining Work ### ~~Fix 2 — Require `state` parameter~~ (DROPPED) > Decision: Do not enforce `state` as required. The `state` parameter is still hashed and stored when provided (via `hashOAuth2State` / `state_hash` column from prior commits), but clients are not forced to supply it. This avoids breaking existing integrations that omit state. Rollback: Remove `"state"` from the `RequiredNotEmpty` call in `coderd/oauth2provider/authorize.go:42`: ```go // BEFORE (current on branch) p.RequiredNotEmpty("response_type", "client_id", "state", "code_challenge") // AFTER p.RequiredNotEmpty("response_type", "client_id", "code_challenge") ``` No test changes needed — tests already pass `state` voluntarily. ### Fix 4 — Exact redirect URI matching Currently `coderd/httpapi/queryparams.go:233` uses prefix matching: ```go // CURRENT — prefix match if v.Host != base.Host \|\| !strings.HasPrefix(v.Path, base.Path) { ``` OAuth 2.1 requires exact string matching. Change to: ```go // AFTER — exact match (OAuth 2.1 §4.1.2.1) if v.Host != base.Host \|\| v.Path != base.Path { ``` File: `coderd/httpapi/queryparams.go` — `RedirectURL` method Also update the error message from "must be a subset of" to "must exactly match". Additionally, store `redirect_uri` with the auth code and verify at the token endpoint (RFC 6749 §4.1.3): 1. New migration (same migration file or a new `000421`): Add `redirect_uri text` column to `oauth2_provider_app_codes` 2. Update INSERT query in `coderd/database/queries/oauth2.sql` to include `redirect_uri` 3. `coderd/oauth2provider/authorize.go`: Store `params.redirectURL.String()` when inserting the code 4. `coderd/oauth2provider/tokens.go`: After retrieving the code from DB, verify that `redirect_uri` from the token request matches the stored value exactly. Currently `tokens.go:103` calls `p.RedirectURL(vals, callbackURL, "redirect_uri")` for prefix validation only — it must compare against the stored redirect_uri from the code, not just the app's callback URL. <details> <summary>Why both exact match AND store+verify?</summary> Exact matching at the authorize endpoint prevents open redirectors (attacker can't use a sub-path). Storing and verifying at the token endpoint prevents code injection — an attacker who steals a code can't exchange it with a different redirect_uri than was originally authorized. This is required by RFC 6749 §4.1.3 and OAuth 2.1. </details> ### Fix 7 — `frame-ancestors` CSP on consent page The consent page can be iframed by a workspace app (same-site), which is the attack vector. Add a `Content-Security-Policy` header to prevent framing. File: `site/site.go` — `RenderOAuthAllowPage` function (~line 731)** Before writing the response, add: ```go func RenderOAuthAllowPage(rw http.ResponseWriter, r http.Request, data RenderOAuthAllowData) { rw.Header().Set("Content-Type", "text/html; charset=utf-8") // Prevent the consent page from being framed to mitigate // clickjacking attacks (coder/security#121). rw.Header().Set("Content-Security-Policy", "frame-ancestors 'none'") rw.Header().Set("X-Frame-Options", "DENY") ... ``` Both headers for defense-in-depth (CSP for modern browsers, X-Frame-Options for legacy). ### OAuth 2.1 — Mandatory PKCE Currently PKCE is checked only when `code_challenge` was provided during authorization (`tokens.go:258`): ```go // CURRENT — conditional check if dbCode.CodeChallenge.Valid && dbCode.CodeChallenge.String != "" { // verify PKCE } ``` OAuth 2.1 requires PKCE for ALL authorization code flows. Change to: File: `coderd/oauth2provider/authorize.go`* — Add `"code_challenge"` to required params: ```go p.RequiredNotEmpty("response_type", "client_id", "code_challenge") ``` File: `coderd/oauth2provider/tokens.go:257-265` — Make PKCE verification unconditional: ```go // AFTER — PKCE always required (OAuth 2.1) if req.CodeVerifier == "" { return codersdk.OAuth2TokenResponse{}, errInvalidPKCE } if !dbCode.CodeChallenge.Valid \|\| dbCode.CodeChallenge.String == "" { // Code was issued without a challenge — should not happen // with the authorize endpoint enforcement, but defend in // depth. return codersdk.OAuth2TokenResponse{}, errInvalidPKCE } if !VerifyPKCE(dbCode.CodeChallenge.String, req.CodeVerifier) { return codersdk.OAuth2TokenResponse{}, errInvalidPKCE } ``` File: `codersdk/oauth2.go` — Remove `OAuth2ProviderResponseTypeToken` from the enum or reject it explicitly in the authorize handler. Currently it's defined at line 216 but the handler ignores `response_type` and always issues a code. We should either: - (a) Remove the `"token"` variant from the enum and reject it with `unsupported_response_type`, OR - (b) Add an explicit check in `ProcessAuthorize` that rejects `response_type=token` Option (b) is simpler and more backwards-compatible: ```go // In ProcessAuthorize, after extracting params: if params.responseType != codersdk.OAuth2ProviderResponseTypeCode { httpapi.WriteOAuth2Error(ctx, rw, http.StatusBadRequest, codersdk.OAuth2ErrorCodeUnsupportedResponseType, "Only response_type=code is supported") return } ``` ### OAuth 2.1 — Bearer tokens in query strings `coderd/httpmw/apikey.go:743` accepts `access_token` from URL query parameters. OAuth 2.1 prohibits this. However, this may be used internally (e.g., workspace apps, DERP). Need to audit callers before removing. Approach: This is a larger change with potential breakage. Mark as a separate follow-up issue rather than including in this PR. Document the finding. ### OAuth 2.1 — Removed flows ✅ Already compliant. `tokens.go` only supports `authorization_code` and `refresh_token` grant types. The implicit grant (`response_type=token`) will be explicitly rejected per the PKCE section above. ### OAuth 2.1 — Refresh token rotation ✅ Already compliant. `tokens.go:442` deletes the old API key when a refresh token is used. ## Migration Plan All DB changes can go in a single new migration (or extend 000420 if the branch is rebased before merge). Columns to add: - `redirect_uri text` on `oauth2_provider_app_codes` The `state_hash` column is already added by migration 000420. ## Implementation Order 1. Fix 7 — CSP headers on consent page (isolated, no deps) 2. ~~Fix 2 — Require `state` parameter~~ (DROPPED — state stays optional) 3. Fix 4 — Exact redirect URI matching + store/verify redirect_uri 4. PKCE mandatory — Require `code_challenge` + reject `response_type=token` 5. Rollback — Remove `"state"` from `RequiredNotEmpty` in `authorize.go` 6. Tests — Update/add tests for all changes 7. `make gen` after DB changes ## Out of Scope (separate PRs) - Bearer tokens in query strings (needs internal caller audit) - Scope enforcement on OAuth2 tokens - Rate limiting / quota on dynamic client registration </details> --- _Generated with [`mux`](https://github.com/coder/mux) • Model: `anthropic:claude-opus-4-6` • Thinking: `xhigh`_	2026-02-23 12:18:44 +01:00
Zach	6a783fc5c7	fix: floor provisioner job queue wait metric (#22184 ) After a PostgreSQL round-trip, job timestamps lose their monotonic clock component, making the subtraction susceptible to wall-clock adjustments producing a small negative delta. Floor at 1ms since a zero or negative queue wait is meaningless. Fixes TestProvisionerJobQueueWaitMetric flakes where small negative values (~ -2ms) are observed.	2026-02-20 16:12:17 -07:00
Steven Masley	b0f35316da	chore!: automatically use secure cookies if using https access-url (#22198 ) `--secure-auth-cookie` now automatically sources it's default value from `--access-url` If the access url uses HTTPS, secure is set to `true`. To revert to old behavior, set the value explicitly to `false`	2026-02-20 10:33:37 -06:00
Steven Masley	efdaaa2c8f	chore: add oidc redirect url to override access url (#21521 ) If a deployment has 2 domains, overriding the oidc url allows the oidc redirect to differ from the access_url response to https://github.com/coder/coder/discussions/21500 This config setting is hidden by default	2026-02-20 09:11:01 -06:00
Steven Masley	e5f64eb21d	chore: optionally prefix authentication related cookies (#22148 ) When the deployment option is enabled auth cookies are prefixed with `__HOST-` ([info](https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Set-Cookie)). This is all done in a middleware that intercepts all requests and strips the prefix on incoming request cookies.	2026-02-20 09:01:00 -06:00
Jake Howell	051ed34580	feat: convert `soft_limit` to `limit` (#22048 ) In relation to [`internal#1281`](https://github.com/coder/internal/issues/1281) Remove the `soft_limit` field from the `Feature` type and simplify license limit handling. This change: - Removes the `soft_limit` field from the API and SDK - Uses the soft limit value as the single `limit` value in the UI and API - Simplifies warning logic to only show warnings when the limit is exceeded - Updates tests to reflect the new behavior - Updates the UI to use the single limit value for display	2026-02-20 16:09:12 +11:00
Garrett Delfosse	e8d6016807	fix: allow users with workspace:create for any owner to list users (#21947 ) ## Summary Custom roles that can create workspaces on behalf of other users need to be able to list users to populate the owner dropdown in the workspace creation UI. Previously, this required a separate `user:read` permission, causing the dropdown to fail for custom roles. ## Changes - Modified `GetUsers` in `dbauthz` to check if the user can create workspaces for any owner (`workspace:create` with `owner_id: *`) - If the user has this permission, they can list all users without needing explicit `user:read` permission - Added tests to verify the new behavior ## Testing - Updated mock tests to assert the new authorization check - Added integration tests for both positive and negative cases Fixes #18203	2026-02-19 13:04:53 -05:00
Danielle Maywood	911d734df9	fix: avoid re-using `AuthInstanceID` for sub agents (#22196 ) Parent agents were re-using AuthInstanceID when spawning child agents. This caused GetWorkspaceAgentByInstanceID to return the most recently created sub agent instead of the parent when the parent tried to refetch its own manifest. Fix by not reusing AuthInstanceID for sub agents, and updating GetWorkspaceAgentByInstanceID to filter them out entirely.	2026-02-19 16:56:29 +00:00
Danielle Maywood	92a6d6c2c0	chore: remove unnecessary loop variable captures (#22180 ) Since Go 1.22, the loop variable capture issue is resolved. Variables declared by for loops are now per-iteration rather than per-loop, making the 'v := v' pattern unnecessary.	2026-02-19 09:02:19 +00:00
Danielle Maywood	31c1279202	feat: notify on task auto pause, manual pause and manual resume (#22050 )	2026-02-18 16:30:16 +00:00
Kacper Sawicki	f016d9e505	fix(coderd): add role param to agent RPC to prevent false connectivity (#22052 ) ## Summary coder-logstream-kube and other tools that use the agent token to connect to the RPC endpoint were incorrectly triggering connection monitoring, causing false connected/disconnected timestamps on the agent. This led to VSCode/JetBrains disconnections and incorrect dashboard status. ## Changes Add a `role` query parameter to `/api/v2/workspaceagents/me/rpc`: - `role=agent`: triggers connection monitoring (default for the agent SDK) - any other value (e.g. `logstream-kube`): skips connection monitoring - omitted: triggers monitoring for backward compatibility with older agents The agent SDK now sends `role=agent` by default. A new `Role` field on the `agentsdk.Client` allows non-agent callers to specify a different role. ## Required follow-up coder-logstream-kube needs to set `client.Role = "logstream-kube"` before calling `ConnectRPC20()`. Without that change, it will still send `role=agent` and trigger monitoring. Fixes #21625	2026-02-18 09:44:06 +01:00
Cian Johnston	f8eea54e97	fix(coderd): use BuildReasonTaskAutoPause for task workspaces (#22126 ) Relates to https://github.com/coder/internal/issues/1252 When a workspace with a TaskID hits its deadline, use BuildReasonTaskAutoPause instead of BuildReasonAutostop. This allows downstream systems to distinguish between regular autostop and task workspace pauses. Created by Mux using Opus 4.5.	2026-02-17 15:11:04 +00:00
Paweł Banaszewski	90c11f3386	feat: add client column to aibridge_interceptions table (#21839 ) Adds `client` column to `aibridge_interceptions` table. It is set accordingly to what is passed from AI Bridge in `RecordInterception`. Adds interception filtering by `client` value. Depends on: https://github.com/coder/aibridge/pull/158 Updates aibridge library to include this change. Fixes: https://github.com/coder/aibridge/issues/31	2026-02-17 15:43:02 +01:00
Cian Johnston	4a3304fc38	feat(cli)!: expire tokens by default (#21783 ) ## Summary > NOTE: Calling this out as a breaking change in case existing consumers of the CLI depend on being able to see expired tokens OR being able to delete tokens immediately. Updates the `coder tokens rm` command to immediately expire a token by ID, preserving the token record for audit trail purposes. Tokens can still be deleted by passing `--delete`. ## Problem During an incident on dev.coder.com, operators needed to urgently expire an API key that was stuck in a hot loop. The only way to do this was via direct database access: ```sql UPDATE api_keys SET expires_at = NOW() WHERE id = '...'; ``` This is not ideal for operators who may not have direct DB access or want to avoid manual SQL. ## Solution This PR adds: - API endpoint: `PUT /api/v2/users/{user}/keys/{keyid}/expire` - Sets the token's `expires_at` to now - SDK method: `ExpireAPIKey(ctx, userID, keyID)` - Updates CLI: `coder tokens rm <name\|id\|token>` now _expires_ by default. You can still delete by passing the `--delete` flag. The `coder tokens list` command now also hides expired tokens by default. You can `--include-expired` if needed to include them. - Audit logging: The expire action is logged with old and new key states ## Test plan - Tests cover: owner expiring own token, admin expiring other user's token, non-admin cannot expire other's token, 404 for non-existent token Closes #21782 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-17 13:16:46 +00:00
Jeremy Ruppel	0df864fb88	fix: hide "Create Workspace" button for deleted templates (#22092 ) Background Reported in #17417, there is a `deleted` query parameter supported by /api/v2/templates, but we do not respect this field on the client, showing the "Create Workspace" button for deleted templates. Expected Behavior Don't show the "Create Workspace" button for deleted templates. Notes This PR adds a new `deleted` field to the templates API response. Co-authored-by: Danielle Maywood <danielle@themaywoods.com>	2026-02-13 19:44:50 -05:00
Steven Masley	01f06671a1	chore: return 404, not 400 if missing or authz deny (#22069 )	2026-02-13 08:19:07 -06:00
Callum Styan	5f3be6b288	feat: add provisioner job queue wait time histogram and jobs enqueued counter (#21869 ) This PR adds some metrics to help identify job enqueue rates and latencies. This work was initiated as a way to help reduce the cost of the observation/measurement itself for autostart scaletests, which impacts our ability to identify/reason about the load caused by autostart. See: https://github.com/coder/internal/issues/1209 I've extended the metrics here to account for regular user initiated builds, prebuilds, autostarts, etc. IMO there is still the question here of whether we want to include or need the `transition` label, which is only present on workspace builds. Including it does lead to an increase in cardinality, and in the case of the histogram (when not using native histograms) that's at least a few extra series for every bucket. We could remove the transition label there but keep it on the counter. Additionally, the histogram is currently observing latencies for other jobs, such as template builds/version imports, those do not have a transition type associated with them. Tested briefly in a workspace, can see metric values like the following: - `coderd_workspace_builds_enqueued_total{build_reason="autostart",provisioner_type="terraform",status="success",transition="start"} 1` - `coderd_provisioner_job_queue_wait_seconds_bucket{build_reason="autostart",job_type="workspace_build",provisioner_type="terraform",transition="start",le="0.025"} 1` --------- Signed-off-by: Callum Styan <callumstyan@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-12 13:40:47 -08:00
Cian Johnston	194d79402e	chore: remove dbmem comment references (#22056 ) 👻 The ghost of dbmem managed to live on... until now.	2026-02-12 09:06:33 +00:00
Sas Swart	47b8ca940c	feat: add an endpoint to manually resume a coder task (#21948 ) Closes https://github.com/coder/internal/issues/1262. This PR adds: * the `POST /api/experimental/tasks/{user}/{task}/resume` endpoint * follows conventions from https://github.com/coder/internal/issues/1261 * sets the build reason to `task_resume` * a task that is not paused (ie. is already running), cannot be resumed.	2026-02-12 09:59:53 +02:00
cryptoluks	fcf431c1d7	fix(coderd/workspaceapps): prefer app session cookie over Authorization (#22041 ) This PR fixes a workspace app authentication bug where requests that include an `Authorization` header (intended for the upstream app) can cause Coder to ignore the workspace app session cookie (`coder_subdomain_app_session_token_` / `coder_path_app_session_token`). When that happens, Coder fails to mint or renew `coder_signed_app_token` and redirects to `/api/v2/applications/auth-redirect` instead of proxying the request to the workspace. This commonly shows up when users run a frontend and backend in the same workspace and the backend requires `Authorization` (for example, `curl -H "Authorization: bearer ..."` or browser `fetch()` calls). Related issues / context: Primary bug report and repro: [https://github.com/coder/coder/issues/21467](https://github.com/coder/coder/issues/21467) * Related symptoms reported as CORS / redirect failures for workspace apps: * [https://github.com/coder/coder/issues/20667](https://github.com/coder/coder/issues/20667) * [https://github.com/coder/coder/issues/19728](https://github.com/coder/coder/issues/19728) ## Root Cause In `coderd/workspaceapps/cookies.go`, `AppCookies.TokenFromRequest` checked `httpmw.APITokenFromRequest(r)` first. That helper returns a token from several places, including `Authorization: Bearer ...`. As a result, when a request included an upstream `Authorization` header, that header value was returned as the “session token” for the app proxy, and `coder_subdomain_app_session_token_` was never read. Authentication then failed and the request was treated as signed out. ## Fix Change the precedence in `AppCookies.TokenFromRequest`: 1. First check the access-method-specific cookie: subdomain apps: `coder_subdomain_app_session_token_{hash}` * path apps: `coder_path_app_session_token` 2. If not present, fall back to `httpmw.APITokenFromRequest(r)` (so non-browser clients can still authenticate via query, header, or bearer tokens if they really want to). This ensures that: * Backend requests that require `Authorization` still reach the workspace. * `coder_signed_app_token` can be renewed from the app session cookie even when `Authorization` is present. * `Authorization` is still forwarded to the upstream app (the reverse proxy code does not strip it). Initially, I attempted workarounds ([https://github.com/coder/coder/issues/20667#issuecomment-3868578388](https://github.com/coder/coder/issues/20667#issuecomment-3868578388), [https://github.com/coder/coder/issues/19728#issuecomment-3868578093](https://github.com/coder/coder/issues/19728#issuecomment-3868578093)), but adding `/auth-redirect` to the permissive CORS paths and extending the validity of workspace app auth tokens from 1 minute to 1 hour only partially masked the issue. After workspace restarts and token expiry, I no longer saw CORS errors, but the tokens were still not renewed. After patching my local Nix-based setup on Coder v1.30.0 with this change, I can no longer observe this behavior.	2026-02-11 23:18:49 +11:00
George K	be94af386c	chore(coderd/database): enforce workspace ACL JSON object constraints (#22019 ) The constraints prevent faulty code from saving 'null' as JSON and breaking the `workspaces_expanded` view.	2026-02-10 16:17:29 -08:00

... 2 3 4 5 6 ...

3372 Commits