coder

mirror of https://github.com/coder/coder.git synced 2026-06-02 20:48:20 +00:00

Author	SHA1	Message	Date
Garrett Delfosse	78d4cf9e47	fix: soft-delete stale workspace agents on new build (#25207 )	2026-05-18 08:33:29 -04:00
Zach	79735f2d45	feat: plumb user secrets through provisioner chain to terraform (#24542 ) This change passes user secrets from coderd to the Terraform process at workspace build time so the `data.coder_secret` data source in terraform-provider-coder can resolve values at plan time. Secrets traverse two proto hops: `provisionerdserver` fetches them via`ListUserSecretsWithValues`, attaches them to `AcquiredJob.WorkspaceBuild.user_secrets` on `provisionerd.proto`; `runner.go` forwards into `PlanRequest.user_secrets` on `provisioner.proto`; the Terraform provisioner encodes each as `CODER_SECRET_ENV_<name>` or `CODER_SECRET_FILE_<hex(path)>` before invoking `terraform plan`. Only plan requests carry secrets; apply runs with `nil` because values are baked into plan state. Fetch is gated on a workspace transitioning to start. stop and delete transitions never carry secrets, so revoking or deleting a stored secret cannot make a workspace unstoppable. DB errors on the fetch fail the job outright rather than silently continuing with an empty secret set. Note that user secrets will be stored in the workspace_builds table in provisioner_state with other Terraform state (including other sensitive data).	2026-04-27 08:26:07 -06:00
Michael Suchacz	e5707a13d6	feat: support multiple agents with shared instance-identity auth (#24325 ) > This PR was authored by Mux on behalf of Mike. ## Summary Adds support for multiple peer root workspace agents sharing the same `auth_instance_id`, so AWS, Azure, and GCP instance-identity auth can issue the correct session token for a selected agent instead of assuming a single root agent per instance. ## Problem When a Terraform template attaches two or more `coder_agent` resources (with `auth = "aws-instance-identity"`) to a single compute instance, every agent shares the same cloud instance ID. The existing singular lookup picks whichever agent was created most recently, silently ignoring the others. ## Solution Introduce an optional pre-auth agent selector (`CODER_AGENT_NAME`) and make the server-side lookup ambiguity-aware. Database layer: - `GetWorkspaceAgentsByInstanceID` (`:many`): returns all matching root agents for an instance ID. - `GetWorkspaceAgentByInstanceIDAndName` (`:one`): returns the named root agent for disambiguation. SDK and CLI: - `agent_name` field added to AWS, Azure, and GCP request structs (`omitempty` for backward compatibility). - `CODER_AGENT_NAME` env var and `--agent-name` flag wired into the agent bootstrap before instance-identity auth runs. Server handler (`handleAuthInstanceID`): - When `agent_name` is present: direct lookup by (instance ID, name). - When absent: legacy lookup, then resource-scoped ambiguity check. Returns 409 with available agent names if multiple root agents match. - Whitespace-only names are trimmed and treated as unspecified. - Sub-agents remain excluded (`parent_id IS NULL` filter). Verification template: - `examples/templates/aws-multi-agent/` provisions one EC2 instance with two agents (`main` and `dev`), both using instance-identity auth with `CODER_AGENT_NAME` set in the cloud-init user data. ## Backward compatibility Existing single-agent deployments work unchanged. The `agent_name` field is optional with `omitempty`, and the unnamed path preserves today's behavior when only one root agent matches.	2026-04-16 13:59:09 +02:00
Sas Swart	5b6b7719df	fix: make prebuild claiming durable and idempotent (#23108 ) ## Problem When a prebuilt workspace is claimed, the agent reinitializes via a single fire-and-forget pubsub event over SSE. If the agent's SSE connection is interrupted at claim time, the event is permanently lost — the workspace is stuck with no self-healing path. Additionally, regular (non-prebuild) workspaces had no way to opt out of the `/reinit` polling loop — agents would reconnect indefinitely to an endpoint that would never send them anything useful. ## Root Cause `workspaceAgentReinit` fetches the workspace (with its current `owner_id`) via `GetWorkspaceByAgentID`, but never checked whether a claim already happened. It only subscribed to pubsub for future events. The database already has durable claim state (`owner_id` changes from `PrebuildsSystemUserID` to the real user), but no layer ever consulted it on reconnection. ## Solution ### Server-side durable check with first-build-initiator gating TOCTOU-safe ordering: Subscribe to pubsub claim events before any durable checks, so a claim that fires during the check is buffered in the channel rather than lost. First-build-initiator gating: When `!workspace.IsPrebuild()` (owner is no longer the system user), look up the first build's `InitiatorID`. The prebuild reconciler always uses `PrebuildsSystemUserID` as the initiator. This distinguishes claimed prebuilds from regular workspaces without any SQL schema changes. - Regular workspace (first build initiator ≠ system user) → 409 Conflict, agent stops reconnecting - Claimed prebuild, build completed → pre-seed channel with reinit event and close it, transmitter delivers one-shot then exits - Claimed prebuild, build in-progress → fall through to pubsub subscription, agent waits for completion event - Unclaimed prebuild → pubsub subscription (existing happy path) ### Declarative reinit events (defense-in-depth) - Added `UserID` field to `ReinitializationEvent` with JSON tags - Switched pubsub serialization from raw string to JSON (with backward-compat fallback for rolling upgrades) - Populated `UserID` at both the publish site and the durable check ### Agent SDK: 409 handling `WaitForReinitLoop` detects 409 Conflict from the server and closes the `reinitEvents` channel, cleanly exiting the retry goroutine. ### Agent CLI: fixed two bugs + added reinitCtx - Closed channel (`!ok`): now blocks on `<-ctx.Done()` instead of `continue`, keeping the current agent running. Previously this would leak agents by skipping `agnt.Close()` and re-entering the loop. - Duplicate owner reinit: cancels `reinitCtx` (stops the reinit goroutine), then blocks on `<-ctx.Done()`. Previously `continue` would skip cleanup and create a new agent on the next loop iteration. - `reinitCtx`: a cancellable child of `ctx` passed to `WaitForReinitLoop`, allowing the agent to stop the reinit HTTP polling after reinit completes. ### Agent-side idempotency Tracks `lastOwnerID` in the agent reinit loop — duplicate events for the same owner are skipped. ## Testing - "unclaimed prebuild receives reinit via pubsub": prebuild owned by system user, pubsub event triggers reinit - "claimed prebuild receives one-shot reinit on reconnect": first build by system user, owner changed, build completed → immediate reinit (no pubsub needed) - "claimed prebuild waits during in-progress claim build": claimed but build still running → no reinit until build completes - "regular workspace gets 409": first build by real user → 409 Conflict, agent stops polling - Updated claim publisher/listener tests: verify `UserID` survives JSON round-trip + backward compat with raw string payloads - Updated SSE round-trip test: verify `UserID` survives transmit → receive cycle Fixes #22359 ## Rolling upgrade note During a rolling deploy where old coderd instances coexist with new ones, the pubsub `ReinitializationEvent` has a new `workspace_id` field (JSON key `workspace_id`). Old publishers send a raw reason string instead of JSON; the new listener gracefully falls back by treating the entire payload as the reason and filling in `WorkspaceID` from context. The only visible effect during the upgrade window is that `WorkspaceID` may be the zero UUID in agent-side logs — this is cosmetic and resolves once all instances are updated.	2026-04-02 23:51:02 +02:00
Cian Johnston	3f55b35f68	refactor: replace AsSystemRestricted with narrower actors (#23712 ) Replace overly-broad `AsSystemRestricted` with purpose-built actors: - OAuth2 provider paths → `AsSystemOAuth2` (13 call sites across `tokens.go`, `registration.go`, `apikey.go`) - Provisioner daemon health read → `AsSystemReadProvisionerDaemons` (1 site in `healthcheck/provisioner.go`) - Provisionerd file cache paths → `AsProvisionerd` (2 sites in `provisionerdserver.go`, matching existing usage nearby) <details> <summary>Implementation notes</summary> Each replacement actor is a strict subset of `AsSystemRestricted`. Every DB method at each call site is already covered by the narrower actor's permissions: - `subjectSystemOAuth2`: OAuth2App/Secret/CodeToken (all), ApiKey (Read, Delete), User (Read), Organization (Read) - `subjectSystemReadProvisionerDaemons`: ProvisionerDaemon (Read) - `subjectProvisionerd`: File (Create, Read) plus provisionerd-scoped resources No new permissions added. `nolint:gocritic` comments updated to reflect the new actors. </details> > 🤖 Created by a Coder Agent, reviewed by me.	2026-03-27 15:08:30 +00:00
Kacper Sawicki	1e07ec49a6	feat: add merge_strategy support for coder_env resources (#23107 ) ## Description Implements the server-side merge logic for the `merge_strategy` attribute added to `coder_env` in [terraform-provider-coder v2.15.0](https://github.com/coder/terraform-provider-coder/pull/489). This allows template authors to control how duplicate environment variable names are combined across multiple `coder_env` resources. Relates to https://github.com/coder/coder/issues/21885 ## Supported strategies \| Strategy \| Behavior \| \|----------\|----------\| \| `replace` (default) \| Last value wins — backward compatible \| \| `append` \| Joins values with `:` separator (e.g. PATH additions) \| \| `prepend` \| Prepends value with `:` separator \| \| `error` \| Fails the build if the variable is already defined \| ## Example ```hcl resource "coder_env" "path_tools" { agent_id = coder_agent.dev.id name = "PATH" value = "/home/coder/tools/bin" merge_strategy = "append" } ``` ## Changes - Proto: Added `merge_strategy` field to `Env` message in `provisioner.proto` - State reader: Updated `agentEnvAttributes` struct and proto construction in `resources.go` - Merge logic: Added `mergeExtraEnvs()` function in `provisionerdserver.go` with strategy-aware merging for both agent envs and devcontainer subagent envs - Tests: 15 unit tests covering all strategies, edge cases (empty values, mixed strategies, multiple appends) - Dependency: Bumped `terraform-provider-coder` v2.14.0 → v2.15.0 - Fixtures: Updated `duplicate-env-keys` test fixtures and golden files ## Ordering When multiple resources `append` or `prepend` to the same key, they are processed in alphabetical order by Terraform resource address (per the determinism fix in #22706).	2026-03-18 15:43:28 +01:00
Steven Masley	84de391f26	chore: add tallyman events for ai seat tracking (#22689 ) AI seat tracking inserted as heartbeat into usage table.	2026-03-18 09:30:22 -05:00
Steven Masley	abf59ee7a6	feat: track ai seat usage (#22682 ) When a user uses an AI feature, we record them in the `ai_seat_state` as consuming a seat. Added in debouching to prevent excessive writes to the db for this feature. There is no need for frequent updates.	2026-03-16 12:36:26 -05:00
Mathias Fredriksson	703b974757	fix(coderd): remove false devcontainers early access warning (#23056 ) The script source claimed Dev Containers are early access and told users to set CODER_AGENT_DEVCONTAINERS_ENABLE=true, which already defaults to true. Clear the script source and set RunOnStart to false since there is nothing to run.	2026-03-16 10:16:14 +02:00
Callum Styan	36665e17b2	feat: add WatchAllWorkspaceBuilds endpoint for autostart scaletests (#22057 ) This PR adds a `WatchAllWorkspaces` function with `watch-all-workspaces` endpoint, which can be used to listen on a single global pubsub channel for _all_ workspace build updates, and makes use of it in the autostart scaletest. This negates the need to use a workspace watch pubsub channel _per_ workspace, which has auth overhead associated with each call. This is especially relevant in situations such as the autostart scaletest, where we need to start/stop a set of workspaces before we can configure their autostart config. The overhead associated with all the watch requests skews the scaletest results and makes it harder to reason about the performance of the autostart feature itself. The autostart scaletest also no longer generates its own metrics nor does it wait for all the workspaces to actually start via autostart. We should update the scaletest dashboard after both PRs are merged to measure autostart performance via the new metrics. The new function/endpoint and its usage in the autostart scaletest are gated behind an experiment feature flag, this is something we should discuss whether we want to enable the endpoint in prod by default or not. If so, we can remove the experiment. --------- Signed-off-by: Callum Styan <callumstyan@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Callum Styan <callum@coder.com>	2026-03-13 20:37:41 -07:00
Mathias Fredriksson	9d33c340ec	fix(coderd): handle ignored errors across coderd packages (#22851 ) Handle previously ignored error return values in coderd: - coderd/chats.go: check sendEvent errors, log on failure - coderd/chatd/chattest: thread testing.TB through server structs, replace log.Printf with t.Logf, check writeSSEEvent errors - coderd/chatd/chattool/createworkspace.go: log UpdateChatWorkspace failure instead of discarding both return values - coderd/chatd/chattool/execute.go: surface ProcessOutput error in the timeout message returned to the caller - coderd/provisionerdserver: log stream.Send failure in the DownloadFile error helper	2026-03-13 19:53:20 +02:00
Kyle Carberry	d39f69f4c2	fix: avoid mutating proto App.Healthcheck in insertAgentApp (#22954 ) ## Problem `insertAgentApp` mutated its input by writing to `app.Healthcheck` when it was nil (line 3525): ```go if app.Healthcheck == nil { app.Healthcheck = &sdkproto.Healthcheck{} // mutation! } ``` The Devcontainers subtests share the same `tt.resource` pointer across two parallel goroutines (`WithProtoIDs` and `WithoutProtoIDs`), causing a data race on the `Healthcheck` field (and its sub-fields `Url`, `Interval`, `Threshold`). ## Fix Replace the in-place mutation with a local variable: ```go healthcheck := app.GetHealthcheck() if healthcheck == nil { healthcheck = &sdkproto.Healthcheck{} } ``` This avoids writing back to the shared proto message. All downstream reads now use the local `healthcheck` variable.	2026-03-11 16:29:10 +00:00
Steven Masley	537260aa22	fix: early oidc refresh with fake idp tests (#22712 ) Wrote unit tests that implement a fake idp to verify the oauth package actually refreshes the token	2026-03-06 16:51:27 +00:00
Steven Masley	c805c8c02c	chore: setting time forward for expiration math (#22687 ) It was set backwards, which allowed invalid refresh tokens. Making things worse.	2026-03-06 12:29:54 +00:00
Cian Johnston	81468323e0	fix(coderd): use dbtime.Now() instead of time.Now() in test assertions against DB timestamps (#22685 ) `time.Now()` has nanosecond precision while Postgres timestamps are microsecond precision. When tests compare `time.Now()` against DB-sourced timestamps using `Before`/`After`/`WithinRange`/etc., there is a non-zero flake risk from the precision mismatch. This replaces `time.Now()` with `dbtime.Now()` (which rounds to microsecond precision) in all test assertions that compare against database timestamps. Follows from #22684. ## Changes (11 files) \| File \| Changes \| \|---\|---\| \| `coderd/apikey_test.go` \| 11 comparisons with `ExpiresAt` \| \| `coderd/users_test.go` \| 2 comparisons with `ExpiresAt` \| \| `coderd/oauth2_test.go` \| 1 comparison with `token.Expiry` \| \| `coderd/workspaces_test.go` \| 2 comparisons with `DormantAt` \| \| `coderd/workspaceagents_test.go` \| 3 comparisons with `ConnectedAt`/`DisconnectedAt` \| \| `coderd/workspaceapps/db_test.go` \| 1 comparison with `token.Expiry` \| \| `coderd/provisionerdserver/provisionerdserver_test.go` \| 1 comparison with `key.ExpiresAt` \| \| `enterprise/coderd/workspaces_test.go` \| 1 comparison with `DormantAt` \| \| `enterprise/coderd/license/license_test.go` \| 3 `NotBefore` values \| \| `enterprise/coderd/licenses_test.go` \| 2 `NotBefore` values \| \| `enterprise/coderd/users_test.go` \| 3 `Next()` comparisons \| ## Not changed (intentionally) - `scaletest/placebo/run_test.go` — compares wall-clock elapsed time, not DB timestamps - `cli/server_test.go`, `coderd/jwtutils/jwt_test.go`, `enterprise/aibridgeproxyd/aibridgeproxyd_test.go` — TLS cert fields, not DB-stored - `coderd/azureidentity/azureidentity_test.go` — Azure cert expiry, not DB 🤖 Generated by Claude Opus 4.6 but reviewed manually.	2026-03-06 09:14:11 +00:00
Steven Masley	f49dea683c	chore: prematurely refresh oidc token near expiry during workspace build (#22502 ) Closes https://github.com/coder/coder/issues/22429	2026-03-03 18:13:00 +00:00
Jon Ayers	0a7a3da178	fix: exclude provisioner_state from workspace_build_with_user view (#22159 ) The provisioner state for a workspace build was being loaded for every long-lived agent rpc connection. Since this state can be anywhere from kilobytes to megabytes this can gradually cause the `coderd` memory footprint to grow over time. It's also a lot of unnecessary allocations for every query that fetches a workspace build since only a few callers ever actually reference the provisioner state. This PR removes it from the returned workspace build and adds a query to fetch the provisioner state explicitly.	2026-02-23 22:46:17 -06:00
Zach	6a783fc5c7	fix: floor provisioner job queue wait metric (#22184 ) After a PostgreSQL round-trip, job timestamps lose their monotonic clock component, making the subtraction susceptible to wall-clock adjustments producing a small negative delta. Floor at 1ms since a zero or negative queue wait is meaningless. Fixes TestProvisionerJobQueueWaitMetric flakes where small negative values (~ -2ms) are observed.	2026-02-20 16:12:17 -07:00
Danielle Maywood	911d734df9	fix: avoid re-using `AuthInstanceID` for sub agents (#22196 ) Parent agents were re-using AuthInstanceID when spawning child agents. This caused GetWorkspaceAgentByInstanceID to return the most recently created sub agent instead of the parent when the parent tried to refetch its own manifest. Fix by not reusing AuthInstanceID for sub agents, and updating GetWorkspaceAgentByInstanceID to filter them out entirely.	2026-02-19 16:56:29 +00:00
Callum Styan	5f3be6b288	feat: add provisioner job queue wait time histogram and jobs enqueued counter (#21869 ) This PR adds some metrics to help identify job enqueue rates and latencies. This work was initiated as a way to help reduce the cost of the observation/measurement itself for autostart scaletests, which impacts our ability to identify/reason about the load caused by autostart. See: https://github.com/coder/internal/issues/1209 I've extended the metrics here to account for regular user initiated builds, prebuilds, autostarts, etc. IMO there is still the question here of whether we want to include or need the `transition` label, which is only present on workspace builds. Including it does lead to an increase in cardinality, and in the case of the histogram (when not using native histograms) that's at least a few extra series for every bucket. We could remove the transition label there but keep it on the counter. Additionally, the histogram is currently observing latencies for other jobs, such as template builds/version imports, those do not have a transition type associated with them. Tested briefly in a workspace, can see metric values like the following: - `coderd_workspace_builds_enqueued_total{build_reason="autostart",provisioner_type="terraform",status="success",transition="start"} 1` - `coderd_provisioner_job_queue_wait_seconds_bucket{build_reason="autostart",job_type="workspace_build",provisioner_type="terraform",transition="start",le="0.025"} 1` --------- Signed-off-by: Callum Styan <callumstyan@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-12 13:40:47 -08:00
Steven Masley	efd98bd93a	chore: add template toggle to disable module caching (#21931 ) There exists use cases to disable the new module caching behavior of workspace builds. This was the legacy behavior.	2026-02-05 14:38:55 -06:00
Danielle Maywood	37aecda165	feat(coderd/provisionerdserver): insert sub agent resource (#21699 ) Update provisionerdserver to handle the changes introduced to provisionerd in https://github.com/coder/coder/pull/21602 We now create a relationship between `workspace_agent_devcontainers` and `workspace_agents` with the newly created `subagent_id`.	2026-01-30 17:19:19 +00:00
Steven Masley	e13f2a9869	chore: remove extra `stop_modules` from provisionerd proto (#21706 ) Was a duplicate of start_modules Closes https://github.com/coder/coder/issues/21206	2026-01-28 09:25:47 -06:00
Cian Johnston	7b44976618	fix(coderd/provisionerdserver): correct managed agent tracking (#21696 ) Relates to https://github.com/coder/internal/issues/1282 Updates tracking of managed agents to be predicated instead on the presence of a related `task_id` instead of the presence of a `coder_ai_task` resource.	2026-01-27 12:14:52 +00:00
Steven Masley	60b3fd0783	chore!: send modules archive over the proto messages (#21398 ) # What this does Dynamic parameters caches the `./terraform/modules` directory for parameter usage. What this PR does is send over this archive to the provisioner when building workspaces. This allow terraform to skip downloading modules from their registries, a step that takes seconds. <img width="1223" height="429" alt="Screenshot From 2025-12-29 12-57-52" src="https://github.com/user-attachments/assets/16066e0a-ac79-4296-819d-924f4b0418dc" /> # Wire protocol The wire protocol reuses the same mechanism used to download the modules `provisoner -> coder`. It splits up large archives into multiple protobuf messages so larger archives can be sent under the message size limit. # 🚨 Behavior Change (Breaking Change) 🚨 Before this PR modules were downloaded on every workspace build. This means unpinned modules always fetched the latest version After this PR modules are cached at template import time, and their versions are effectively pinned for all subsequent workspace builds.	2026-01-09 11:33:34 -06:00
Steven Masley	d2044c2ee9	chore: update protobuf to reuse file request (#21447 ) This is just the protobuf changes for the PR https://github.com/coder/coder/pull/21398 Moved `UploadFileRequest` from `provisionerd.proto` -> `provisioner.proto`. Renamed to `FileUpload` because it is now bi-directional. This is backwards compatible. I tested it to confirm the payloads are identical. Types were just renamed and moved around. ```golang func TestTypeUpgrade(t *testing.T) { t.Parallel() x := &proto2.UploadFileRequest{ Type: &proto2.UploadFileRequest_ChunkPiece{ ChunkPiece: &proto.ChunkPiece{ Data: []byte("Hello World!"), FullDataHash: []byte("Foobar"), PieceIndex: 42, }, }, } data, err := protobuf.Marshal(x) require.NoError(t, err) // Exactly the same output // EhgKDEhlbGxvIFdvcmxkIRIGRm9vYmFyGCo= on `main` // EhgKDEhlbGxvIFdvcmxkIRIGRm9vYmFyGCo= on this branch fmt.Println(base64.StdEncoding.EncodeToString(data)) } ``` # What this does This allows provisioner daemons to download files from `coderd`'s `files` table. This is used to send over cached module files and prevent the need of downloading these modules on each workspace build.	2026-01-09 11:23:32 -06:00
Steven Masley	89f4d60e7b	chore: remove experiment "terraform-directory-reuse" (#21397 ) Experiment is no longer required, the new method will be released without an experiment and without a toggle Main PR is: https://github.com/coder/coder/pull/21398	2026-01-09 11:13:16 -06:00
Spike Curtis	bddb808b25	chore: arrange imports in a standard way (#21452 ) Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example: ``` import ( "context" "time" "github.com/prometheus/client_golang/prometheus" "golang.org/x/xerrors" "gopkg.in/natefinch/lumberjack.v2" "cdr.dev/slog/v3" "github.com/coder/coder/v2/codersdk/agentsdk" "github.com/coder/serpent" ) ``` 3 groups: standard library, 3rd partly libs, Coder libs. This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.	2026-01-08 15:24:11 +04:00
Spike Curtis	49b34a716a	fix: fix slog to always use array of Fields (#21426 ) Upgrades to slog v3 which includes a small, but backward incompatible API change to the acceptible call arguments when logging. This change allows us to verify via compile time type checking that arguments are correct and won't cause a panic, as was possible in slog v1, which this replaces (v2 was tagged but never used in coder/coder). It also updates dependencies that also use slog and were updated. I've left the `aibridge` dependency as a commit SHA, under the assumption that the team there (cc @pawbana @dannykopping ) will tag and update the dependency soon and on their own schedule. Other dependencies, I pushed new tags.	2026-01-08 10:29:41 +04:00
Danielle Maywood	c3224b793e	fix: handle scenario where provisionerdserver deletes task before coderd (#21220 )	2025-12-11 13:04:13 +00:00
Marcin Tojek	d004710a74	feat: add prebuild invalidation via last_invalidated_at timestamp (#20582 ) Updates #17917	2025-11-20 17:12:25 +01:00
Steven Masley	a10c5ff381	chore: protect build timings insert for invalid enums (#20821 ) Database insert errors will fail the transaction. So this error is fatal. Properly return it for a better error call stack, and not just hiding the error in the logs.	2025-11-19 09:34:19 -06:00
Susana Ferreira	79d46769fe	chore: remove warning for non-trackable workspace builds in metrics (#20775 ) Previously, `UpdateWorkspaceTimingsMetrics` would log a warning for workspace builds that aren't tracked (restarts, stops, subsequent builds after creation). This was noisy since these are legitimate operations, not errors. `UpdateWorkspaceTimingsMetrics` is specifically designed to track only workspace creation, prebuild creation, and prebuild claim timings. Related with: https://github.com/coder/coder/pull/20772	2025-11-14 12:26:32 +00:00
Danny Kopping	86c4948445	chore: add timing flag context to warn message (#20772 ) `prometheus.provisionerd_server_metrics: unsupported workspace timing flags` appears in the logs, but without knowledge of the available flags it's not possible to troubleshoot this. Signed-off-by: Danny Kopping <danny@coder.com>	2025-11-14 10:10:53 +00:00
Steven Masley	fe3b825b86	chore: per template opt into cached terraform directories (#20609 ) For experimental and dogfood purposes, this adds the ability to opt in a single template. Leaving the rest of the templates as is. For GA, this setting might be removed or changed.	2025-11-13 14:04:12 -06:00
Steven Masley	9ca5b44b56	chore: implement persistent terraform directories (experimental) (#20563 ) Prior to this, every workspace build ran `terraform init` in a fresh directory. This would mean the `modules` are downloaded fresh. If the module is not pinned, subsequent workspace builds would have different modules.	2025-11-13 07:50:17 -06:00
Steven Masley	04727c06e8	chore: add experiment toggle for terraform workspace caching (#20559 ) Experiments passed to provisioners to determine behavior. This adds `--experiments` flag to provisioner daemons. Prior to this, provisioners had no method to turn on/off experiments.	2025-11-12 14:26:15 -06:00
Steven Masley	9149c1e9f2	chore: append template metadata to protobuf config (#20558 ) Adds some extra meta data sent to provisioners. Also adds a field `reuse_terraform_workspace` to tell the provisioner whether or not to use the caching experiment.	2025-11-12 12:46:39 -06:00
Mathias Fredriksson	ce04f6cc5d	fix(coderd): remove deprecated AITaskSidebarApp column (#20680 ) This column was no longer used in `v2.28` and the codersdk field deprecated. Both can now be dropped in `v2.29`. Closes coder/internal#974	2025-11-07 12:45:45 +02:00
Mathias Fredriksson	a6b0eae38d	refactor(coderd): drop sidebar app constraint and simplify provisionerdserver for tasks (#20591 ) Updates coder/internal#973 Updates coder/internal#974	2025-11-03 13:46:38 +02:00
Cian Johnston	1961252918	chore(coderd/provisionerdserver): address flake in TestServer_ExpirePrebuildsSessionToken (#20648 ) Addresses a flake seen locally by @mafredri: ``` panic: interface conversion: proto.isAcquiredJob_Type is nil, not proto.AcquiredJob_WorkspaceBuild_ [recovered] panic: interface conversion: proto.isAcquiredJob_Type is nil, not proto.AcquiredJob_WorkspaceBuild_ goroutine 77 [running]: testing.tRunner.func1.2({0x35ba440, 0xc000f15620}) /usr/local/go/src/testing/testing.go:1734 +0x21c testing.tRunner.func1() /usr/local/go/src/testing/testing.go:1737 +0x35e panic({0x35ba440?, 0xc000f15620?}) /usr/local/go/src/runtime/panic.go:792 +0x132 github.com/coder/coder/v2/coderd/provisionerdserver_test.TestServer_ExpirePrebuildsSessionToken(0xc00010d500) /home/coder/coder/coderd/provisionerdserver/provisionerdserver_test.go:4128 +0xc4b testing.tRunner(0xc00010d500, 0x4bd8450) /usr/local/go/src/testing/testing.go:1792 +0xf4 created by testing.(*T).Run in goroutine 1 /usr/local/go/src/testing/testing.go:1851 +0x413 FAIL github.com/coder/coder/v2/coderd/provisionerdserver 20.830s FAIL ``` It's unclear why this would happen in the first place.	2025-11-03 11:39:02 +00:00
Cian Johnston	73dedcc765	fix: delete related task when deleting workspace (#20567 ) * Instead of prompting the user to start a deleted workspace (which is silly), prompt them to create a new task instead. * Adds a warning dialog when deleting a workspace * Updates provisionerdserver to delete the related task if a workspace is related to a task	2025-10-30 10:37:51 +00:00
Danielle Maywood	5a31c590e6	fix(coderd/provisionerdserver): pipe through task id and prompt (#20408 ) Pipes through the Task's ID and prompt into the provisioner. This is required to support the new `coder_ai_task.prompt` field and modified `coder_ai_task.id` field.	2025-10-24 09:43:48 +01:00
Cian Johnston	dc6e50d6b7	feat(coderd/telemetry): add telemetry for database Tasks (#20279 ) Adds Tasks to telemetry snapshots Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>	2025-10-17 10:48:56 +01:00
Mathias Fredriksson	a8f87c2625	feat(coderd): implement task to app linking (#20237 ) This change adds workspace build/agent/app linking to tasks and wires it into `wsbuilder` and `provisionerdserver`. Closes coder/internal#948 Closes coder/coder#20212 Closes coder/coder#19773	2025-10-13 12:57:06 +03:00
Danielle Maywood	f31e6e09ba	chore(provisioner): support updated coder_ai_task resource (#20160 ) Closes https://github.com/coder/internal/issues/978 - Introduce `CODER_TASK_ID` and `CODER_TASK_PROMPT` to the provisioner environment - Make use of new `app_id` field in provider, with a fallback to `sidebar_app.id` for backwards compatibility For now I've left the `taskPrompt` and `taskID` as a TODO as we do not yet create these values.	2025-10-09 10:42:01 +01:00
Rafael Rodriguez	e53bc247e9	feat: add tooltip field to workspace app that renders as markdown (#19651 ) In this pull request we're adding an optional `tooltip` field. The `tooltip` field is a string field (with markdown support) that will be used to display tooltips on hover over app buttons in a workspace dashboard. Tooltip screenshot <img width="816" height="275" alt="Screenshot 2025-08-29 at 4 11 56 PM" src="https://github.com/user-attachments/assets/52c736a1-f632-465b-89a0-35ca99bd367b" /> Tooltip video https://github.com/user-attachments/assets/21806337-accc-4acf-b8c6-450c031d98f1 Issue: https://github.com/coder/coder/issues/18431 Related provider PR: https://github.com/coder/terraform-provider-coder/pull/435 ### Changes - Added migration to add `tooltip` column to `workspace_apps` table - Updated queries to get/set the new `tooltip` column - Updated frontend to render tooltip as markdown (primary tool tip takes precedence over template tooltip) ### Testing - Added storybook test for `Applink` markdown rendering	2025-09-10 11:01:54 -05:00
Cian Johnston	06cbb2890f	fix: expire token for prebuilds user when regenerating session token (#19667 ) * provisionerdserver: Expires prebuild user token for workspace, if it exists, when regenerating session token. * dbauthz: disallow prebuilds user from creating api keys * dbpurge: added functionality to expire stale api keys owned by the prebuilds user	2025-09-02 09:38:43 +01:00
Susana Ferreira	353f5dedc1	fix(coderd): fix logic for reporting prebuilt workspace duration metric (#19641 ) ## Description When creating a prebuilt workspace, both `flags.IsPrebuild` and `flags.IsFirstBuild` are true. Previously, the logic rejected cases with multiple flags, so `coderd_workspace_creation_duration_seconds` wasn’t updated for prebuilt creations. This is the only valid scenario where two flags can be true. ## Changes * Fix logic to update `coderd_workspace_creation_duration_seconds` metric for prebuilt workspaces. * Add prebuild helper functions to coderdenttest (other prebuild tests can reuse this). * Update workspace's provisionerdmetric tests to include this metric. Follow-up: https://github.com/coder/coder/pull/19503 Related to: https://github.com/coder/coder/issues/19528	2025-08-29 15:48:48 +01:00
Susana Ferreira	0ab345ca84	feat: add prebuild timing metrics to Prometheus (#19503 ) ## Description This PR introduces one counter and two histograms related to workspace creation and claiming. The goal is to provide clearer observability into how workspaces are created (regular vs prebuild) and the time cost of those operations. ### `coderd_workspace_creation_total` * Metric type: Counter * Name: `coderd_workspace_creation_total` * Labels: `organization_name`, `template_name`, `preset_name` This counter tracks whether a regular workspace (not created from a prebuild pool) was created using a preset or not. Currently, we already expose `coderd_prebuilt_workspaces_claimed_total` for claimed prebuilt workspaces, but we lack a comparable metric for regular workspace creations. This metric fills that gap, making it possible to compare regular creations against claims. Implementation notes: * Exposed as a `coderd_` metric, consistent with other workspace-related metrics (e.g. `coderd_api_workspace_latest_build`: https://github.com/coder/coder/blob/main/coderd/prometheusmetrics/prometheusmetrics.go#L149). * Every `defaultRefreshRate` (1 minute ), DB query `GetRegularWorkspaceCreateMetrics` is executed to fetch all regular workspaces (not created from a prebuild pool). * The counter is updated with the total from all time (not just since metric introduction). This differs from the histograms below, which only accumulate from their introduction forward. ### `coderd_workspace_creation_duration_seconds` & `coderd_prebuilt_workspace_claim_duration_seconds` * Metric types: Histogram * Names: * `coderd_workspace_creation_duration_seconds` * Labels: `organization_name`, `template_name`, `preset_name`, `type` (`regular`, `prebuild`) * `coderd_prebuilt_workspace_claim_duration_seconds` * Labels: `organization_name`, `template_name`, `preset_name` We already have `coderd_provisionerd_workspace_build_timings_seconds`, which tracks build run times for all workspace builds handled by the provisioner daemon. However, in the context of this issue, we are only interested in creation and claim build times, not all transitions; additionally, this metric does not include `preset_name`, and adding it there would significantly increase cardinality. Therefore, separate more focused metrics are introduced here: * `coderd_workspace_creation_duration_seconds`: Build time to create a workspace (either a regular workspace or the build into a prebuild pool, for prebuild initial provisioning build). * `coderd_prebuilt_workspace_claim_duration_seconds`: Time to claim a prebuilt workspace from the pool. The reason for two separate histograms is that: * Creation (regular or prebuild): provisioning builds with similar time magnitude, generally expected to take longer than a claim operation. * Claim: expected to be a much faster provisioning build. #### Native histogram usage Provisioning times vary widely between projects. Using static buckets risks unbalanced or poorly informative histograms. To address this, these metrics use [Prometheus native histograms](https://prometheus.io/docs/specs/native_histograms/): * First introduced in Prometheus v2.40.0 * Recommended stable usage from v2.45+ * Requires Go client `prometheus/client_golang` v1.15.0+ * Experimental and must be explicitly enabled on the server (`--enable-feature=native-histograms`) For compatibility, we also retain a classic bucket definition (aligned with the existing provisioner metric: https://github.com/coder/coder/blob/main/provisionerd/provisionerd.go#L182-L189). * If native histograms are enabled, Prometheus ingests the high-resolution histogram. * If not, it falls back to the predefined buckets. Implementation notes: * Unlike the counter, these histograms are updated in real-time at workspace build job completion. * They reflect data only from the point of introduction forward (no historical backfill). ## Relates to Closes: https://github.com/coder/coder/issues/19528 Native histograms tested in observability stack: https://github.com/coder/observability/pull/50	2025-08-28 15:00:26 +01:00

1 2 3 4 5 ...

271 Commits