Replace the old `InTx` ruleguard rule in `scripts/rules.go` with a
custom in-tree `go/analysis` analyzer under `scripts/intxcheck/`. The
new analyzer catches the same direct and pass-through misuse classes as
before, plus two new classes the pattern-matcher couldn't reach:
- **Indirect same-package helper misuse** — flags `p.someHelper(ctx)`
inside `InTx` when the helper body uses the outer store (the PR #24369
bug class).
- **Nested dangerous closures** — descends into `go func() { ... }()`,
`defer func() { ... }()`, and immediately-invoked function literals.
The analyzer uses semantic `types.Object` identity instead of raw
expression string comparison, which avoids false positives from
closure-local shadowing and catches simple aliases like `outer := s.db`
and `alias := s`.
This PR also fixes three real outer-store-inside-transaction bugs the
new analyzer surfaced:
- `coderd/wsbuilder/wsbuilder.go`: `FindMatchingPresetID` and
`getWorkspaceTask` now use the inner transaction store instead of
`b.store`.
- `enterprise/dbcrypt/dbcrypt.go`: `ensureEncrypted` now calls
`s.InsertDBCryptKey` (the tx-wrapped store) instead of
`db.InsertDBCryptKey`. The `dbCrypt.InTx` method wraps the raw tx in a
new `*dbCrypt`, so `s.InsertDBCryptKey` still dispatches through the
encryption layer.
Two call sites need `// intxcheck:ignore` suppressions. Both are one-off
patterns that only look like misuse because the analyzer doesn't track
assignments — proving them safe would require full dataflow analysis,
which is well beyond what a targeted lint like this should attempt:
- `coderd/database/dbfake/dbfake.go` — `b.db` is reassigned to `tx` on
the preceding line, so `b.doInTX()` actually uses the transaction. The
analyzer sees the original `b.db` identity and flags it.
- `coderd/database/db_test.go` — test intentionally passes the outer
store to `require.Equal` to assert that nested `InTx` returns the same
handle.
Suppressions use `// intxcheck:ignore` instead of `//nolint:intxcheck`
because `intxcheck` runs as a standalone `go/analysis` tool outside
golangci-lint. golangci-lint's `nolintlint` checker flags `//nolint`
directives for linters it doesn't control, so we use a custom comment
prefix to avoid that conflict.
Previously, WorkspaceBuildBuilder.doInTX() inserted provisioner jobs
with empty tags and used a loop in AcquireProvisionerJob that could
match other tests' pending jobs when parallel tests share a database.
Add a unique tag (jobID -> "true") to each provisioner job at insert
time, then use that tag in AcquireProvisionerJob to target only the
correct job. This follows the same pattern used in dbgen.ProvisionerJob.
Closescoder/internal#1367
The provisioner state for a workspace build was being loaded for every
long-lived agent rpc connection. Since this state can be anywhere from
kilobytes to megabytes this can gradually cause the `coderd` memory
footprint to grow over time. It's also a lot of unnecessary allocations
for every query that fetches a workspace build since only a few callers
ever actually reference the provisioner state.
This PR removes it from the returned workspace build and adds a query to
fetch the provisioner state explicitly.
Relates to https://github.com/coder/coder/pull/21922 /
https://github.com/coder/internal/issues/1259
* Adds `dbfake.BuilderOption func(*WorkspaceBuildBuilder)`
* Adds `BuilderOption` methods for setting various provisioner job
related fields on `WorkspaceBuildBuilder`.
* Migrates a number of existing tests that previously dependeded on
provisioner job timing to use these updated methods in the following
packages:
* `coderd/jobreaper`
* `coderd/notifications/reports`
* `enterprise/coderd/schedule`
* `enterprise/coderd/prebuilds`
* `scripts/workspace-runtime-audit`
🤖 Created using Mux (Opus 4.5)
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
The AcquireProvisionerJob query only checked started_at IS NULL, allowing
it to acquire jobs that were canceled while pending (which have
completed_at set but started_at still NULL).
Added completed_at IS NULL check to the query to prevent this.
Also fixed JobCompleteBuilder.Do() in dbfake to set started_at when
completing jobs to match production behavior.
Fixescoder/internal#1323
Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example:
```
import (
"context"
"time"
"github.com/prometheus/client_golang/prometheus"
"golang.org/x/xerrors"
"gopkg.in/natefinch/lumberjack.v2"
"cdr.dev/slog/v3"
"github.com/coder/coder/v2/codersdk/agentsdk"
"github.com/coder/serpent"
)
```
3 groups: standard library, 3rd partly libs, Coder libs.
This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.
This PR adds a command to view the provisioner and agent logs for a
given workspace.
Note: I did investigate using the existing `cliui` methods to tail the
logs but they are tailored to a very specific use-case.
Other changes:
- Adds `Agents` to `dbfake.WorkspaceResponse`
- Adds methods to generate provisioner and agent logs in `dbgen`
---------
Co-authored-by: Steven Masley <Emyrk@users.noreply.github.com>
Upgrades to slog v3 which includes a small, but backward incompatible API change to the acceptible call arguments when logging. This change allows us to verify via compile time type checking that arguments are correct and won't cause a panic, as was possible in slog v1, which this replaces (v2 was tagged but never used in coder/coder).
It also updates dependencies that also use slog and were updated.
I've left the `aibridge` dependency as a commit SHA, under the assumption that the team there (cc @pawbana @dannykopping ) will tag and update the dependency soon and on their own schedule.
Other dependencies, I pushed new tags.
This removes the deprecated AITaskPromptParameterName constant and all
backward compatibility code that was added for v2.28.
- Remove AITaskPromptParameterName constant from codersdk/aitasks.go
- Remove backward compatibility code in coderd/aitasks.go that populated
the "AI Prompt" parameter for templates that defined it
- Remove the backward compatibility test (OK AIPromptBackCompat)
- Update dbfake to no longer set the AI Prompt parameter
- Remove AITaskPromptParameterName from frontend TypeScript types
- Remove preset prompt read-only feature from TaskPrompt component
- Update docs to reflect that pre-2.28 definition is no longer supported
Task prompts are now exclusively stored in the tasks.prompt database
column, as introduced in the migration that added the tasks table.
## Problem
With the new tasks data model, a task starts with an `initializing`
status. However, the API returns `current_state: null` to represent the
agent state, causing the frontend to display "No message available".
This PR updates `codersdk.Task` to return a `current_state` when the
task is initializing with meaningful messages about what's happening
during task initialization.
**Previous message**
<img width="2764" height="288" alt="Screenshot 2025-11-07 at 09 06 13"
src="https://github.com/user-attachments/assets/feec9f15-91ca-4378-8565-5f9de062d11a"
/>
**New message**
<img width="2726" height="226" alt="Screenshot 2025-11-12 at 11 00 15"
src="https://github.com/user-attachments/assets/2f9bee3e-7ac4-4382-b1c3-1d06bbc2906e"
/>
## Changes
- Populate `current_state` with descriptive initialization messages when
task status is `initializing` and no valid app status exists for the
current build
- **dbfake**: Fix `WorkspaceBuild` builder to properly handle
pending/running jobs by linking tasks without requiring agent/app
resources
**Note:** UI Storybook changes to reflect these new messages will be
addressed in a follow-up PR.
Closes: https://github.com/coder/internal/issues/1063
Adds some extra meta data sent to provisioners. Also adds a field
`reuse_terraform_workspace` to tell the provisioner whether or not to
use the caching experiment.
While investigating a flake I noticed that the dbfake workspace builder
executes all database inserts without a transaction. Since our real
wsbuilder implementation utilizes one it makes sense to do here as well.
For example, our normal workspace <-> build relationship is such that a
workspace cannot exist with at least one build. However, our
GetWorkspaces query left joins workspace builds but has types that are
non-nullable, leading to flakes like coder/internal#1103.
## Description
This PR introduces an optimization to automatically cancel pending
prebuild-related jobs from non-active template versions in the
reconciliation loop.
## Problem
Currently, when a template is configured with more prebuild instances
than available provisioners, the provisioner queue can become flooded
with pending prebuild jobs. This issue is worsened when
provisioning/deprovisioning operations take a long time.
When the prebuild reconciliation loop generates jobs faster than
provisioners can process them, pending jobs accumulate in the queue.
Since prebuilt workspaces should always run the latest active template
version, pending prebuild jobs from non-active versions become obsolete
once a new version is promoted.
## Solution
The reconciliation loop cancels pending prebuild-related jobs from
non-active template versions that match the following criteria:
* Build number: 1 (initial build created by the reconciliation loop)
* Job status: `pending`
* Not yet picked up by a provisioner (`worker_id` is `NULL`)
* Owned by the prebuilds system user
* Workspace transition: `start`
This prevents the queue from being cluttered with stale prebuild jobs
that would provision workspaces on an outdated template version that
would consequently need to be deprovisioned.
## Changes
* Added new SQL query `CountPendingNonActivePrebuilds` to identify
presets with pending jobs from non-active versions
* Added new SQL query `UpdatePrebuildProvisionerJobWithCancel` to cancel
jobs for a specific preset
* New reconciliation action type `ActionTypeCancelPending` handles the
cancellation logic
* Cancellation is non-blocking: failures to cancel prebuild jobs are
logged as errors and don't prevent other reconciliation actions
## Follow-up PR
Canceling pending prebuild jobs leaves workspaces in a Canceled state.
While no Terraform resources need to be destroyed (since jobs were
canceled before provisioning started), these database records should
still be cleaned up. This will be addressed in a follow-up PR.
Closes: https://github.com/coder/coder/issues/20242
relates to #778
Somehow in `TestWorkspaceAgent` the agent with the test instance identifier is not being added to the database, or is getting deleted.
I'm adding some additional logging to `dbfake` and setting the affected tests to dump postgres on error, to see if we can get to the bottom of the issue.
This PR sets a constraint of 1MB on the provisioner job logs written to
the database. This is consistent with the constraint we place on
workspace agent logs:
https://github.com/coder/coder/blob/4ac6be6d835dc36c242e35a26b584b784040bf28/coderd/database/dump.sql#L2030
It also adds a message printed to the front end about the provisioner
log overflow, and updates the message printed to the front end when
workspace startup logs exceed the max, as it was causing some customers
to think their startup script had failed to run.
## Description
This PR adds support for `description` and `icon` fields to
`template_version_presets`. These fields will allow displaying richer
information for presets in the UI, improving the user experience when
creating a workspace.
Both fields are optional, non-nullable, and default to empty strings.
## Changes
* Database migration with the addition of `description VARCHAR(128)` and
`icon VARCHAR(256)` columns to the `template_version_presets` table.
* Updated the `CreateWorkspacePageView` in the UI
Note: UI changes will be addressed in a separate PR
Closes https://github.com/coder/internal/issues/312
Depends on https://github.com/coder/terraform-provider-coder/pull/408
This PR adds support for defining an **autoscaling block** for
prebuilds, allowing number of desired instances to scale dynamically
based on a schedule.
Example usage:
```
data "coder_workspace_preset" "us-nix" {
...
prebuilds = {
instances = 0 # default to 0 instances
scheduling = {
timezone = "UTC" # a single timezone is used for simplicity
# Scale to 3 instances during the work week
schedule {
cron = "* 8-18 * * 1-5" # from 8AM–6:59PM, Mon–Fri, UTC
instances = 3 # scale to 3 instances
}
# Scale to 1 instance on Saturdays for urgent support queries
schedule {
cron = "* 8-14 * * 6" # from 8AM–2:59PM, Sat, UTC
instances = 1 # scale to 1 instance
}
}
}
}
```
### Behavior
- Multiple `schedule` blocks per `prebuilds` block are supported.
- If the current time matches any defined autoscaling schedule, the
corresponding number of instances is used.
- If no schedule matches, the **default instance count**
(`prebuilds.instances`) is used as a fallback.
### Why
This feature allows prebuild instance capacity to adapt to predictable
usage patterns, such as:
- Scaling up during business hours or high-demand periods
- Reducing capacity during off-hours to save resources
### Cron specification
The cron specification is interpreted as a **continuous time range.**
For example, the expression:
```
* 9-18 * * 1-5
```
is intended to represent a continuous range from **09:00 to 18:59**,
Monday through Friday.
However, due to minor implementation imprecision, it is currently
interpreted as a range from **08:59:00 to 18:58:59**, Monday through
Friday.
This slight discrepancy arises because the evaluation is based on
whether a specific **point in time** falls within the range, using the
`github.com/coder/coder/v2/coderd/schedule/cron` library, which performs
per-minute matching rather than strict range evaluation.
---------
Co-authored-by: Danny Kopping <danny@coder.com>
Deletion of data is uncommon in our database, so the introduction of sub agents
and the deletion of them introduced issues with foreign key assumptions, as can
be seen in coder/internal#685. We could have only addressed the specific case by
allowing cascade deletion of stats as well as handling in the stats collector,
but it's unclear how many more such edge-cases we could run into.
In this change, we mark the rows as deleted via boolean instead, and filter them
out in all relevant queries.
Fixescoder/internal#685
This pull request allows coder workspace agents to be reinitialized when
a prebuilt workspace is claimed by a user. This facilitates the transfer
of ownership between the anonymous prebuilds system user and the new
owner of the workspace.
Only a single agent per prebuilt workspace is supported for now, but
plumbing has already been done to facilitate the seamless transition to
multi-agent support.
---------
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Co-authored-by: Danny Kopping <dannykopping@gmail.com>
Deleted organizations are still attempting to sync members. This causes
an error on inserting the member, and would likely cause issues later in
the sync process even if that member is inserted. Deleted orgs should be
skipped.
- Refactors existing `mcp` package to use `kylecarbs/aisdk-go` and moves
to `codersdk/toolsdk` package.
- Updates existing MCP server implementation to use `codersdk/toolsdk`
Co-authored-by: Kyle Carberry <kyle@coder.com>
- Update go.mod to use Go 1.24.1
- Update GitHub Actions setup-go action to use Go 1.24.1
- Fix linting issues with golangci-lint by:
- Updating to golangci-lint v1.57.1 (more compatible with Go 1.24.1)
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Claude <claude@anthropic.com>
Underscores and double hyphens are now blocked. The regex is almost the
exact same as the `coder_app` `slug` regex, but uppercase characters are
still permitted.
The failure condition being fixed is `w1` and `w2` could belong
to different users, organizations, and templates and still cause a
serializable failure if run concurrently. This is because the old query
did a `seq scan` on the `workspace_builds` table. Since that is the
table being updated, we really want to prevent that.
So before this would fail for any 2 workspaces. Now it only fails if
`w1` and `w2` are owned by the same user and organization.
Closes#14716Closes#14717
Adds a new user-scoped tailnet API endpoint (`api/v2/tailnet`) with a new RPC stream for receiving updates on workspaces owned by a specific user, as defined in #14716.
When a stream is started, the `WorkspaceUpdatesProvider` will begin listening on the user-scoped pubsub events implemented in #14964. When a relevant event type is seen (such as a workspace state transition), the provider will query the DB for all the workspaces (and agents) owned by the user. This gets compared against the result of the previous query to produce a set of workspace updates.
Workspace updates can be requested for any user ID, however only workspaces the authorised user is permitted to `ActionRead` will have their updates streamed.
Opening a tunnel to an agent requires that the user can perform `ActionSSH` against the workspace containing it.
We currently send empty payloads to pubsub channels of the form `workspace:<workspace_id>` to notify listeners of updates to workspaces (such as for refreshing the workspace dashboard).
To support https://github.com/coder/coder/issues/14716, we'll instead send `WorkspaceEvent` payloads to pubsub channels of the form `workspace_owner:<owner_id>`. This enables a listener to receive events for all workspaces owned by a user.
This PR replaces the usage of the old channels without modifying any existing behaviors.
```
type WorkspaceEvent struct {
Kind WorkspaceEventKind `json:"kind"`
WorkspaceID uuid.UUID `json:"workspace_id" format:"uuid"`
// AgentID is only set for WorkspaceEventKindAgent* events
// (excluding AgentTimeout)
AgentID *uuid.UUID `json:"agent_id,omitempty" format:"uuid"`
}
```
We've defined `WorkspaceEventKind`s based on how the old channel was used, but it's not yet necessary to inspect the types of any of the events, as the existing listeners are designed to fire off any of them.
```
WorkspaceEventKindStateChange WorkspaceEventKind = "state_change"
WorkspaceEventKindStatsUpdate WorkspaceEventKind = "stats_update"
WorkspaceEventKindMetadataUpdate WorkspaceEventKind = "mtd_update"
WorkspaceEventKindAppHealthUpdate WorkspaceEventKind = "app_health"
WorkspaceEventKindAgentLifecycleUpdate WorkspaceEventKind = "agt_lifecycle_update"
WorkspaceEventKindAgentLogsUpdate WorkspaceEventKind = "agt_logs_update"
WorkspaceEventKindAgentConnectionUpdate WorkspaceEventKind = "agt_connection_update"
WorkspaceEventKindAgentLogsOverflow WorkspaceEventKind = "agt_logs_overflow"
WorkspaceEventKindAgentTimeout WorkspaceEventKind = "agt_timeout"
```
Joins in fields like `username`, `avatar_url`, `organization_name`,
`template_name` to `workspaces` via a **view**.
The view must be maintained moving forward, but this prevents needing to
add RBAC permissions to fetch related workspace fields.
* chore: create type for unique role names
Using `string` was confusing when something should be combined with
org context, and when not to. Naming this new name, "RoleIdentifier"
* chore: add org ID as optional param to AcquireJob
* chore: plumb through organization id to provisioner daemons
* add org id to provisioner domain key
* enforce org id argument
* dbgen provisioner jobs defaults to default org
Drop "New" and "Builder" from the function names, in favor of the top-level resource created. This shortens tests and gives a nice syntax. Since everything is a builder, the prefix and suffix don't add much value and just make things harder to read.
I've also chosen to leave `Do()` as the function to insert into the database. Even though it's a builder pattern, I fear `.Build()` might be confusing with Workspace Builds. One other idea is `Insert()` but if we later add dbfake functions that update, this might be inconsistent.
Convert to builder for consistency with rest of the package. This will make it easier to use, and means we can drop "Builder" from function arguments since they are all builders in the package.
I'd like to convert dbfake into a builder pattern to prevent a proliferation of XXXWithYYY methods. This is one step of the way by removing the Non-builder function.