The lifecycle executor did not handle unique-violation errors from
InsertWorkspaceBuild. When a concurrent actor (API handler, another
lifecycle executor, or prebuilds reconciler) inserts a workspace build
with the same build number, PostgreSQL returns a unique constraint
violation on workspace_builds_workspace_id_build_number_key. The
lifecycle executor treated this as a hard error, logging it and storing
it in stats.Errors.
The per-workspace advisory lock (pg_try_advisory_xact_lock) prevents
two lifecycle executors from racing, but does not protect against
races with the CreateWorkspaceBuild API handler or the prebuilds
reconciler, which use different (or no) locking.
Catch the specific unique-violation error after InTx returns (where
the transaction is already rolled back) and clear it. The concurrent
actor's build takes effect; the lifecycle executor treats the
workspace as a no-op for this tick.
Closescoder/internal#455
Closes PLAT-290
Closes https://github.com/coder/coder/issues/13112
**Breaking Change**: Removed status code `StatusNotModified` when no
diffs occur in a patch. Now the patch is always applied and a template
is always returned.
Relates to https://github.com/coder/internal/issues/1252
When a workspace with a TaskID hits its deadline, use
BuildReasonTaskAutoPause instead of BuildReasonAutostop. This allows
downstream systems to distinguish between regular autostop and task
workspace pauses.
Created by Mux using Opus 4.5.
AI agents report status via patchWorkspaceAgentAppStatus, but this wasn't
extending workspace deadlines. This prevented proper task auto-pause behavior,
causing tasks to pause mid-execution when there were no human connections.
Now we call ActivityBumpWorkspace when agents report status, using the same
logic as SSH/IDE connections. We bump when transitioning to or from the working
state.
Closescoder/internal#1251
Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example:
```
import (
"context"
"time"
"github.com/prometheus/client_golang/prometheus"
"golang.org/x/xerrors"
"gopkg.in/natefinch/lumberjack.v2"
"cdr.dev/slog/v3"
"github.com/coder/coder/v2/codersdk/agentsdk"
"github.com/coder/serpent"
)
```
3 groups: standard library, 3rd partly libs, Coder libs.
This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.
Upgrades to slog v3 which includes a small, but backward incompatible API change to the acceptible call arguments when logging. This change allows us to verify via compile time type checking that arguments are correct and won't cause a panic, as was possible in slog v1, which this replaces (v2 was tagged but never used in coder/coder).
It also updates dependencies that also use slog and were updated.
I've left the `aibridge` dependency as a commit SHA, under the assumption that the team there (cc @pawbana @dannykopping ) will tag and update the dependency soon and on their own schedule.
Other dependencies, I pushed new tags.
Provisioner steps broken into smaller granular actions.
Changes:
- `ExtractArchive` moved to `init` request (was in `configure`)
- Writing `tfstate` moved to `plan` (was in `configure`)
- Moved most plan/apply outputs to `GraphComplete`
We recently made a change to the `wsbuilder` to handle task related
logic. Our test coverage for the lifecycle executor didn't handle this
scenario and so we missed that it had insufficient permissions.
This PR adds `Update` and `Read` permissions for `Task`s in the
lifecycle executor, as well as an autostart/autostop test tailored to
task workspaces to verify the change.
---
Anthropic's Claude Sonnet 4.5 Thinking was involved in writing the tests
This test still flakes occasionally, see
https://github.com/coder/internal/issues/954#issuecomment-3237154735
The cause appears to be related to the assignment of `time.Now()` as the
`LastSeenAt` time when creating a provisioner which can flake with the
calculated scheduled next autostart and the code to set then
`require.Eventually` the updated provisioner LastSeenAt.
Instead we should simply calculate all time values for the stale portion
of the test based on the provisioners LastSeenAt value to avoid such
issues.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
The flake here had two causes:
1. related to usage of time.Now() in MustWaitForProvisionersAvailable
and
2. the fact that UpdateProvisionerLastSeenAt can not use a time that is
further in the past than the current LastSeenAt time
Previously the test here was calling
`coderdtest.MustWaitForProvisionersAvailable` which was using `time.Now`
rather than the next tick time like the real `hasProvisionersAvailable`
function does. Additionally, when using `UpdateProvisionerLastSeenAt`
the underlying db query enforces that the time we're trying to set
`LastSeenAt` to cannot be older than the current value.
I was able to reliably reproduce the flake by executing both the
`UpdateProvisionerLastSeenAt` call and `tickCh <- next` in their own
goroutines, the former with a small sleep to reliably ensure we'd
trigger the autobuild before we set the `LastSeenAt` time. That's when I
also noticed that `coderdtest.MustWaitForProvisionersAvailable` was
using `time.Now` instead of the tick time. When I updated that function
to take in a tick time + added a 2nd call to
`UpdateProvisionerLastSeenAt` to set an original non-stale time, we
could then never get the test to pass because the later call to set the
stale time would not actually modify `LastSeenAt`. On top of that,
calling the provisioner daemons closer in the middle of the function
doesn't really do anything of value in this test.
**The fix for the flake is to keep the go routines, ensuring there would
be a flake if there was not a relevant fix, but to include the fix which
is to ensure that we explicitly wait for the provisioner to be stale
before passing the time to `tickCh`.**
---------
Signed-off-by: Callum Styan <callumstyan@gmail.com>
## Description
This PR ensures that prebuilt workspaces are properly excluded from the
lifecycle executor and treated as a separate class of workspaces, fully
managed by the prebuild reconciliation loop.
It introduces two lifecycle guarantees:
* When a prebuilt workspace is created (i.e., when the workspace build
completes), all lifecycle-related fields are unset, ensuring the
workspace does not participate in TTL, autostop, autostart, dormancy, or
auto-deletion logic.
* When a prebuilt workspace is claimed, it transitions into a regular
user workspace. At this point, all lifecycle fields are correctly
populated according to template-level configurations, allowing the
workspace to be managed by the lifecycle executor as expected.
## Changes
* Prebuilt workspaces now have all lifecycle-relevant fields unset
during creation
* When a prebuild is claimed:
* Lifecycle fields are set based on template and workspace level
configurations. This ensures a clean transition into the standard
workspace lifecycle flow.
* Updated lifecycle-related SQL update queries to explicitly exclude
prebuilt workspaces.
## Relates
Related issue: https://github.com/coder/coder/issues/18898
To reduce the scope of this PR and make the review process more
manageable, the original implementation has been split into the
following focused PRs:
* https://github.com/coder/coder/pull/19259
* https://github.com/coder/coder/pull/19263
* https://github.com/coder/coder/pull/19264
* https://github.com/coder/coder/pull/19265
These PRs should be considered in conjunction with this one to
understand the complete set of lifecycle separation changes for prebuilt
workspaces.
## Description
This PR updates the lifecycle executor to explicitly exclude prebuilt
workspaces from being considered for lifecycle operations such as
`autostart`, `autostop`, `dormancy`, `default TTL` and `failure TTL`.
Prebuilt workspaces (i.e., those owned by the prebuild system user) are
handled separately by the prebuild reconciliation loop. Including them
in the lifecycle executor could lead to unintended behavior such as
incorrect scheduling or state transitions.
## Changes
* Updated the lifecycle executor query
`GetWorkspacesEligibleForTransition` to exclude workspaces with
`owner_id = 'c42fdf75-3097-471c-8c33-fb52454d81c0'` (prebuilds).
* Added tests to verify prebuilt workspaces are not considered in:
* Autostop
* Autostart
* Default TTL
* Dormancy
* Failure TTL
Fixes: https://github.com/coder/coder/issues/18740
Related to: https://github.com/coder/coder/issues/18658
Fixes https://github.com/coder/coder/issues/17840
NOTE: calling this out as a breaking change so that it is highly visible
in the changelog.
* CLI: Modifies `coder update` to stop the workspace if already running.
* UI: Modifies "update" button to always stop the workspace if already
running.
Fixes https://github.com/coder/coder/issues/9775
When a workspace's TTL is removed, and the workspace is running, the
deadline is removed from the workspace.
This also modifies the frontend to not show a confirmation dialog when
the change is to remove autostop.
- Adds `testutil.GoleakOptions` and consolidates existing options to
this location
- Pre-emptively adds required ignore for this Dependabot PR to pass CI
https://github.com/coder/coder/pull/16066
When Coder is ran in High Availability mode, each Coder instance has a
lifecycle executor. These lifecycle executors are all trying to do the
same work, and whilst transactions saves us from this causing an issue,
we are still doing extra work that could be prevented.
This PR adds a `TryAcquireLock` call for each attempted workspace
transition, meaning two Coder instances shouldn't duplicate effort.
Relates to https://github.com/coder/coder/issues/15082
Further to https://github.com/coder/coder/pull/15429, this reduces the
amount of false-positives returned by the 'is eligible for autostart'
part of the query. We achieve this by calculating the 'next start at'
time of the workspace, storing it in the database, and using it in our
`GetWorkspacesEligibleForTransition` query.
The prior implementation of the 'is eligible for autostart' query would
return _all_ workspaces that at some point in the future _might_ be
eligible for autostart. This now ensures we only return workspaces that
_should_ be eligible for autostart.
We also now pass `currentTick` instead of `t` to the
`GetWorkspacesEligibleForTransition` query as otherwise we'll have one
round of workspaces that are skipped by `isEligibleForTransition` due to
`currentTick` being a truncated version of `t`.
- Assert rbac in fake notifications enqueuer
- Move fake notifications enqueuer to separate notificationstest package
- Update dbauthz rbac policy to allow provisionerd and autostart to create and read notification messages
- Update tests as required
Fixes flake seen here: https://github.com/coder/coder/actions/runs/6716682414/job/18253279654
The test used a cron schedule to compute autobuild ticks, with ticks every hour on the hour. The default TTL was set to an hour. Usually, the next tick is less than one hour in the future, unless the test runs at :00 past the hour, which it did in my flake'd
run. But, given that this is an autostop test, the cron schedule is irrelevant (such schedules are used for auto_start_). So, I've removed it from the test and compute the build ticks directly.
Also, the test originally had the workspace TTL set to longer than the default template TTL, and then tested that no build happened when the tick was prior to both. This seems odd to me, as we want to demonstrate the the executor disregards the workspace TTL.
So, I changed the test to set the workspace TTL shorter, and then send in a tick between the two, verify that we don't autostop, then a tick after the template TTL and verify that we do.
The refactored ActivityBump query did not take into account the
template-level TTL, resulting in potentially incorrect bump
amounts for workspaces that have both a user-defined and template-
defined TTL that differ.
This change is ported over from PR#10035 to reduce the overall
size of that PR.
Also includes a drive-by unit test in autobuild for checking template autostop/TTL.
Co-authored-by: Dean Sheather <dean@deansheather.com>
This removes an indirect import of `coderd/database` from the CLI and
results in a logical separation between server related and generalized
schedule.
No size change (yet).
Ref: #9380