mirror of
https://github.com/coder/coder.git
synced 2026-06-02 20:48:20 +00:00
fix: make prebuild claiming durable and idempotent (#23108)
## Problem When a prebuilt workspace is claimed, the agent reinitializes via a single fire-and-forget pubsub event over SSE. If the agent's SSE connection is interrupted at claim time, the event is permanently lost — the workspace is stuck with no self-healing path. Additionally, regular (non-prebuild) workspaces had no way to opt out of the `/reinit` polling loop — agents would reconnect indefinitely to an endpoint that would never send them anything useful. ## Root Cause `workspaceAgentReinit` fetches the workspace (with its current `owner_id`) via `GetWorkspaceByAgentID`, but never checked whether a claim already happened. It only subscribed to pubsub for future events. The database already has durable claim state (`owner_id` changes from `PrebuildsSystemUserID` to the real user), but no layer ever consulted it on reconnection. ## Solution ### Server-side durable check with first-build-initiator gating **TOCTOU-safe ordering**: Subscribe to pubsub claim events *before* any durable checks, so a claim that fires during the check is buffered in the channel rather than lost. **First-build-initiator gating**: When `!workspace.IsPrebuild()` (owner is no longer the system user), look up the first build's `InitiatorID`. The prebuild reconciler always uses `PrebuildsSystemUserID` as the initiator. This distinguishes claimed prebuilds from regular workspaces without any SQL schema changes. - **Regular workspace** (first build initiator ≠ system user) → **409 Conflict**, agent stops reconnecting - **Claimed prebuild, build completed** → pre-seed channel with reinit event and close it, transmitter delivers one-shot then exits - **Claimed prebuild, build in-progress** → fall through to pubsub subscription, agent waits for completion event - **Unclaimed prebuild** → pubsub subscription (existing happy path) ### Declarative reinit events (defense-in-depth) - Added `UserID` field to `ReinitializationEvent` with JSON tags - Switched pubsub serialization from raw string to JSON (with backward-compat fallback for rolling upgrades) - Populated `UserID` at both the publish site and the durable check ### Agent SDK: 409 handling `WaitForReinitLoop` detects 409 Conflict from the server and closes the `reinitEvents` channel, cleanly exiting the retry goroutine. ### Agent CLI: fixed two bugs + added reinitCtx - **Closed channel (`!ok`)**: now blocks on `<-ctx.Done()` instead of `continue`, keeping the current agent running. Previously this would leak agents by skipping `agnt.Close()` and re-entering the loop. - **Duplicate owner reinit**: cancels `reinitCtx` (stops the reinit goroutine), then blocks on `<-ctx.Done()`. Previously `continue` would skip cleanup and create a new agent on the next loop iteration. - **`reinitCtx`**: a cancellable child of `ctx` passed to `WaitForReinitLoop`, allowing the agent to stop the reinit HTTP polling after reinit completes. ### Agent-side idempotency Tracks `lastOwnerID` in the agent reinit loop — duplicate events for the same owner are skipped. ## Testing - **"unclaimed prebuild receives reinit via pubsub"**: prebuild owned by system user, pubsub event triggers reinit - **"claimed prebuild receives one-shot reinit on reconnect"**: first build by system user, owner changed, build completed → immediate reinit (no pubsub needed) - **"claimed prebuild waits during in-progress claim build"**: claimed but build still running → no reinit until build completes - **"regular workspace gets 409"**: first build by real user → 409 Conflict, agent stops polling - Updated claim publisher/listener tests: verify `UserID` survives JSON round-trip + backward compat with raw string payloads - Updated SSE round-trip test: verify `UserID` survives transmit → receive cycle Fixes #22359 ## Rolling upgrade note During a rolling deploy where old coderd instances coexist with new ones, the pubsub `ReinitializationEvent` has a new `workspace_id` field (JSON key `workspace_id`). Old publishers send a raw reason string instead of JSON; the new listener gracefully falls back by treating the entire payload as the reason and filling in `WorkspaceID` from context. The only visible effect during the upgrade window is that `WorkspaceID` may be the zero UUID in agent-side logs — this is cosmetic and resolves once all instances are updated.
This commit is contained in:
+33
-6
@@ -17,6 +17,7 @@ import (
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"github.com/google/uuid"
|
||||
"github.com/prometheus/client_golang/prometheus"
|
||||
"golang.org/x/xerrors"
|
||||
"gopkg.in/natefinch/lumberjack.v2"
|
||||
@@ -272,11 +273,14 @@ func workspaceAgent() *serpent.Command {
|
||||
logger.Info(ctx, "agent devcontainer detection not enabled")
|
||||
}
|
||||
|
||||
reinitEvents := agentsdk.WaitForReinitLoop(ctx, logger, client)
|
||||
reinitCtx, reinitCancel := context.WithCancel(ctx)
|
||||
defer reinitCancel()
|
||||
reinitEvents := agentsdk.WaitForReinitLoop(reinitCtx, logger, client)
|
||||
|
||||
var (
|
||||
lastErr error
|
||||
mustExit bool
|
||||
lastOwnerID uuid.UUID
|
||||
lastErr error
|
||||
mustExit bool
|
||||
)
|
||||
for {
|
||||
prometheusRegistry := prometheus.NewRegistry()
|
||||
@@ -343,9 +347,32 @@ func workspaceAgent() *serpent.Command {
|
||||
case <-ctx.Done():
|
||||
logger.Info(ctx, "agent shutting down", slog.Error(context.Cause(ctx)))
|
||||
mustExit = true
|
||||
case event := <-reinitEvents:
|
||||
logger.Info(ctx, "agent received instruction to reinitialize",
|
||||
slog.F("workspace_id", event.WorkspaceID), slog.F("reason", event.Reason))
|
||||
case event, ok := <-reinitEvents:
|
||||
switch {
|
||||
case !ok:
|
||||
// Channel closed — the reinit loop exited
|
||||
// (terminal 409 or context expired). Keep
|
||||
// running the current agent until the parent
|
||||
// context is canceled.
|
||||
logger.Info(ctx, "reinit channel closed, running without reinit capability")
|
||||
reinitEvents = nil
|
||||
<-ctx.Done()
|
||||
mustExit = true
|
||||
case event.OwnerID != uuid.Nil && event.OwnerID == lastOwnerID:
|
||||
// Duplicate reinit for same owner — already
|
||||
// reinitialized. Cancel the reinit loop
|
||||
// goroutine and keep the current agent.
|
||||
logger.Info(ctx, "skipping redundant reinit, owner unchanged",
|
||||
slog.F("owner_id", event.OwnerID))
|
||||
reinitCancel()
|
||||
reinitEvents = nil
|
||||
<-ctx.Done()
|
||||
mustExit = true
|
||||
default:
|
||||
lastOwnerID = event.OwnerID
|
||||
logger.Info(ctx, "agent received instruction to reinitialize",
|
||||
slog.F("workspace_id", event.WorkspaceID), slog.F("reason", event.Reason))
|
||||
}
|
||||
}
|
||||
|
||||
lastErr = agnt.Close()
|
||||
|
||||
Generated
+21
-2
@@ -10205,12 +10205,26 @@ const docTemplate = `{
|
||||
],
|
||||
"summary": "Get workspace agent reinitialization",
|
||||
"operationId": "get-workspace-agent-reinitialization",
|
||||
"parameters": [
|
||||
{
|
||||
"type": "boolean",
|
||||
"description": "Opt in to durable reinit checks",
|
||||
"name": "wait",
|
||||
"in": "query"
|
||||
}
|
||||
],
|
||||
"responses": {
|
||||
"200": {
|
||||
"description": "OK",
|
||||
"schema": {
|
||||
"$ref": "#/definitions/agentsdk.ReinitializationEvent"
|
||||
}
|
||||
},
|
||||
"409": {
|
||||
"description": "Conflict",
|
||||
"schema": {
|
||||
"$ref": "#/definitions/codersdk.Response"
|
||||
}
|
||||
}
|
||||
},
|
||||
"security": [
|
||||
@@ -12647,11 +12661,16 @@ const docTemplate = `{
|
||||
"agentsdk.ReinitializationEvent": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"owner_id": {
|
||||
"type": "string",
|
||||
"format": "uuid"
|
||||
},
|
||||
"reason": {
|
||||
"$ref": "#/definitions/agentsdk.ReinitializationReason"
|
||||
},
|
||||
"workspaceID": {
|
||||
"type": "string"
|
||||
"workspace_id": {
|
||||
"type": "string",
|
||||
"format": "uuid"
|
||||
}
|
||||
}
|
||||
},
|
||||
|
||||
Generated
+21
-2
@@ -9038,12 +9038,26 @@
|
||||
"tags": ["Agents"],
|
||||
"summary": "Get workspace agent reinitialization",
|
||||
"operationId": "get-workspace-agent-reinitialization",
|
||||
"parameters": [
|
||||
{
|
||||
"type": "boolean",
|
||||
"description": "Opt in to durable reinit checks",
|
||||
"name": "wait",
|
||||
"in": "query"
|
||||
}
|
||||
],
|
||||
"responses": {
|
||||
"200": {
|
||||
"description": "OK",
|
||||
"schema": {
|
||||
"$ref": "#/definitions/agentsdk.ReinitializationEvent"
|
||||
}
|
||||
},
|
||||
"409": {
|
||||
"description": "Conflict",
|
||||
"schema": {
|
||||
"$ref": "#/definitions/codersdk.Response"
|
||||
}
|
||||
}
|
||||
},
|
||||
"security": [
|
||||
@@ -11229,11 +11243,16 @@
|
||||
"agentsdk.ReinitializationEvent": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"owner_id": {
|
||||
"type": "string",
|
||||
"format": "uuid"
|
||||
},
|
||||
"reason": {
|
||||
"$ref": "#/definitions/agentsdk.ReinitializationReason"
|
||||
},
|
||||
"workspaceID": {
|
||||
"type": "string"
|
||||
"workspace_id": {
|
||||
"type": "string",
|
||||
"format": "uuid"
|
||||
}
|
||||
}
|
||||
},
|
||||
|
||||
+29
-16
@@ -2,6 +2,7 @@ package prebuilds
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"sync"
|
||||
|
||||
"github.com/google/uuid"
|
||||
@@ -22,7 +23,11 @@ type PubsubWorkspaceClaimPublisher struct {
|
||||
|
||||
func (p PubsubWorkspaceClaimPublisher) PublishWorkspaceClaim(claim agentsdk.ReinitializationEvent) error {
|
||||
channel := agentsdk.PrebuildClaimedChannel(claim.WorkspaceID)
|
||||
if err := p.ps.Publish(channel, []byte(claim.Reason)); err != nil {
|
||||
payload, err := json.Marshal(claim)
|
||||
if err != nil {
|
||||
return xerrors.Errorf("marshal claim event: %w", err)
|
||||
}
|
||||
if err := p.ps.Publish(channel, payload); err != nil {
|
||||
return xerrors.Errorf("failed to trigger prebuilt workspace agent reinitialization: %w", err)
|
||||
}
|
||||
return nil
|
||||
@@ -37,33 +42,41 @@ type PubsubWorkspaceClaimListener struct {
|
||||
ps pubsub.Pubsub
|
||||
}
|
||||
|
||||
// ListenForWorkspaceClaims subscribes to a pubsub channel and sends any received events on the chan that it returns.
|
||||
// pubsub.Pubsub does not communicate when its last callback has been called after it has been closed. As such the chan
|
||||
// returned by this method is never closed. Call the returned cancel() function to close the subscription when it is no longer needed.
|
||||
// cancel() will be called if ctx expires or is canceled.
|
||||
func (p PubsubWorkspaceClaimListener) ListenForWorkspaceClaims(ctx context.Context, workspaceID uuid.UUID, reinitEvents chan<- agentsdk.ReinitializationEvent) (func(), error) {
|
||||
// ListenForWorkspaceClaims subscribes to a pubsub channel and returns a
|
||||
// receive-only channel that emits claim events for the given workspace.
|
||||
// The returned channel is owned by this function and is never closed,
|
||||
// because pubsub.Pubsub does not guarantee that all in-flight callbacks
|
||||
// have returned after unsubscribe. Call the returned cancel function to
|
||||
// unsubscribe when events are no longer needed; cancel is also called
|
||||
// automatically if ctx expires or is canceled.
|
||||
func (p PubsubWorkspaceClaimListener) ListenForWorkspaceClaims(ctx context.Context, workspaceID uuid.UUID) (<-chan agentsdk.ReinitializationEvent, func(), error) {
|
||||
select {
|
||||
case <-ctx.Done():
|
||||
return func() {}, ctx.Err()
|
||||
return nil, func() {}, ctx.Err()
|
||||
default:
|
||||
}
|
||||
|
||||
cancelSub, err := p.ps.Subscribe(agentsdk.PrebuildClaimedChannel(workspaceID), func(inner context.Context, reason []byte) {
|
||||
claim := agentsdk.ReinitializationEvent{
|
||||
WorkspaceID: workspaceID,
|
||||
Reason: agentsdk.ReinitializationReason(reason),
|
||||
reinitEvents := make(chan agentsdk.ReinitializationEvent, 1)
|
||||
|
||||
cancelSub, err := p.ps.Subscribe(agentsdk.PrebuildClaimedChannel(workspaceID), func(inner context.Context, payload []byte) {
|
||||
var event agentsdk.ReinitializationEvent
|
||||
if err := json.Unmarshal(payload, &event); err != nil {
|
||||
// Rolling upgrade: old publishers send the raw reason
|
||||
// string instead of JSON.
|
||||
event = agentsdk.ReinitializationEvent{
|
||||
WorkspaceID: workspaceID,
|
||||
Reason: agentsdk.ReinitializationReason(payload),
|
||||
}
|
||||
}
|
||||
|
||||
select {
|
||||
case <-ctx.Done():
|
||||
return
|
||||
case <-inner.Done():
|
||||
return
|
||||
case reinitEvents <- claim:
|
||||
case reinitEvents <- event:
|
||||
}
|
||||
})
|
||||
if err != nil {
|
||||
return func() {}, xerrors.Errorf("failed to subscribe to prebuild claimed channel: %w", err)
|
||||
return nil, func() {}, xerrors.Errorf("failed to subscribe to prebuild claimed channel: %w", err)
|
||||
}
|
||||
|
||||
var once sync.Once
|
||||
@@ -78,5 +91,5 @@ func (p PubsubWorkspaceClaimListener) ListenForWorkspaceClaims(ctx context.Conte
|
||||
cancel()
|
||||
}()
|
||||
|
||||
return cancel, nil
|
||||
return reinitEvents, cancel, nil
|
||||
}
|
||||
|
||||
@@ -25,24 +25,26 @@ func TestPubsubWorkspaceClaimPublisher(t *testing.T) {
|
||||
logger := testutil.Logger(t)
|
||||
ps := pubsub.NewInMemory()
|
||||
workspaceID := uuid.New()
|
||||
reinitEvents := make(chan agentsdk.ReinitializationEvent, 1)
|
||||
publisher := prebuilds.NewPubsubWorkspaceClaimPublisher(ps)
|
||||
listener := prebuilds.NewPubsubWorkspaceClaimListener(ps, logger)
|
||||
|
||||
cancel, err := listener.ListenForWorkspaceClaims(ctx, workspaceID, reinitEvents)
|
||||
events, cancel, err := listener.ListenForWorkspaceClaims(ctx, workspaceID)
|
||||
require.NoError(t, err)
|
||||
defer cancel()
|
||||
|
||||
userID := uuid.New()
|
||||
claim := agentsdk.ReinitializationEvent{
|
||||
WorkspaceID: workspaceID,
|
||||
Reason: agentsdk.ReinitializeReasonPrebuildClaimed,
|
||||
OwnerID: userID,
|
||||
}
|
||||
err = publisher.PublishWorkspaceClaim(claim)
|
||||
require.NoError(t, err)
|
||||
|
||||
gotEvent := testutil.RequireReceive(ctx, t, reinitEvents)
|
||||
gotEvent := testutil.RequireReceive(ctx, t, events)
|
||||
require.Equal(t, workspaceID, gotEvent.WorkspaceID)
|
||||
require.Equal(t, claim.Reason, gotEvent.Reason)
|
||||
require.Equal(t, userID, gotEvent.OwnerID)
|
||||
})
|
||||
|
||||
t.Run("fail to publish claim", func(t *testing.T) {
|
||||
@@ -69,10 +71,8 @@ func TestPubsubWorkspaceClaimListener(t *testing.T) {
|
||||
ps := pubsub.NewInMemory()
|
||||
listener := prebuilds.NewPubsubWorkspaceClaimListener(ps, slogtest.Make(t, nil))
|
||||
|
||||
claims := make(chan agentsdk.ReinitializationEvent, 1) // Buffer to avoid messing with goroutines in the rest of the test
|
||||
|
||||
workspaceID := uuid.New()
|
||||
cancelFunc, err := listener.ListenForWorkspaceClaims(context.Background(), workspaceID, claims)
|
||||
events, cancelFunc, err := listener.ListenForWorkspaceClaims(context.Background(), workspaceID)
|
||||
require.NoError(t, err)
|
||||
defer cancelFunc()
|
||||
|
||||
@@ -84,9 +84,10 @@ func TestPubsubWorkspaceClaimListener(t *testing.T) {
|
||||
|
||||
// Verify we receive the claim
|
||||
ctx := testutil.Context(t, testutil.WaitShort)
|
||||
claim := testutil.RequireReceive(ctx, t, claims)
|
||||
claim := testutil.RequireReceive(ctx, t, events)
|
||||
require.Equal(t, workspaceID, claim.WorkspaceID)
|
||||
require.Equal(t, reason, claim.Reason)
|
||||
require.Equal(t, uuid.Nil, claim.OwnerID)
|
||||
})
|
||||
|
||||
t.Run("ignores claim events for other workspaces", func(t *testing.T) {
|
||||
@@ -95,10 +96,9 @@ func TestPubsubWorkspaceClaimListener(t *testing.T) {
|
||||
ps := pubsub.NewInMemory()
|
||||
listener := prebuilds.NewPubsubWorkspaceClaimListener(ps, slogtest.Make(t, nil))
|
||||
|
||||
claims := make(chan agentsdk.ReinitializationEvent)
|
||||
workspaceID := uuid.New()
|
||||
otherWorkspaceID := uuid.New()
|
||||
cancelFunc, err := listener.ListenForWorkspaceClaims(context.Background(), workspaceID, claims)
|
||||
events, cancelFunc, err := listener.ListenForWorkspaceClaims(context.Background(), workspaceID)
|
||||
require.NoError(t, err)
|
||||
defer cancelFunc()
|
||||
|
||||
@@ -109,7 +109,7 @@ func TestPubsubWorkspaceClaimListener(t *testing.T) {
|
||||
|
||||
// Verify we don't receive the claim
|
||||
select {
|
||||
case <-claims:
|
||||
case <-events:
|
||||
t.Fatal("received claim for wrong workspace")
|
||||
case <-time.After(100 * time.Millisecond):
|
||||
// Expected - no claim received
|
||||
@@ -119,11 +119,10 @@ func TestPubsubWorkspaceClaimListener(t *testing.T) {
|
||||
t.Run("communicates the error if it can't subscribe", func(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
claims := make(chan agentsdk.ReinitializationEvent)
|
||||
ps := &brokenPubsub{}
|
||||
listener := prebuilds.NewPubsubWorkspaceClaimListener(ps, slogtest.Make(t, nil))
|
||||
|
||||
_, err := listener.ListenForWorkspaceClaims(context.Background(), uuid.New(), claims)
|
||||
_, _, err := listener.ListenForWorkspaceClaims(context.Background(), uuid.New())
|
||||
require.ErrorContains(t, err, "failed to subscribe to prebuild claimed channel")
|
||||
})
|
||||
}
|
||||
|
||||
@@ -2539,6 +2539,7 @@ func (s *server) completeWorkspaceBuildJob(ctx context.Context, job database.Pro
|
||||
err = prebuilds.NewPubsubWorkspaceClaimPublisher(s.Pubsub).PublishWorkspaceClaim(agentsdk.ReinitializationEvent{
|
||||
WorkspaceID: workspace.ID,
|
||||
Reason: agentsdk.ReinitializeReasonPrebuildClaimed,
|
||||
OwnerID: workspace.OwnerID,
|
||||
})
|
||||
if err != nil {
|
||||
s.Logger.Error(ctx, "failed to publish workspace claim event", slog.Error(err))
|
||||
|
||||
@@ -51,7 +51,6 @@ import (
|
||||
"github.com/coder/coder/v2/coderd/usage/usagetypes"
|
||||
"github.com/coder/coder/v2/coderd/wspubsub"
|
||||
"github.com/coder/coder/v2/codersdk"
|
||||
"github.com/coder/coder/v2/codersdk/agentsdk"
|
||||
"github.com/coder/coder/v2/provisionerd/proto"
|
||||
"github.com/coder/coder/v2/provisionersdk"
|
||||
sdkproto "github.com/coder/coder/v2/provisionersdk/proto"
|
||||
@@ -2787,8 +2786,7 @@ func TestCompleteJob(t *testing.T) {
|
||||
require.NoError(t, err)
|
||||
|
||||
// GIVEN something is listening to process workspace reinitialization:
|
||||
reinitChan := make(chan agentsdk.ReinitializationEvent, 1) // Buffered to simplify test structure
|
||||
cancel, err := agplprebuilds.NewPubsubWorkspaceClaimListener(ps, testutil.Logger(t)).ListenForWorkspaceClaims(ctx, workspace.ID, reinitChan)
|
||||
reinitChan, cancel, err := agplprebuilds.NewPubsubWorkspaceClaimListener(ps, testutil.Logger(t)).ListenForWorkspaceClaims(ctx, workspace.ID)
|
||||
require.NoError(t, err)
|
||||
defer cancel()
|
||||
|
||||
|
||||
+100
-3
@@ -1465,7 +1465,9 @@ func (api *API) workspaceAgentPostLogSource(rw http.ResponseWriter, r *http.Requ
|
||||
// @Security CoderSessionToken
|
||||
// @Produce json
|
||||
// @Tags Agents
|
||||
// @Param wait query bool false "Opt in to durable reinit checks"
|
||||
// @Success 200 {object} agentsdk.ReinitializationEvent
|
||||
// @Failure 409 {object} codersdk.Response
|
||||
// @Router /workspaceagents/me/reinit [get]
|
||||
func (api *API) workspaceAgentReinit(rw http.ResponseWriter, r *http.Request) {
|
||||
// Allow us to interrupt watch via cancel.
|
||||
@@ -1482,18 +1484,113 @@ func (api *API) workspaceAgentReinit(rw http.ResponseWriter, r *http.Request) {
|
||||
if err != nil {
|
||||
log.Error(ctx, "failed to retrieve workspace from agent token", slog.Error(err))
|
||||
httpapi.InternalServerError(rw, xerrors.New("failed to determine workspace from agent token"))
|
||||
return
|
||||
}
|
||||
log = log.With(slog.F("workspace_id", workspace.ID))
|
||||
|
||||
log.Info(ctx, "agent waiting for reinit instruction")
|
||||
|
||||
reinitEvents := make(chan agentsdk.ReinitializationEvent)
|
||||
cancel, err = prebuilds.NewPubsubWorkspaceClaimListener(api.Pubsub, log).ListenForWorkspaceClaims(ctx, workspace.ID, reinitEvents)
|
||||
// Subscribe to claim events BEFORE any durable checks to avoid a
|
||||
// TOCTOU race: without this, a claim could fire between the
|
||||
// IsPrebuild() check and the subscribe call, and we'd miss the
|
||||
// pubsub event entirely. By subscribing first, any event that
|
||||
// fires during the checks below is buffered in the channel.
|
||||
pubsubCh, cancelSub, err := prebuilds.NewPubsubWorkspaceClaimListener(api.Pubsub, log).ListenForWorkspaceClaims(ctx, workspace.ID)
|
||||
if err != nil {
|
||||
log.Error(ctx, "subscribe to prebuild claimed channel", slog.Error(err))
|
||||
httpapi.InternalServerError(rw, xerrors.New("failed to subscribe to prebuild claimed channel"))
|
||||
return
|
||||
}
|
||||
defer cancel()
|
||||
defer cancelSub()
|
||||
|
||||
reinitEvents := pubsubCh
|
||||
|
||||
// Only perform the durable claim check when the agent opts in via
|
||||
// the "wait" query parameter. Older agents don't send the
|
||||
// "wait" query parameter and lack the duplicate-reinit guard, so
|
||||
// they would enter an infinite reinit loop if we pre-seeded the
|
||||
// channel on every connection.
|
||||
waitParam, _ := strconv.ParseBool(r.URL.Query().Get("wait"))
|
||||
if waitParam && !workspace.IsPrebuild() {
|
||||
firstBuild, err := api.Database.GetWorkspaceBuildByWorkspaceIDAndBuildNumber(ctx,
|
||||
database.GetWorkspaceBuildByWorkspaceIDAndBuildNumberParams{
|
||||
WorkspaceID: workspace.ID,
|
||||
BuildNumber: 1,
|
||||
})
|
||||
if err != nil {
|
||||
log.Error(ctx, "failed to get first workspace build", slog.Error(err))
|
||||
httpapi.InternalServerError(rw, xerrors.New("failed to get first workspace build"))
|
||||
return
|
||||
}
|
||||
if firstBuild.InitiatorID != database.PrebuildsSystemUserID {
|
||||
// Not a claimed prebuild — this is a regular workspace.
|
||||
// Return 409 so the agent stops reconnecting to this
|
||||
// endpoint.
|
||||
httpapi.Write(ctx, rw, http.StatusConflict, codersdk.Response{
|
||||
Message: "Workspace is not a prebuilt workspace waiting to be claimed.",
|
||||
Detail: "This endpoint is only for agents running in prebuilt workspaces.",
|
||||
})
|
||||
return
|
||||
}
|
||||
|
||||
// This workspace was a prebuild that got claimed. Check if
|
||||
// the claim build completed successfully before sending
|
||||
// reinit. We assume the latest build is the claim build
|
||||
// (build 2). If a third build (e.g. a restart) starts
|
||||
// between the claim and the agent's reconnection, this
|
||||
// would check that build instead. The window is extremely
|
||||
// small in practice, and a restart would trigger its own
|
||||
// reinit path.
|
||||
latestBuild, err := api.Database.GetLatestWorkspaceBuildByWorkspaceID(ctx, workspace.ID)
|
||||
if err != nil {
|
||||
log.Error(ctx, "failed to get latest workspace build", slog.Error(err))
|
||||
httpapi.InternalServerError(rw, xerrors.New("failed to get latest workspace build"))
|
||||
return
|
||||
}
|
||||
job, err := api.Database.GetProvisionerJobByID(ctx, latestBuild.JobID)
|
||||
if err != nil {
|
||||
log.Error(ctx, "failed to get provisioner job", slog.Error(err))
|
||||
httpapi.InternalServerError(rw, xerrors.New("failed to get provisioner job"))
|
||||
return
|
||||
}
|
||||
|
||||
if job.CompletedAt.Valid && !job.Error.Valid {
|
||||
// Claim build succeeded — cancel the pubsub
|
||||
// subscription (no longer needed) and swap in a
|
||||
// pre-seeded channel so the transmitter delivers
|
||||
// exactly one reinit event.
|
||||
cancelSub()
|
||||
seeded := make(chan agentsdk.ReinitializationEvent, 1)
|
||||
seeded <- agentsdk.ReinitializationEvent{
|
||||
WorkspaceID: workspace.ID,
|
||||
Reason: agentsdk.ReinitializeReasonPrebuildClaimed,
|
||||
OwnerID: workspace.OwnerID,
|
||||
}
|
||||
reinitEvents = seeded
|
||||
} else if job.CompletedAt.Valid && job.Error.Valid {
|
||||
// Claim build failed permanently. Return 409 so the
|
||||
// agent treats this as terminal and stops retrying
|
||||
// (WaitForReinitLoop exits on any 409).
|
||||
cancelSub()
|
||||
log.Warn(ctx, "claim build failed",
|
||||
slog.F("job_id", job.ID),
|
||||
slog.F("error", job.Error.String))
|
||||
httpapi.Write(ctx, rw, http.StatusConflict, codersdk.Response{
|
||||
Message: "Claim build failed permanently.",
|
||||
Detail: job.Error.String,
|
||||
})
|
||||
return
|
||||
}
|
||||
|
||||
// Claim build still in progress — fall through to the
|
||||
// transmitter. The pubsub subscription (set up above)
|
||||
// will deliver the event when the build completes
|
||||
// successfully. Note: FailJob does not publish a claim
|
||||
// event, so a failed in-progress build will leave the
|
||||
// agent blocking here until it disconnects and
|
||||
// reconnects (at which point the durable check above
|
||||
// handles it).
|
||||
}
|
||||
|
||||
transmitter := agentsdk.NewSSEAgentReinitTransmitter(log, rw, r)
|
||||
|
||||
|
||||
+190
-35
@@ -2,6 +2,7 @@ package coderd_test
|
||||
|
||||
import (
|
||||
"context"
|
||||
"database/sql"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"io"
|
||||
@@ -3278,51 +3279,205 @@ func TestAgentConnectionInfo(t *testing.T) {
|
||||
func TestReinit(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
db, ps := dbtestutil.NewDB(t)
|
||||
pubsubSpy := pubsubReinitSpy{
|
||||
Pubsub: ps,
|
||||
triedToSubscribe: make(chan string),
|
||||
// Helper to create the prebuilds system user's workspace (an
|
||||
// unclaimed prebuild) and return the build result. The first
|
||||
// build's InitiatorID defaults to PrebuildsSystemUserID via
|
||||
// dbfake.
|
||||
setupPrebuildWorkspace := func(t *testing.T, db database.Store, orgID uuid.UUID) dbfake.WorkspaceResponse {
|
||||
t.Helper()
|
||||
return dbfake.WorkspaceBuild(t, db, database.WorkspaceTable{
|
||||
OrganizationID: orgID,
|
||||
OwnerID: database.PrebuildsSystemUserID,
|
||||
}).WithAgent().Do()
|
||||
}
|
||||
client := coderdtest.New(t, &coderdtest.Options{
|
||||
Database: db,
|
||||
Pubsub: &pubsubSpy,
|
||||
|
||||
// Helper to simulate claiming a prebuild: change the workspace
|
||||
// owner to the real user and create a second (claim) build.
|
||||
claimPrebuild := func(t *testing.T, db database.Store, sqlDB *sql.DB, ws database.WorkspaceTable, claimerID uuid.UUID, templateVersionID uuid.UUID, complete bool) dbfake.WorkspaceResponse {
|
||||
t.Helper()
|
||||
// Change the workspace owner to the claiming user.
|
||||
_, err := sqlDB.Exec("UPDATE workspaces SET owner_id = $1 WHERE id = $2", claimerID, ws.ID)
|
||||
require.NoError(t, err)
|
||||
|
||||
// Update the in-memory workspace to reflect the new owner
|
||||
// so that dbfake uses it for the second build.
|
||||
ws.OwnerID = claimerID
|
||||
|
||||
builder := dbfake.WorkspaceBuild(t, db, ws).
|
||||
Seed(database.WorkspaceBuild{
|
||||
TemplateVersionID: templateVersionID,
|
||||
BuildNumber: 2,
|
||||
InitiatorID: claimerID,
|
||||
Transition: database.WorkspaceTransitionStart,
|
||||
}).
|
||||
WithAgent()
|
||||
if !complete {
|
||||
builder = builder.Starting()
|
||||
}
|
||||
return builder.Do()
|
||||
}
|
||||
|
||||
t.Run("unclaimed prebuild receives reinit via pubsub", func(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
db, ps := dbtestutil.NewDB(t)
|
||||
pubsubSpy := pubsubReinitSpy{
|
||||
Pubsub: ps,
|
||||
triedToSubscribe: make(chan string),
|
||||
}
|
||||
client := coderdtest.New(t, &coderdtest.Options{
|
||||
Database: db,
|
||||
Pubsub: &pubsubSpy,
|
||||
})
|
||||
user := coderdtest.CreateFirstUser(t, client)
|
||||
|
||||
r := setupPrebuildWorkspace(t, db, user.OrganizationID)
|
||||
|
||||
pubsubSpy.Lock()
|
||||
pubsubSpy.expectedEvent = agentsdk.PrebuildClaimedChannel(r.Workspace.ID)
|
||||
pubsubSpy.Unlock()
|
||||
|
||||
agentCtx := testutil.Context(t, testutil.WaitShort)
|
||||
agentClient := agentsdk.New(client.URL, agentsdk.WithFixedToken(r.AgentToken))
|
||||
|
||||
agentReinitializedCh := make(chan *agentsdk.ReinitializationEvent)
|
||||
go func() {
|
||||
reinitEvent, err := agentClient.WaitForReinit(agentCtx)
|
||||
assert.NoError(t, err)
|
||||
agentReinitializedCh <- reinitEvent
|
||||
}()
|
||||
|
||||
// We need to subscribe before we publish, lest we miss the
|
||||
// event.
|
||||
ctx := testutil.Context(t, testutil.WaitShort)
|
||||
testutil.TryReceive(ctx, t, pubsubSpy.triedToSubscribe)
|
||||
|
||||
// Now that we're subscribed, publish the event.
|
||||
err := prebuilds.NewPubsubWorkspaceClaimPublisher(ps).PublishWorkspaceClaim(agentsdk.ReinitializationEvent{
|
||||
WorkspaceID: r.Workspace.ID,
|
||||
Reason: agentsdk.ReinitializeReasonPrebuildClaimed,
|
||||
})
|
||||
require.NoError(t, err)
|
||||
|
||||
ctx = testutil.Context(t, testutil.WaitShort)
|
||||
reinitEvent := testutil.TryReceive(ctx, t, agentReinitializedCh)
|
||||
require.NotNil(t, reinitEvent)
|
||||
require.Equal(t, r.Workspace.ID, reinitEvent.WorkspaceID)
|
||||
})
|
||||
user := coderdtest.CreateFirstUser(t, client)
|
||||
|
||||
r := dbfake.WorkspaceBuild(t, db, database.WorkspaceTable{
|
||||
OrganizationID: user.OrganizationID,
|
||||
OwnerID: user.UserID,
|
||||
}).WithAgent().Do()
|
||||
// Verifies the durable claim check: when an agent reconnects
|
||||
// after missing the pubsub event, the handler detects that the
|
||||
// workspace was originally a prebuild (first build initiated by
|
||||
// PrebuildsSystemUserID), is now claimed (owner changed), and
|
||||
// the claim build completed, so it sends a one-shot reinit
|
||||
// event immediately.
|
||||
t.Run("claimed prebuild receives one-shot reinit on reconnect", func(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
pubsubSpy.Lock()
|
||||
pubsubSpy.expectedEvent = agentsdk.PrebuildClaimedChannel(r.Workspace.ID)
|
||||
pubsubSpy.Unlock()
|
||||
db, ps, sqlDB := dbtestutil.NewDBWithSQLDB(t)
|
||||
client := coderdtest.New(t, &coderdtest.Options{
|
||||
Database: db,
|
||||
Pubsub: ps,
|
||||
})
|
||||
user := coderdtest.CreateFirstUser(t, client)
|
||||
|
||||
agentCtx := testutil.Context(t, testutil.WaitShort)
|
||||
agentClient := agentsdk.New(client.URL, agentsdk.WithFixedToken(r.AgentToken))
|
||||
// Create an unclaimed prebuild (build 1, completed).
|
||||
r := setupPrebuildWorkspace(t, db, user.OrganizationID)
|
||||
|
||||
agentReinitializedCh := make(chan *agentsdk.ReinitializationEvent)
|
||||
go func() {
|
||||
reinitEvent, err := agentClient.WaitForReinit(agentCtx)
|
||||
assert.NoError(t, err)
|
||||
agentReinitializedCh <- reinitEvent
|
||||
}()
|
||||
// Claim it: change owner + create build 2 (completed).
|
||||
claimR := claimPrebuild(t, db, sqlDB, r.Workspace, user.UserID, r.TemplateVersion.ID, true)
|
||||
|
||||
// We need to subscribe before we publish, lest we miss the event
|
||||
ctx := testutil.Context(t, testutil.WaitShort)
|
||||
testutil.TryReceive(ctx, t, pubsubSpy.triedToSubscribe)
|
||||
agentCtx := testutil.Context(t, testutil.WaitShort)
|
||||
agentClient := agentsdk.New(client.URL, agentsdk.WithFixedToken(claimR.AgentToken))
|
||||
|
||||
// Now that we're subscribed, publish the event
|
||||
err := prebuilds.NewPubsubWorkspaceClaimPublisher(ps).PublishWorkspaceClaim(agentsdk.ReinitializationEvent{
|
||||
WorkspaceID: r.Workspace.ID,
|
||||
Reason: agentsdk.ReinitializeReasonPrebuildClaimed,
|
||||
agentReinitializedCh := make(chan *agentsdk.ReinitializationEvent)
|
||||
go func() {
|
||||
reinitEvent, err := agentClient.WaitForReinit(agentCtx)
|
||||
assert.NoError(t, err)
|
||||
agentReinitializedCh <- reinitEvent
|
||||
}()
|
||||
|
||||
// The agent should receive a reinit event immediately from
|
||||
// the durable claim check — no pubsub publish needed.
|
||||
ctx := testutil.Context(t, testutil.WaitShort)
|
||||
reinitEvent := testutil.TryReceive(ctx, t, agentReinitializedCh)
|
||||
require.NotNil(t, reinitEvent)
|
||||
require.Equal(t, r.Workspace.ID, reinitEvent.WorkspaceID)
|
||||
require.Equal(t, agentsdk.ReinitializeReasonPrebuildClaimed, reinitEvent.Reason)
|
||||
require.Equal(t, user.UserID, reinitEvent.OwnerID)
|
||||
})
|
||||
require.NoError(t, err)
|
||||
|
||||
ctx = testutil.Context(t, testutil.WaitShort)
|
||||
reinitEvent := testutil.TryReceive(ctx, t, agentReinitializedCh)
|
||||
require.NotNil(t, reinitEvent)
|
||||
require.Equal(t, r.Workspace.ID, reinitEvent.WorkspaceID)
|
||||
// Verifies that when the claim build completed with an error,
|
||||
// the handler returns 409 so the agent treats it as terminal
|
||||
// and stops retrying (WaitForReinitLoop exits on any 409).
|
||||
t.Run("failed claim build returns terminal 409", func(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
db, ps, sqlDB := dbtestutil.NewDBWithSQLDB(t)
|
||||
client := coderdtest.New(t, &coderdtest.Options{
|
||||
Database: db,
|
||||
Pubsub: ps,
|
||||
})
|
||||
user := coderdtest.CreateFirstUser(t, client)
|
||||
|
||||
// Create an unclaimed prebuild (build 1, completed).
|
||||
r := setupPrebuildWorkspace(t, db, user.OrganizationID)
|
||||
|
||||
// Claim it: create build 2 as completed (so agent rows
|
||||
// exist and the token is valid for auth).
|
||||
claimR := claimPrebuild(t, db, sqlDB, r.Workspace, user.UserID, r.TemplateVersion.ID, true)
|
||||
|
||||
// Simulate a claim build failure: set an error on the
|
||||
// provisioner job. This models the case where terraform
|
||||
// apply partially succeeded (creating resources/agents)
|
||||
// but ultimately errored.
|
||||
_, err := sqlDB.Exec(
|
||||
"UPDATE provisioner_jobs SET error = 'simulated claim failure' WHERE id = $1",
|
||||
claimR.Build.JobID,
|
||||
)
|
||||
require.NoError(t, err)
|
||||
|
||||
agentCtx := testutil.Context(t, testutil.WaitShort)
|
||||
agentClient := agentsdk.New(client.URL, agentsdk.WithFixedToken(claimR.AgentToken))
|
||||
|
||||
_, err = agentClient.WaitForReinit(agentCtx)
|
||||
require.Error(t, err)
|
||||
var sdkErr *codersdk.Error
|
||||
require.ErrorAs(t, err, &sdkErr)
|
||||
require.Equal(t, http.StatusConflict, sdkErr.StatusCode())
|
||||
})
|
||||
|
||||
// Verifies that a regular workspace (never a prebuild) gets a
|
||||
// 409 Conflict response, causing the agent's reinit loop to
|
||||
// close the channel gracefully.
|
||||
t.Run("regular workspace gets 409", func(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
db, ps := dbtestutil.NewDB(t)
|
||||
client := coderdtest.New(t, &coderdtest.Options{
|
||||
Database: db,
|
||||
Pubsub: ps,
|
||||
})
|
||||
user := coderdtest.CreateFirstUser(t, client)
|
||||
|
||||
// Create a regular workspace (not a prebuild). The first
|
||||
// build's initiator will be the user, not the prebuilds
|
||||
// system user.
|
||||
r := dbfake.WorkspaceBuild(t, db, database.WorkspaceTable{
|
||||
OrganizationID: user.OrganizationID,
|
||||
OwnerID: user.UserID,
|
||||
}).WithAgent().Do()
|
||||
|
||||
agentCtx := testutil.Context(t, testutil.WaitShort)
|
||||
agentClient := agentsdk.New(client.URL, agentsdk.WithFixedToken(r.AgentToken))
|
||||
|
||||
// WaitForReinit should return an error wrapping a 409.
|
||||
_, err := agentClient.WaitForReinit(agentCtx)
|
||||
require.Error(t, err)
|
||||
var sdkErr *codersdk.Error
|
||||
require.ErrorAs(t, err, &sdkErr)
|
||||
require.Equal(t, http.StatusConflict, sdkErr.StatusCode())
|
||||
})
|
||||
}
|
||||
|
||||
type pubsubReinitSpy struct {
|
||||
|
||||
@@ -3,6 +3,7 @@ package agentsdk
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"fmt"
|
||||
"io"
|
||||
"net/http"
|
||||
@@ -706,8 +707,9 @@ const (
|
||||
)
|
||||
|
||||
type ReinitializationEvent struct {
|
||||
WorkspaceID uuid.UUID
|
||||
WorkspaceID uuid.UUID `json:"workspace_id" format:"uuid"`
|
||||
Reason ReinitializationReason `json:"reason"`
|
||||
OwnerID uuid.UUID `json:"owner_id,omitzero" format:"uuid"`
|
||||
}
|
||||
|
||||
func PrebuildClaimedChannel(id uuid.UUID) string {
|
||||
@@ -722,6 +724,9 @@ func (c *Client) WaitForReinit(ctx context.Context) (*ReinitializationEvent, err
|
||||
if err != nil {
|
||||
return nil, xerrors.Errorf("parse url: %w", err)
|
||||
}
|
||||
q := rpcURL.Query()
|
||||
q.Set("wait", "true")
|
||||
rpcURL.RawQuery = q.Encode()
|
||||
|
||||
httpClient := &http.Client{
|
||||
Transport: c.SDK.HTTPClient.Transport,
|
||||
@@ -750,21 +755,33 @@ func (c *Client) WaitForReinit(ctx context.Context) (*ReinitializationEvent, err
|
||||
return reinitEvent, nil
|
||||
}
|
||||
|
||||
// WaitForReinitLoop polls the /reinit SSE endpoint in a retry loop and
|
||||
// forwards received reinitialization events to the returned channel. The
|
||||
// channel is closed when ctx is canceled or the server returns 409
|
||||
// Conflict (indicating the workspace is not a prebuilt workspace or the
|
||||
// claim build failed permanently). The caller should select on both the
|
||||
// channel and ctx.Done().
|
||||
func WaitForReinitLoop(ctx context.Context, logger slog.Logger, client *Client) <-chan ReinitializationEvent {
|
||||
reinitEvents := make(chan ReinitializationEvent)
|
||||
|
||||
go func() {
|
||||
defer close(reinitEvents)
|
||||
for retrier := retry.New(100*time.Millisecond, 10*time.Second); retrier.Wait(ctx); {
|
||||
logger.Debug(ctx, "waiting for agent reinitialization instructions")
|
||||
reinitEvent, err := client.WaitForReinit(ctx)
|
||||
if err != nil {
|
||||
var sdkErr *codersdk.Error
|
||||
if errors.As(err, &sdkErr) && sdkErr.StatusCode() == http.StatusConflict {
|
||||
logger.Info(ctx, "received terminal 409, stopping reinit polling",
|
||||
slog.Error(sdkErr))
|
||||
return
|
||||
}
|
||||
logger.Error(ctx, "failed to wait for agent reinitialization instructions", slog.Error(err))
|
||||
continue
|
||||
}
|
||||
retrier.Reset()
|
||||
select {
|
||||
case <-ctx.Done():
|
||||
close(reinitEvents)
|
||||
return
|
||||
case reinitEvents <- *reinitEvent:
|
||||
}
|
||||
|
||||
@@ -26,6 +26,7 @@ func TestStreamAgentReinitEvents(t *testing.T) {
|
||||
eventToSend := agentsdk.ReinitializationEvent{
|
||||
WorkspaceID: uuid.New(),
|
||||
Reason: agentsdk.ReinitializeReasonPrebuildClaimed,
|
||||
OwnerID: uuid.New(),
|
||||
}
|
||||
|
||||
events := make(chan agentsdk.ReinitializationEvent, 1)
|
||||
|
||||
Generated
+12
-4
@@ -483,22 +483,30 @@ curl -X GET http://coder-server:8080/api/v2/workspaceagents/me/reinit \
|
||||
|
||||
`GET /workspaceagents/me/reinit`
|
||||
|
||||
### Parameters
|
||||
|
||||
| Name | In | Type | Required | Description |
|
||||
|--------|-------|---------|----------|---------------------------------|
|
||||
| `wait` | query | boolean | false | Opt in to durable reinit checks |
|
||||
|
||||
### Example responses
|
||||
|
||||
> 200 Response
|
||||
|
||||
```json
|
||||
{
|
||||
"owner_id": "8826ee2e-7933-4665-aef2-2393f84a0d05",
|
||||
"reason": "prebuild_claimed",
|
||||
"workspaceID": "string"
|
||||
"workspace_id": "0967198e-ec7b-4c6b-b4d3-f71244cadbe9"
|
||||
}
|
||||
```
|
||||
|
||||
### Responses
|
||||
|
||||
| Status | Meaning | Description | Schema |
|
||||
|--------|---------------------------------------------------------|-------------|----------------------------------------------------------------------------|
|
||||
| 200 | [OK](https://tools.ietf.org/html/rfc7231#section-6.3.1) | OK | [agentsdk.ReinitializationEvent](schemas.md#agentsdkreinitializationevent) |
|
||||
| Status | Meaning | Description | Schema |
|
||||
|--------|---------------------------------------------------------------|-------------|----------------------------------------------------------------------------|
|
||||
| 200 | [OK](https://tools.ietf.org/html/rfc7231#section-6.3.1) | OK | [agentsdk.ReinitializationEvent](schemas.md#agentsdkreinitializationevent) |
|
||||
| 409 | [Conflict](https://tools.ietf.org/html/rfc7231#section-6.5.8) | Conflict | [codersdk.Response](schemas.md#codersdkresponse) |
|
||||
|
||||
To perform this operation, you must be authenticated. [Learn more](authentication.md).
|
||||
|
||||
|
||||
Generated
+7
-5
@@ -186,17 +186,19 @@
|
||||
|
||||
```json
|
||||
{
|
||||
"owner_id": "8826ee2e-7933-4665-aef2-2393f84a0d05",
|
||||
"reason": "prebuild_claimed",
|
||||
"workspaceID": "string"
|
||||
"workspace_id": "0967198e-ec7b-4c6b-b4d3-f71244cadbe9"
|
||||
}
|
||||
```
|
||||
|
||||
### Properties
|
||||
|
||||
| Name | Type | Required | Restrictions | Description |
|
||||
|---------------|--------------------------------------------------------------------|----------|--------------|-------------|
|
||||
| `reason` | [agentsdk.ReinitializationReason](#agentsdkreinitializationreason) | false | | |
|
||||
| `workspaceID` | string | false | | |
|
||||
| Name | Type | Required | Restrictions | Description |
|
||||
|----------------|--------------------------------------------------------------------|----------|--------------|-------------|
|
||||
| `owner_id` | string | false | | |
|
||||
| `reason` | [agentsdk.ReinitializationReason](#agentsdkreinitializationreason) | false | | |
|
||||
| `workspace_id` | string | false | | |
|
||||
|
||||
## agentsdk.ReinitializationReason
|
||||
|
||||
|
||||
Reference in New Issue
Block a user