mirror of
https://github.com/coder/coder.git
synced 2026-06-03 04:58:23 +00:00
5b6b7719df
## Problem When a prebuilt workspace is claimed, the agent reinitializes via a single fire-and-forget pubsub event over SSE. If the agent's SSE connection is interrupted at claim time, the event is permanently lost — the workspace is stuck with no self-healing path. Additionally, regular (non-prebuild) workspaces had no way to opt out of the `/reinit` polling loop — agents would reconnect indefinitely to an endpoint that would never send them anything useful. ## Root Cause `workspaceAgentReinit` fetches the workspace (with its current `owner_id`) via `GetWorkspaceByAgentID`, but never checked whether a claim already happened. It only subscribed to pubsub for future events. The database already has durable claim state (`owner_id` changes from `PrebuildsSystemUserID` to the real user), but no layer ever consulted it on reconnection. ## Solution ### Server-side durable check with first-build-initiator gating **TOCTOU-safe ordering**: Subscribe to pubsub claim events *before* any durable checks, so a claim that fires during the check is buffered in the channel rather than lost. **First-build-initiator gating**: When `!workspace.IsPrebuild()` (owner is no longer the system user), look up the first build's `InitiatorID`. The prebuild reconciler always uses `PrebuildsSystemUserID` as the initiator. This distinguishes claimed prebuilds from regular workspaces without any SQL schema changes. - **Regular workspace** (first build initiator ≠ system user) → **409 Conflict**, agent stops reconnecting - **Claimed prebuild, build completed** → pre-seed channel with reinit event and close it, transmitter delivers one-shot then exits - **Claimed prebuild, build in-progress** → fall through to pubsub subscription, agent waits for completion event - **Unclaimed prebuild** → pubsub subscription (existing happy path) ### Declarative reinit events (defense-in-depth) - Added `UserID` field to `ReinitializationEvent` with JSON tags - Switched pubsub serialization from raw string to JSON (with backward-compat fallback for rolling upgrades) - Populated `UserID` at both the publish site and the durable check ### Agent SDK: 409 handling `WaitForReinitLoop` detects 409 Conflict from the server and closes the `reinitEvents` channel, cleanly exiting the retry goroutine. ### Agent CLI: fixed two bugs + added reinitCtx - **Closed channel (`!ok`)**: now blocks on `<-ctx.Done()` instead of `continue`, keeping the current agent running. Previously this would leak agents by skipping `agnt.Close()` and re-entering the loop. - **Duplicate owner reinit**: cancels `reinitCtx` (stops the reinit goroutine), then blocks on `<-ctx.Done()`. Previously `continue` would skip cleanup and create a new agent on the next loop iteration. - **`reinitCtx`**: a cancellable child of `ctx` passed to `WaitForReinitLoop`, allowing the agent to stop the reinit HTTP polling after reinit completes. ### Agent-side idempotency Tracks `lastOwnerID` in the agent reinit loop — duplicate events for the same owner are skipped. ## Testing - **"unclaimed prebuild receives reinit via pubsub"**: prebuild owned by system user, pubsub event triggers reinit - **"claimed prebuild receives one-shot reinit on reconnect"**: first build by system user, owner changed, build completed → immediate reinit (no pubsub needed) - **"claimed prebuild waits during in-progress claim build"**: claimed but build still running → no reinit until build completes - **"regular workspace gets 409"**: first build by real user → 409 Conflict, agent stops polling - Updated claim publisher/listener tests: verify `UserID` survives JSON round-trip + backward compat with raw string payloads - Updated SSE round-trip test: verify `UserID` survives transmit → receive cycle Fixes #22359 ## Rolling upgrade note During a rolling deploy where old coderd instances coexist with new ones, the pubsub `ReinitializationEvent` has a new `workspace_id` field (JSON key `workspace_id`). Old publishers send a raw reason string instead of JSON; the new listener gracefully falls back by treating the entire payload as the reason and filling in `WorkspaceID` from context. The only visible effect during the upgrade window is that `WorkspaceID` may be the zero UUID in agent-side logs — this is cosmetic and resolves once all instances are updated.
96 lines
2.7 KiB
Go
96 lines
2.7 KiB
Go
package prebuilds
|
|
|
|
import (
|
|
"context"
|
|
"encoding/json"
|
|
"sync"
|
|
|
|
"github.com/google/uuid"
|
|
"golang.org/x/xerrors"
|
|
|
|
"cdr.dev/slog/v3"
|
|
"github.com/coder/coder/v2/coderd/database/pubsub"
|
|
"github.com/coder/coder/v2/codersdk/agentsdk"
|
|
)
|
|
|
|
func NewPubsubWorkspaceClaimPublisher(ps pubsub.Pubsub) *PubsubWorkspaceClaimPublisher {
|
|
return &PubsubWorkspaceClaimPublisher{ps: ps}
|
|
}
|
|
|
|
type PubsubWorkspaceClaimPublisher struct {
|
|
ps pubsub.Pubsub
|
|
}
|
|
|
|
func (p PubsubWorkspaceClaimPublisher) PublishWorkspaceClaim(claim agentsdk.ReinitializationEvent) error {
|
|
channel := agentsdk.PrebuildClaimedChannel(claim.WorkspaceID)
|
|
payload, err := json.Marshal(claim)
|
|
if err != nil {
|
|
return xerrors.Errorf("marshal claim event: %w", err)
|
|
}
|
|
if err := p.ps.Publish(channel, payload); err != nil {
|
|
return xerrors.Errorf("failed to trigger prebuilt workspace agent reinitialization: %w", err)
|
|
}
|
|
return nil
|
|
}
|
|
|
|
func NewPubsubWorkspaceClaimListener(ps pubsub.Pubsub, logger slog.Logger) *PubsubWorkspaceClaimListener {
|
|
return &PubsubWorkspaceClaimListener{ps: ps, logger: logger}
|
|
}
|
|
|
|
type PubsubWorkspaceClaimListener struct {
|
|
logger slog.Logger
|
|
ps pubsub.Pubsub
|
|
}
|
|
|
|
// ListenForWorkspaceClaims subscribes to a pubsub channel and returns a
|
|
// receive-only channel that emits claim events for the given workspace.
|
|
// The returned channel is owned by this function and is never closed,
|
|
// because pubsub.Pubsub does not guarantee that all in-flight callbacks
|
|
// have returned after unsubscribe. Call the returned cancel function to
|
|
// unsubscribe when events are no longer needed; cancel is also called
|
|
// automatically if ctx expires or is canceled.
|
|
func (p PubsubWorkspaceClaimListener) ListenForWorkspaceClaims(ctx context.Context, workspaceID uuid.UUID) (<-chan agentsdk.ReinitializationEvent, func(), error) {
|
|
select {
|
|
case <-ctx.Done():
|
|
return nil, func() {}, ctx.Err()
|
|
default:
|
|
}
|
|
|
|
reinitEvents := make(chan agentsdk.ReinitializationEvent, 1)
|
|
|
|
cancelSub, err := p.ps.Subscribe(agentsdk.PrebuildClaimedChannel(workspaceID), func(inner context.Context, payload []byte) {
|
|
var event agentsdk.ReinitializationEvent
|
|
if err := json.Unmarshal(payload, &event); err != nil {
|
|
// Rolling upgrade: old publishers send the raw reason
|
|
// string instead of JSON.
|
|
event = agentsdk.ReinitializationEvent{
|
|
WorkspaceID: workspaceID,
|
|
Reason: agentsdk.ReinitializationReason(payload),
|
|
}
|
|
}
|
|
|
|
select {
|
|
case <-ctx.Done():
|
|
case <-inner.Done():
|
|
case reinitEvents <- event:
|
|
}
|
|
})
|
|
if err != nil {
|
|
return nil, func() {}, xerrors.Errorf("failed to subscribe to prebuild claimed channel: %w", err)
|
|
}
|
|
|
|
var once sync.Once
|
|
cancel := func() {
|
|
once.Do(func() {
|
|
cancelSub()
|
|
})
|
|
}
|
|
|
|
go func() {
|
|
<-ctx.Done()
|
|
cancel()
|
|
}()
|
|
|
|
return reinitEvents, cancel, nil
|
|
}
|