mirror of
https://github.com/coder/coder.git
synced 2026-06-03 04:58:23 +00:00
e48d12160f
## Summary Restores `v2.33.0-rc.2`-equivalent query cost for agent instance-identity auth on `v2.33.0-rc.3`, which currently saturates the pgx pool when multiple agents share an instance ID. Customer report against rc.3 traced 233× `Internal error fetching provisioner job resource. fetch related workspace build: context canceled` 500s during a 50-minute incident window to this path. Backport to `release/2.33` will follow as a separate PR after this merges. ## Root cause [#24325](https://github.com/coder/coder/pull/24325) ("support multiple agents with shared instance-identity auth") rewrote `coderd/workspaceresourceauth.go::handleAuthInstanceID` to use the new `:many` agent lookup followed by a per-candidate filter loop. Each iteration synchronously calls `GetWorkspaceResourceByID` and `GetProvisionerJobByID`. Both go through `dbauthz`, and both fan out into the same `provisioner_job → workspace_build → workspace` cascade because `authorizeProvisionerJob` always re-authorizes the workspace via `GetWorkspaceBuildByJobID → GetWorkspaceByID`. The handler then re-fetches resource and job again for the surviving agent. Net effect on the agent-auth happy path: | | SQL | RBAC | |---|---|---| | rc.2 baseline | 13 | 5 | | rc.3 today, 1 agent | 19 | 7 | | rc.3 today, 2 agents | 26 | 9 | | **After this PR, 1 agent** | **6** | **3** | | **After this PR, 2 agents** | **7** | **3** | Under load, the rc.3 chain blocks on pool acquire and the request blows past the 30s HTTP write timeout. ## Changes ### 1. System fast-path on `authorizeProvisionerJob` (`coderd/database/dbauthz/dbauthz.go`) Add an `AsSystemRestricted` early-return at the top of `authorizeProvisionerJob`. Instance-identity auth has already proven cloud identity before reaching the DB layer, so re-authorizing the workspace on every provisioner-job lookup is pure overhead. Existing `GetWorkspaceAgentsByInstanceID` already uses the same fast-path pattern. ```go if err := q.authorizeContext(ctx, policy.ActionRead, rbac.ResourceSystem); err == nil { return nil } ``` ### 2. Drop survivor re-fetch in `handleAuthInstanceID` (`coderd/workspaceresourceauth.go`) Capture the provisioner job alongside each candidate during the filter loop so the survivor lookup does not re-fetch resource and job after selection. The previous code fired the resource→job→build→workspace cascade twice for the surviving agent. ## Tests Adds `TestAuthorizeProvisionerJob_SystemFastPath` in `coderd/database/dbauthz/dbauthz_test.go` with two sub-tests: - `AsSystemRestricted/SkipsCascade` — strict mock fails the test if `GetWorkspaceBuildByJobID` or `GetWorkspaceByID` is called. - `NonSystemActor/StillCascades` — auditor (no `ResourceSystem`) still pays the cascade and produces a `NotAuthorized` error, proving the fast-path is gated correctly. Updates 12 existing dbauthz suite cases to expect the new `ResourceSystem.Read` check ahead of the workspace/template-version check, with `FailSystemObjectChecks()` to force the slow path. Existing integration coverage in `TestPostWorkspaceAuthAWSInstanceIdentity/Ambiguous/{SingleAgent, MultipleAgentsWithSelector, MultipleAgentsNoSelector, SubAgentExcluded, ...}` exercises Part 2 end-to-end and continues to pass. ## Footprint - 3 files changed, +166/-48 - No SQL changes - No `make gen` - No migrations - No audit-table updates ## Validation - [x] `go test ./coderd/database/dbauthz/` — full suite, ~6s - [x] `go test -run TestPostWorkspaceAuth ./coderd/` — instance-identity handler tests - [x] `go test -run TestProvisionerJob ./coderd/` - [x] `go test -run TestWorkspaceAgent ./coderd/` - [x] `go test ./coderd/provisionerdserver/` - [x] `gofmt -l` clean ## Alternatives considered - **SQL-side filter:** rewrite `GetWorkspaceAgentsByInstanceID` to join `workspace_resources`/`provisioner_jobs` and filter `job.type = 'workspace_build'` server-side, eliminating the filter loop entirely. Cleaner long-term, but changes generated SQL and is too much surface for a release-branch hotfix. Worth doing as a follow-up. - **Full revert of #24325:** removes the multi-agent feature outright; conflicts with downstream commits ([#24441](https://github.com/coder/coder/pull/24441), [#24438](https://github.com/coder/coder/pull/24438), [#24313](https://github.com/coder/coder/pull/24313)). Reserved as fallback if the surgical fix doesn't hold under load testing.