Files
coder/coderd/workspaceresourceauth.go
T
Dean Sheather e48d12160f fix(coderd): cut DB fan-out on agent instance-identity auth (#24973)
## Summary

Restores `v2.33.0-rc.2`-equivalent query cost for agent
instance-identity auth on `v2.33.0-rc.3`, which currently saturates the
pgx pool when multiple agents share an instance ID. Customer report
against rc.3 traced 233× `Internal error fetching provisioner job
resource. fetch related workspace build: context canceled` 500s during a
50-minute incident window to this path.

Backport to `release/2.33` will follow as a separate PR after this
merges.

## Root cause

[#24325](https://github.com/coder/coder/pull/24325) ("support multiple
agents with shared instance-identity auth") rewrote
`coderd/workspaceresourceauth.go::handleAuthInstanceID` to use the new
`:many` agent lookup followed by a per-candidate filter loop. Each
iteration synchronously calls `GetWorkspaceResourceByID` and
`GetProvisionerJobByID`. Both go through `dbauthz`, and both fan out
into the same `provisioner_job → workspace_build → workspace` cascade
because `authorizeProvisionerJob` always re-authorizes the workspace via
`GetWorkspaceBuildByJobID → GetWorkspaceByID`. The handler then
re-fetches resource and job again for the surviving agent.

Net effect on the agent-auth happy path:

| | SQL | RBAC |
|---|---|---|
| rc.2 baseline | 13 | 5 |
| rc.3 today, 1 agent | 19 | 7 |
| rc.3 today, 2 agents | 26 | 9 |
| **After this PR, 1 agent** | **6** | **3** |
| **After this PR, 2 agents** | **7** | **3** |

Under load, the rc.3 chain blocks on pool acquire and the request blows
past the 30s HTTP write timeout.

## Changes

### 1. System fast-path on `authorizeProvisionerJob`
(`coderd/database/dbauthz/dbauthz.go`)

Add an `AsSystemRestricted` early-return at the top of
`authorizeProvisionerJob`. Instance-identity auth has already proven
cloud identity before reaching the DB layer, so re-authorizing the
workspace on every provisioner-job lookup is pure overhead. Existing
`GetWorkspaceAgentsByInstanceID` already uses the same fast-path
pattern.

```go
if err := q.authorizeContext(ctx, policy.ActionRead, rbac.ResourceSystem); err == nil {
    return nil
}
```

### 2. Drop survivor re-fetch in `handleAuthInstanceID`
(`coderd/workspaceresourceauth.go`)

Capture the provisioner job alongside each candidate during the filter
loop so the survivor lookup does not re-fetch resource and job after
selection. The previous code fired the resource→job→build→workspace
cascade twice for the surviving agent.

## Tests

Adds `TestAuthorizeProvisionerJob_SystemFastPath` in
`coderd/database/dbauthz/dbauthz_test.go` with two sub-tests:
- `AsSystemRestricted/SkipsCascade` — strict mock fails the test if
`GetWorkspaceBuildByJobID` or `GetWorkspaceByID` is called.
- `NonSystemActor/StillCascades` — auditor (no `ResourceSystem`) still
pays the cascade and produces a `NotAuthorized` error, proving the
fast-path is gated correctly.

Updates 12 existing dbauthz suite cases to expect the new
`ResourceSystem.Read` check ahead of the workspace/template-version
check, with `FailSystemObjectChecks()` to force the slow path.

Existing integration coverage in
`TestPostWorkspaceAuthAWSInstanceIdentity/Ambiguous/{SingleAgent,
MultipleAgentsWithSelector, MultipleAgentsNoSelector, SubAgentExcluded,
...}` exercises Part 2 end-to-end and continues to pass.

## Footprint

- 3 files changed, +166/-48
- No SQL changes
- No `make gen`
- No migrations
- No audit-table updates

## Validation

- [x] `go test ./coderd/database/dbauthz/` — full suite, ~6s
- [x] `go test -run TestPostWorkspaceAuth ./coderd/` — instance-identity
handler tests
- [x] `go test -run TestProvisionerJob ./coderd/`
- [x] `go test -run TestWorkspaceAgent ./coderd/`
- [x] `go test ./coderd/provisionerdserver/`
- [x] `gofmt -l` clean

## Alternatives considered

- **SQL-side filter:** rewrite `GetWorkspaceAgentsByInstanceID` to join
`workspace_resources`/`provisioner_jobs` and filter `job.type =
'workspace_build'` server-side, eliminating the filter loop entirely.
Cleaner long-term, but changes generated SQL and is too much surface for
a release-branch hotfix. Worth doing as a follow-up.
- **Full revert of #24325:** removes the multi-agent feature outright;
conflicts with downstream commits
([#24441](https://github.com/coder/coder/pull/24441),
[#24438](https://github.com/coder/coder/pull/24438),
[#24313](https://github.com/coder/coder/pull/24313)). Reserved as
fallback if the surgical fix doesn't hold under load testing.
2026-05-05 15:15:39 -04:00

271 lines
9.5 KiB
Go

package coderd
import (
"encoding/json"
"fmt"
"net/http"
"sort"
"strings"
"github.com/google/uuid"
"github.com/mitchellh/mapstructure"
"github.com/coder/coder/v2/coderd/awsidentity"
"github.com/coder/coder/v2/coderd/azureidentity"
"github.com/coder/coder/v2/coderd/database"
"github.com/coder/coder/v2/coderd/database/dbauthz"
"github.com/coder/coder/v2/coderd/httpapi"
"github.com/coder/coder/v2/coderd/provisionerdserver"
"github.com/coder/coder/v2/codersdk"
"github.com/coder/coder/v2/codersdk/agentsdk"
)
// Azure supports instance identity verification:
// https://docs.microsoft.com/en-us/azure/virtual-machines/windows/instance-metadata-service?tabs=linux#tabgroup_14
//
// @Summary Authenticate agent on Azure instance
// @ID authenticate-agent-on-azure-instance
// @Security CoderSessionToken
// @Accept json
// @Produce json
// @Tags Agents
// @Param request body agentsdk.AzureInstanceIdentityToken true "Instance identity token. The optional agent_name field disambiguates when multiple agents share the same instance ID."
// @Success 200 {object} agentsdk.AuthenticateResponse
// @Router /api/v2/workspaceagents/azure-instance-identity [post]
func (api *API) postWorkspaceAuthAzureInstanceIdentity(rw http.ResponseWriter, r *http.Request) {
ctx := r.Context()
var req agentsdk.AzureInstanceIdentityToken
if !httpapi.Read(ctx, rw, r, &req) {
return
}
instanceID, err := azureidentity.Validate(r.Context(), req.Signature, azureidentity.Options{
VerifyOptions: api.AzureCertificates,
})
if err != nil {
httpapi.Write(ctx, rw, http.StatusUnauthorized, codersdk.Response{
Message: "Invalid Azure identity.",
Detail: err.Error(),
})
return
}
api.handleAuthInstanceID(rw, r, instanceID, req.AgentName)
}
// AWS supports instance identity verification:
// https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-identity-documents.html
// Using this, we can exchange a signed instance payload for an agent token.
//
// @Summary Authenticate agent on AWS instance
// @ID authenticate-agent-on-aws-instance
// @Security CoderSessionToken
// @Accept json
// @Produce json
// @Tags Agents
// @Param request body agentsdk.AWSInstanceIdentityToken true "Instance identity token. The optional agent_name field disambiguates when multiple agents share the same instance ID."
// @Success 200 {object} agentsdk.AuthenticateResponse
// @Router /api/v2/workspaceagents/aws-instance-identity [post]
func (api *API) postWorkspaceAuthAWSInstanceIdentity(rw http.ResponseWriter, r *http.Request) {
ctx := r.Context()
var req agentsdk.AWSInstanceIdentityToken
if !httpapi.Read(ctx, rw, r, &req) {
return
}
identity, err := awsidentity.Validate(req.Signature, req.Document, api.AWSCertificates)
if err != nil {
httpapi.Write(ctx, rw, http.StatusUnauthorized, codersdk.Response{
Message: "Invalid AWS identity.",
Detail: err.Error(),
})
return
}
api.handleAuthInstanceID(rw, r, identity.InstanceID, req.AgentName)
}
// Google Compute Engine supports instance identity verification:
// https://cloud.google.com/compute/docs/instances/verifying-instance-identity
// Using this, we can exchange a signed instance payload for an agent token.
//
// @Summary Authenticate agent on Google Cloud instance
// @ID authenticate-agent-on-google-cloud-instance
// @Security CoderSessionToken
// @Accept json
// @Produce json
// @Tags Agents
// @Param request body agentsdk.GoogleInstanceIdentityToken true "Instance identity token. The optional agent_name field disambiguates when multiple agents share the same instance ID."
// @Success 200 {object} agentsdk.AuthenticateResponse
// @Router /api/v2/workspaceagents/google-instance-identity [post]
func (api *API) postWorkspaceAuthGoogleInstanceIdentity(rw http.ResponseWriter, r *http.Request) {
ctx := r.Context()
var req agentsdk.GoogleInstanceIdentityToken
if !httpapi.Read(ctx, rw, r, &req) {
return
}
// We leave the audience blank. It's not important we validate who made the token.
payload, err := api.GoogleTokenValidator.Validate(ctx, req.JSONWebToken, "")
if err != nil {
httpapi.Write(ctx, rw, http.StatusUnauthorized, codersdk.Response{
Message: "Invalid GCP identity.",
Detail: err.Error(),
})
return
}
claims := struct {
Google struct {
ComputeEngine struct {
InstanceID string `mapstructure:"instance_id"`
} `mapstructure:"compute_engine"`
} `mapstructure:"google"`
}{}
err = mapstructure.Decode(payload.Claims, &claims)
if err != nil {
httpapi.Write(ctx, rw, http.StatusBadRequest, codersdk.Response{
Message: "Error decoding JWT claims.",
Detail: err.Error(),
})
return
}
api.handleAuthInstanceID(rw, r, claims.Google.ComputeEngine.InstanceID, req.AgentName)
}
func (api *API) handleAuthInstanceID(rw http.ResponseWriter, r *http.Request, instanceID string, agentName string) {
ctx := r.Context()
// Instance identity auth happens before the agent has a session token, so
// these lookups must use a restricted system context.
//nolint:gocritic // Instance identity auth happens before agent auth.
systemCtx := dbauthz.AsSystemRestricted(ctx)
agentName = strings.TrimSpace(agentName)
agents, err := api.Database.GetWorkspaceAgentsByInstanceID(systemCtx, instanceID)
if err != nil {
httpapi.Write(ctx, rw, http.StatusInternalServerError, codersdk.Response{
Message: "Internal error fetching provisioner job agent.",
Detail: err.Error(),
})
return
}
// Template version agents can share an instance ID with workspace build
// agents. Keep only workspace build agents before resolving ambiguity so
// template version agents do not force CODER_AGENT_NAME.
//
// We attach the provisioner job to each candidate during the filter
// loop so the post-selection code below can read it directly from the
// chosen candidate instead of re-querying. The previous code re-fetched
// the resource and job for the surviving agent, firing the
// resource->job->build->workspace dbauthz cascade twice and saturating
// the pgx pool under load.
type instanceCandidate struct {
agent database.WorkspaceAgent
job database.ProvisionerJob
}
buildCandidates := make([]instanceCandidate, 0, len(agents))
for _, candidate := range agents {
resource, err := api.Database.GetWorkspaceResourceByID(systemCtx, candidate.ResourceID)
if err != nil {
httpapi.Write(ctx, rw, http.StatusInternalServerError, codersdk.Response{
Message: "Internal error fetching provisioner job resource.",
Detail: err.Error(),
})
return
}
job, err := api.Database.GetProvisionerJobByID(systemCtx, resource.JobID)
if err != nil {
httpapi.Write(ctx, rw, http.StatusInternalServerError, codersdk.Response{
Message: "Internal error fetching provisioner job.",
Detail: err.Error(),
})
return
}
if job.Type == database.ProvisionerJobTypeWorkspaceBuild {
buildCandidates = append(buildCandidates, instanceCandidate{
agent: candidate,
job: job,
})
}
}
if len(buildCandidates) == 0 {
httpapi.Write(ctx, rw, http.StatusNotFound, codersdk.Response{
Message: fmt.Sprintf("Instance with id %q not found.", instanceID),
})
return
}
var selected instanceCandidate
if agentName != "" {
for _, candidate := range buildCandidates {
if candidate.agent.Name == agentName {
selected = candidate
break
}
}
if selected.agent.ID == uuid.Nil {
httpapi.Write(ctx, rw, http.StatusNotFound, codersdk.Response{
Message: fmt.Sprintf("No agent found with instance ID %q and name %q.", instanceID, agentName),
})
return
}
} else {
if len(buildCandidates) != 1 {
// Include agent names in the error message to help operators
// configure CODER_AGENT_NAME. The caller has already proven
// cloud instance identity, so agent names are not sensitive
// here.
names := make([]string, len(buildCandidates))
for i, candidate := range buildCandidates {
names[i] = candidate.agent.Name
}
sort.Strings(names)
httpapi.Write(ctx, rw, http.StatusConflict, codersdk.Response{
Message: fmt.Sprintf(
"Multiple agents found with instance ID %q. Set CODER_AGENT_NAME to one of: %s",
instanceID,
strings.Join(names, ", "),
),
})
return
}
selected = buildCandidates[0]
}
agent := selected.agent
job := selected.job
var jobData provisionerdserver.WorkspaceProvisionJob
err = json.Unmarshal(job.Input, &jobData)
if err != nil {
httpapi.Write(ctx, rw, http.StatusInternalServerError, codersdk.Response{
Message: "Internal error extracting job data.",
Detail: err.Error(),
})
return
}
resourceHistory, err := api.Database.GetWorkspaceBuildByID(systemCtx, jobData.WorkspaceBuildID)
if err != nil {
httpapi.Write(ctx, rw, http.StatusInternalServerError, codersdk.Response{
Message: "Internal error fetching workspace build.",
Detail: err.Error(),
})
return
}
// This token should only be exchanged if the instance ID is valid
// for the latest history. If an instance ID is recycled by a cloud,
// we'd hate to leak access to a user's workspace.
latestHistory, err := api.Database.GetLatestWorkspaceBuildByWorkspaceID(systemCtx, resourceHistory.WorkspaceID)
if err != nil {
httpapi.Write(ctx, rw, http.StatusInternalServerError, codersdk.Response{
Message: "Internal error fetching the latest workspace build.",
Detail: err.Error(),
})
return
}
if latestHistory.ID != resourceHistory.ID {
httpapi.Write(ctx, rw, http.StatusBadRequest, codersdk.Response{
Message: fmt.Sprintf("Resource found for id %q, but isn't registered on the latest history.", instanceID),
})
return
}
httpapi.Write(ctx, rw, http.StatusOK, agentsdk.AuthenticateResponse{
SessionToken: agent.AuthToken.String(),
})
}