Migrates Azure instance identity verification from
`go.mozilla.org/pkcs7` and `github.com/fullsailor/pkcs7` to
`github.com/smallstep/pkcs7`, using `VerifyWithChainAtTime` to validate
both the PKCS7 signature and the certificate chain in one call. The
previous code only verified the signer certificate against a set of
intermediates/roots but did not verify that the PKCS7 signature itself
covered the content, meaning tampered payloads could be accepted.
The `Options` struct is restructured to accept `Roots`, `Intermediates`,
and `CurrentTime` as explicit fields instead of embedding
`x509.VerifyOptions`. The test helper `NewAzureInstanceIdentity` now
builds a realistic 3-level certificate chain (Root CA -> Intermediate CA
-> Signing Cert) matching real Azure trust hierarchy. New tests
(`TestValidate_TamperedContent`,
`TestValidate_UntrustedCertWithValidSignature`) confirm tampered and
untrusted envelopes are rejected.
Addresses GHSA-6x44-w3xg-hqqf.
> [!NOTE]
> This PR was authored by Coder Agents.
<details>
<summary>Implementation Plan</summary>
### Files Changed
| File | Summary |
|------|---------|
| `coderd/azureidentity/azureidentity.go` | Replace `signer.Verify()`
with `VerifyWithChainAtTime`; restructure `Options` struct; add
`ParseCertificates()` helper |
| `coderd/azureidentity/azureidentity_test.go` | Add `testCertChain`
builder, tampered-content and untrusted-cert tests; update existing
tests for new `Options` API |
| `coderd/coderd.go` | Change `AzureCertificates` field from
`x509.VerifyOptions` to `azureidentity.Options` |
| `coderd/workspaceresourceauth.go` | Pass `api.AzureCertificates`
directly instead of wrapping |
| `coderd/coderdtest/coderdtest.go` | Migrate to `smallstep/pkcs7`;
build 3-level cert chain in test helper |
| `go.mod` / `go.sum` | Add `github.com/smallstep/pkcs7`; remove
`fullsailor/pkcs7` and `go.mozilla.org/pkcs7` |
</details>
Security improvements:
- Restrict cert fetches to a host+port allowlist (Microsoft and DigiCert
on 80/443).
- Route requests through a dedicated `http.Client` that resolves the
host once and dials the validated IP directly, preventing DNS rebinding.
- Reject loopback, private (RFC 1918 / IPv6 ULA), link-local, multicast,
unspecified, CGNAT, benchmarking, and IPv4-mapped IPv6 addresses.
- Cap the certificate response body at 1 MiB.
- Log the underlying error via slog and return a generic detail to the
caller to prevent information disclosure.
* fix(coderd): Harden Azure identity certificate fetch
- Restrict cert fetches to a host+port allowlist (Microsoft and
DigiCert on 80/443).
- Route requests through a dedicated `http.Client` that resolves
the host once and dials the validated IP directly.
- Reject loopback, private (RFC 1918 / IPv6 ULA), link-local,
multicast, unspecified, CGNAT, benchmarking, and IPv4-mapped
IPv6 addresses.
- Cap the certificate response body at 1 MiB.
- Log the underlying error via slog and return a generic detail
to the caller.
- Add unit tests for the URL allowlist, IP classification, and
dialer.
* fix(coderd/azureidentity): add IPv6 special-use ranges to SSRF blocklist
The extraBlockedNetworks list only contained IPv4 CIDRs. Add IPv6
equivalents that Go's stdlib classification methods do not cover:
- 64:ff9b:1::/48 RFC 8215 NAT64 translation
- 100::/64 RFC 6666 discard-only
- 2001:2::/48 RFC 5180 benchmarking
- 2001:db8::/32 RFC 3849 documentation
IPv6 ranges already handled by stdlib (unchanged):
- ::1/128 (IsLoopback)
- fc00::/7 (IsPrivate, ULA)
- fe80::/10 (IsLinkLocalUnicast)
- ff00::/8 (IsMulticast)
- ::/128 (IsUnspecified)
Replaces the per-agent Go-side template-version filter in
`handleAuthInstanceID` with a purpose-built SQL query.
`GetWorkspaceBuildAgentsByInstanceID` joins `workspace_agents ->
workspace_resources -> workspace_builds -> provisioner_jobs ->
workspaces` and excludes:
- non-`workspace_build` provisioner jobs (template-version-import,
dry-run)
- deleted agents and sub-agents
- deleted workspaces
The handler:
- drops the per-candidate `GetWorkspaceResourceByID` /
`GetProvisionerJobByID` lookups
- drops the `provisioner_jobs.input` JSON parsing and the follow-up
`GetWorkspaceBuildByID` call
- compares `latestHistory.ID` against `selected.WorkspaceBuildID`
returned directly from the query
- preserves the existing recycled-instance safety check and matching
response codes
One intentional behavior tightening: agents whose workspace is deleted
now return 404 (previously they could reach the recycled-instance check
and return 400, or 200 if the stale build was still latest). This
matches the existing token-auth path, which already refuses to
authenticate against deleted workspaces.
The original `GetWorkspaceAgentsByInstanceID` query is intentionally
untouched. It remains the generic raw lookup used elsewhere in tests and
helpers.
The dbauthz wrapper for the new query uses the system-read fast path
with `fetchWithPostFilter` for non-system reads, with `RBACObject()`
delegating to the embedded `WorkspaceTable`.
Tests:
- new `TestGetWorkspaceBuildAgentsByInstanceID` covering newest-first
ordering, exclusion of deleted/sub agents, exclusion of template-import
and dry-run jobs, and exclusion of deleted workspaces
- new dbauthz mock test for `GetWorkspaceBuildAgentsByInstanceID`
- new `TestPostWorkspaceAuthAWSInstanceIdentity/RecycledInstanceID`
exercising the recycled-instance rejection branch (HTTP 400 when the
agent's build is no longer latest)
- existing `TestPostWorkspaceAuth{AWS,Azure,Google}InstanceIdentity`
continue to cover the handler end to end (including the template-version
+ workspace-build same-instance-ID scenario via
`setupInstanceIDWorkspace`)
> Mux is acting on Mike's behalf.
## Summary
Restores `v2.33.0-rc.2`-equivalent query cost for agent
instance-identity auth on `v2.33.0-rc.3`, which currently saturates the
pgx pool when multiple agents share an instance ID. Customer report
against rc.3 traced 233× `Internal error fetching provisioner job
resource. fetch related workspace build: context canceled` 500s during a
50-minute incident window to this path.
Backport to `release/2.33` will follow as a separate PR after this
merges.
## Root cause
[#24325](https://github.com/coder/coder/pull/24325) ("support multiple
agents with shared instance-identity auth") rewrote
`coderd/workspaceresourceauth.go::handleAuthInstanceID` to use the new
`:many` agent lookup followed by a per-candidate filter loop. Each
iteration synchronously calls `GetWorkspaceResourceByID` and
`GetProvisionerJobByID`. Both go through `dbauthz`, and both fan out
into the same `provisioner_job → workspace_build → workspace` cascade
because `authorizeProvisionerJob` always re-authorizes the workspace via
`GetWorkspaceBuildByJobID → GetWorkspaceByID`. The handler then
re-fetches resource and job again for the surviving agent.
Net effect on the agent-auth happy path:
| | SQL | RBAC |
|---|---|---|
| rc.2 baseline | 13 | 5 |
| rc.3 today, 1 agent | 19 | 7 |
| rc.3 today, 2 agents | 26 | 9 |
| **After this PR, 1 agent** | **6** | **3** |
| **After this PR, 2 agents** | **7** | **3** |
Under load, the rc.3 chain blocks on pool acquire and the request blows
past the 30s HTTP write timeout.
## Changes
### 1. System fast-path on `authorizeProvisionerJob`
(`coderd/database/dbauthz/dbauthz.go`)
Add an `AsSystemRestricted` early-return at the top of
`authorizeProvisionerJob`. Instance-identity auth has already proven
cloud identity before reaching the DB layer, so re-authorizing the
workspace on every provisioner-job lookup is pure overhead. Existing
`GetWorkspaceAgentsByInstanceID` already uses the same fast-path
pattern.
```go
if err := q.authorizeContext(ctx, policy.ActionRead, rbac.ResourceSystem); err == nil {
return nil
}
```
### 2. Drop survivor re-fetch in `handleAuthInstanceID`
(`coderd/workspaceresourceauth.go`)
Capture the provisioner job alongside each candidate during the filter
loop so the survivor lookup does not re-fetch resource and job after
selection. The previous code fired the resource→job→build→workspace
cascade twice for the surviving agent.
## Tests
Adds `TestAuthorizeProvisionerJob_SystemFastPath` in
`coderd/database/dbauthz/dbauthz_test.go` with two sub-tests:
- `AsSystemRestricted/SkipsCascade` — strict mock fails the test if
`GetWorkspaceBuildByJobID` or `GetWorkspaceByID` is called.
- `NonSystemActor/StillCascades` — auditor (no `ResourceSystem`) still
pays the cascade and produces a `NotAuthorized` error, proving the
fast-path is gated correctly.
Updates 12 existing dbauthz suite cases to expect the new
`ResourceSystem.Read` check ahead of the workspace/template-version
check, with `FailSystemObjectChecks()` to force the slow path.
Existing integration coverage in
`TestPostWorkspaceAuthAWSInstanceIdentity/Ambiguous/{SingleAgent,
MultipleAgentsWithSelector, MultipleAgentsNoSelector, SubAgentExcluded,
...}` exercises Part 2 end-to-end and continues to pass.
## Footprint
- 3 files changed, +166/-48
- No SQL changes
- No `make gen`
- No migrations
- No audit-table updates
## Validation
- [x] `go test ./coderd/database/dbauthz/` — full suite, ~6s
- [x] `go test -run TestPostWorkspaceAuth ./coderd/` — instance-identity
handler tests
- [x] `go test -run TestProvisionerJob ./coderd/`
- [x] `go test -run TestWorkspaceAgent ./coderd/`
- [x] `go test ./coderd/provisionerdserver/`
- [x] `gofmt -l` clean
## Alternatives considered
- **SQL-side filter:** rewrite `GetWorkspaceAgentsByInstanceID` to join
`workspace_resources`/`provisioner_jobs` and filter `job.type =
'workspace_build'` server-side, eliminating the filter loop entirely.
Cleaner long-term, but changes generated SQL and is too much surface for
a release-branch hotfix. Worth doing as a follow-up.
- **Full revert of #24325:** removes the multi-agent feature outright;
conflicts with downstream commits
([#24441](https://github.com/coder/coder/pull/24441),
[#24438](https://github.com/coder/coder/pull/24438),
[#24313](https://github.com/coder/coder/pull/24313)). Reserved as
fallback if the surgical fix doesn't hold under load testing.
> This PR was authored by Mux on behalf of Mike.
## Summary
Adds support for multiple peer root workspace agents sharing the same
`auth_instance_id`, so AWS, Azure, and GCP instance-identity auth can
issue the correct session token for a selected agent instead of assuming
a
single root agent per instance.
## Problem
When a Terraform template attaches two or more `coder_agent` resources
(with `auth = "aws-instance-identity"`) to a single compute instance,
every agent shares the same cloud instance ID. The existing singular
lookup picks whichever agent was created most recently, silently
ignoring
the others.
## Solution
Introduce an optional pre-auth agent selector (`CODER_AGENT_NAME`) and
make the server-side lookup ambiguity-aware.
**Database layer:**
- `GetWorkspaceAgentsByInstanceID` (`:many`): returns all matching root
agents for an instance ID.
- `GetWorkspaceAgentByInstanceIDAndName` (`:one`): returns the named
root
agent for disambiguation.
**SDK and CLI:**
- `agent_name` field added to AWS, Azure, and GCP request structs
(`omitempty` for backward compatibility).
- `CODER_AGENT_NAME` env var and `--agent-name` flag wired into the
agent
bootstrap before instance-identity auth runs.
**Server handler (`handleAuthInstanceID`):**
- When `agent_name` is present: direct lookup by (instance ID, name).
- When absent: legacy lookup, then resource-scoped ambiguity check.
Returns 409 with available agent names if multiple root agents match.
- Whitespace-only names are trimmed and treated as unspecified.
- Sub-agents remain excluded (`parent_id IS NULL` filter).
**Verification template:**
- `examples/templates/aws-multi-agent/` provisions one EC2 instance with
two agents (`main` and `dev`), both using instance-identity auth with
`CODER_AGENT_NAME` set in the cloud-init user data.
## Backward compatibility
Existing single-agent deployments work unchanged. The `agent_name` field
is optional with `omitempty`, and the unnamed path preserves today's
behavior when only one root agent matches.
Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example:
```
import (
"context"
"time"
"github.com/prometheus/client_golang/prometheus"
"golang.org/x/xerrors"
"gopkg.in/natefinch/lumberjack.v2"
"cdr.dev/slog/v3"
"github.com/coder/coder/v2/codersdk/agentsdk"
"github.com/coder/serpent"
)
```
3 groups: standard library, 3rd partly libs, Coder libs.
This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.
* chore: add /v2 to import module path
go mod requires semantic versioning with versions greater than 1.x
This was a mechanical update by running:
```
go install github.com/marwan-at-work/mod/cmd/mod@latest
mod upgrade
```
Migrate generated files to import /v2
* Fix gen
* chore: Rbac errors should be returned, and not hidden behind 404
SqlErrNoRows was hiding actual errors
* Replace sql.ErrNoRow checks
* Remove sql err no rows check from dbauthz test
* Fix to use dbauthz system user
- rbac: export rbac.Permissions
- dbauthz: move GetDeploymentDAUs, GetTemplateDAUs,
GetTemplateAverageBuildTime from querier.go to system.go
and removes auth checks
- dbauthz: remove AsSystem(), add individual roles for
autostart, provisionerd, add restricted system role for
everything else
feat: Add initial AuthzQuerier implementation
- Adds package database/dbauthz that adds a database.Store implementation where each method goes through AuthZ checks
- Implements all database.Store methods on AuthzQuerier
- Updates and fixes unit tests where required
- Updates coderd initialization to use AuthzQuerier if codersdk.ExperimentAuthzQuerier is enabled
* chore: rename `AgentConn` to `WorkspaceAgentConn`
The codersdk was becoming bloated with consts for the workspace
agent that made no sense to a reader. `Tailnet*` is an example
of these consts.
* chore: remove `Get` prefix from *Client functions
* chore: remove `BypassRatelimits` option in `codersdk.Client`
It feels wrong to have this as a direct option because it's so infrequently
needed by API callers. It's better to directly modify headers in the two
places that we actually use it.
* Merge `appearance.go` and `buildinfo.go` into `deployment.go`
* Merge `experiments.go` and `features.go` into `deployment.go`
* Fix `make gen` referencing old type names
* Merge `error.go` into `client.go`
`codersdk.Response` lived in `error.go`, which is wrong.
* chore: refactor workspace agent functions into agentsdk
It was odd conflating the codersdk that clients should use
with functions that only the agent should use. This separates
them into two SDKs that are closely coupled, but separate.
* Merge `insights.go` into `deployment.go`
* Merge `organizationmember.go` into `organizations.go`
* Merge `quota.go` into `workspaces.go`
* Rename `sse.go` to `serversentevents.go`
* Rename `codersdk.WorkspaceAppHostResponse` to `codersdk.AppHostResponse`
* Format `.vscode/settings.json`
* Fix outdated naming in `api.ts`
* Fix app host response
* Fix unsupported type
* Fix imported type
* chore: Separate the provisionerd server into it's own package
This code should be thoroughly tested now that we understand the abstraction.
I separated it to make our lives a bit easier for external provisioner daemons
as well!
* Add tests
* Add workspace builds
* Add test for workspace resources
Abstracting coderd into an interface added misdirection because
the interface was never intended to be fulfilled outside of a single
implementation.
This lifts the abstraction, and attaches all handlers to a root struct
named `*coderd.API`.
* feat: Add AWS instance identity authentication
This allows zero-trust authentication for all AWS instances.
Prior to this, AWS instances could be used by passing `CODER_TOKEN`
as an environment variable to the startup script. AWS explicitly
states that secrets should not be passed in startup scripts because
it's user-readable.
* Fix sha256 verbosity
* Fix HTTP client being exposed on auth
* chore: Move httpmw to /coderd directory
httpmw is specific to coderd and should be scoped under coderd
* chore: Move httpapi to /coderd directory
httpapi is specific to coderd and should be scoped under coderd
* chore: Move database to /coderd directory
database is specific to coderd and should be scoped under coderd
* chore: Update codecov & gitattributes for generated files
* chore: Update Makefile