<!--
If you have used AI to produce some or all of this PR, please ensure you have read our [AI Contribution guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING) before submitting.
-->
relates to GRU-18
Adds basic implementation for Workspace Agent Connection Watch and tests.
Missing are handling of logs.
Migrates Azure instance identity verification from
`go.mozilla.org/pkcs7` and `github.com/fullsailor/pkcs7` to
`github.com/smallstep/pkcs7`, using `VerifyWithChainAtTime` to validate
both the PKCS7 signature and the certificate chain in one call. The
previous code only verified the signer certificate against a set of
intermediates/roots but did not verify that the PKCS7 signature itself
covered the content, meaning tampered payloads could be accepted.
The `Options` struct is restructured to accept `Roots`, `Intermediates`,
and `CurrentTime` as explicit fields instead of embedding
`x509.VerifyOptions`. The test helper `NewAzureInstanceIdentity` now
builds a realistic 3-level certificate chain (Root CA -> Intermediate CA
-> Signing Cert) matching real Azure trust hierarchy. New tests
(`TestValidate_TamperedContent`,
`TestValidate_UntrustedCertWithValidSignature`) confirm tampered and
untrusted envelopes are rejected.
Addresses GHSA-6x44-w3xg-hqqf.
> [!NOTE]
> This PR was authored by Coder Agents.
<details>
<summary>Implementation Plan</summary>
### Files Changed
| File | Summary |
|------|---------|
| `coderd/azureidentity/azureidentity.go` | Replace `signer.Verify()`
with `VerifyWithChainAtTime`; restructure `Options` struct; add
`ParseCertificates()` helper |
| `coderd/azureidentity/azureidentity_test.go` | Add `testCertChain`
builder, tampered-content and untrusted-cert tests; update existing
tests for new `Options` API |
| `coderd/coderd.go` | Change `AzureCertificates` field from
`x509.VerifyOptions` to `azureidentity.Options` |
| `coderd/workspaceresourceauth.go` | Pass `api.AzureCertificates`
directly instead of wrapping |
| `coderd/coderdtest/coderdtest.go` | Migrate to `smallstep/pkcs7`;
build 3-level cert chain in test helper |
| `go.mod` / `go.sum` | Add `github.com/smallstep/pkcs7`; remove
`fullsailor/pkcs7` and `go.mozilla.org/pkcs7` |
</details>
Chat tests previously constructed a real `openai` provider with a fake
API key and no `BaseURL`, so background title generation hit
`api.openai.com` and timed out under `-race`. The same root cause
produced several distinct flakes: title regeneration races with
synchronous `UpdateChat`/`ProposeChatTitle`, and pagination races
against `updated_at` bumps from real-network processing.
This moves the fake OpenAI-compatible provider and the chat-settle wait
into first-class `coderdtest` capabilities.
`coderd.Options.ChatProviderAPIKeys` is the new seam tests use to
redirect chat traffic to a local `httptest.Server`.
`coderdtest.WaitForChatSettled` replaces per-test waiters and drains
tracked chat-daemon work after the chat row leaves `pending`/`running`.
The `newChatClient*` constructors funnel through one options builder
that installs the fake provider before the coderd test server so cleanup
ordering is deterministic.
Closes https://github.com/coder/internal/issues/1528 & Closes ENG-2659
Closes https://github.com/coder/internal/issues/1480 & Closes CODAGT-359
Closes https://github.com/coder/internal/issues/1507 & Closes CODAGT-368
Relates to https://github.com/coder/internal/issues/1397 & Relates to
CODAGT-374
The username suffix could put the name past the 32 character limit,
causing the test to flake. Instead of using a suffix, match on the
expected ID instead.
Adds `coder exp chat context add` and `coder exp chat context clear`
commands that run inside a workspace to manage chat context files via
the agent token.
`add` reads instruction and skill files from a directory (defaulting to
cwd) and inserts them as context-file messages into an active chat.
Multiple calls are additive — `instructionFromContextFiles` already
accumulates all context-file parts across messages.
`clear` soft-deletes all context-file messages, causing
`contextFileAgentID()` to return `!found` on the next turn, which
triggers `needsInstructionPersist=true` and re-fetches defaults from the
agent.
Both commands auto-detect the target chat via `CODER_CHAT_ID` (already
set by `agentproc` on chat-spawned processes), or fall back to
single-active-chat resolution for the agent. The `--chat` flag overrides
both.
Also adds sub-agent context inheritance: `createChildSubagentChat` now
copies parent context-file messages to child chats at spawn time, so
delegated sub-agents share the same instruction context without
independently re-fetching from the workspace agent.
<details><summary>Implementation details</summary>
**New files:**
- `cli/exp_chat.go` — CLI command tree under `coder exp chat context`
**Modified files:**
- `agent/agentcontextconfig/api.go` — `ConfigFromDir()` reads context
from an arbitrary directory without env vars
- `codersdk/agentsdk/agentsdk.go` — `AddChatContext`/`ClearChatContext`
SDK methods
- `coderd/workspaceagents.go` — POST/DELETE handlers on
`/workspaceagents/me/chat-context`
- `coderd/coderd.go` — Route registration
- `coderd/database/queries/chats.sql` — `GetActiveChatsByAgentID`,
`SoftDeleteContextFileMessages`
- `coderd/database/dbauthz/dbauthz.go` — RBAC implementations for new
queries
- `coderd/x/chatd/subagent.go` — `copyParentContextFiles` for sub-agent
inheritance
- `cli/root.go` — Register `chatCommand()` in `AGPLExperimental()`
**Auth pattern:** Uses `AgentAuth` (same as `coder external-auth`) —
agent token via `CODER_AGENT_TOKEN` + `CODER_AGENT_URL` env vars.
</details>
> 🤖 Generated by Coder Agents
---------
Co-authored-by: Michael Suchacz <203725896+ibetitsmike@users.noreply.github.com>
Piggybacks on #23878. Moves instruction file reading and skill discovery
from `chatd` (server-side, via multiple `LS`/`ReadFile` round-trips
through the agent connection) to the agent itself (local filesystem
access).
This intentionally drops backward compatibility with older agents that
don't support the context-config endpoint. Agents and server are
deployed together; there is no rolling-update contract to maintain here.
## What changed
The agent's `GET /api/v0/context-config` response now returns
`[]ChatMessagePart` directly — the same types chatd persists. This
eliminates intermediate type conversions and makes the protocol
extensible.
| Field | Type | Description |
|---|---|---|
| `parts` | `[]ChatMessagePart` | Context-file and skill parts, ready to
persist |
| `working_dir` | `string` | Agent's resolved working directory |
Removed from the response: `instructions_dirs`, `instructions_file`,
`skills_dirs`, `skill_meta_file`, `mcp_config_files` — the agent reads
files locally and returns their content as parts.
Removed from chatd: all legacy `LS`/`ReadFile` fallback code
(`readHomeInstructionFile`, `readInstructionDirFile`, `DiscoverSkills`
via LS, etc).
## Why
The previous architecture had the agent resolve paths, serve them over
HTTP, then `chatd` make N+1 round-trips back through the agent
connection to read files. The agent has direct filesystem access and
should just read the files.
## Key design decisions
- **Agent returns `ChatMessagePart` directly** — same types chatd
persists. No intermediate `InstructionFileEntry`/`SkillEntry` types
needed.
- **`SkillMeta.MetaFile`** — persisted via `ContextFileSkillMetaFile` on
the skill part, so custom meta file names
(`CODER_AGENT_EXP_SKILL_META_FILE`) survive across chat turns.
- **No pre-read body** — `read_skill` always dials the workspace to
fetch the skill body on demand. Simpler than caching the body in the
response.
- **MCP config paths kept agent-internal** — `MCPConfigFiles()` getter,
not sent over the wire.
- **No backward compat fallback** — old agents that don't support
context-config get no instruction files. This is acceptable since agent
and server deploy together.
Partially addresses #21813 (still need to make changes to the "add user"
button to be complete)
Since there are a lot of user tests already, I moved them into
`coderdtest` to be shared.
Introduce a three-way workspace sharing setting (none, everyone,
service_accounts) replacing the boolean workspace_sharing_disabled.
In service_accounts mode, only service account-owned workspaces can be
shared while regular members' share permissions are removed. Adds a
new organization-service-account system role with per-org permissions
reconciled alongside the existing organization-member system role.
Related to:
https://linear.app/codercom/issue/PLAT-28/feat-service-accounts-sharing-mode-and-rbac-role
---------
Co-authored-by: Steven Masley <Emyrk@users.noreply.github.com>
Co-authored-by: Kayla はな <mckayla@hey.com>
## Summary
Change the four main `coderdtest` Await helper functions to poll at
`IntervalFast` (25ms) instead of `IntervalMedium` (250ms):
- `AwaitTemplateVersionJobCompleted`
- `AwaitWorkspaceBuildJobCompleted`
- `WorkspaceAgentWaiter.WaitFor`
- `WorkspaceAgentWaiter.Wait`
These are called **~855 times** across the test suite. Each call
previously wasted ~125ms on average waiting for the next poll tick.
`AwaitTemplateVersionJobRunning` already used `IntervalFast` — this
makes all Await helpers consistent.
## Measured Impact
Local benchmarks (postgres, `-short -count=1 -p 8 -parallel 8
-tags=testsmallbatch`):
| Package | Before | After | Delta |
|---|---|---|---|
| enterprise/coderd | 90.8s | 76.0s | **-16.3%** |
| coderd | 65.6s | 56.5s | **-13.8%** |
| cli | 57.9s | 37.8s | **-34.7%** |
| enterprise (root) | 41.1s | 39.9s | -2.9% |
| **Sum of all packages** | **623s** | **543s** | **-12.8%** |
Zero test failures across all 199 packages.
This PR adds some metrics to help identify job enqueue rates and
latencies. This work was initiated as a way to help reduce the cost of
the observation/measurement itself for autostart scaletests, which
impacts our ability to identify/reason about the load caused by
autostart. See: https://github.com/coder/internal/issues/1209
I've extended the metrics here to account for regular user initiated
builds, prebuilds, autostarts, etc. IMO there is still the question here
of whether we want to include or need the `transition` label, which is
only present on workspace builds. Including it does lead to an increase
in cardinality, and in the case of the histogram (when not using native
histograms) that's at least a few extra series for every bucket. We
could remove the transition label there but keep it on the counter.
Additionally, the histogram is currently observing latencies for other
jobs, such as template builds/version imports, those do not have a
transition type associated with them.
Tested briefly in a workspace, can see metric values like the following:
-
`coderd_workspace_builds_enqueued_total{build_reason="autostart",provisioner_type="terraform",status="success",transition="start"}
1`
-
`coderd_provisioner_job_queue_wait_seconds_bucket{build_reason="autostart",job_type="workspace_build",provisioner_type="terraform",transition="start",le="0.025"}
1`
---------
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Only task workspaces have the checks in wsbuilder for violating the
managed agent caps in the license.
Stopped tasks that are resumed with a regular workspace start **still
count as usage**.
Relates to https://github.com/coder/internal/issues/272
This flake has been persisting for a while, and unfortunately there's no
detail on which healthcheck in particular is holding things up.
This PR adds a concurrency-safe `healthcheck.Progress` and wires it
through `healthcheck.Run`. If the healthcheck times out, it will provide
information on which healthchecks are completed / running, and how long
they took / are still taking.
🤖 Claude Opus 4.5 completed the first round of this implementation,
which I then refactored.
Replace the external moby/moby/pkg/namesgenerator dependency with an
internal implementation using gofakeit/v7. The moby package has ~25k
unique name combinations, and with its retry parameter only adds a
random digit 0-9, giving ~250k possibilities. In parallel tests, this
has led to collisions (flakes).
The new internal API at coderd/util/namesgenerator eliminates the
external dependnecy and offers functions with explicit uniqueness
guarantees. This PR also consolidates fragmented name generation in a
few places to use the new package.
| Old (moby/moby) | New |
|-------------------------------------|------------------------|
| namesgenerator.GetRandomName(0) | NameWith("_") |
| namesgenerator.GetRandomName(>0) | NameDigitWith("_") |
| testutil.GetRandomName(t) | UniqueName() |
| testutil.GetRandomNameHyphenated(t) | UniqueNameWith("-") |
namesgenerator package API:
- NameWith(delim): random name, not unique
- NameDigitWith(delim): random name with 1-9 suffix, not unique
- UniqueName(): guaranteed unique via atomic counter
- UniqueNameWith(delim): unique with custom delimiter
Names continue to be docker style `[adjective][delim][surname]`. Unique
names are truncated to 32 characters (preserving the numeric suffix) to
fit common name length limits in Coder.
Related test flakes:
https://github.com/coder/internal/issues/1212https://github.com/coder/internal/issues/118https://github.com/coder/internal/issues/1068
Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example:
```
import (
"context"
"time"
"github.com/prometheus/client_golang/prometheus"
"golang.org/x/xerrors"
"gopkg.in/natefinch/lumberjack.v2"
"cdr.dev/slog/v3"
"github.com/coder/coder/v2/codersdk/agentsdk"
"github.com/coder/serpent"
)
```
3 groups: standard library, 3rd partly libs, Coder libs.
This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.
Upgrades to slog v3 which includes a small, but backward incompatible API change to the acceptible call arguments when logging. This change allows us to verify via compile time type checking that arguments are correct and won't cause a panic, as was possible in slog v1, which this replaces (v2 was tagged but never used in coder/coder).
It also updates dependencies that also use slog and were updated.
I've left the `aibridge` dependency as a commit SHA, under the assumption that the team there (cc @pawbana @dannykopping ) will tag and update the dependency soon and on their own schedule.
Other dependencies, I pushed new tags.
**Breaking Change:** Existing oauth apps might now use PKCE. If an
unknown IdP type was being used, and it does not support PKCE, it will
break.
To fix, set the PKCE methods on the external auth to `none`
```
export CODER_EXTERNAL_AUTH_1_PKCE_METHODS=none
```
Provisioner steps broken into smaller granular actions.
Changes:
- `ExtractArchive` moved to `init` request (was in `configure`)
- Writing `tfstate` moved to `plan` (was in `configure`)
- Moved most plan/apply outputs to `GraphComplete`
With `%w` it prints an address instead of the error, like `<op> <url>
0xc001329370` instead of `<op> <url>: some error`, honestly idk why you
even can log with `%w` it seems like it makes no sense to use `%w`
outside of `fmt.Errorf`.
This is to help debug https://github.com/coder/internal/issues/1010.
Adds the following debug routes for people with the `debug_info:read`
permission:
- `/api/v2/debug/pprof` for `net/http/pprof`
- `/`
- `/cmdline`
- `/profile`
- `/symbol`
- `/trace`
- `/*`
- `/api/v2/debug/metrics` for Prometheus metrics
Fixes https://github.com/coder/internal/issues/1067
- Adds `WorkspaceAgentWaiter.WithContext()`
- Updates usage of `WorkspaceAgentWaiter` in `aitasks_test.go` with
context bumped to `testutil.WaitMedium`
Authored by Claude with manual review and updates.
# Add API key allow_list for resource-scoped tokens
This PR adds support for API key allow lists, enabling tokens to be scoped to specific resources. The implementation:
1. Adds a new `allow_list` field to the `CreateTokenRequest` struct, allowing clients to specify resource-specific scopes when creating API tokens
2. Implements `APIAllowListTarget` type to represent resource targets in the format `<type>:<id>` with support for wildcards
3. Adds validation and normalization logic for allow lists to handle wildcards and deduplication
4. Integrates with RBAC by creating an `APIKeyEffectiveScope` that merges API key scopes with allow list restrictions
5. Updates API documentation and TypeScript types to reflect the new functionality
This feature enables creating tokens that are limited to specific resources (like workspaces or templates) by ID, making it possible to create more granular API tokens with limited access.
# Add support for low-level API key scopes
This PR adds support for fine-grained API key scopes based on RBAC resource:action pairs. It includes:
1. A new endpoint `/api/v2/auth/scopes` to list all public low-level API key scopes
2. Generated constants in the SDK for all public scopes
3. Tests to verify scope validation during token creation
4. Updated API documentation to reflect the expanded scope options
The implementation allows users to create API keys with specific permissions like `workspace:read` or `template:use` instead of only the legacy `all` or `application_connect` scopes.
Fixes#19847
relates to https://github.com/coder/internal/issues/912
Adds a new scaletest Runner to generate dynamic parameters load.
A later PR will add the CLI command, including creating the template & version.
Solves #15575
Adds OAuth access token revocation when unlinking external auth
provider. Due to revocation not being consistently implemented by
providers this is only best effort attempt. Unsuccessful revocation
won't influence link removal.
Closes https://github.com/coder/internal/issues/935
This PR enhances the AwaitWorkspaceBuildJobCompleted func in coderdtest
pkg to provide better visibility into test failures and debugging
information.
Refactors `codersdk.Client`'s use of session tokens to use a `SessionTokenProvider`, which abstracts the obtaining and storing of the session token.
The main motiviation is to unify Agent authentication an an upstack PR, which can use cloud instance identity via token exchange, rather than a fixed session token.
However, the abstraction could also allow functionality like obtaining the session token from other external sources like the OS credential manager, or an external secret/key management system like Vault.
The flake here had two causes:
1. related to usage of time.Now() in MustWaitForProvisionersAvailable
and
2. the fact that UpdateProvisionerLastSeenAt can not use a time that is
further in the past than the current LastSeenAt time
Previously the test here was calling
`coderdtest.MustWaitForProvisionersAvailable` which was using `time.Now`
rather than the next tick time like the real `hasProvisionersAvailable`
function does. Additionally, when using `UpdateProvisionerLastSeenAt`
the underlying db query enforces that the time we're trying to set
`LastSeenAt` to cannot be older than the current value.
I was able to reliably reproduce the flake by executing both the
`UpdateProvisionerLastSeenAt` call and `tickCh <- next` in their own
goroutines, the former with a small sleep to reliably ensure we'd
trigger the autobuild before we set the `LastSeenAt` time. That's when I
also noticed that `coderdtest.MustWaitForProvisionersAvailable` was
using `time.Now` instead of the tick time. When I updated that function
to take in a tick time + added a 2nd call to
`UpdateProvisionerLastSeenAt` to set an original non-stale time, we
could then never get the test to pass because the later call to set the
stale time would not actually modify `LastSeenAt`. On top of that,
calling the provisioner daemons closer in the middle of the function
doesn't really do anything of value in this test.
**The fix for the flake is to keep the go routines, ensuring there would
be a flake if there was not a relevant fix, but to include the fix which
is to ensure that we explicitly wait for the provisioner to be stale
before passing the time to `tickCh`.**
---------
Signed-off-by: Callum Styan <callumstyan@gmail.com>
## Description
This PR introduces one counter and two histograms related to workspace
creation and claiming. The goal is to provide clearer observability into
how workspaces are created (regular vs prebuild) and the time cost of
those operations.
### `coderd_workspace_creation_total`
* Metric type: Counter
* Name: `coderd_workspace_creation_total`
* Labels: `organization_name`, `template_name`, `preset_name`
This counter tracks whether a regular workspace (not created from a
prebuild pool) was created using a preset or not.
Currently, we already expose `coderd_prebuilt_workspaces_claimed_total`
for claimed prebuilt workspaces, but we lack a comparable metric for
regular workspace creations. This metric fills that gap, making it
possible to compare regular creations against claims.
Implementation notes:
* Exposed as a `coderd_` metric, consistent with other workspace-related
metrics (e.g. `coderd_api_workspace_latest_build`:
https://github.com/coder/coder/blob/main/coderd/prometheusmetrics/prometheusmetrics.go#L149).
* Every `defaultRefreshRate` (1 minute ), DB query
`GetRegularWorkspaceCreateMetrics` is executed to fetch all regular
workspaces (not created from a prebuild pool).
* The counter is updated with the total from all time (not just since
metric introduction). This differs from the histograms below, which only
accumulate from their introduction forward.
### `coderd_workspace_creation_duration_seconds` &
`coderd_prebuilt_workspace_claim_duration_seconds`
* Metric types: Histogram
* Names:
* `coderd_workspace_creation_duration_seconds`
* Labels: `organization_name`, `template_name`, `preset_name`, `type`
(`regular`, `prebuild`)
* `coderd_prebuilt_workspace_claim_duration_seconds`
* Labels: `organization_name`, `template_name`, `preset_name`
We already have `coderd_provisionerd_workspace_build_timings_seconds`,
which tracks build run times for all workspace builds handled by the
provisioner daemon.
However, in the context of this issue, we are only interested in
creation and claim build times, not all transitions; additionally, this
metric does not include `preset_name`, and adding it there would
significantly increase cardinality. Therefore, separate more focused
metrics are introduced here:
* `coderd_workspace_creation_duration_seconds`: Build time to create a
workspace (either a regular workspace or the build into a prebuild pool,
for prebuild initial provisioning build).
* `coderd_prebuilt_workspace_claim_duration_seconds`: Time to claim a
prebuilt workspace from the pool.
The reason for two separate histograms is that:
* Creation (regular or prebuild): provisioning builds with similar time
magnitude, generally expected to take longer than a claim operation.
* Claim: expected to be a much faster provisioning build.
#### Native histogram usage
Provisioning times vary widely between projects. Using static buckets
risks unbalanced or poorly informative histograms.
To address this, these metrics use [Prometheus native
histograms](https://prometheus.io/docs/specs/native_histograms/):
* First introduced in Prometheus v2.40.0
* Recommended stable usage from v2.45+
* Requires Go client `prometheus/client_golang` v1.15.0+
* Experimental and must be explicitly enabled on the server
(`--enable-feature=native-histograms`)
For compatibility, we also retain a classic bucket definition (aligned
with the existing provisioner metric:
https://github.com/coder/coder/blob/main/provisionerd/provisionerd.go#L182-L189).
* If native histograms are enabled, Prometheus ingests the
high-resolution histogram.
* If not, it falls back to the predefined buckets.
Implementation notes:
* Unlike the counter, these histograms are updated in real-time at
workspace build job completion.
* They reflect data only from the point of introduction forward (no
historical backfill).
## Relates to
Closes: https://github.com/coder/coder/issues/19528
Native histograms tested in observability stack:
https://github.com/coder/observability/pull/50
This pull request introduces support for external workspace management, allowing users to register and manage workspaces that are provisioned and managed outside of the Coder.
Depends on: https://github.com/coder/terraform-provider-coder/pull/424
* GET /api/v2/init-script - Gets the agent initialization script
* By default, it returns a script for Linux (amd64), but with query parameters (os and arch) you can get the init script for different platforms
* GET /api/v2/workspaces/{workspace}/external-agent/{agent}/credentials - Gets credentials for an external agent **(enterprise)**
* Updated queries to filter workspaces/templates by the has_external_agent field