Refactors `codersdk.Client`'s use of session tokens to use a `SessionTokenProvider`, which abstracts the obtaining and storing of the session token.
The main motiviation is to unify Agent authentication an an upstack PR, which can use cloud instance identity via token exchange, rather than a fixed session token.
However, the abstraction could also allow functionality like obtaining the session token from other external sources like the OS credential manager, or an external secret/key management system like Vault.
The flake here had two causes:
1. related to usage of time.Now() in MustWaitForProvisionersAvailable
and
2. the fact that UpdateProvisionerLastSeenAt can not use a time that is
further in the past than the current LastSeenAt time
Previously the test here was calling
`coderdtest.MustWaitForProvisionersAvailable` which was using `time.Now`
rather than the next tick time like the real `hasProvisionersAvailable`
function does. Additionally, when using `UpdateProvisionerLastSeenAt`
the underlying db query enforces that the time we're trying to set
`LastSeenAt` to cannot be older than the current value.
I was able to reliably reproduce the flake by executing both the
`UpdateProvisionerLastSeenAt` call and `tickCh <- next` in their own
goroutines, the former with a small sleep to reliably ensure we'd
trigger the autobuild before we set the `LastSeenAt` time. That's when I
also noticed that `coderdtest.MustWaitForProvisionersAvailable` was
using `time.Now` instead of the tick time. When I updated that function
to take in a tick time + added a 2nd call to
`UpdateProvisionerLastSeenAt` to set an original non-stale time, we
could then never get the test to pass because the later call to set the
stale time would not actually modify `LastSeenAt`. On top of that,
calling the provisioner daemons closer in the middle of the function
doesn't really do anything of value in this test.
**The fix for the flake is to keep the go routines, ensuring there would
be a flake if there was not a relevant fix, but to include the fix which
is to ensure that we explicitly wait for the provisioner to be stale
before passing the time to `tickCh`.**
---------
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Addresses comment raised on previous PR
https://github.com/coder/coder/pull/19619#discussion_r2307943410
We know we can skip sub agents when searching for which agent is related
to the task, as this is not an explicitly supported feature at the
moment. When we come to properly setting up a Task -> Agent relationship
this limitation will be dropped.
Closes https://github.com/coder/internal/issues/949
Adds the following fields to `codersdk.Task`
- OwnerName
- TemplateName
- TemplateDisplayName
- TemplateIcon
- WorkspaceAgentID
- WorkspaceAgentLifecycle
- WorkspaceAgentHealth
The implementation is unfortunately not compatible with multiple agents
as we have no reliable way to tell which agent has the AI task running
in it. For now we just pick the first agent found, but in the future
this will need to be changed.
## Description
This PR introduces one counter and two histograms related to workspace
creation and claiming. The goal is to provide clearer observability into
how workspaces are created (regular vs prebuild) and the time cost of
those operations.
### `coderd_workspace_creation_total`
* Metric type: Counter
* Name: `coderd_workspace_creation_total`
* Labels: `organization_name`, `template_name`, `preset_name`
This counter tracks whether a regular workspace (not created from a
prebuild pool) was created using a preset or not.
Currently, we already expose `coderd_prebuilt_workspaces_claimed_total`
for claimed prebuilt workspaces, but we lack a comparable metric for
regular workspace creations. This metric fills that gap, making it
possible to compare regular creations against claims.
Implementation notes:
* Exposed as a `coderd_` metric, consistent with other workspace-related
metrics (e.g. `coderd_api_workspace_latest_build`:
https://github.com/coder/coder/blob/main/coderd/prometheusmetrics/prometheusmetrics.go#L149).
* Every `defaultRefreshRate` (1 minute ), DB query
`GetRegularWorkspaceCreateMetrics` is executed to fetch all regular
workspaces (not created from a prebuild pool).
* The counter is updated with the total from all time (not just since
metric introduction). This differs from the histograms below, which only
accumulate from their introduction forward.
### `coderd_workspace_creation_duration_seconds` &
`coderd_prebuilt_workspace_claim_duration_seconds`
* Metric types: Histogram
* Names:
* `coderd_workspace_creation_duration_seconds`
* Labels: `organization_name`, `template_name`, `preset_name`, `type`
(`regular`, `prebuild`)
* `coderd_prebuilt_workspace_claim_duration_seconds`
* Labels: `organization_name`, `template_name`, `preset_name`
We already have `coderd_provisionerd_workspace_build_timings_seconds`,
which tracks build run times for all workspace builds handled by the
provisioner daemon.
However, in the context of this issue, we are only interested in
creation and claim build times, not all transitions; additionally, this
metric does not include `preset_name`, and adding it there would
significantly increase cardinality. Therefore, separate more focused
metrics are introduced here:
* `coderd_workspace_creation_duration_seconds`: Build time to create a
workspace (either a regular workspace or the build into a prebuild pool,
for prebuild initial provisioning build).
* `coderd_prebuilt_workspace_claim_duration_seconds`: Time to claim a
prebuilt workspace from the pool.
The reason for two separate histograms is that:
* Creation (regular or prebuild): provisioning builds with similar time
magnitude, generally expected to take longer than a claim operation.
* Claim: expected to be a much faster provisioning build.
#### Native histogram usage
Provisioning times vary widely between projects. Using static buckets
risks unbalanced or poorly informative histograms.
To address this, these metrics use [Prometheus native
histograms](https://prometheus.io/docs/specs/native_histograms/):
* First introduced in Prometheus v2.40.0
* Recommended stable usage from v2.45+
* Requires Go client `prometheus/client_golang` v1.15.0+
* Experimental and must be explicitly enabled on the server
(`--enable-feature=native-histograms`)
For compatibility, we also retain a classic bucket definition (aligned
with the existing provisioner metric:
https://github.com/coder/coder/blob/main/provisionerd/provisionerd.go#L182-L189).
* If native histograms are enabled, Prometheus ingests the
high-resolution histogram.
* If not, it falls back to the predefined buckets.
Implementation notes:
* Unlike the counter, these histograms are updated in real-time at
workspace build job completion.
* They reflect data only from the point of introduction forward (no
historical backfill).
## Relates to
Closes: https://github.com/coder/coder/issues/19528
Native histograms tested in observability stack:
https://github.com/coder/observability/pull/50
The previous logic verified a generated name was valid, _and then
appended a suffix to it_. This was flawed as it would allow a 32
character name, and then append an extra 5 characters to it.
Instead we now append the suffix _and then_ verify it is valid.
closes https://github.com/coder/coder/issues/18274
This pull request makes system users visible in various group related
queries so that they can be added to and removed from groups. This
allows system user quotas to be configured. System users are still
ignored in certain queries, such as when license seat consumption is
determined.
This pull request further ensures the existence of a
"coder_prebuilt_workspaces" group in any organization that needs
prebuilt workspaces
---------
Co-authored-by: Susana Ferreira <susana@coder.com>
Quick fix for following issue in CLI:
```
$ go run ./cmd/coder exp task list
Encountered an error running "coder exp task list", see "coder exp task list --help" for more information
error: Trace=[list tasks: ]
Internal error fetching task prompts and states.
workspace 14d548f4-aaad-40dd-833b-6ffe9c9d31bc is not an AI task workspace
exit status 1
```
This occurs in a short time window directly after creating a new task.
I took a stab at writing a test for this, but ran out of time. I'm not
entirely sure what causes non-AI-task workspaces to be returned in the
query but I suspect it's when a workspace build is pending or running.
Fix https://github.com/coder/internal/issues/826
I wasn't able to recreate the flake, but my underlying assumption (from
reading the logs we have) is that there is a race condition where the
test will begin cleanup before the dev container recreation goroutine
has a chance to call `devcontainer up`.
I've refactored the test slightly and made it so that the test will not
finish until either the context has timed out, or `Up` has been called.
This works around the issue where a task may "disappear" on stop.
Re-using the previous value of `has_ai_task` and `sidebar_app_id` on a
stop transition.
---------
Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
Blink helped here but it's suggestion was to have a set map of sensitive
fields based on predefined constants in various files, such as the api
token string names. For now we'll add additional query param logging for fields we know are safe/that we want to log, such as query pagination/limit fields and ID list counts which may help identify P99 DB query latencies.
---------
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Closes https://github.com/coder/internal/issues/716
This prevents a scan over the entire `workspace_build` table by removing
a `join`. This is still imperfect as we are still scanning over the
number of builds for the workspaces in the arguments. Ideally we would
have some index or something precomputed. Then we could skip scanning
over the builds for the correct workspaces that are not the latest.
Fixescoder/internal#892Fixescoder/internal#896
Example output:
```
❯ coder exp task list
ID NAME STATUS STATE STATE CHANGED MESSAGE
a7a27450-ca16-4553-a6c5-9d6f04808569 task-hardcore-herschel-bd08 running idle 5h22m3s ago Listed root directory contents, working directory reset
50f92138-f463-4f2b-abad-1816264b065f task-musing-dewdney-f058 running idle 6h3m8s ago Completed arithmetic calculation
```
- Adds `usagetypes.UnknownEventTypeError` type, which is returned by
`ParseEventWithType`
- Changes `ParseEvent` to not be a generic function since it doesn't
really need it
- Adds `User-Agent` to tallyman requests
## Summary
In this pull request we're adding support for additional filtering
options to the `provisioners list` CLI command and the
`/provisionerdaemons` API endpoint.
Resolves: https://github.com/coder/coder/issues/18783
### Changes
#### Added CLI Options
- `--show-offline`: When this option is provided, all provisioner
daemons will be returned. This means that when `--show-offline` is not
provided only `idle` and `busy` provisioner daemons will be returned.
- `--status=<list_of_statuses>`: When this option is provided with a
comma-separated list of valid statuses (`idle`, `busy`, or `offline`)
only provisioner daemons that have these statuses will be returned.
- `--max-age=<duration>`: When this option is provided with a valid
duration value (e.g., `24h`, `30s`) only provisioner daemons with a
`last_seen_at` timestamp within the provided max age will be returned.
#### Query Params
- `?offline=true`: Include offline provisioner daemons in the results.
Offline provisioner daemons will be excluded if `?offline=false` or if
offline is not provided.
- `?status=<list_of_statuses>`: Include provisioner daemons with the
specified statuses.
- `?max_age=<duration>`: Include provisioner daemons with a
`last_seen_at` timestamp within the max age duration.
#### Frontend
- Since offline provisioners will not be returned by default anymore
(`--show-offline` has to be provided to see them), a checkbox was added
to the provisioners list page to allow for offline provisioners to be
displayed
- A revamp of the provisioners page will be done in:
https://github.com/coder/coder/issues/17156, this checkbox change was
just added to maintain currently functionality with the backend updates
Current provisioners page (without checkbox)
<img width="1329" height="574" alt="Screenshot 2025-08-20 at 10 51
00 AM"
src="https://github.com/user-attachments/assets/77b73650-0b62-44f0-a77f-acbe5710809f"
/>
Provisioners page with checkbox (unchecked)
<img width="1314" height="626" alt="Screenshot 2025-08-20 at 10 48
40 AM"
src="https://github.com/user-attachments/assets/7ba164ad-6d3f-417b-bd39-338c0161b145"
/>
Provisioner page with checkbox (checked) and URL updated with query
parameters
<img width="1306" height="597" alt="Screenshot 2025-08-20 at 10 50
14 AM"
src="https://github.com/user-attachments/assets/e78d0986-bbf8-491b-9d56-b682973237a0"
/>
### Show Offline vs Offline Status
To list offline provisioner daemons, users can either:
1. Include the `--show-offline` option
OR
2. Include `offline` in the list of values provided to the `--status`
option
At the moment, the loop which retrieves and updates the values of the
agents metrics excessively calls `GetUserByID` (a DB query). First it
retrieves a list of all workspaces, filtering out inactive agents (not
entirely clear to me whether this is non-running workspaces, or just
dead agents), and then iterates over those workspaces to get the rest of
the relevant data for the metrics. The next call is `GetUserByID` for
`workspace.OwnerID`. This is unnecessary because the `workspaces_visible` view we pull workspaces from has already been joined with the users table to get the username/name/etc.
This should at least partially resolve
https://github.com/coder/internal/issues/726
---------
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Username length and format, via regex, are already enforced at the
application layer, but we have some code paths with database queries
where we could optimize away many of the DB query calls if we could be
sure at the database level that the username is never an empty string.
For example: https://github.com/coder/coder/pull/19395
---------
Signed-off-by: Callum Styan <callumstyan@gmail.com>
## Summary
In this pull request we're adding support for OIDC allowed groups in the
OSS version as part of work for
https://github.com/coder/coder/issues/17027.
### Changes
- Restored support for parsing group allow list in OSS code
### Testing
- Added tests for OSS code
- Tested allowed/prohibited group OIDC flows in premium and OSS
## Description
This PR ensures that activity-based deadline extensions ("activity
bumping") are not applied to prebuilt workspaces. Prebuilds are managed
by the reconciliation loop and must not have `deadline` or
`max_deadline` values set or extended, as they are not part of the
regular lifecycle executor path.
## Changes
- Update `ActivityBumpWorkspace` SQL query to discard prebuilt
workspaces
- Update application layer to avoid calling activity bump logic on
prebuilt workspaces
Related with:
* Issue: https://github.com/coder/coder/issues/18898
* PR: https://github.com/coder/coder/pull/19252
Closes https://github.com/coder/coder/issues/18356.
This change finds and selects a matching preset if one was not chosen
during workspace creation. This solidifies the relationship between
presets and parameters.
When a workspace is created without in explicitly chosen preset, it will
now still be eligible to claim a prebuilt workspace if one is available.
Fixes https://github.com/coder/internal/issues/907
We convert `workspacesdk.AgentConn` to an interface and generate a mock
for it. This allows writing `coderd` tests that rely on the agent's HTTP
api to not have to set up an entire tailnet networking stack.
Closes https://github.com/coder/coder/issues/18159
If an Anthropic API key is available, we call out to Claude to generate
a task name based on the user-provided prompt instead of our random name
generator.
## Description
This PR ensures that lifecycle-related changes made via template
schedule updates do **not affect prebuilt workspaces**. Since prebuilds
are managed by the reconciliation loop and do not participate in the
regular lifecycle executor flow, they must be excluded from any updates
triggered by template configuration changes.
This includes changes to TTL, dormant-deletion scheduling, deadline and
autostart scheduling.
## Changes
- Updated SQL query `UpdateWorkspacesTTLByTemplateID` to exclude
prebuilt workspaces
- Updated SQL query `UpdateWorkspacesDormantDeletingAtByTemplateID` to
exclude prebuilt workspaces
- Updated application-layer logic to skip any updates to lifecycle
parameters if a workspace is a prebuild
- Preserved all existing update behavior for regular user workspaces
This change guarantees that only lifecycle-managed workspaces are
affected when template-level configurations are modified, preserving
strict boundaries between prebuild and user workspace lifecycles.
Related with:
* Issue: https://github.com/coder/coder/issues/18898
* PR: https://github.com/coder/coder/pull/19252
This pull request introduces support for external workspace management, allowing users to register and manage workspaces that are provisioned and managed outside of the Coder.
Depends on: https://github.com/coder/terraform-provider-coder/pull/424
* GET /api/v2/init-script - Gets the agent initialization script
* By default, it returns a script for Linux (amd64), but with query parameters (os and arch) you can get the init script for different platforms
* GET /api/v2/workspaces/{workspace}/external-agent/{agent}/credentials - Gets credentials for an external agent **(enterprise)**
* Updated queries to filter workspaces/templates by the has_external_agent field
This pull request introduces support for external workspace management, allowing users to register and manage workspaces that are provisioned and managed outside of the Coder.
* Added has_external_agent field to workspace builds and template versions
Relates to https://github.com/coder/internal/issues/907
The test can take around 10s when it is the only one running, so in a
constrained environment like CI it makes sense that it still hits the 25
second timeout. For now we up the limit to 60 seconds until the test is
rewritten to greatly reduce the time taken.
Closes https://github.com/coder/internal/issues/906
This test was using dbmem until we removed it. The test just makes sure the background job runs at all, so a mock db continues to be fine here.
No other tests in this package used dbmem, so this is the only test I've changed.
Not used in coderd yet, see stack.
Adds two new packages:
- `coderd/usage`: provides an interface for the "Collector" as well as a stub implementation for AGPL
- `enterprise/coderd/usage`: provides an interface for the "Publisher" as well as a Tallyman implementation
Relates to https://github.com/coder/internal/issues/814
Fixes https://github.com/coder/coder/issues/19372
We increase the read limit to 4MiB (we use this limit elsewhere). We
also make sure to stop sending messages when `containersCh` becomes
closed.