Fixes https://github.com/coder/internal/issues/970
The test doesn't wait for `monitor()` to complete, and the mock database call that we assert takes place in a `defer` within `monitor()`. This allows the mock assertions to race with the defer and flake the test.
Solution is to explicitly wait for `monitor()` to complete before the end of the test, so that mock assertions (which happen in a `t.Cleanup()`) don't race.
Breaking API Change:
> The presence of the `ip` field on `codersdk.ConnectionLog` cannot be
guaranteed, and so the field has been made optional. It may be omitted
on API responses.
When running a scaletest, I noticed logs of the form:
```
2025-09-12 06:34:10.924 [erro] coderd.workspaceapps: upsert connection log failed trace=0xa17580 span=0xa17620 workspace_id=81b937d7-5777-4df5-b5cb-80241f30326f agent_id=78b2ff6d-b4a6-4a4e-88a7-283e05455a88 app_id=00000000-0000-0000-0000-000000000000 user_id=00000000-0000-0000-0000-000000000000 user_agent="" app_slug_or_port=terminal status_code=404 request_id=67f03cf8-9523-444a-97bc-90de080a54c8 ...
error= 1 error occurred:
* pq: null value in column "ip" of relation "connection_logs" violates not-null constraint
```
to ensure logs are never omitted from the connection log due to a
missing IP again (i.e. I'm not sure if we can always rely on a valid,
parseable, IP from `(http.Request).RemoteAddr`), I've removed the `NOT
NULL` constraint on `ip` on `connection_logs`, and made `ip` on the API
response optional.
The specific cause for these null IPs was the
`/workspaceproxies/me/issue-signed-app-token [post]` endpoint
constructing it's own `http.Request` without a `RemoteAddr` set, and
then passing that to the token issuer.
To solve this, we'll have workspace proxies send the real IP of the
client when calling `/workspaceproxies/me/issue-signed-app-token [post]`
via the header `Coder-Workspace-Proxy-Real-IP`.
## Description
Adds support for sending an ad‑hoc custom notification to the
authenticated user via API and CLI. This is useful for surfacing the
result of scripts or long‑running tasks. Notifications are delivered
through the configured method and the dashboard Inbox, respecting
existing preferences and delivery settings.
## Changes
* New notification template: “Custom Notification” with a label for a
custom title and a custom message.
* New API endpoint: `POST /api/v2/notifications/custom` to send a custom
notification to the requesting user.
* New API endpoint: `GET /notifications/templates/custom` to get custom
notification template.
* New CLI subcommand: `coder notifications custom <title> <message>` to
send a custom notification to the requesting user.
* Documentation updates: Add a “Custom notifications” section under
Administration > Monitoring > Notifications, including instructions on
sending custom notifications and examples of when to use them.
Closes: https://github.com/coder/coder/issues/19611
Follows similarly to the bash tool (and some code to connect to an agent
was extracted from it).
There are two main parts: a new agent endpoint, and then a new MCP tool
that consumes that endpoint.
Closes https://github.com/coder/internal/issues/935
This PR enhances the AwaitWorkspaceBuildJobCompleted func in coderdtest
pkg to provide better visibility into test failures and debugging
information.
## Summary
In this pull request we're updating search to support queries with
spaces in addition to the `field:value` pattern that is currently
supported.
Additionally templates search now defaults to `display_name` (since
`display_name` is optional the search will fallback to `name`) when
searching without the `field:value` pattern
Closes: https://github.com/coder/coder/issues/14384
### Downsides with searching on `name` and `display_name`
Because the `name` field cannot include spaces, we end up in a situation
where including a space in the query will result in no results since the
query searches on both `name` AND `display_name`. In the following
example, we can see the results of searching by both `name` and
`display_name` on these templates:
| Name | Display Name |
| ------ | ------------- |
| docker | Docker Template |
| faketemplate | A Fake Template |
| azure | Fake Azure Template |
| anotherfake | Another Fake Template |
| azurefake | Another Fake Fake Azure Template |
https://github.com/user-attachments/assets/b0e0793e-e77d-46bc-9a42-d7cf4f8bd910
### Proposal: Search on `display_name` by default and allow for `name`
using the `field:value` pattern
If we remove `name` from the default template search, we're now able to
search with spaces on template `display_names`. Since `display_names`
are what users see in the templates list they might expect the search to
work this way.
Below is an example of `name` being removed from the default template
search.
https://github.com/user-attachments/assets/9aba5911-4960-4384-befb-08ea1acaa3ab
With this approach users would still be able to search on template names
by specifying `exact_name:foo`.
### Testing
Added additional test cases to ensure spaces were handled as expected in
combination with `field:value` patterns.
Closes https://github.com/coder/internal/issues/885
Adds a new database method GetProvisionerJobByIDWithLock that uses FOR
UPDATE without SKIP LOCKED to fix workspace build cancellation returning
500 errors when jobs are locked.
Update OAuth2 metadata endpoint routes to support path suffixes
This PR updates the OAuth2 metadata endpoint routes to include a wildcard character (*) at the end of the paths. This change allows the endpoints to match requests with path suffixes, making our OAuth2 discovery implementation more flexible and compliant with the relevant RFCs.
The updated routes are:
- `/.well-known/oauth-authorization-server*` for RFC 8414 discovery
- `/.well-known/oauth-protected-resource*` for RFC 9728 discovery
This change adds a support-index for `GetAPIKeysLastUsedAfter`.
On dogfood (24h): called 4.4k times, 380ms average.
Change for tested time range: `170ms` -> `3.3ms`.
Fixescoder/internal#724
This change adds an index to optimize the
`GetProvisionerDaemonsWithStatusByOrganization` query.
Execution time dropped from `18s 838ms` to `107ms`.
Closes https://github.com/coder/internal/issues/961
Likely the same deal as in #19599, the body of `require.Eventually` now fires immediately, when it used to fire after 250ms (the interval). Presumably, the deployment stats become ready before the vs code session count gets incremented. This was never an issue with the 250ms delay, as this flake has only cropped up after the testify version bump.
We'll fix the issue by making it possible to wait for a full metrics cache refresh, i.e. removing `require.Eventually` in this test altogether.
When clients disconnected from the /containers/watch endpoint, the WebSocket
connection between coderd and the agent stayed open. This caused heartbeat
traffic every 15s that was incorrectly counted as workspace activity,
extending workspace lifetimes indefinitely.
Now properly cancels the agent connection context when the client disconnects.
see https://github.com/coder/internal/issues/959 but the tl; dr is:
- we call this DB query on an interval (every 15s) and it would be
called on each coderd replica as well
- the generated values update very infrequently (for our most used
internal template I saw the builds created/claimed update twice in a 1h
period)
- we have no index on the initiator ID, so this query has to scan the
entire workspace_builds table on every request
In reality this should likely just be a Prometheus metric, and
Prometheus can handle the counter reset behaviour at query time, but for
now this should at least cut the load of the query to 25% of it's
current impact.
---------
Signed-off-by: Callum Styan <callumstyan@gmail.com>
This PR improves the ruleguard rule for detecting `t.Fail` calls in goroutines. It picks up additional violations, of which are fixed in this PR.
See self-review for details.
The motivation for fixing this comes from a flake I fixed in https://github.com/coder/coder/pull/19599, where tests would fail from a `require` in an `Eventually`.
The latest release of all `pg_dump` major versions, going back to 13,
started inserting `\restrict` `\unrestrict` keywords into dumps. This
currently breaks sqlc in `gen/dump` and our check migration script. Full
details of the postgres change are available here:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=575f54d4c
To fix, we'll always use the `pg_dump` in our postgres 13.21 docker
image for schema dumps, instead of what's on the runner/local machine.
Coder doesn't restore from postgres dumps, so we're not vulnerable to
attacks that would be patched by the latest postgres version.
Regardless, we'll unpin ASAP.
Once sqlc is updated to handle these keywords, we need to start
stripping them when comparing the schema in the migration check script,
and then we can unpin the pg_dump version. This is being tracked at
https://github.com/coder/internal/issues/965
Refactors Agent instance identity to be a SessionTokenProvider.
Refactors the CLI to create Agent clients via a centralized function, rather than add-hoc via individual command handlers and their flags.
This allows commands besides `coder agent`, but which still use the agent identity, to support instance identity authentication.
Fixes#19111 by unifying all API requests to go thru the SessionTokenProvider for auth credentials.
Fixes: https://github.com/coder/internal/issues/950
Pretty sure the intention of the `hold` wait group is to try to get the two goroutines that the test starts running at the same time. But, that should be the case for two goroutines started anyway.
The use of `hold` doesn't actually guarantee concurrent execution of `Acquire`, just that both goroutines get as far as `Done()` --- the go scheduler could run them serially without incident.
So I've chosen to just remove the use of `hold` to simplify.
But, for posterity, the data race was due to incrementing by 1 in the loop along with the goroutine that calls Done. You could increment by 1 and then back down to 0 before the second iteration of the loop starts. This then causes a data race with calling `Wait()` in the first goroutine and `Add()` in the second iteration. c.f. https://pkg.go.dev/sync#WaitGroup.Add
Due to how we currently label a workspace as a task, there is a delay
between when a task workspace is created and when it is labelled as a
task.
This PR introduces fallback check for when a workspace does _not_ have
`HasAITask` set. This fallback check tests to see if the special "AI
Prompt" parameter is present in the workspace's build parameters.
* provisionerdserver: Expires prebuild user token for workspace, if it
exists, when regenerating session token.
* dbauthz: disallow prebuilds user from creating api keys
* dbpurge: added functionality to expire stale api keys owned by the
prebuilds user
This PR should resolve https://github.com/coder/internal/issues/719 by
limiting the `workspace_builds` rows selected by the query to the most
recent 100 builds of a template, as opposed to all builds in the last
30d. For our own internal templates with the most builds (1700-2000 in a
30d period) this should cut the query execution time by about 80%.
Unless we have some restriction on keeping the 30d period, contract
related or otherwise, this seems like a safe change to make. In addition
to the execution speed improvements it also means the memory for the
query is bounded as well.
If we want to keep a 30d time period for the avg build time value I
think it's worth exploring a purpose built solution such as histogram
structures where the build times could be bucketized by template ID as
they're observed.
---------
Signed-off-by: Callum Styan <callumstyan@gmail.com>
This test still flakes occasionally, see
https://github.com/coder/internal/issues/954#issuecomment-3237154735
The cause appears to be related to the assignment of `time.Now()` as the
`LastSeenAt` time when creating a provisioner which can flake with the
calculated scheduled next autostart and the code to set then
`require.Eventually` the updated provisioner LastSeenAt.
Instead we should simply calculate all time values for the stale portion
of the test based on the provisioners LastSeenAt value to avoid such
issues.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
- Removes GetManagedAgentCount query
- Adds new table `usage_events_daily` which stores aggregated usage
events by the type and UTC day
- Adds trigger to update the values in this table when a new row is
inserted into `usage_events`
- Adds a migration that adds `usage_events_daily` rows for existing data
in `usage_events`
- Adds tests for the trigger
- Adds tests for the backfill query in the migration
Since the `usage_events` table is unreleased currently, this migration
will do nothing on real deployments and will only affect preview
deployments such as dogfood.
Closes https://github.com/coder/internal/issues/943
## Description
When creating a prebuilt workspace, both `flags.IsPrebuild` and
`flags.IsFirstBuild` are true. Previously, the logic rejected cases with
multiple flags, so `coderd_workspace_creation_duration_seconds` wasn’t
updated for prebuilt creations. This is the only valid scenario where
two flags can be true.
## Changes
* Fix logic to update `coderd_workspace_creation_duration_seconds`
metric for prebuilt workspaces.
* Add prebuild helper functions to coderdenttest (other prebuild tests
can reuse this).
* Update workspace's provisionerdmetric tests to include this metric.
Follow-up: https://github.com/coder/coder/pull/19503
Related to: https://github.com/coder/coder/issues/19528
Coder Tasks requires us to create a workspace, but we want to be able to
return a `codersdk.Task` instead of a `codersdk.Workspace`. This
requires untangling `createWorkspace` from directly writing to
`http.ResponseWriter`.
Refactors `codersdk.Client`'s use of session tokens to use a `SessionTokenProvider`, which abstracts the obtaining and storing of the session token.
The main motiviation is to unify Agent authentication an an upstack PR, which can use cloud instance identity via token exchange, rather than a fixed session token.
However, the abstraction could also allow functionality like obtaining the session token from other external sources like the OS credential manager, or an external secret/key management system like Vault.
The flake here had two causes:
1. related to usage of time.Now() in MustWaitForProvisionersAvailable
and
2. the fact that UpdateProvisionerLastSeenAt can not use a time that is
further in the past than the current LastSeenAt time
Previously the test here was calling
`coderdtest.MustWaitForProvisionersAvailable` which was using `time.Now`
rather than the next tick time like the real `hasProvisionersAvailable`
function does. Additionally, when using `UpdateProvisionerLastSeenAt`
the underlying db query enforces that the time we're trying to set
`LastSeenAt` to cannot be older than the current value.
I was able to reliably reproduce the flake by executing both the
`UpdateProvisionerLastSeenAt` call and `tickCh <- next` in their own
goroutines, the former with a small sleep to reliably ensure we'd
trigger the autobuild before we set the `LastSeenAt` time. That's when I
also noticed that `coderdtest.MustWaitForProvisionersAvailable` was
using `time.Now` instead of the tick time. When I updated that function
to take in a tick time + added a 2nd call to
`UpdateProvisionerLastSeenAt` to set an original non-stale time, we
could then never get the test to pass because the later call to set the
stale time would not actually modify `LastSeenAt`. On top of that,
calling the provisioner daemons closer in the middle of the function
doesn't really do anything of value in this test.
**The fix for the flake is to keep the go routines, ensuring there would
be a flake if there was not a relevant fix, but to include the fix which
is to ensure that we explicitly wait for the provisioner to be stale
before passing the time to `tickCh`.**
---------
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Addresses comment raised on previous PR
https://github.com/coder/coder/pull/19619#discussion_r2307943410
We know we can skip sub agents when searching for which agent is related
to the task, as this is not an explicitly supported feature at the
moment. When we come to properly setting up a Task -> Agent relationship
this limitation will be dropped.
Closes https://github.com/coder/internal/issues/949
Adds the following fields to `codersdk.Task`
- OwnerName
- TemplateName
- TemplateDisplayName
- TemplateIcon
- WorkspaceAgentID
- WorkspaceAgentLifecycle
- WorkspaceAgentHealth
The implementation is unfortunately not compatible with multiple agents
as we have no reliable way to tell which agent has the AI task running
in it. For now we just pick the first agent found, but in the future
this will need to be changed.
## Description
This PR introduces one counter and two histograms related to workspace
creation and claiming. The goal is to provide clearer observability into
how workspaces are created (regular vs prebuild) and the time cost of
those operations.
### `coderd_workspace_creation_total`
* Metric type: Counter
* Name: `coderd_workspace_creation_total`
* Labels: `organization_name`, `template_name`, `preset_name`
This counter tracks whether a regular workspace (not created from a
prebuild pool) was created using a preset or not.
Currently, we already expose `coderd_prebuilt_workspaces_claimed_total`
for claimed prebuilt workspaces, but we lack a comparable metric for
regular workspace creations. This metric fills that gap, making it
possible to compare regular creations against claims.
Implementation notes:
* Exposed as a `coderd_` metric, consistent with other workspace-related
metrics (e.g. `coderd_api_workspace_latest_build`:
https://github.com/coder/coder/blob/main/coderd/prometheusmetrics/prometheusmetrics.go#L149).
* Every `defaultRefreshRate` (1 minute ), DB query
`GetRegularWorkspaceCreateMetrics` is executed to fetch all regular
workspaces (not created from a prebuild pool).
* The counter is updated with the total from all time (not just since
metric introduction). This differs from the histograms below, which only
accumulate from their introduction forward.
### `coderd_workspace_creation_duration_seconds` &
`coderd_prebuilt_workspace_claim_duration_seconds`
* Metric types: Histogram
* Names:
* `coderd_workspace_creation_duration_seconds`
* Labels: `organization_name`, `template_name`, `preset_name`, `type`
(`regular`, `prebuild`)
* `coderd_prebuilt_workspace_claim_duration_seconds`
* Labels: `organization_name`, `template_name`, `preset_name`
We already have `coderd_provisionerd_workspace_build_timings_seconds`,
which tracks build run times for all workspace builds handled by the
provisioner daemon.
However, in the context of this issue, we are only interested in
creation and claim build times, not all transitions; additionally, this
metric does not include `preset_name`, and adding it there would
significantly increase cardinality. Therefore, separate more focused
metrics are introduced here:
* `coderd_workspace_creation_duration_seconds`: Build time to create a
workspace (either a regular workspace or the build into a prebuild pool,
for prebuild initial provisioning build).
* `coderd_prebuilt_workspace_claim_duration_seconds`: Time to claim a
prebuilt workspace from the pool.
The reason for two separate histograms is that:
* Creation (regular or prebuild): provisioning builds with similar time
magnitude, generally expected to take longer than a claim operation.
* Claim: expected to be a much faster provisioning build.
#### Native histogram usage
Provisioning times vary widely between projects. Using static buckets
risks unbalanced or poorly informative histograms.
To address this, these metrics use [Prometheus native
histograms](https://prometheus.io/docs/specs/native_histograms/):
* First introduced in Prometheus v2.40.0
* Recommended stable usage from v2.45+
* Requires Go client `prometheus/client_golang` v1.15.0+
* Experimental and must be explicitly enabled on the server
(`--enable-feature=native-histograms`)
For compatibility, we also retain a classic bucket definition (aligned
with the existing provisioner metric:
https://github.com/coder/coder/blob/main/provisionerd/provisionerd.go#L182-L189).
* If native histograms are enabled, Prometheus ingests the
high-resolution histogram.
* If not, it falls back to the predefined buckets.
Implementation notes:
* Unlike the counter, these histograms are updated in real-time at
workspace build job completion.
* They reflect data only from the point of introduction forward (no
historical backfill).
## Relates to
Closes: https://github.com/coder/coder/issues/19528
Native histograms tested in observability stack:
https://github.com/coder/observability/pull/50
The previous logic verified a generated name was valid, _and then
appended a suffix to it_. This was flawed as it would allow a 32
character name, and then append an extra 5 characters to it.
Instead we now append the suffix _and then_ verify it is valid.