There is currently an issue with subdomain workspace apps on workspace
proxies, where if you have a workspace proxy wildcard nested beneath the
primary wildcard, cookies from the primary may be sent to the server
before cookies from the proxy specifically.
Currently:
1. Use a subdomain app via the primary proxy `*.coder.corp.com`
a. Client sends no cookies
a. Server does token smuggling flow
a. Server sets a cookie `coder_subdomain_app_session_token` on
`*.coder.corp.com`
a. Server redirects client to reload the page
a. Request should succeed as usual
1. Wait until the primary proxy's session token cookie has expired in
the database (or make it invalid yourself)
1. Use a subdomain app via a separate proxy `*.sydney.coder.corp.com`
a. Client sends `coder_subdomain_app_session_token` cookie from
`*.coder.corp.com`
a. Server validates supplied cookie, it fails because it's expired
a. Server does token smuggling flow
a. Server sets a cookie `coder_subdomain_app_session_token` on
`*.sydney.coder.corp.com`
a. Server redirects client to reload page
a. Client sends BOTH cookies.
a. The server will only process the first cookie it receives, so if the
expired cookie for the primary proxy is sent first the request will end
up in a permanent loop on step b.
The fix is to append `_{hash(wildcard_access_url)}` to the subdomain
cookies as we cannot control browser behavior further. This avoids the
conflict as each proxy will only read it's specific cookie.
This PR adds a readiness wait to OAuth2 metadata endpoint tests to avoid rare races with server startup. Instead of immediately making HTTP requests, the tests now use `testutil.Eventually` to retry the requests until they succeed, with a short interval between attempts. This helps prevent flaky tests that might fail due to timing issues during server initialization.
Fixes: https://github.com/coder/internal/issues/996
relates to https://github.com/coder/internal/issues/927
Refactors dbtestutil to use a `Broker` struct to create test databases. Additionally uses a `coder_testing` database to record test databases that are created and when they are dropped.
This is in preparation for the PR above this in the stack which adds a "cleaner" subprocess that cleans out any databases that were left when the test process ends.
Adds shared_with_user and shared_with_group filters to the /workspaces
endpoint.
- `shared_with_user`: filters workspaces shared with a specific user.
Accepts a user UUID or username.
- `shared_with_group`: filters workspaces shared with a specific group.
Accepts:
- a group UUID, or
- `<organization name>/<group name>`, or
- `<group name>` (resolved in the default organization).
Closes
[coder/internal#1004](https://github.com/coder/internal/issues/1004)
Solves #15575
Adds OAuth access token revocation when unlinking external auth
provider. Due to revocation not being consistently implemented by
providers this is only best effort attempt. Unsuccessful revocation
won't influence link removal.
I noticed during a scaletest that many warning logs were being generated when enqueuing notifications. The error was:
```
failed to notify of workspace creation: notification is not enabled
```
I don't think we should be warning if automated notifications fail to send to users because they have them disabled.
To fix, we'll stop returning these errors.
Fixes https://github.com/coder/internal/issues/970
The test doesn't wait for `monitor()` to complete, and the mock database call that we assert takes place in a `defer` within `monitor()`. This allows the mock assertions to race with the defer and flake the test.
Solution is to explicitly wait for `monitor()` to complete before the end of the test, so that mock assertions (which happen in a `t.Cleanup()`) don't race.
Breaking API Change:
> The presence of the `ip` field on `codersdk.ConnectionLog` cannot be
guaranteed, and so the field has been made optional. It may be omitted
on API responses.
When running a scaletest, I noticed logs of the form:
```
2025-09-12 06:34:10.924 [erro] coderd.workspaceapps: upsert connection log failed trace=0xa17580 span=0xa17620 workspace_id=81b937d7-5777-4df5-b5cb-80241f30326f agent_id=78b2ff6d-b4a6-4a4e-88a7-283e05455a88 app_id=00000000-0000-0000-0000-000000000000 user_id=00000000-0000-0000-0000-000000000000 user_agent="" app_slug_or_port=terminal status_code=404 request_id=67f03cf8-9523-444a-97bc-90de080a54c8 ...
error= 1 error occurred:
* pq: null value in column "ip" of relation "connection_logs" violates not-null constraint
```
to ensure logs are never omitted from the connection log due to a
missing IP again (i.e. I'm not sure if we can always rely on a valid,
parseable, IP from `(http.Request).RemoteAddr`), I've removed the `NOT
NULL` constraint on `ip` on `connection_logs`, and made `ip` on the API
response optional.
The specific cause for these null IPs was the
`/workspaceproxies/me/issue-signed-app-token [post]` endpoint
constructing it's own `http.Request` without a `RemoteAddr` set, and
then passing that to the token issuer.
To solve this, we'll have workspace proxies send the real IP of the
client when calling `/workspaceproxies/me/issue-signed-app-token [post]`
via the header `Coder-Workspace-Proxy-Real-IP`.
## Description
Adds support for sending an ad‑hoc custom notification to the
authenticated user via API and CLI. This is useful for surfacing the
result of scripts or long‑running tasks. Notifications are delivered
through the configured method and the dashboard Inbox, respecting
existing preferences and delivery settings.
## Changes
* New notification template: “Custom Notification” with a label for a
custom title and a custom message.
* New API endpoint: `POST /api/v2/notifications/custom` to send a custom
notification to the requesting user.
* New API endpoint: `GET /notifications/templates/custom` to get custom
notification template.
* New CLI subcommand: `coder notifications custom <title> <message>` to
send a custom notification to the requesting user.
* Documentation updates: Add a “Custom notifications” section under
Administration > Monitoring > Notifications, including instructions on
sending custom notifications and examples of when to use them.
Closes: https://github.com/coder/coder/issues/19611
Follows similarly to the bash tool (and some code to connect to an agent
was extracted from it).
There are two main parts: a new agent endpoint, and then a new MCP tool
that consumes that endpoint.
Closes https://github.com/coder/internal/issues/935
This PR enhances the AwaitWorkspaceBuildJobCompleted func in coderdtest
pkg to provide better visibility into test failures and debugging
information.
## Summary
In this pull request we're updating search to support queries with
spaces in addition to the `field:value` pattern that is currently
supported.
Additionally templates search now defaults to `display_name` (since
`display_name` is optional the search will fallback to `name`) when
searching without the `field:value` pattern
Closes: https://github.com/coder/coder/issues/14384
### Downsides with searching on `name` and `display_name`
Because the `name` field cannot include spaces, we end up in a situation
where including a space in the query will result in no results since the
query searches on both `name` AND `display_name`. In the following
example, we can see the results of searching by both `name` and
`display_name` on these templates:
| Name | Display Name |
| ------ | ------------- |
| docker | Docker Template |
| faketemplate | A Fake Template |
| azure | Fake Azure Template |
| anotherfake | Another Fake Template |
| azurefake | Another Fake Fake Azure Template |
https://github.com/user-attachments/assets/b0e0793e-e77d-46bc-9a42-d7cf4f8bd910
### Proposal: Search on `display_name` by default and allow for `name`
using the `field:value` pattern
If we remove `name` from the default template search, we're now able to
search with spaces on template `display_names`. Since `display_names`
are what users see in the templates list they might expect the search to
work this way.
Below is an example of `name` being removed from the default template
search.
https://github.com/user-attachments/assets/9aba5911-4960-4384-befb-08ea1acaa3ab
With this approach users would still be able to search on template names
by specifying `exact_name:foo`.
### Testing
Added additional test cases to ensure spaces were handled as expected in
combination with `field:value` patterns.
Closes https://github.com/coder/internal/issues/885
Adds a new database method GetProvisionerJobByIDWithLock that uses FOR
UPDATE without SKIP LOCKED to fix workspace build cancellation returning
500 errors when jobs are locked.
Update OAuth2 metadata endpoint routes to support path suffixes
This PR updates the OAuth2 metadata endpoint routes to include a wildcard character (*) at the end of the paths. This change allows the endpoints to match requests with path suffixes, making our OAuth2 discovery implementation more flexible and compliant with the relevant RFCs.
The updated routes are:
- `/.well-known/oauth-authorization-server*` for RFC 8414 discovery
- `/.well-known/oauth-protected-resource*` for RFC 9728 discovery
This change adds a support-index for `GetAPIKeysLastUsedAfter`.
On dogfood (24h): called 4.4k times, 380ms average.
Change for tested time range: `170ms` -> `3.3ms`.
Fixescoder/internal#724
This change adds an index to optimize the
`GetProvisionerDaemonsWithStatusByOrganization` query.
Execution time dropped from `18s 838ms` to `107ms`.
Closes https://github.com/coder/internal/issues/961
Likely the same deal as in #19599, the body of `require.Eventually` now fires immediately, when it used to fire after 250ms (the interval). Presumably, the deployment stats become ready before the vs code session count gets incremented. This was never an issue with the 250ms delay, as this flake has only cropped up after the testify version bump.
We'll fix the issue by making it possible to wait for a full metrics cache refresh, i.e. removing `require.Eventually` in this test altogether.
When clients disconnected from the /containers/watch endpoint, the WebSocket
connection between coderd and the agent stayed open. This caused heartbeat
traffic every 15s that was incorrectly counted as workspace activity,
extending workspace lifetimes indefinitely.
Now properly cancels the agent connection context when the client disconnects.
see https://github.com/coder/internal/issues/959 but the tl; dr is:
- we call this DB query on an interval (every 15s) and it would be
called on each coderd replica as well
- the generated values update very infrequently (for our most used
internal template I saw the builds created/claimed update twice in a 1h
period)
- we have no index on the initiator ID, so this query has to scan the
entire workspace_builds table on every request
In reality this should likely just be a Prometheus metric, and
Prometheus can handle the counter reset behaviour at query time, but for
now this should at least cut the load of the query to 25% of it's
current impact.
---------
Signed-off-by: Callum Styan <callumstyan@gmail.com>
This PR improves the ruleguard rule for detecting `t.Fail` calls in goroutines. It picks up additional violations, of which are fixed in this PR.
See self-review for details.
The motivation for fixing this comes from a flake I fixed in https://github.com/coder/coder/pull/19599, where tests would fail from a `require` in an `Eventually`.
The latest release of all `pg_dump` major versions, going back to 13,
started inserting `\restrict` `\unrestrict` keywords into dumps. This
currently breaks sqlc in `gen/dump` and our check migration script. Full
details of the postgres change are available here:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=575f54d4c
To fix, we'll always use the `pg_dump` in our postgres 13.21 docker
image for schema dumps, instead of what's on the runner/local machine.
Coder doesn't restore from postgres dumps, so we're not vulnerable to
attacks that would be patched by the latest postgres version.
Regardless, we'll unpin ASAP.
Once sqlc is updated to handle these keywords, we need to start
stripping them when comparing the schema in the migration check script,
and then we can unpin the pg_dump version. This is being tracked at
https://github.com/coder/internal/issues/965
Refactors Agent instance identity to be a SessionTokenProvider.
Refactors the CLI to create Agent clients via a centralized function, rather than add-hoc via individual command handlers and their flags.
This allows commands besides `coder agent`, but which still use the agent identity, to support instance identity authentication.
Fixes#19111 by unifying all API requests to go thru the SessionTokenProvider for auth credentials.
Fixes: https://github.com/coder/internal/issues/950
Pretty sure the intention of the `hold` wait group is to try to get the two goroutines that the test starts running at the same time. But, that should be the case for two goroutines started anyway.
The use of `hold` doesn't actually guarantee concurrent execution of `Acquire`, just that both goroutines get as far as `Done()` --- the go scheduler could run them serially without incident.
So I've chosen to just remove the use of `hold` to simplify.
But, for posterity, the data race was due to incrementing by 1 in the loop along with the goroutine that calls Done. You could increment by 1 and then back down to 0 before the second iteration of the loop starts. This then causes a data race with calling `Wait()` in the first goroutine and `Add()` in the second iteration. c.f. https://pkg.go.dev/sync#WaitGroup.Add
Due to how we currently label a workspace as a task, there is a delay
between when a task workspace is created and when it is labelled as a
task.
This PR introduces fallback check for when a workspace does _not_ have
`HasAITask` set. This fallback check tests to see if the special "AI
Prompt" parameter is present in the workspace's build parameters.
* provisionerdserver: Expires prebuild user token for workspace, if it
exists, when regenerating session token.
* dbauthz: disallow prebuilds user from creating api keys
* dbpurge: added functionality to expire stale api keys owned by the
prebuilds user
This PR should resolve https://github.com/coder/internal/issues/719 by
limiting the `workspace_builds` rows selected by the query to the most
recent 100 builds of a template, as opposed to all builds in the last
30d. For our own internal templates with the most builds (1700-2000 in a
30d period) this should cut the query execution time by about 80%.
Unless we have some restriction on keeping the 30d period, contract
related or otherwise, this seems like a safe change to make. In addition
to the execution speed improvements it also means the memory for the
query is bounded as well.
If we want to keep a 30d time period for the avg build time value I
think it's worth exploring a purpose built solution such as histogram
structures where the build times could be bucketized by template ID as
they're observed.
---------
Signed-off-by: Callum Styan <callumstyan@gmail.com>
This test still flakes occasionally, see
https://github.com/coder/internal/issues/954#issuecomment-3237154735
The cause appears to be related to the assignment of `time.Now()` as the
`LastSeenAt` time when creating a provisioner which can flake with the
calculated scheduled next autostart and the code to set then
`require.Eventually` the updated provisioner LastSeenAt.
Instead we should simply calculate all time values for the stale portion
of the test based on the provisioners LastSeenAt value to avoid such
issues.
Signed-off-by: Callum Styan <callumstyan@gmail.com>