Fixes https://github.com/coder/internal/issues/1035
Or, at least, closes a remaining race that seems pretty likely.
The tests in question write a file, close the file, then execute the file. Sometimes Linux errors saying "text file busy" which means the file is still open for writing.
What I think is going on is:
1. Test_sshConfigProxyCommandEscape goroutine opens the file and begins writing.
2. Some other, unrelated test execs a command, which causes a `fork()` syscall. The child process now has a copy of the file descriptor to our open file.
3. Test_sshConfigProxyCommandEscape goroutine executes the file and gets "text file busy".
4. The child process calls the `exec` syscall, which closes the file (due to `CLOEXEC` being set).
The race is very tight because 3 has to happen before 4 (and, 3 involves it's own fork/exec), but it's not impossible on a busy system.
c.f. #14233 which was an earlier attempt to fix this. It only prevented the subtests from running in parallel. When the subtests were all running in parallel, the flake was fairly likely because you've got all this fork() activity happening at the same time. But, since the main test was in parallel there is still a chance a totally different test is `fork`'ing at in inopportune time.
Builds upon https://github.com/coder/coder/pull/19970
I got kinda carried away when I saw the extra stuff we could add in
here, so I went ahead and added it:
* User ID
* Organization IDs
* Roles
This technically duplicates functionality from `coder users show` but I
figure folks may find it useful.
* Improves logic for `exp task status --watch` so that it will also exit
on task idle status.
* Adds workspace agent health to `exp task status` output.
This changes the task get endpoint to omit app statuses for previous
'lifetimes' of a workspace.
It also introduces a [breaking
change](https://github.com/coder/coder/blob/release/2.26/codersdk/aitasks.go#L83)
to bring `TaskStateComplete` in line with
`WorkspaceAppStatusStateComplete`. I can alternatively revert this
change and add a conversion function between the two SDK types.
This addresses a long-standing gripe of mine: to get your logged in
username you would have to do
```bash
coder whoami | awk '{print $9}'
```
This allows you to do:
```
coder whoami -o json | jq -r '.username'
```
or
```
coder whoami -f table -c username
```
Closes#19812
## Problem
> When I try to SSH into my workspace with multiple agents. It does not
provide an intuitive way to do that successfully and instead misguides
by printing wrong instructions.
This PR enhances the error handling to provide suggestions with SSH
commands that users can copy and paste directly.
Before:
```
Encountered an error running "coder ssh", see "coder ssh --help" for more information
error: multiple agents found, please specify the agent name, available agents: [coder dev]
```
After:
```
Encountered an error running "coder ssh", see "coder ssh --help" for more information
error: multiple agents found, please specify the agent name, available agents: [coder dev]
Try running:
$ ssh coder.dogfood.me.coder
$ ssh dev.dogfood.me.coder
```
As part of converting production code to use the new ClientBuilder, I noticed some dead code that creates a client with a URL for the only purpose of later accessing the URL. This PR removes the cruft.
Refactors the CLI to create the `*codersdk.Client` in the handlers. This is groundwork for changing the `rootCmd.InitClient()` to use the new `ClientOption`s.
It also improves variable locality, scoping the Client to the handler. This makes misuse less likely and reduces the memory allocations to just the command being executed, rather than allocating a Client for every command regardless of whether it is executed.
Relates to https://github.com/coder/internal/issues/985.
Some scaletest runners would autogenerate names if they weren't supplied on the config, while others required a name be supplied, and a name was autogenerated in the CLI command handler. This PR unifies the runners to make names and emails optional on each config, and generate them in the scaletest runner if omitted.
The create user runner in the PR above in the stack will do this too.
Solves #15575
Adds OAuth access token revocation when unlinking external auth
provider. Due to revocation not being consistently implemented by
providers this is only best effort attempt. Unsuccessful revocation
won't influence link removal.
fixes https://github.com/coder/internal/issues/946
Some tests tear down the server before we are done with PostgreSQL work, and the default `clitest` infrastructure fails the test if any errors like that are thrown. This PR modifies the tests like that to ignore postgreSQL errors like this.
fixes https://github.com/coder/internal/issues/966
TestCloserStack_Timeout creates `asyncCloser`s which allow control over the exact timing and order of their close method returning. They also, as a final backstop will throw an error if the test context ends before they are unblocked.
TestCloserStack_Timeout unblocks all `asyncCloser`s in a defer and then ends the test. This defer _unblocks_ the running close goroutines, but does not wait for them to finish. Since the test context is canceled as soon as the test completes, this creates a race condition where the close goroutines can trigger the context cancelled arm of the `select` statement.
The fix is to both unblock and wait for all close goroutines to complete before ending the test and cancelling the context.
## Description
Adds support for sending an ad‑hoc custom notification to the
authenticated user via API and CLI. This is useful for surfacing the
result of scripts or long‑running tasks. Notifications are delivered
through the configured method and the dashboard Inbox, respecting
existing preferences and delivery settings.
## Changes
* New notification template: “Custom Notification” with a label for a
custom title and a custom message.
* New API endpoint: `POST /api/v2/notifications/custom` to send a custom
notification to the requesting user.
* New API endpoint: `GET /notifications/templates/custom` to get custom
notification template.
* New CLI subcommand: `coder notifications custom <title> <message>` to
send a custom notification to the requesting user.
* Documentation updates: Add a “Custom notifications” section under
Administration > Monitoring > Notifications, including instructions on
sending custom notifications and examples of when to use them.
Closes: https://github.com/coder/coder/issues/19611
Adds a `sharing add` command for sharing Workspaces with other users and
groups.
The command allows sharing with multiple users, and groups within one
command as well as specifying the role (`use`, or `admin`) defaulting to
`use` if none is specified.
In the current implementation when the command completes we show the
user the current state of the workspace ACL.
```
$ coder sharing add apricot-catfish-86 --user=member:admin --group=contractors:use
USER GROUP ROLE
member - admin
member contractors use
```
If a user is a part of multiple groups, or the workspace has been
individually shared with them they will show up multiple times. Although
this is a bit confusing at first glance it's important to be able to
tell what the maximum role a user may have, and via what ACL they have
it.
---
One piece of UX to consider is that in order to be able to share a
Workspace with a user they must have a role that can read that user. In
the tests we give the user the `ScopedRoleOrgAuditor` role.
Closes
[coder/internal#859](https://github.com/coder/internal/issues/859)
In trying to address confusion with the `-` (for stdin) directory flag last year, I had `template push` read from stdin if stdin was not a TTY. However, I made the mistake of checking if the directory flag was set or not by comparing it to the default value. This meant in something like GitHub Actions, where you don't have a TTY for stdin, it was impossible to read from the current working directory. The fix is just to check if the flag was explicitly set, using pflags.
If users encounter this bug, and this fix is unavailable in their version of Coder, they can workaround it by setting `-d "$(pwd)"`
Fixes https://github.com/coder/internal/issues/933
Refactors CLI tests that check the `--auth` flag parsing for various public clouds into a unit test that just creates the agent Client and asserts on the type.
Testing that the agent client actually authenticates correctly with these auth types is well covered by Coderd tests, so we don't need to retread that ground here, and the deleted tests were flaky on Windows.
Refactors Agent instance identity to be a SessionTokenProvider.
Refactors the CLI to create Agent clients via a centralized function, rather than add-hoc via individual command handlers and their flags.
This allows commands besides `coder agent`, but which still use the agent identity, to support instance identity authentication.
Fixes#19111 by unifying all API requests to go thru the SessionTokenProvider for auth credentials.
Relates to https://github.com/coder/internal/issues/893
Instead of `coder task create <template> --input <input>`, it is now
`coder task create <input> --template <template>`.
If there is only one AI task template on the deployment, the
`--template` parameter can be omitted.
Refactors `codersdk.Client`'s use of session tokens to use a `SessionTokenProvider`, which abstracts the obtaining and storing of the session token.
The main motiviation is to unify Agent authentication an an upstack PR, which can use cloud instance identity via token exchange, rather than a fixed session token.
However, the abstraction could also allow functionality like obtaining the session token from other external sources like the OS credential manager, or an external secret/key management system like Vault.
Relates to https://github.com/coder/internal/issues/888
As part of our renewed connection scaletesting efforts, we want to
scaletest coder in a scenario where direct connections aren't available
(relatively common for our customers), and all concurrent connections
are relayed via DERP.
This PR adds a flag, `--disable-direct` that can be included on the
existing`coder exp scaletest workspace-traffic -ssh` to disable direct
connections.
Closes https://github.com/coder/internal/issues/949
Adds the following fields to `codersdk.Task`
- OwnerName
- TemplateName
- TemplateDisplayName
- TemplateIcon
- WorkspaceAgentID
- WorkspaceAgentLifecycle
- WorkspaceAgentHealth
The implementation is unfortunately not compatible with multiple agents
as we have no reliable way to tell which agent has the AI task running
in it. For now we just pick the first agent found, but in the future
this will need to be changed.
## Description
This PR introduces one counter and two histograms related to workspace
creation and claiming. The goal is to provide clearer observability into
how workspaces are created (regular vs prebuild) and the time cost of
those operations.
### `coderd_workspace_creation_total`
* Metric type: Counter
* Name: `coderd_workspace_creation_total`
* Labels: `organization_name`, `template_name`, `preset_name`
This counter tracks whether a regular workspace (not created from a
prebuild pool) was created using a preset or not.
Currently, we already expose `coderd_prebuilt_workspaces_claimed_total`
for claimed prebuilt workspaces, but we lack a comparable metric for
regular workspace creations. This metric fills that gap, making it
possible to compare regular creations against claims.
Implementation notes:
* Exposed as a `coderd_` metric, consistent with other workspace-related
metrics (e.g. `coderd_api_workspace_latest_build`:
https://github.com/coder/coder/blob/main/coderd/prometheusmetrics/prometheusmetrics.go#L149).
* Every `defaultRefreshRate` (1 minute ), DB query
`GetRegularWorkspaceCreateMetrics` is executed to fetch all regular
workspaces (not created from a prebuild pool).
* The counter is updated with the total from all time (not just since
metric introduction). This differs from the histograms below, which only
accumulate from their introduction forward.
### `coderd_workspace_creation_duration_seconds` &
`coderd_prebuilt_workspace_claim_duration_seconds`
* Metric types: Histogram
* Names:
* `coderd_workspace_creation_duration_seconds`
* Labels: `organization_name`, `template_name`, `preset_name`, `type`
(`regular`, `prebuild`)
* `coderd_prebuilt_workspace_claim_duration_seconds`
* Labels: `organization_name`, `template_name`, `preset_name`
We already have `coderd_provisionerd_workspace_build_timings_seconds`,
which tracks build run times for all workspace builds handled by the
provisioner daemon.
However, in the context of this issue, we are only interested in
creation and claim build times, not all transitions; additionally, this
metric does not include `preset_name`, and adding it there would
significantly increase cardinality. Therefore, separate more focused
metrics are introduced here:
* `coderd_workspace_creation_duration_seconds`: Build time to create a
workspace (either a regular workspace or the build into a prebuild pool,
for prebuild initial provisioning build).
* `coderd_prebuilt_workspace_claim_duration_seconds`: Time to claim a
prebuilt workspace from the pool.
The reason for two separate histograms is that:
* Creation (regular or prebuild): provisioning builds with similar time
magnitude, generally expected to take longer than a claim operation.
* Claim: expected to be a much faster provisioning build.
#### Native histogram usage
Provisioning times vary widely between projects. Using static buckets
risks unbalanced or poorly informative histograms.
To address this, these metrics use [Prometheus native
histograms](https://prometheus.io/docs/specs/native_histograms/):
* First introduced in Prometheus v2.40.0
* Recommended stable usage from v2.45+
* Requires Go client `prometheus/client_golang` v1.15.0+
* Experimental and must be explicitly enabled on the server
(`--enable-feature=native-histograms`)
For compatibility, we also retain a classic bucket definition (aligned
with the existing provisioner metric:
https://github.com/coder/coder/blob/main/provisionerd/provisionerd.go#L182-L189).
* If native histograms are enabled, Prometheus ingests the
high-resolution histogram.
* If not, it falls back to the predefined buckets.
Implementation notes:
* Unlike the counter, these histograms are updated in real-time at
workspace build job completion.
* They reflect data only from the point of introduction forward (no
historical backfill).
## Relates to
Closes: https://github.com/coder/coder/issues/19528
Native histograms tested in observability stack:
https://github.com/coder/observability/pull/50
Closes https://github.com/coder/internal/issues/942
The flakey test, `RemoteForwardUnixSocket`, was using `netstat` to check if the unix socket was forwarded properly. In the flake, it looks like netstat was hanging. This PR has `RemoteForwardUnixSocket` be rewritten to match the implementation of `RemoteForwardMultipleUnixSockets`, where we send bytes over the socket in-process instead. More importantly, that test hasn't flaked (yet).
Note: The implementation has been copied directly from the other test, comments and all.
Partially implements https://github.com/coder/internal/issues/893
This isn't the full implementation of `coder exp tasks create` as
defined in the issue, but it is the minimum required to create a task.