This PR removes a log field that could expose sensitive information in
agent logs for workspaces that pass such information to the agent via
its manifest.
(cherry picked from commit 1d726c81bb)
Co-authored-by: Sas Swart <sas.swart.cdk@gmail.com>
This is to debug context timeouts on API requests to the agent.
Because rbac and database cannot be imported in slim, split the logger
middleware into slim and non-slim versions and break out the recovery
middleware.
fixes https://github.com/coder/internal/issues/1123
We want to tests that ports are not included after they are no longer used, but this isn't safe on the real OS networking stack because there is no way to guarantee a port _won't_ be used. Instead, we introduce an interface and fake implementation for testing.
On order to leave the filtering logic in the test path, this PR also does some refactoring.
Caching logic is left in the real OS querying implementation and a new test case is added for it in this PR.
relates to: https://github.com/coder/internal/issues/1094
This is number 2 of 5 pull requests in an effort to add agent script
ordering. It adds a drpc API that is exposed via a local socket. This
API serves access to a lightweight DAG based dependency manager that was
inspired by systemd.
In follow-up PRs:
* This unit manager will be plumbed into the workspace agent struct.
* CLI commands will use this agentsocket api to express dependencies
between coder scripts
I used an LLM to produce some of these changes, but I have conducted
thorough self review and consider this contribution to be ready for an
external reviewer.
relates to: https://github.com/coder/internal/issues/1094
This is number 1 of 5 pull requests in an effort to add agent script
ordering. It adds a unit manager, which uses an underlying DAG and a
list of subscribers to inform units when their dependencies have changed
in status.
In follow-up PRs:
* This unit manager will be plumbed into the workspace agent struct.
* It will then be exposed to users via a new socket based drpc API
* The agentsocket API will then become accessible via CLI commands that
allow coder scripts to express their dependencies on one another.
This is an experimental feature. There may be ways to improve the
efficiency of the manager struct, but it is more important to validate
this feature with customers before we invest in such optimizations.
See the tests for examples of how units may communicate with one
another. Actual CLI usage will be analogous.
I used an LLM to produce some of these changes, but I have conducted
thorough self review and consider this contribution to be ready for an
external reviewer.
Closes https://github.com/coder/internal/issues/769
According to the `time.NewTicker` documentation [^1] (which is used
under the hood by https://github.com/coder/quartz) it will automatically
adjust the time interval to make up for slow receivers. This means we
should be safe to drop the default branch.
> NewTicker returns a new Ticker containing a channel that will send the
current time on the channel after each tick. The period of the ticks is
specified by the duration argument. The ticker will adjust the time
interval or drop ticks to make up for slow receivers. The duration d
must be greater than zero; if not, NewTicker will panic.
[^1]: https://pkg.go.dev/time#Ticker
Relates to https://github.com/coder/internal/issues/1093
This is the first of N pull requests to allow coder script ordering.
It introduces what is for now dead code, but paves the way for various
interfaces that allow coder scripts and other processes to depend on one
another via CLI commands and terraform configurations.
The next step is to add reactivity to the graph, such that changes in
the status of one vertex will propagate and allow other vertices to
change their own statuses.
Concurrency and stress testing yield the following:
CPU Profile:
<img width="1512" height="862" alt="Screenshot 2025-10-17 at 10 38 52"
src="https://github.com/user-attachments/assets/f46cf1a2-a0b2-4c02-81a0-069798108ee5"
/>
Mem Profile:
<img width="1512" height="862" alt="Screenshot 2025-10-17 at 10 38 01"
src="https://github.com/user-attachments/assets/45be1235-fff6-45ba-a50d-db9880377bd0"
/>
Predictably, lock contention and memory allocation are the largest
components of this system under stress. Nothing seems untoward.
Second flake for this test today 😮💨.
Flake seen here, though I couldn't replicate this locally, some CI
exclusive networking issue.
https://github.com/coder/coder/actions/runs/18770305895/job/53553517887?pr=20448
```
agent_test.go:3619:
Error Trace: /home/runner/work/coder/coder/agent/agent_test.go:3619
Error: Received unexpected error:
expected 1, got 0.000000:
github.com/coder/coder/v2/agent_test.TestAgent_Metrics_SSH.func7
/home/runner/work/coder/coder/agent/agent_test.go:3557
Test: TestAgent_Metrics_SSH
Messages: check fn for coderd_agentstats_currently_reachable_peers failed
```
This value is incremented by a successful ping to the peer from the
agent, which is dependent on all the networking code, which I think is
definitely out of scope of this test for agent metrics. So, we'll just
assert that the metrics exist with the correct labels (`derp`, `p2p`)
Closes https://github.com/coder/internal/issues/921
The flake in the linked issue was caused by the startup script taking longer than 1 second in CI. The existing conditional, that the startup script duration was under a second, was incorrect; the correct conditional is that the metric exists with the `success` label set to `true`.
- Ignore errors when reporting a connection from the server, just log
them instead
- Translate connection log IP `localhost` to `127.0.0.1` on both the
server and the agent
Note that the temporary fix for converting invalid IPs to localhost is
not required in main since the database no longer forbids NULL for the
IP column since https://github.com/coder/coder/pull/19788
Relates to #20194
When we added support for connection tracking in the Workspace agent, we modified the ReconnectingPTY tests to add an initial connection that we immediately hang up and check that connections are logged.
In the case of `screen`-based pty handling, hanging up the initial connection can race with the initial attachment to the `screen` process, and cause that process to exit early. This leaves subsequent connections to the same session ID to fail.
In this PR we just use different pty session IDs so that the initial connections we do to verify logging don't interfere with the rest of the test.
_Arguably_ it's a bug in our Reconnecting PTY code that hanging up immediately can leave the system in a weird state, but we do eventually recover and error out, so I don't think it's worth trying to fix.
This PR makes the initial steps at removing usage of the global Go HTTP
client, which was seen to have impacts on test flakiness in
https://github.com/coder/internal/issues/1020. The first commit removes
uses from tests, with the exception of one test that is tightly coupled
to the default client. The second commit makes easy/low-risk removals
from application code. This should have some impact to reduce test flakiness.
Relates to: https://github.com/coder/coder/issues/18101
This PR introduces a new `backedpipe` package that provides reliable
bidirectional byte streams over unreliable network connections. The
implementation includes:
- `BackedPipe`: Orchestrates a reader and writer to provide transparent
reconnection and data replay
- `BackedReader`: Handles reading with automatic reconnection, blocking
reads when disconnected
- `BackedWriter`: Maintains a ring buffer of recent writes for replay
during reconnection
- `RingBuffer`: Efficient circular buffer implementation for storing
data
The package enables resilient connections by tracking sequence numbers
and replaying missed data after reconnection. It handles connection
failures gracefully, automatically reconnecting and resuming data
transfer from the appropriate point.
Follows similarly to the bash tool (and some code to connect to an agent
was extracted from it).
There are two main parts: a new agent endpoint, and then a new MCP tool
that consumes that endpoint.
Refactors Agent instance identity to be a SessionTokenProvider.
Refactors the CLI to create Agent clients via a centralized function, rather than add-hoc via individual command handlers and their flags.
This allows commands besides `coder agent`, but which still use the agent identity, to support instance identity authentication.
Fixes#19111 by unifying all API requests to go thru the SessionTokenProvider for auth credentials.
closes https://github.com/coder/internal/issues/921
Not sure what I was thinking when I wrote this test case, but it was
relying on the connection being p2p on every ping, which is technically
and evidently not always the case. Instead we'll require a DERP peer,
and block direct connections.
Fixes https://github.com/coder/coder/issues/18350
I attempted the route of relying on just the session env vars, in hopes
that this issue was fixed in Toolbox and the process name matching was
no longer need, but it was not a fruitful endeavor and it seems to be
using the same connection logic as it did in gateway, just with new
binary and flag names.
Fixes https://github.com/coder/internal/issues/907
We convert `workspacesdk.AgentConn` to an interface and generate a mock
for it. This allows writing `coderd` tests that rely on the agent's HTTP
api to not have to set up an entire tailnet networking stack.
Fixes https://github.com/coder/coder/issues/19372
We increase the read limit to 4MiB (we use this limit elsewhere). We
also make sure to stop sending messages when `containersCh` becomes
closed.
Closes https://github.com/coder/internal/issues/884
We're adding this as a `go run` in `lint/go` for now, since adding it to
golangci-lint ourselves involves recompiling golangci-lint and then
running that new binary. I'll look into proposing it being added to the
public golangci-lint linters.
Doesn't appear to cause the lint ci job to take any longer, which is
nice.
fixes https://github.com/coder/internal/issues/878
On my dev system it takes 900ms, but looking at timestamps in CI it took
25 seconds. Bumping timeout to 60s.
Also fixes the segfault.
fixes https://github.com/coder/internal/issues/863
We read an output file in a loop, but this could lead to races where the other process has created the file but not written, or a partial write in progress. Fix is to retry if the content is shorter than we expect.
We disable the logic that allows autostarting discovered devcontainers
by default. We want this behavior to be opt-in rather than opt-out.
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
As it turns out, prebuilds + devcontainers appear to already work
together. This PR has created a test that simulates a prebuild claim
happening to `agentcontainers.API`, to see how we handle it.
Closes https://github.com/coder/internal/issues/711
When a `devcontainer.json` has been found and it has `.customizations.coder.autoStart = true`, we will now auto start this dev container.
Relates to https://github.com/coder/internal/issues/711
This PR implements a project discovery mechanism that searches for any
dev container projects and makes them visible in the UI so that they can
be started. To make the wording on the site more clear, "Rebuild" has
been changed to "Start" when there is no container associated with a
known dev container configuration. I've also made it so that site will
show the dev container config path when there is no other name
available.
### Design decisions
Just want to ensure my explanation for a few design decisions are noted
down:
- We only search for dev container configurations inside git
repositories
- We only search for these git repositories if they're at the top level
or a direct child of the agent directory.
This limited approach is to reduce the amount of files we ultimately
walk when trying to find these projects. It makes sense to limit it to
only the agent directory, although I'm open to expanding how deep we
search.
The agentsdk currently does a remap of the DERP map to change the
EmbeddedRelay node's URL to match the agent's access URL.
This PR makes changes to the `workspacesdk` (used by clients like the
CLI) and `vpn` (used by Coder Desktop) to match this behavior.
This enables us the ability to try Coder clients in dogfood over a VPN
without changing the global access URL.
This (week-old) test was failing in my workspace because I use fish shell.
I really do not like that Fish shell does not support `$?`, but I also do like Fish shell! We have a few people at Coder who use it who would appreciate this change.
This change allows a devcontainer to be opened via the agent syntax,
`coder open vscode <workspace>.<agent>` and removes the `--container`
option to simplify the subcommand. Accessing the subagent will behave
similarly to how the `--container` option behaved.
Fixescoder/internal#748