Commit Graph

536 Commits

Author SHA1 Message Date
Jon Ayers 8b2f472f71 chore: use old slog (#21959) 2026-02-05 16:35:41 -06:00
Jon Ayers b275be2e7a chore: backport fixes (#21957) 2026-02-05 16:09:41 -06:00
blinkagent[bot] ba71b321bc fix: remove a sensitive field from an agent log line (#20968) (#21063)
This PR removes a log field that could expose sensitive information in
agent logs for workspaces that pass such information to the agent via
its manifest.

(cherry picked from commit 1d726c81bb)

Co-authored-by: Sas Swart <sas.swart.cdk@gmail.com>
2025-12-02 11:33:50 -06:00
Sas Swart abe66a38eb feat: implement agent socket api, client and cli (#20758) (#20976) 2025-12-01 14:07:40 -06:00
Asher c266bb830c chore: add debug logging and recovery to agent api requests (#20785)
This is to debug context timeouts on API requests to the agent.

Because rbac and database cannot be imported in slim, split the logger
middleware into slim and non-slim versions and break out the recovery
middleware.
2025-11-25 14:59:20 -09:00
Spike Curtis afd40436f0 fix: mock Agent querying OS for listening ports in tests (#20842)
fixes https://github.com/coder/internal/issues/1123

We want to tests that ports are not included after they are no longer used, but this isn't safe on the real OS networking stack because there is no way to guarantee a port _won't_ be used. Instead, we introduce an interface and fake implementation for testing.

On order to leave the filtering logic in the test path, this PR also does some refactoring.

Caching logic is left in the real OS querying implementation and a new test case is added for it in this PR.
2025-11-25 14:25:24 +04:00
Sas Swart 2840fdcb54 feat(agent): add agent socket API (#20717)
relates to: https://github.com/coder/internal/issues/1094

This is number 2 of 5 pull requests in an effort to add agent script
ordering. It adds a drpc API that is exposed via a local socket. This
API serves access to a lightweight DAG based dependency manager that was
inspired by systemd.

In follow-up PRs:

* This unit manager will be plumbed into the workspace agent struct.
* CLI commands will use this agentsocket api to express dependencies
between coder scripts

I used an LLM to produce some of these changes, but I have conducted
thorough self review and consider this contribution to be ready for an
external reviewer.
2025-11-21 13:09:27 +02:00
Sas Swart 500c17e257 feat(agent): add agent unit manager (#20715)
relates to: https://github.com/coder/internal/issues/1094

This is number 1 of 5 pull requests in an effort to add agent script
ordering. It adds a unit manager, which uses an underlying DAG and a
list of subscribers to inform units when their dependencies have changed
in status.

In follow-up PRs:
* This unit manager will be plumbed into the workspace agent struct. 
* It will then be exposed to users via a new socket based drpc API 
* The agentsocket API will then become accessible via CLI commands that
allow coder scripts to express their dependencies on one another.

This is an experimental feature. There may be ways to improve the
efficiency of the manager struct, but it is more important to validate
this feature with customers before we invest in such optimizations.

See the tests for examples of how units may communicate with one
another. Actual CLI usage will be analogous.

I used an LLM to produce some of these changes, but I have conducted
thorough self review and consider this contribution to be ready for an
external reviewer.
2025-11-19 19:03:37 +02:00
Asher 643fe38b1e fix: use temp file on same device with mcp file edit (#20477)
Otherwise you can get errors like "invalid cross-device link".
2025-10-29 12:23:06 -08:00
Danielle Maywood e4e4669feb fix(agent/agentcontainers): remove unneeded default branch (#20511)
Closes https://github.com/coder/internal/issues/769

According to the `time.NewTicker` documentation [^1] (which is used
under the hood by https://github.com/coder/quartz) it will automatically
adjust the time interval to make up for slow receivers. This means we
should be safe to drop the default branch.

> NewTicker returns a new Ticker containing a channel that will send the
current time on the channel after each tick. The period of the ticks is
specified by the duration argument. The ticker will adjust the time
interval or drop ticks to make up for slow receivers. The duration d
must be greater than zero; if not, NewTicker will panic.

[^1]: https://pkg.go.dev/time#Ticker
2025-10-28 12:16:42 +00:00
Sas Swart 6c621364f8 feat: add a dependency management graph for agents (#20208)
Relates to https://github.com/coder/internal/issues/1093

This is the first of N pull requests to allow coder script ordering.
It introduces what is for now dead code, but paves the way for various
interfaces that allow coder scripts and other processes to depend on one
another via CLI commands and terraform configurations.

The next step is to add reactivity to the graph, such that changes in
the status of one vertex will propagate and allow other vertices to
change their own statuses.

Concurrency and stress testing yield the following:

CPU Profile:
<img width="1512" height="862" alt="Screenshot 2025-10-17 at 10 38 52"
src="https://github.com/user-attachments/assets/f46cf1a2-a0b2-4c02-81a0-069798108ee5"
/>

Mem Profile:
<img width="1512" height="862" alt="Screenshot 2025-10-17 at 10 38 01"
src="https://github.com/user-attachments/assets/45be1235-fff6-45ba-a50d-db9880377bd0"
/>

Predictably, lock contention and memory allocation are the largest
components of this system under stress. Nothing seems untoward.
2025-10-24 16:18:16 +02:00
Ethan 33b42fca7a test: fix flake in TestAgent_Metrics_SSH (#20450)
Second flake for this test today 😮‍💨.

Flake seen here, though I couldn't replicate this locally, some CI
exclusive networking issue.

https://github.com/coder/coder/actions/runs/18770305895/job/53553517887?pr=20448
```
    agent_test.go:3619: 
        	Error Trace:	/home/runner/work/coder/coder/agent/agent_test.go:3619
        	Error:      	Received unexpected error:
        	            	expected 1, got 0.000000:
        	            	    github.com/coder/coder/v2/agent_test.TestAgent_Metrics_SSH.func7
        	            	        /home/runner/work/coder/coder/agent/agent_test.go:3557
        	Test:       	TestAgent_Metrics_SSH
        	Messages:   	check fn for coderd_agentstats_currently_reachable_peers failed
```
This value is incremented by a successful ping to the peer from the
agent, which is dependent on all the networking code, which I think is
definitely out of scope of this test for agent metrics. So, we'll just
assert that the metrics exist with the correct labels (`derp`, `p2p`)
2025-10-24 17:28:57 +11:00
Ethan 86ef3fb497 test: fix flake in TestAgent_Metrics_SSH (#20447)
Closes https://github.com/coder/internal/issues/921

The flake in the linked issue was caused by the startup script taking longer than 1 second in CI. The existing conditional, that the startup script duration was under a second, was incorrect; the correct conditional is that the metric exists with the `success` label set to `true`.
2025-10-24 14:06:25 +11:00
Dean Sheather 6c99d5eca2 fix: avoid connection logging crashes in agent (#20307)
- Ignore errors when reporting a connection from the server, just log
them instead
- Translate connection log IP `localhost` to `127.0.0.1` on both the
server and the agent

Note that the temporary fix for converting invalid IPs to localhost is
not required in main since the database no longer forbids NULL for the
IP column since https://github.com/coder/coder/pull/19788

Relates to #20194
2025-10-16 01:56:43 +11:00
Spike Curtis 5807fe01e4 test: prevent TestAgent_ReconnectingPTY connection reporting check from interfering (#20210)
When we added support for connection tracking in the Workspace agent, we modified the ReconnectingPTY tests to add an initial connection that we immediately hang up and check that connections are logged.

In the case of `screen`-based pty handling, hanging up the initial connection can race with the initial attachment to the `screen` process, and cause that process to exit early. This leaves subsequent connections to the same session ID to fail.

In this PR we just use different pty session IDs so that the initial connections we do to verify logging don't interfere with the rest of the test.

_Arguably_ it's a bug in our Reconnecting PTY code that hanging up immediately can leave the system in a weird state, but we do eventually recover and error out, so I don't think it's worth trying to fix.
2025-10-08 16:23:46 +04:00
Zach 4d1003eace fix: remove initial global HTTP client usage (#20128)
This PR makes the initial steps at removing usage of the global Go HTTP
client, which was seen to have impacts on test flakiness in
https://github.com/coder/internal/issues/1020. The first commit removes
uses from tests, with the exception of one test that is tightly coupled
to the default client. The second commit makes easy/low-risk removals
from application code. This should have some impact to reduce test flakiness.
2025-10-02 11:43:13 -06:00
Asher be7aa58075 feat: add coder_workspace_ls MCP tool (#19652) 2025-09-12 15:57:15 -08:00
Asher 30330abaea feat: add coder_workspace_edit_file MCP tool (#19629) 2025-09-12 15:36:14 -08:00
Michael Suchacz 336e62bc37 fix: deflake BackedWriter tests (#19802) 2025-09-12 14:00:08 +00:00
Asher d5a02d570f feat: add coder_workspace_write_file MCP tool (#19591) 2025-09-11 12:17:15 -08:00
Michael Suchacz 4c98decfb7 chore: add backed reader, writer and pipe implementation (#19147)
Relates to: https://github.com/coder/coder/issues/18101

This PR introduces a new `backedpipe` package that provides reliable
bidirectional byte streams over unreliable network connections. The
implementation includes:

- `BackedPipe`: Orchestrates a reader and writer to provide transparent
reconnection and data replay
- `BackedReader`: Handles reading with automatic reconnection, blocking
reads when disconnected
- `BackedWriter`: Maintains a ring buffer of recent writes for replay
during reconnection
- `RingBuffer`: Efficient circular buffer implementation for storing
data

The package enables resilient connections by tracking sequence numbers
and replaying missed data after reconnection. It handles connection
failures gracefully, automatically reconnecting and resuming data
transfer from the appropriate point.
2025-09-11 14:05:14 +02:00
Asher 4bf63b4068 feat: add coder_workspace_read_file MCP tool (#19562)
Follows similarly to the bash tool (and some code to connect to an agent
was extracted from it).

There are two main parts: a new agent endpoint, and then a new MCP tool
that consumes that endpoint.
2025-09-09 15:12:24 -08:00
Spike Curtis 1354d84eb4 chore: refactor instance identity to be a SessionTokenProvider (#19566)
Refactors Agent instance identity to be a SessionTokenProvider.

Refactors the CLI to create Agent clients via a centralized function, rather than add-hoc via individual command handlers and their flags.

This allows commands besides `coder agent`, but which still use the agent identity, to support instance identity authentication.

Fixes #19111 by unifying all API requests to go thru the SessionTokenProvider for auth credentials.
2025-09-03 10:38:42 +04:00
Ethan 51d8a05301 test: disable direct connections for a deterministic reachable peers metric (#19458)
closes https://github.com/coder/internal/issues/921

Not sure what I was thinking when I wrote this test case, but it was
relying on the connection being p2p on every ping, which is technically
and evidently not always the case. Instead we'll require a DERP peer,
and block direct connections.
2025-08-21 11:46:56 +10:00
Garrett Delfosse dd867bd743 fix: fix jetbrains toolbox connection tracking (#19348)
Fixes https://github.com/coder/coder/issues/18350

I attempted the route of relying on just the session env vars, in hopes
that this issue was fixed in Toolbox and the process name matching was
no longer need, but it was not a fruitful endeavor and it seems to be
using the same connection logic as it did in gateway, just with new
binary and flag names.
2025-08-20 08:39:08 -04:00
Danielle Maywood 5e84d257b7 refactor: convert workspacesdk.AgentConn to an interface (#19392)
Fixes https://github.com/coder/internal/issues/907

We convert `workspacesdk.AgentConn` to an interface and generate a mock
for it. This allows writing `coderd` tests that rely on the agent's HTTP
api to not have to set up an entire tailnet networking stack.
2025-08-20 10:00:44 +01:00
Danielle Maywood 23c494f36b fix(agent/agentcontainers): resolve symlink in tests (#19440)
Fixes https://github.com/coder/internal/issues/917
2025-08-20 09:32:28 +01:00
Danielle Maywood e8795269e4 fix: resolve TestAPI/Error/DuringInjection flake (#19407)
Resolves https://github.com/coder/internal/issues/905
2025-08-19 12:23:37 +01:00
Dean Sheather c6c8b00b07 chore: require nolint for testutil.RunRetry (#19394) 2025-08-19 00:48:10 +10:00
Dean Sheather e2ba9e7d62 chore: retry TestAgent_Dial subtests (#19387)
Closes https://github.com/coder/internal/issues/595
2025-08-18 13:51:19 +00:00
Danielle Maywood 205eb29e60 fix: stop reading closed channel for /watch devcontainers endpoint (#19373)
Fixes https://github.com/coder/coder/issues/19372

We increase the read limit to 4MiB (we use this limit elsewhere). We
also make sure to stop sending messages when `containersCh` becomes
closed.
2025-08-15 12:32:33 +01:00
Ethan d7bdb3cdef ci: add paralleltestctx to lint/go (#19369)
Closes https://github.com/coder/internal/issues/884

We're adding this as a `go run` in `lint/go` for now, since adding it to
golangci-lint ourselves involves recompiling golangci-lint and then
running that new binary. I'll look into proposing it being added to the
public golangci-lint linters.

Doesn't appear to cause the lint ci job to take any longer, which is
nice.
2025-08-15 16:16:18 +10:00
Spike Curtis 6ba55213fb test: fix timeout on TestServer_X11_EvictionLRU (#19217)
fixes https://github.com/coder/internal/issues/878

On my dev system it takes 900ms, but looking at timestamps in CI it took
25 seconds. Bumping timeout to 60s.

Also fixes the segfault.
2025-08-07 16:40:38 +04:00
Danielle Maywood 760dc8b467 fix(agent/agentcontainers): fix TestDevcontainerDiscovery/AutoStart flake (#19179)
Fixes https://github.com/coder/internal/issues/864
2025-08-05 13:58:55 +01:00
Spike Curtis 7eb41193f8 test: fix TestSSHServer_ClosesStdin to handle non-atomic write (#19174)
fixes https://github.com/coder/internal/issues/863

We read an output file in a loop, but this could lead to races where the other process has created the file but not written, or a partial write in progress.  Fix is to retry if the content is shorter than we expect.
2025-08-05 11:36:21 +04:00
Danielle Maywood b8e2344ef5 chore(agent/agentcontainers): disable project autostart by default (#19114)
We disable the logic that allows autostarting discovered devcontainers
by default. We want this behavior to be opt-in rather than opt-out.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-08-04 16:21:13 +01:00
Danielle Maywood ddb5b87815 chore(agent/agentcontainers): test current prebuilds integration (#19074)
As it turns out, prebuilds + devcontainers appear to already work
together. This PR has created a test that simulates a prebuild claim
happening to `agentcontainers.API`, to see how we handle it.
2025-07-31 15:31:44 +01:00
Danielle Maywood cc4f8da6e1 fix(agent/agentcontainers): fix devcontainer integration tests (#19109)
It appears we accidentally merged a change that broke our devcontainer
integration tests https://github.com/coder/coder/pull/18570.
2025-07-31 13:24:23 +01:00
Danielle Maywood 219d1b4101 chore(agent/agentcontainers): skip part of test if on darwin (#19081) 2025-07-29 17:06:17 +01:00
Danielle Maywood 66cf90c736 feat(agent/agentcontainers): allow auto start for discovered containers (#19040)
Closes https://github.com/coder/internal/issues/711

When a `devcontainer.json` has been found and it has `.customizations.coder.autoStart = true`, we will now auto start this dev container.
2025-07-28 12:30:52 +01:00
Danielle Maywood 25d70ce7bc fix(agent/agentcontainers): respect ignore files (#19016)
Closes https://github.com/coder/coder/issues/19011

We now use
[go-git](https://pkg.go.dev/github.com/go-git/go-git/v5@v5.16.2/plumbing/format/gitignore)'s
`gitignore` plumbing implementation to parse the `.gitignore` files and
match against the patterns generated. We use this to ignore any ignored
files in the git repository.

Unfortunately I've had to slightly re-implement some of the interface
exposed by `go-git` because they use `billy.Filesystem` instead of
`afero.Fs`.
2025-07-24 12:12:05 +01:00
Danielle Maywood f41275eb39 feat(agent/agentcontainers): auto detect dev containers (#18950)
Relates to https://github.com/coder/internal/issues/711

This PR implements a project discovery mechanism that searches for any
dev container projects and makes them visible in the UI so that they can
be started. To make the wording on the site more clear, "Rebuild" has
been changed to "Start" when there is no container associated with a
known dev container configuration. I've also made it so that site will
show the dev container config path when there is no other name
available.

### Design decisions

Just want to ensure my explanation for a few design decisions are noted
down:
- We only search for dev container configurations inside git
repositories
- We only search for these git repositories if they're at the top level
or a direct child of the agent directory.

This limited approach is to reduce the amount of files we ultimately
walk when trying to find these projects. It makes sense to limit it to
only the agent directory, although I'm open to expanding how deep we
search.
2025-07-22 19:02:43 +01:00
Dean Sheather a1b87a67c6 fix: use client preferred URL for the default DERP (#18911)
The agentsdk currently does a remap of the DERP map to change the
EmbeddedRelay node's URL to match the agent's access URL.

This PR makes changes to the `workspacesdk` (used by clients like the
CLI) and `vpn` (used by Coder Desktop) to match this behavior.

This enables us the ability to try Coder clients in dogfood over a VPN
without changing the global access URL.
2025-07-17 20:17:44 +10:00
Danielle Maywood fb00cd2c1a fix(agent/agentcontainers): fix TestAPI/NoUpdaterLoopLogspam flake (#18905) 2025-07-17 10:59:02 +01:00
Danielle Maywood bd3d0ea482 fix(agent/agentcontainers): fix TestAPI/IgnoreCustomization flake (#18863) 2025-07-15 10:01:04 +01:00
Danielle Maywood 43b0bb7f61 feat(site): use websocket connection for devcontainer updates (#18808)
Instead of polling every 10 seconds, we instead use a WebSocket
connection for more timely updates.
2025-07-14 21:35:35 +01:00
Ethan c1b2304d18 test(agent/agentssh): use fish shell compatible exit status checking (#18824)
This (week-old) test was failing in my workspace because I use fish shell. 
I really do not like that Fish shell does not support `$?`, but I also do like Fish shell! We have a few people at Coder who use it who would appreciate this change.
2025-07-10 19:50:30 +10:00
Mathias Fredriksson 6c4db7a2bc feat(cli): replace open vscode container with devcontainer subagent (#18765)
This change allows a devcontainer to be opened via the agent syntax,
`coder open vscode <workspace>.<agent>` and removes the `--container`
option to simplify the subcommand. Accessing the subagent will behave
similarly to how the `--container` option behaved.

Fixes coder/internal#748
2025-07-08 19:21:41 +03:00
Danielle Maywood 0118e75009 fix(agent): disable dev container integration inside sub agents (#18781)
It appears we accidentally broke this logic in a previous PR. This
should now correctly disable the agent api as we'd expect.
2025-07-08 11:05:30 +01:00
blink-so[bot] 2c95a1dd71 chore: update gofumpt from v0.4.0 to v0.8.0 (#18652) 2025-07-03 11:28:00 -06:00