1 of 9 [next >>](https://github.com/coder/coder/pull/24811)
RFC: [Bridge ↔ Boundaries Correlation
RFC](https://www.notion.so/Bridge-Boundaries-Correlation-313d579be59281f3b4efdbfd6896775a)
Adds three new proto fields for boundary session correlation.
**`ReportBoundaryLogsRequest`**
- `session_id` (string, field 2) — UUID generated by boundary at
startup,
shared across all batches from a single run.
- `confined_process` (string, field 3) — name of the confined process
(e.g. `claude-code`, `codex`, `copilot`).
**`BoundaryLog`**
- `sequence_number` (uint64, field 4) — monotonically increasing counter
per session, primary ordering key when boundary is in use.
`BoundaryLog.time` already existed at field 2; no change needed there.
API version bumped to v2.9.
No behaviour change in coderd or the agent. This is a pure schema bump
that the boundary repo will consume in its own stack.
> Generated by Coder Agents
When the control plane connection drops and reconnects, a new
`tunnelUpdater` is created with empty workspace state. This causes the
in-memory DNS resolver to lose all host records, breaking `.coder` name
resolution until the server sends a fresh workspace snapshot.
If the API is unreachable (e.g., the route goes through a VPN that is
also reconnecting), the snapshot never arrives and DNS stays broken
indefinitely — requiring a full Coder Desktop restart.
Fix: carry workspace state from the previous `tunnelUpdater` to the new
one on reconnect, and immediately re-apply DNS hosts so the resolver
stays populated during the reconnection window.
Fixes https://linear.app/codercom/issue/PLAT-110
<details><summary>Investigation & decision log</summary>
### Root cause analysis
Customer diagnostic data from Roblox (March 31) showed:
- NRPT rule present (`.coder` → `fd60:627a:a42b::53`) — routing is
correct
- DNS resolver returns NXDOMAIN for everything including the sentinel
`is.coder--connect--enabled--right--now.coder` — resolver is running but
has zero host records
- Coder Connect UI shows "connected" — the WireGuard data plane is up
The resolver is empty because
`TunnelAllWorkspaceUpdatesController.New()` creates a fresh
`tunnelUpdater` with `workspaces: make(map[uuid.UUID]*Workspace)`
(empty). The previous updater's workspace data is discarded. If the
server's workspace snapshot is delayed or the API is unreachable, the
resolver has no records to serve.
This is compounded by GlobalProtect VPN reconnects: the Coder API is
behind the VPN, so when GP reconnects, the API route is temporarily lost
and the snapshot can't arrive.
### What this PR changes
- `TunnelAllWorkspaceUpdatesController.New()` now clones workspace state
from the previous updater before creating the new one
- Immediately re-applies DNS hosts with the inherited state (log:
`re-applying DNS hosts from previous session`)
- When the server's snapshot arrives, it replaces the inherited data
normally
- If `SetDNSHosts` fails during re-apply, it's logged as a warning and
not fatal — the recvLoop will program DNS when the snapshot arrives
### What this PR does NOT fix (future work)
- **Tunnel binary restart**: when the tunnel process itself is killed
and relaunched, all in-memory state is lost. A DNS host cache on disk
would be needed for this case.
- **NRPT rule cleanup on startup**: the Tailscale fork's
`nrptRuleDatabase` constructor unconditionally deletes all NRPT rules on
engine creation. Deferring cleanup to the first successful `SetDNS` call
would reduce the DNS gap.
- **Hosts file retry**: the `setHosts()` retry in the Tailscale fork
(5×10ms) is too short for environments where endpoint security locks the
file.
These are tracked as follow-up items in the `coder/tailscale` fork.
</details>
> 🤖 Generated by Coder Agents
Wire DERPTLSConfig through the CLI, SDK, tailnet, VPN client, agent, and
health checks to allow custom TLS configuration for DERP connections.
The main use case is to be able to set a custom CA and also present
client certs (mTLS). See https://github.com/coder/tailscale/pull/105 for
related changes.
Adds three new global CLI flags:
- `--client-tls-ca-file` / `CODER_CLIENT_TLS_CA_FILE`
- `--client-tls-cert-file` / `CODER_CLIENT_TLS_CERT_FILE`
- `--client-tls-key-file` / `CODER_CLIENT_TLS_KEY_FILE`
Based on community PR #22695 by @ibdafna, with autogeneration issues
fixed (protobuf version mismatches in .pb.go files, golden file
regeneration, lint fixes).
> [!NOTE]
> This PR was authored by Coder Agents on behalf of a Coder team member.
<details>
<summary>Relationship to #22695</summary>
This is a clean reimplementation of the changes from #22695 on top of
current `main`, with the following differences:
- **Removed**: Accidental protobuf version changes in `.pb.go` files
(contributor had `protoc v6.33.4` vs project's `protoc v4.23.4`)
- **Added**: Properly regenerated golden files and docs via `make gen`
- **Fixed**: Lint issue (`var-declaration` revive warning on explicit
type in `createHTTPClient`)
- All meaningful code changes are identical to the original PR
</details>
## Changes
- **Commit 1**: Remove 17 unnecessary `//nolint` directives:
- `//nolint:varnamelen` — linter not active
- `//nolint:unused` on exported `SlimUnsupported`
- `//nolint:govet` in `coderd/httpmw/csrf` — no longer fires
- `//nolint:revive` on functions refactored since the nolint was added
- `//nolint:paralleltest` citing Go 1.22 loop variable capture
(obsolete)
- Bare `//nolint` narrowed to specific `//nolint:gocritic` with
justification
- **Commit 2**: Fix root causes behind 5 dangerous nolint suppressions:
- Add `MinVersion: tls.VersionTLS12` to TLS client config (removes
`gosec` G402)
- Delete trivial unexported wrappers `apiKey()`/`normalizeProvider()` in
chatprovider (removes `revive` confusing-naming)
- Add doc comments to `StartWithAssert` and `Router` (removes `revive`
exported)
- Rename unused parameters to `_` in integration test helpers
> 🤖 This PR was created using Coder Agents and reviewed by me.
_Generated with mux but reviewed by a human_
This PR fixes a bug where Coder Desktop could stop retrying connections
to coderd after a prolonged network interruption. When that happened,
the client would no longer recoordinate or receive workspace updates,
even after connectivity returned.
This is likely the long-standing “stale connection” issue that has been
reported both internally and by customers. In practice, it would cause
all Coder Desktop workspaces to appear yellow or red in the UI and
become unreachable.
The underlying behavior matches the reports: peers are removed after 15
minutes without a handshake. So if network connectivity is lost for that
long, the client must recoordinate to recover. This bug prevented that
recoordination from happening.
For that reason, I’m marking this as:
Closes https://github.com/coder/coder-desktop-macos/issues/227
## Problem
The tailnet controller owns a long-lived retry loop in `Controller.Run`.
That loop already had an important graceful-shutdown guard added in
[`ba21ba87`](https://github.com/coder/coder/commit/ba21ba87ba2209fad3c9f4bb131d7de1fc0e58be)
to prevent a phantom redial after cancellation:
```go
if c.ctx.Err() != nil {
return
}
```
That guard was correct. It made controller lifetime depend on the
controller's own context rather than on retry timing races.
But the post-dial error path had since grown a broader terminal check:
```go
if xerrors.Is(err, context.Canceled) ||
xerrors.Is(err, context.DeadlineExceeded) {
return
}
```
That turns out to be too broad for desktop reconnects. A dial attempt
can fail with a wrapped `context.DeadlineExceeded` even while the
controller's own context is still live.
## Why that happens
The workspace tailnet dialer uses the SDK HTTP client, which inherits
`http.DefaultTransport`. That transport uses a `net.Dialer` with a 30s
`Timeout`. Go implements that timeout by creating an internal
deadline-bound sub-context for the TCP connect.
So during a control-plane blackhole, the reconnect path can look like
this:
- the existing control-plane connection dies
- `Controller.Run` re-enters the retry path
- the next websocket/TCP dial hangs against unreachable coderd
- `net.Dialer` times out the connect after ~30s
- the returned error unwraps to `context.DeadlineExceeded`
- `Controller.Run` treats that as terminal and returns
- the retry goroutine exits forever even though `c.ctx` is still alive
At that point the data plane can remain partially alive, the desktop app
can still look online, and unblocking coderd does nothing because the
process is no longer trying to redial.
## How this was found
We reproduced the issue in the macOS vpn-daemon process with temporary
diagnostics, blackholed coderd with `pfctl`, and captured multiple
goroutine dumps while the daemon was wedged.
Those dumps showed:
- `manageGracefulTimeout` was still blocked on `<-c.ctx.Done()`, proving
the controller context was not canceled
- the `Controller.Run` retry goroutine was missing from later dumps
- control-plane consumers stayed idle longer over time
- once coderd became reachable again the daemon still did not dial it
That narrowed the failure from "slow retry" to "retry loop exited", and
tracing the dial path back through `http.DefaultTransport` and
`net.Dialer` explained why a transport timeout was being mistaken for
controller shutdown.
In my testing with coderd blocked, as expected, I did retain a
connection to the workspace agent. I suspect the scenarios where
connection to the agent are lost is because we can't retry coordination.
## Fix
Keep the graceful-shutdown guard from
[`ba21ba87`](https://github.com/coder/coder/commit/ba21ba87ba2209fad3c9f4bb131d7de1fc0e58be)
exactly as-is, but narrow the post-dial exit condition so it keys off
the controller's own context instead of the error unwrap chain.
Before:
```go
if xerrors.Is(err, context.Canceled) ||
xerrors.Is(err, context.DeadlineExceeded) {
return
}
```
After:
```go
if c.ctx.Err() != nil {
return
}
```
## Why this is the right behavior
This preserves the original graceful-shutdown invariant from
[`ba21ba87`](https://github.com/coder/coder/commit/ba21ba87ba2209fad3c9f4bb131d7de1fc0e58be)
while restoring retryability for transient transport failures:
- if `c.ctx` is canceled before dialing, the pre-dial guard still
prevents a phantom redial
- if `c.ctx` is canceled during a dial attempt, the error path still
exits cleanly because `c.ctx.Err()` is non-nil
- if a live controller hits a wrapped transport timeout, the loop no
longer dies and instead retries as intended
In other words, controller state remains the only authoritative signal
for loop shutdown.
## Non-regression coverage
This also preserves the earlier flaky-test fix from
[`ba21ba87`](https://github.com/coder/coder/commit/ba21ba87ba2209fad3c9f4bb131d7de1fc0e58be):
- `pipeDialer` still returns errors instead of asserting from background
goroutines
- `TestController_Disconnects` still waits for `uut.Closed()` before the
test exits
On top of that, this change adds focused controller tests that assert:
- a wrapped `net.OpError(context.DeadlineExceeded)` under a live
controller causes another dial attempt instead of closing the controller
- cancellation still shuts the controller down without an extra redial
## Validation
After blocking TCP connections to coderd for 20 minutes to force the
retry path, unblocking coderd allowed the daemon to recover on its own
without toggling Coder Connect.
This PR does three things:
- Exports derp expvars to the pprof endpoint
- Exports the expvar metrics as prometheus metrics in both coderd and
wsproxy
- Updates our tailscale to a fix I also had to make to avoid a data race
condition
I generated this with mux but I also manually tested that the metrics
were getting properly emitted
<!--
If you have used AI to produce some or all of this PR, please ensure you
have read our [AI Contribution
guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING)
before submitting.
-->
part of https://github.com/coder/coder/issues/21335
This moves updating app status (used by Tasks) into the workspace agent
API over dRPC. This will allow us to update the status without having to
re-authenticate each time, like we would with an HTTP PATCH request.
Further PRs in this stack will pipe these requests thru from the CLI MCP
server to the agentsock and finally to this dRPC call to coderd.
Update the agent protobuf schema (agent/proto/agent.proto) to include:
- subagent_id field in WorkspaceAgentDevcontainer message
- id field in CreateSubAgentRequest message
Bump the Agent API version from v2.7 to v2.8 and update all client
references throughout the codebase (ConnectRPC27 -> ConnectRPC28,
DRPCAgentClient27 -> DRPCAgentClient28).
Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example:
```
import (
"context"
"time"
"github.com/prometheus/client_golang/prometheus"
"golang.org/x/xerrors"
"gopkg.in/natefinch/lumberjack.v2"
"cdr.dev/slog/v3"
"github.com/coder/coder/v2/codersdk/agentsdk"
"github.com/coder/serpent"
)
```
3 groups: standard library, 3rd partly libs, Coder libs.
This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.
Upgrades to slog v3 which includes a small, but backward incompatible API change to the acceptible call arguments when logging. This change allows us to verify via compile time type checking that arguments are correct and won't cause a panic, as was possible in slog v1, which this replaces (v2 was tagged but never used in coder/coder).
It also updates dependencies that also use slog and were updated.
I've left the `aibridge` dependency as a commit SHA, under the assumption that the team there (cc @pawbana @dannykopping ) will tag and update the dependency soon and on their own schedule.
Other dependencies, I pushed new tags.
Add the AgentAPI changes to support the feature that transmits boundary
logs from workspaces to coderd via the agent API for eventual re-emission to
stderr. The API handlers are stubs for now because I'm trying to land
this feature from multiple smaller PRs.
High level architecture:
- Boundary records resource access in batches and sends proto message to
agent
- Agent proxies messages to coderd **(captured by the API changes in
this PR)**
- coderd re-emits logs to stderr
RFC:
https://www.notion.so/coderhq/Agent-Boundary-Logs-2afd579be59280f29629fc9823ac41ba
fixes: https://github.com/coder/internal/issues/1143
Both gVisor and the Go standard library implementations of `net.Conn` can under certain circumstances return `nil` for `RemoteAddr()` and `LocalAddr()` calls. If we call their methods, we segfault.
This PR fixes these calls and adds ruleguard rules.
Note that `slog.F("remote_addr", conn.RemoteAddr())` is fine because slog detects the `nil` before attempting to stringify the type.
Fixes https://github.com/coder/internal/issues/66
The problem is a race during test tear down where the node callback can fire after the destination tailnet.Conn has already closed, causing an error.
The fix I have employed is to remove the callback in a t.Cleanup() and also refactor some tests to ensure they close the tailnet.Conn in a Cleanup deeper in the stack.
The agentsdk currently does a remap of the DERP map to change the
EmbeddedRelay node's URL to match the agent's access URL.
This PR makes changes to the `workspacesdk` (used by clients like the
CLI) and `vpn` (used by Coder Desktop) to match this behavior.
This enables us the ability to try Coder clients in dogfood over a VPN
without changing the global access URL.
Fixes#15523
Uses latest https://github.com/coder/tailscale which includes https://github.com/coder/tailscale/pull/85 to stop selecting paths with small MTU for direct connections.
Also updates the tailnet integration test to reproduce the issue. The previous version had the 2 peers connected by a single veth, but this allows the OS to fragment the packet. In the new version, the 2 peers (and server) are all connected by a central router. The link between peer 1 and the router has an adjustable MTU. IPv6 does not allow packets to be fragmented by intermediate routers, so sending a too-large packet in this scenario forces the router to drop packets and reproduce the issue (without the tailscale changes).
Refactors tailnet integration test and adds UDP echo tests with different MTU related to #15523
I still haven't gotten to the bottom of what's causing the issue (the added test case I expected to fail actually succeeds), but these integration test improvements are generally useful.
also:
* consolidates networking setup with easy and hard NAT
* consolidates client setup
* makes Client2 act like an agent at the tailnet layer, so it will send ReadyForHandshake and speed up the tunnel establishment
* adds support for logging tunneled packets
* adds support for dumping outer (underlay) IP traffic
* adds support for adjusting veth MTU
* adds support for IPv6 in the outer (underlay) network topology
Closes#17982.
The purpose of this PR is to expose network latency via the API used by Coder Desktop.
This PR has the tunnel ping all known agents every 5 seconds, in order to produce an instance of:
```proto
message LastPing {
// latency is the RTT of the ping to the agent.
google.protobuf.Duration latency = 1;
// did_p2p indicates whether the ping was sent P2P, or over DERP.
bool did_p2p = 2;
// preferred_derp is the human readable name of the preferred DERP region,
// or the region used for the last ping, if it was sent over DERP.
string preferred_derp = 3;
// preferred_derp_latency is the last known latency to the preferred DERP
// region. Unset if the region does not appear in the DERP map.
optional google.protobuf.Duration preferred_derp_latency = 4;
}
```
The contents of this message are stored and included on all subsequent upsertions of the agent.
Note that we upsert existing agents every 5 seconds to update the `last_handshake` value.
On the desktop apps, this message will be used to produce a tooltip similar to that of the VS Code extension:
<img width="495" alt="image" src="https://github.com/user-attachments/assets/d8b65f3d-f536-4c64-9af9-35c1a42c92d2" />
(wording not final)
Unlike the VS Code extension, we omit:
- The Latency of *all* available DERP regions. It seems not ideal to send a copy of this entire map for every online agent, and it certainly doesn't make sense for it to be on the `Agent` or `LastPing` message.
If we do want to expose this info on Coder Desktop, we should consider how best to do so; maybe we want to include it on a more generic `Netcheck` message.
- The current throughput (Bytes up/down). This is out of scope of the linked issue, and is non-trivial to implement. I'm also not sure of the value given the frequency we're doing these status updates (every 5 seconds).
If we want to expose it, it'll be in a separate PR.
<img width="343" alt="image" src="https://github.com/user-attachments/assets/8447d03b-9721-4111-8ac1-332d70a1e8f1" />
Closes https://github.com/coder/internal/issues/619
Implement the `coderd` side of the AgentAPI for the upcoming
dev-container agents work.
`agent/agenttest/client.go` is left unimplemented for a future PR
working to implement the agent side of this feature.
Closes https://github.com/coder/internal/issues/648
This change introduces a new `ParentId` field to the agent's manifest.
This will allow an agent to know if it is a child or not, as well as
knowing who the owner is.
This is part of the Dev Container Agents work
Closes https://github.com/coder/internal/issues/563
The [Coder Connect
tunnel](https://github.com/coder/coder/blob/main/vpn/tunnel.go) receives
workspace state from the Coder server over a [dRPC
stream.](https://github.com/coder/coder/blob/114ba4593b2a82dfd41cdcb7fd6eb70d866e7b86/tailnet/controllers.go#L1029)
When first connecting to this stream, the current state of the user's
workspaces is received, with subsequent messages being diffs on top of
that state.
However, if the client disconnects from this stream, such as when the
user's device is suspended, and then reconnects later, no mechanism
exists for the tunnel to differentiate that message containing the
entire initial state from another diff, and so that state is incorrectly
applied as a diff.
In practice:
- Tunnel connects, receives a workspace update containing all the
existing workspaces & agents.
- Tunnel loses connection, but isn't completely stopped.
- All the user's workspaces are restarted, producing a new set of
agents.
- Tunnel regains connection, and receives a workspace update containing
all the existing workspaces & agents.
- This initial update is incorrectly applied as a diff, with the
Tunnel's state containing both the old & new agents.
This PR introduces a solution in which tunnelUpdater, when created,
sends a FreshState flag with the WorkspaceUpdate type. This flag is
handled in the vpn tunnel in the following fashion:
- Preserve existing Agents
- Remove current Agents in the tunnel that are not present in the
WorkspaceUpdate
- Remove unreferenced Workspaces
Adds support to our coordinator implementations to send Error updates before disconnecting clients.
I was recently debugging a connection issue where the client was getting repeatedly disconnected from the Coordinator, but since we never send any error information it was really hard without server logs.
This PR aims to correct that, by sending a CoordinateResponse with `Error` set in cases where we disconnect a client without them asking us to.
It also logs the error whenever we get one in the client controller.
Closes https://github.com/coder/internal/issues/466
```
$ dig -6 @fd60:627a:a42b::53 is.coder--connect--enabled--right--now.coder AAAA
; <<>> DiG 9.10.6 <<>> -6 @fd60:627a:a42b::53 is.coder--connect--enabled--right--now.coder AAAA
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62390
;; flags: qr aa rd ra ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;is.coder--connect--enabled--right--now.coder. IN AAAA
;; ANSWER SECTION:
is.coder--connect--enabled--right--now.coder. 2 IN AAAA fd60:627a:a42b::53
;; Query time: 3 msec
;; SERVER: fd60:627a:a42b::53#53(fd60:627a:a42b::53)
;; WHEN: Wed Apr 09 16:59:18 AEST 2025
;; MSG SIZE rcvd: 134
```
Hostname considerations:
- Workspace names, usernames, and agent names can't have double hyphens, so this name can't conflict with a real Coder Connect hostname.
- Components can't start or end with hyphens according to [RFC 952](https://www.rfc-editor.org/rfc/rfc952.html)
- DNS records can't have hyphens in the 3rd and 4th positions, as to not conflict with IDNs https://datatracker.ietf.org/doc/html/rfc5891
- Update go.mod to use Go 1.24.1
- Update GitHub Actions setup-go action to use Go 1.24.1
- Fix linting issues with golangci-lint by:
- Updating to golangci-lint v1.57.1 (more compatible with Go 1.24.1)
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Claude <claude@anthropic.com>
Updating golden files is an unnecessary extra step in addition to gen
that is easily overlooked, leading to the developer noticing the issue
in CI leading to lost developer time waiting for tests to complete.
The experimental functions in `golang.org/x/exp/slices` are now
available in the standard library since Go 1.21.
Reference: https://go.dev/doc/go1.21#slices
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
Fixes: https://github.com/coder/internal/issues/377
Added an additional SSH listener on port 22, so the agent now listens on both, port one and port 22.
---
Change-Id: Ifd986b260f8ac317e37d65111cd4e0bd1dc38af8
Signed-off-by: Thomas Kosiewski <tk@coder.com>
Addresses https://github.com/coder/coder/issues/16231.
This PR reduces the volume of logs we print after server startup in
order to surface the web UI URL better.
Here are the logs after the changes a couple of seconds after starting
the server:
<img width="868" alt="Screenshot 2025-02-18 at 16 31 32"
src="https://github.com/user-attachments/assets/786dc4b8-7383-48c8-a5c3-a997c01ca915"
/>
The warning is due to running a development site-less build. It wouldn't
show in a release build.
This change adds a new `ReportConnection` endpoint to the `agentapi`.
The protocol version was bumped previously, so it has been omitted here.
This allows the agent to report connection events, for example when the
user connects to the workspace via SSH or VS Code.
Updates #15139
Closes https://github.com/coder/coder-desktop-macos/issues/54
I've also double checked that agents with hyphens & underscores play nice once programmed, as do workspaces with hyphens:
```
$ ping6 main_agent-1.main-workspace.admin.coder
PING6(56=40+8+8 bytes) fd60:627a:a42b:4e91:88c0:da4a:df4f:b54e --> fd60:627a:a42b:46d4:8b55:e549:e498:e6f5
```
also fine in Firefox & Safari, though I'm a little surprised underscores work.
As part of the new resources monitoring logic - more specifically for
OOM & OOD Notifications , we need to update the AgentAPI , and the
agents logic.
This PR aims to do it, and more specifically :
We are updating the AgentAPI & TailnetAPI to version 24 to add two new
methods in the AgentAPI :
- One method to fetch the resources monitoring configuration
- One method to push the datapoints for the resources monitoring.
Also, this PR adds a new logic on the agent side, with a routine running
and ticking - fetching the resources usage each time , but also storing
it in a FIFO like queue.
Finally, this PR fixes a problem we had with RBAC logic on the resources
monitoring model, applying the same logic than we have for similar
entities.