Commit Graph

251 Commits

Author SHA1 Message Date
Cian Johnston 847a88c6ca chore: clean up stale and dangerous //nolint comments (#23643)
## Changes

- **Commit 1**: Remove 17 unnecessary `//nolint` directives:
  - `//nolint:varnamelen` — linter not active
  - `//nolint:unused` on exported `SlimUnsupported`
  - `//nolint:govet` in `coderd/httpmw/csrf` — no longer fires
  - `//nolint:revive` on functions refactored since the nolint was added
- `//nolint:paralleltest` citing Go 1.22 loop variable capture
(obsolete)
- Bare `//nolint` narrowed to specific `//nolint:gocritic` with
justification

- **Commit 2**: Fix root causes behind 5 dangerous nolint suppressions:
- Add `MinVersion: tls.VersionTLS12` to TLS client config (removes
`gosec` G402)
- Delete trivial unexported wrappers `apiKey()`/`normalizeProvider()` in
chatprovider (removes `revive` confusing-naming)
- Add doc comments to `StartWithAssert` and `Router` (removes `revive`
exported)
  - Rename unused parameters to `_` in integration test helpers

> 🤖 This PR was created using Coder Agents and reviewed by me.
2026-03-26 14:13:53 +00:00
Ethan 5130404f2a fix(tailnet): retry after transport dial timeouts (#22977)
_Generated with mux but reviewed by a human_

This PR fixes a bug where Coder Desktop could stop retrying connections
to coderd after a prolonged network interruption. When that happened,
the client would no longer recoordinate or receive workspace updates,
even after connectivity returned.

This is likely the long-standing “stale connection” issue that has been
reported both internally and by customers. In practice, it would cause
all Coder Desktop workspaces to appear yellow or red in the UI and
become unreachable.

The underlying behavior matches the reports: peers are removed after 15
minutes without a handshake. So if network connectivity is lost for that
long, the client must recoordinate to recover. This bug prevented that
recoordination from happening.

For that reason, I’m marking this as:

Closes https://github.com/coder/coder-desktop-macos/issues/227

## Problem

The tailnet controller owns a long-lived retry loop in `Controller.Run`.
That loop already had an important graceful-shutdown guard added in
[`ba21ba87`](https://github.com/coder/coder/commit/ba21ba87ba2209fad3c9f4bb131d7de1fc0e58be)
to prevent a phantom redial after cancellation:

```go
if c.ctx.Err() != nil {
    return
}
```

That guard was correct. It made controller lifetime depend on the
controller's own context rather than on retry timing races.

But the post-dial error path had since grown a broader terminal check:

```go
if xerrors.Is(err, context.Canceled) ||
   xerrors.Is(err, context.DeadlineExceeded) {
    return
}
```

That turns out to be too broad for desktop reconnects. A dial attempt
can fail with a wrapped `context.DeadlineExceeded` even while the
controller's own context is still live.

## Why that happens

The workspace tailnet dialer uses the SDK HTTP client, which inherits
`http.DefaultTransport`. That transport uses a `net.Dialer` with a 30s
`Timeout`. Go implements that timeout by creating an internal
deadline-bound sub-context for the TCP connect.

So during a control-plane blackhole, the reconnect path can look like
this:

- the existing control-plane connection dies
- `Controller.Run` re-enters the retry path
- the next websocket/TCP dial hangs against unreachable coderd
- `net.Dialer` times out the connect after ~30s
- the returned error unwraps to `context.DeadlineExceeded`
- `Controller.Run` treats that as terminal and returns
- the retry goroutine exits forever even though `c.ctx` is still alive

At that point the data plane can remain partially alive, the desktop app
can still look online, and unblocking coderd does nothing because the
process is no longer trying to redial.

## How this was found

We reproduced the issue in the macOS vpn-daemon process with temporary
diagnostics, blackholed coderd with `pfctl`, and captured multiple
goroutine dumps while the daemon was wedged.

Those dumps showed:

- `manageGracefulTimeout` was still blocked on `<-c.ctx.Done()`, proving
the controller context was not canceled
- the `Controller.Run` retry goroutine was missing from later dumps
- control-plane consumers stayed idle longer over time
- once coderd became reachable again the daemon still did not dial it

That narrowed the failure from "slow retry" to "retry loop exited", and
tracing the dial path back through `http.DefaultTransport` and
`net.Dialer` explained why a transport timeout was being mistaken for
controller shutdown.

In my testing with coderd blocked, as expected, I did retain a
connection to the workspace agent. I suspect the scenarios where
connection to the agent are lost is because we can't retry coordination.

## Fix

Keep the graceful-shutdown guard from
[`ba21ba87`](https://github.com/coder/coder/commit/ba21ba87ba2209fad3c9f4bb131d7de1fc0e58be)
exactly as-is, but narrow the post-dial exit condition so it keys off
the controller's own context instead of the error unwrap chain.

Before:

```go
if xerrors.Is(err, context.Canceled) ||
   xerrors.Is(err, context.DeadlineExceeded) {
    return
}
```

After:

```go
if c.ctx.Err() != nil {
    return
}
```

## Why this is the right behavior

This preserves the original graceful-shutdown invariant from
[`ba21ba87`](https://github.com/coder/coder/commit/ba21ba87ba2209fad3c9f4bb131d7de1fc0e58be)
while restoring retryability for transient transport failures:

- if `c.ctx` is canceled before dialing, the pre-dial guard still
prevents a phantom redial
- if `c.ctx` is canceled during a dial attempt, the error path still
exits cleanly because `c.ctx.Err()` is non-nil
- if a live controller hits a wrapped transport timeout, the loop no
longer dies and instead retries as intended

In other words, controller state remains the only authoritative signal
for loop shutdown.

## Non-regression coverage

This also preserves the earlier flaky-test fix from
[`ba21ba87`](https://github.com/coder/coder/commit/ba21ba87ba2209fad3c9f4bb131d7de1fc0e58be):

- `pipeDialer` still returns errors instead of asserting from background
goroutines
- `TestController_Disconnects` still waits for `uut.Closed()` before the
test exits

On top of that, this change adds focused controller tests that assert:

- a wrapped `net.OpError(context.DeadlineExceeded)` under a live
controller causes another dial attempt instead of closing the controller
- cancellation still shuts the controller down without an extra redial

## Validation

After blocking TCP connections to coderd for 20 minutes to force the
retry path, unblocking coderd allowed the daemon to recover on its own
without toggling Coder Connect.
2026-03-12 18:05:56 +11:00
Jon Ayers 6c44de951d feat: add Prometheus collector for DERP server expvar metrics (#22583)
This PR does three things:
- Exports derp expvars to the pprof endpoint
- Exports the expvar metrics as prometheus metrics in both coderd and
wsproxy
- Updates our tailscale to a fix I also had to make to avoid a data race
condition

I generated this with mux but I also manually tested that the metrics
were getting properly emitted
2026-03-06 01:57:58 -06:00
Jon Ayers 43b8df86c1 fix: log WARN on ErrConnectionClosed in tailnet.Controller.Run (#22293) 2026-02-25 01:27:53 -06:00
Spike Curtis 393b3874ac feat: add UpdateAppStatus to the workspace agent API (#22219)
<!--

If you have used AI to produce some or all of this PR, please ensure you
have read our [AI Contribution
guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING)
before submitting.

-->

part of https://github.com/coder/coder/issues/21335  
  
This moves updating app status (used by Tasks) into the workspace agent
API over dRPC. This will allow us to update the status without having to
re-authenticate each time, like we would with an HTTP PATCH request.
  
Further PRs in this stack will pipe these requests thru from the CLI MCP
server to the agentsock and finally to this dRPC call to coderd.
2026-02-24 13:26:55 +04:00
Danielle Maywood 2de8cdf160 feat(agent): add subagent ID fields to devcontainers in manifest (#21848)
Update the agent protobuf schema (agent/proto/agent.proto) to include:
- subagent_id field in WorkspaceAgentDevcontainer message
- id field in CreateSubAgentRequest message

Bump the Agent API version from v2.7 to v2.8 and update all client
references throughout the codebase (ConnectRPC27 -> ConnectRPC28,
DRPCAgentClient27 -> DRPCAgentClient28).
2026-02-03 12:37:30 +00:00
Spike Curtis bddb808b25 chore: arrange imports in a standard way (#21452)
Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example:

```
import (
	"context"
	"time"

	"github.com/prometheus/client_golang/prometheus"
	"golang.org/x/xerrors"
	"gopkg.in/natefinch/lumberjack.v2"

	"cdr.dev/slog/v3"
	"github.com/coder/coder/v2/codersdk/agentsdk"
	"github.com/coder/serpent"
)
```

3 groups: standard library, 3rd partly libs, Coder libs.

This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.
2026-01-08 15:24:11 +04:00
Spike Curtis 49b34a716a fix: fix slog to always use array of Fields (#21426)
Upgrades to slog v3 which includes a small, but backward incompatible API change to the acceptible call arguments when logging. This change allows us to verify via compile time type checking that arguments are correct and won't cause a panic, as was possible in slog v1, which this replaces (v2 was tagged but never used in coder/coder).

It also updates dependencies that also use slog and were updated.

I've left the `aibridge` dependency as a commit SHA, under the assumption that the team there (cc @pawbana @dannykopping ) will tag and update the dependency soon and on their own schedule.

Other dependencies, I pushed new tags.
2026-01-08 10:29:41 +04:00
Zach 9d1493a13a feat: add initial API for boundary log forwarding to coderd (#21293)
Add the AgentAPI changes to support the feature that transmits boundary
logs from workspaces to coderd via the agent API for eventual re-emission to
stderr. The API handlers are stubs for now because I'm trying to land
this feature from multiple smaller PRs.

High level architecture:
- Boundary records resource access in batches and sends proto message to
agent
- Agent proxies messages to coderd **(captured by the API changes in
this PR)**
- coderd re-emits logs to stderr

RFC:
https://www.notion.so/coderhq/Agent-Boundary-Logs-2afd579be59280f29629fc9823ac41ba
2025-12-19 10:41:39 -07:00
Spike Curtis 40df21ed62 fix: fixes use of possibly nil RemoteAddr() and LocalAddr() return values (#21076)
fixes: https://github.com/coder/internal/issues/1143

Both gVisor and the Go standard library implementations of `net.Conn` can under certain circumstances return `nil` for `RemoteAddr()` and `LocalAddr()` calls. If we call their methods, we segfault.

This PR fixes these calls and adds ruleguard rules.

Note that `slog.F("remote_addr", conn.RemoteAddr())` is fine because slog detects the `nil` before attempting to stringify the type.
2025-12-03 15:06:00 +04:00
Spike Curtis e96ab0ef59 test: fix PingDirect test tear down (#20687)
Fixes https://github.com/coder/internal/issues/66

The problem is a race during test tear down where the node callback can fire after the destination tailnet.Conn has already closed, causing an error.

The fix I have employed is to remove the callback in a t.Cleanup() and also refactor some tests to ensure they close the tailnet.Conn in a Cleanup deeper in the stack.
2025-11-11 10:34:31 +04:00
Dean Sheather e2ba9e7d62 chore: retry TestAgent_Dial subtests (#19387)
Closes https://github.com/coder/internal/issues/595
2025-08-18 13:51:19 +00:00
Dean Sheather bf78966256 chore: remove soft isolation configurability (#19069)
Undoes a lot of the changes in 5319d47dfa

Keeps the `netns.SetCoderSoftIsolation()` call, but always sets it to
`true` when using a TUN device.
2025-07-29 22:30:17 +10:00
Dean Sheather 5319d47dfa chore: add support for tailscale soft isolation in VPN (#19023) 2025-07-24 04:18:29 +00:00
Dean Sheather a1b87a67c6 fix: use client preferred URL for the default DERP (#18911)
The agentsdk currently does a remap of the DERP map to change the
EmbeddedRelay node's URL to match the agent's access URL.

This PR makes changes to the `workspacesdk` (used by clients like the
CLI) and `vpn` (used by Coder Desktop) to match this behavior.

This enables us the ability to try Coder clients in dogfood over a VPN
without changing the global access URL.
2025-07-17 20:17:44 +10:00
ケイラ fae30a00fd chore: remove unnecessary redeclarations in for loops (#18440) 2025-06-20 13:16:55 -06:00
ケイラ 5df70a613d feat: add organization scope for shared ports (#18314) 2025-06-16 16:15:59 -06:00
Spike Curtis af4a6682b4 fix: use tailscale that avoids small MTU paths (#18323)
Fixes #15523

Uses latest https://github.com/coder/tailscale which includes https://github.com/coder/tailscale/pull/85 to stop selecting paths with small MTU for direct connections.

Also updates the tailnet integration test to reproduce the issue. The previous version had the 2 peers connected by a single veth, but this allows the OS to fragment the packet. In the new version, the 2 peers (and server) are all connected by a central router. The link between peer 1 and the router has an adjustable MTU. IPv6 does not allow packets to be fragmented by intermediate routers, so sending a too-large packet in this scenario forces the router to drop packets and reproduce the issue (without the tailscale changes).
2025-06-11 14:16:25 +04:00
Spike Curtis 08eff7f433 chore: improve tailnet integration test (#18124)
Refactors tailnet integration test and adds UDP echo tests with different MTU related to #15523

I still haven't gotten to the bottom of what's causing the issue (the added test case I expected to fail actually succeeds), but these integration test improvements are generally useful.

also:
 * consolidates networking setup with easy and hard NAT
 * consolidates client setup
 * makes Client2 act like an agent at the tailnet layer, so it will send ReadyForHandshake and speed up the tunnel establishment
 * adds support for logging tunneled packets
 * adds support for dumping outer (underlay) IP traffic
 * adds support for adjusting veth MTU
 * adds support for IPv6 in the outer (underlay) network topology
2025-06-06 10:18:08 +04:00
Ethan 0076e8479f chore(vpn): send ping results over tunnel (#18200)
Closes #17982.

The purpose of this PR is to expose network latency via the API used by Coder Desktop.

This PR has the tunnel ping all known agents every 5 seconds, in order to produce an instance of:
```proto
message LastPing {
	// latency is the RTT of the ping to the agent.
	google.protobuf.Duration latency = 1;
	// did_p2p indicates whether the ping was sent P2P, or over DERP.
	bool did_p2p = 2;
	// preferred_derp is the human readable name of the preferred DERP region,
	// or the region used for the last ping, if it was sent over DERP.
	string preferred_derp = 3;
	// preferred_derp_latency is the last known latency to the preferred DERP
	// region. Unset if the region does not appear in the DERP map.
	optional google.protobuf.Duration preferred_derp_latency = 4;
}
```
The contents of this message are stored and included on all subsequent upsertions of the agent. 
Note that we upsert existing agents every 5 seconds to update the `last_handshake` value.

On the desktop apps, this message will be used to produce a tooltip similar to that of the VS Code extension:
<img width="495" alt="image" src="https://github.com/user-attachments/assets/d8b65f3d-f536-4c64-9af9-35c1a42c92d2" />
(wording not final)

Unlike the VS Code extension, we omit:
- The Latency of *all* available DERP regions. It seems not ideal to send a copy of this entire map for every online agent, and it certainly doesn't make sense for it to be on the `Agent` or `LastPing` message. 
If we do want to expose this info on Coder Desktop, we should consider how best to do so; maybe we want to include it on a more generic `Netcheck` message.
- The current throughput (Bytes up/down). This is out of scope of the linked issue, and is non-trivial to implement. I'm also not sure of the value given the frequency we're doing these status updates (every 5 seconds).
If we want to expose it, it'll be in a separate PR.

<img width="343" alt="image" src="https://github.com/user-attachments/assets/8447d03b-9721-4111-8ac1-332d70a1e8f1" />
2025-06-06 14:18:57 +10:00
Danielle Maywood b712d0b23f feat(coderd/agentapi): implement sub agent api (#17823)
Closes https://github.com/coder/internal/issues/619

Implement the `coderd` side of the AgentAPI for the upcoming
dev-container agents work.

`agent/agenttest/client.go` is left unimplemented for a future PR
working to implement the agent side of this feature.
2025-05-29 12:15:47 +01:00
Spike Curtis 6c0bed0f53 chore: update to coder/quartz v0.2.0 (#18007)
Upgrade to coder/quartz v0.2.0 including fixing up a minor API breaking change.
2025-05-27 16:05:03 +04:00
Danielle Maywood 61f22a59ba feat(agent): add ParentId to agent manifest (#17888)
Closes https://github.com/coder/internal/issues/648

This change introduces a new `ParentId` field to the agent's manifest.
This will allow an agent to know if it is a child or not, as well as
knowing who the owner is.

This is part of the Dev Container Agents work
2025-05-19 16:09:56 +01:00
Steven Masley 64807e1d61 chore: apply the 4mb max limit on drpc protocol message size (#17771)
Respect the 4mb max limit on proto messages
2025-05-13 11:24:51 -05:00
Steven Masley 37832413ba chore: resolve internal drpc package conflict (#17770)
Our internal drpc package name conflicts with the external one in usage. 
`drpc.*` == external
`drpcsdk.*` == internal
2025-05-12 10:31:38 -05:00
Steven Masley e4c6c10369 chore: fix comment regarding provisioner api version release (#17705)
See
https://github.com/coder/coder/commit/bc609d0056adeb11b1d2dc282db4d0ad20f3444b
2025-05-07 15:05:00 -05:00
Michael Suchacz 5f516ed135 feat: improve coder connect tunnel handling on reconnect (#17598)
Closes https://github.com/coder/internal/issues/563

The [Coder Connect
tunnel](https://github.com/coder/coder/blob/main/vpn/tunnel.go) receives
workspace state from the Coder server over a [dRPC
stream.](https://github.com/coder/coder/blob/114ba4593b2a82dfd41cdcb7fd6eb70d866e7b86/tailnet/controllers.go#L1029)
When first connecting to this stream, the current state of the user's
workspaces is received, with subsequent messages being diffs on top of
that state.

However, if the client disconnects from this stream, such as when the
user's device is suspended, and then reconnects later, no mechanism
exists for the tunnel to differentiate that message containing the
entire initial state from another diff, and so that state is incorrectly
applied as a diff.

In practice:
- Tunnel connects, receives a workspace update containing all the
existing workspaces & agents.
- Tunnel loses connection, but isn't completely stopped.
- All the user's workspaces are restarted, producing a new set of
agents.
- Tunnel regains connection, and receives a workspace update containing
all the existing workspaces & agents.
- This initial update is incorrectly applied as a diff, with the
Tunnel's state containing both the old & new agents.

This PR introduces a solution in which tunnelUpdater, when created,
sends a FreshState flag with the WorkspaceUpdate type. This flag is
handled in the vpn tunnel in the following fashion:
- Preserve existing Agents
- Remove current Agents in the tunnel that are not present in the
WorkspaceUpdate
- Remove unreferenced Workspaces
2025-05-06 16:00:16 +02:00
Dean Sheather d566008087 fix: update tailscale to improve block endpoints functionality (#17496)
Direct endpoints from the peer will no longer be processed.
2025-04-22 09:32:21 +00:00
Spike Curtis 345435a04c feat: modify coordinators to send errors and peers to log them (#17467)
Adds support to our coordinator implementations to send Error updates before disconnecting clients.

I was recently debugging a connection issue where the client was getting repeatedly disconnected from the Coordinator, but since we never send any error information it was really hard without server logs.

This PR aims to correct that, by sending a CoordinateResponse with `Error` set in cases where we disconnect a client without them asking us to.

It also logs the error whenever we get one in the client controller.
2025-04-21 11:40:56 +04:00
ケイラ f670bc31f5 chore: update testutil chan helpers (#17408) 2025-04-16 10:37:09 -06:00
Michael Suchacz 06d39151dc feat: extend request logs with auth & DB info (#17304)
Closes #16903
2025-04-15 13:27:23 +02:00
Danny Kopping 0b18e458f4 fix: reduce excessive logging when database is unreachable (#17363)
Fixes #17045

---------

Signed-off-by: Danny Kopping <dannykopping@gmail.com>
2025-04-15 10:55:30 +02:00
Spike Curtis 9e2af3e127 feat: add configurable DNS match domain for tailnet connections (#17336)
Use the hostname suffix to set the DNS match domain when creating a Tailnet as part of the vpn `Tunnel`.

part of: #16828
2025-04-11 15:00:48 +04:00
Spike Curtis 2c573dc023 feat: vpn uses WorkspaceHostnameSuffix for DNS names (#17335)
Use the hostname suffix to set DNS names as programmed into the DNS service and returned by the vpn `Tunnel`.

part of: #16828
2025-04-11 13:24:20 +04:00
Ethan 3c1cb5d05a chore: add generic DNS record for checking if Coder Connect is running (#17298)
Closes https://github.com/coder/internal/issues/466

```
$ dig -6 @fd60:627a:a42b::53 is.coder--connect--enabled--right--now.coder AAAA

; <<>> DiG 9.10.6 <<>> -6 @fd60:627a:a42b::53 is.coder--connect--enabled--right--now.coder AAAA
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62390
;; flags: qr aa rd ra ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;is.coder--connect--enabled--right--now.coder. IN AAAA

;; ANSWER SECTION:
is.coder--connect--enabled--right--now.coder. 2	IN AAAA	fd60:627a:a42b::53

;; Query time: 3 msec
;; SERVER: fd60:627a:a42b::53#53(fd60:627a:a42b::53)
;; WHEN: Wed Apr 09 16:59:18 AEST 2025
;; MSG SIZE  rcvd: 134
```

Hostname considerations:
- Workspace names, usernames, and agent names can't have double hyphens, so this name can't conflict with a real Coder Connect hostname.
- Components can't start or end with hyphens according to [RFC 952](https://www.rfc-editor.org/rfc/rfc952.html)
- DNS records can't have hyphens in the 3rd and 4th positions, as to not conflict with IDNs https://datatracker.ietf.org/doc/html/rfc5891
2025-04-11 13:59:25 +10:00
Jon Ayers 17ddee05e5 chore: update golang to 1.24.1 (#17035)
- Update go.mod to use Go 1.24.1
- Update GitHub Actions setup-go action to use Go 1.24.1
- Fix linting issues with golangci-lint by:
  - Updating to golangci-lint v1.57.1 (more compatible with Go 1.24.1)

🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <claude@anthropic.com>
2025-03-26 01:56:39 -05:00
Mathias Fredriksson b79167293c chore(Makefile): update golden files as part of make gen (#17039)
Updating golden files is an unnecessary extra step in addition to gen
that is easily overlooked, leading to the developer noticing the issue
in CI leading to lost developer time waiting for tests to complete.
2025-03-21 13:04:30 +00:00
Eng Zer Jun 04c33968cf refactor: replace golang.org/x/exp/slices with slices (#16772)
The experimental functions in `golang.org/x/exp/slices` are now
available in the standard library since Go 1.21.

Reference: https://go.dev/doc/go1.21#slices

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
2025-03-04 00:46:49 +11:00
Thomas Kosiewski d0e2060692 feat(agent): add second SSH listener on port 22 (#16627)
Fixes: https://github.com/coder/internal/issues/377

Added an additional SSH listener on port 22, so the agent now listens on both, port one and port 22.

---
Change-Id: Ifd986b260f8ac317e37d65111cd4e0bd1dc38af8
Signed-off-by: Thomas Kosiewski <tk@coder.com>
2025-03-03 04:47:42 +01:00
Hugo Dutka 44499315ed chore: reduce log volume on server startup (#16608)
Addresses https://github.com/coder/coder/issues/16231.

This PR reduces the volume of logs we print after server startup in
order to surface the web UI URL better.

Here are the logs after the changes a couple of seconds after starting
the server:

<img width="868" alt="Screenshot 2025-02-18 at 16 31 32"
src="https://github.com/user-attachments/assets/786dc4b8-7383-48c8-a5c3-a997c01ca915"
/>

The warning is due to running a development site-less build. It wouldn't
show in a release build.
2025-02-20 16:33:14 +01:00
Mathias Fredriksson b07b33ec9d feat: add agentapi endpoint to report connections for audit (#16507)
This change adds a new `ReportConnection` endpoint to the `agentapi`.

The protocol version was bumped previously, so it has been omitted here.

This allows the agent to report connection events, for example when the
user connects to the workspace via SSH or VS Code.

Updates #15139
2025-02-20 14:52:01 +02:00
Ethan 92870f0642 fix: force lowercase DNS hostnames for VPN (#16613)
Closes https://github.com/coder/coder-desktop-macos/issues/54


I've also double checked that agents with hyphens & underscores play nice once programmed, as do workspaces with hyphens:

```
$ ping6 main_agent-1.main-workspace.admin.coder
PING6(56=40+8+8 bytes) fd60:627a:a42b:4e91:88c0:da4a:df4f:b54e --> fd60:627a:a42b:46d4:8b55:e549:e498:e6f5
```
also fine in Firefox & Safari, though I'm a little surprised underscores work.
2025-02-20 13:02:45 +11:00
Vincent Vielle bc609d0056 feat: integrate agentAPI with resources monitoring logic (#16438)
As part of the new resources monitoring logic - more specifically for
OOM & OOD Notifications , we need to update the AgentAPI , and the
agents logic.

This PR aims to do it, and more specifically :  
We are updating the AgentAPI & TailnetAPI to version 24 to add two new
methods in the AgentAPI :
- One method to fetch the resources monitoring configuration
- One method to push the datapoints for the resources monitoring.

Also, this PR adds a new logic on the agent side, with a routine running
and ticking - fetching the resources usage each time , but also storing
it in a FIFO like queue.

Finally, this PR fixes a problem we had with RBAC logic on the resources
monitoring model, applying the same logic than we have for similar
entities.
2025-02-14 10:28:15 +01:00
Mathias Fredriksson c069563af1 test: fix use of t.Logf where t.Log would suffice (#16328) 2025-01-29 14:35:04 +00:00
Dean Sheather 28088165a1 chore: get TUN/DNS working on Windows for CoderVPN (#16310) 2025-01-29 08:09:36 +00:00
Thomas Kosiewski 1336925c9f feat(flake.nix): switch dogfood dev image to buildNixShellImage from dockerTools (#16223)
Replace Depot build action with Nix for Nix dogfood image builds

The dogfood Nix image is now built using Nix's native container tooling instead of Depot. This change:

- Adds Nix setup steps to the GitHub Actions workflow
- Removes the Dockerfile.nix in favor of a Nix-native container build
- Updates the flake.nix to support building Docker images
- Introduces a hash file to track Nix-related changes
- Updates the vendorHash for Go dependencies

Change-Id: I4e011fe3a19d9a1375fbfd5223c910e59d66a5d9
Signed-off-by: Thomas Kosiewski <tk@coder.com>
2025-01-28 16:38:37 +01:00
Cian Johnston 7b88776403 chore(testutil): add testutil.GoleakOptions (#16070)
- Adds `testutil.GoleakOptions` and consolidates existing options to
this location
- Pre-emptively adds required ignore for this Dependabot PR to pass CI
https://github.com/coder/coder/pull/16066
2025-01-08 15:38:37 +00:00
Spike Curtis 2c7f8ac65f chore: migrate to coder/websocket 1.8.12 (#15898)
Migrates us to `coder/websocket` v1.8.12 rather than `nhooyr/websocket` on an older version.

Works around https://github.com/coder/websocket/issues/504 by adding an explicit test for `xerrors.Is(err, io.EOF)` where we were previously getting `io.EOF` from the netConn.
2024-12-19 00:51:30 +04:00
Ethan ba48069325 chore: implement CoderVPN client & tunnel (#15612)
Addresses #14734.

This PR wires up `tunnel.go` to a `tailnet.Conn` via the new `/tailnet` endpoint, with all the necessary controllers such that a VPN connection can be started, stopped and inspected via the CoderVPN protocol.
2024-12-05 13:30:22 +11:00
Spike Curtis 029cd5d064 fix(tailnet): prevent redial after Coord graceful restart (#15586)
fixes: https://github.com/coder/internal/issues/217

> There are a couple problems:
>
> One is that we assert the RPCs succeed, but if the pipeDialer context is canceled at the end of the test, then these assertions happen after the test is officially complete, which panics and affects other tests.

This converts these to just return the error rather than assert.

> The other is that the retrier is slightly bugged: if the current retry delay is 0 AND the ctx is done, (e.g. after successfully connecting and then gracefully disconnecting), then retrier.Wait(c.ctx) is racy and could return either true or false.

Fixes the phantom redial by explicitly checking the context before dialing. Also, in the test, we assert that the controller is closed before completing the test.
2024-11-19 11:37:11 +04:00