coder

mirror of https://github.com/coder/coder.git synced 2026-06-02 20:48:20 +00:00

Author	SHA1	Message	Date
Steven Masley	51b531f5b3	chore: 'go generate' mockgen to use `go tool` wrapper (#25490 ) Calling `mockgen` relies on the executable in the `$PATH`. Using `go tool` uses the one defined in `go.mod`	2026-05-19 14:53:13 +00:00
Steven Masley	1afc6d4fd0	feat: structured disconnect attribution for agent logs (#25191 ) Implements [PLAT-60](https://linear.app/codercom/issue/PLAT-60/enhance-disconnect-logs-with-structured-reason-attribution): adds structured disconnect attribution to disconnect logs throughout the agent and tailnet packages. Every disconnect log site now carries structured slog fields. All existing logs remain; existing messages are preserved with the fields added alongside. New fields on disconnect log lines: - `connect_type` — which layer disconnected: `server_to_agent`, `agent_to_client`, or `client_to_server` - `disconnect_reason` — categorical reason: `graceful`, `network_error`, `server_shutdown`, etc. - `disconnect_expected` — whether the disconnect is normal operation (`true`) or should be investigated (`false`) - `disconnect_initiator` — who started it: `client`, `agent`, `server`, or `network` (control-plane sites only) - `disconnect_detail` — free-form supplemental info (where useful) ## What's covered Control plane (`server_to_agent`): coordination RPC, DERP map subscriber, agent runLoop, agent Close, `BasicCoordination.Close`, `Controller.run`. Data plane (`agent_to_client`): SSH sessions, reconnecting PTY, JetBrains port-forwarding. <details> <summary>Control-plane sites</summary> \| Site \| Reason \| Initiator \| \|---\|---\|---\| \| `agent/agent.go` `runLoop` EOF \| `network_error` \| `network` \| \| `agent/agent.go` `runCoordinator` deferred exit \| `server_shutdown` / `graceful` / `network_error` \| `agent` / `server` / `network` \| \| `agent/agent.go` `runDERPMapSubscriber` deferred exit \| same (shared `classifyCoordinatorRPCExit`) \| same \| \| `agent/agent.go` `Close` shutdown timeout \| `server_shutdown` + detail \| `agent` \| \| `agent/agent.go` `Close` clean coord disconnect \| `server_shutdown` \| `agent` \| \| `tailnet/controllers.go` `BasicCoordination.Close` \| `graceful` or `network_error` \| `c.initiator` \| \| `tailnet/controllers.go` `Controller.run` `net.ErrClosed` \| `network_error` \| `network` \| </details> <details> <summary>Data-plane sites</summary> \| Site \| Reason \| Notes \| \|---\|---\|---\| \| `agent/agentssh/agentssh.go` SSH session closed \| free-form (`graceful`, `process exited with error status: N`, etc.) \| Also sets `closeCause("normal exit")` for clean exits so coderd's `connection_log.DisconnectReason` is no longer empty \| \| `agent/reconnectingpty/server.go` PTY closed \| `server_shutdown`, error string, or `graceful` \| \| \| `agent/agentssh/jetbrainstrack.go` channel closed \| `normal close` or error string \| Previously passed empty reason \| </details> <details> <summary>Bug fix</summary> The deferred `disconnected from coordination RPC` log no longer fires when the initial `Coordinate()` RPC call fails before any connection is established. </details> Refs PLAT-60. --- _This PR was prepared by Coder Agents on behalf of @Emyrk._ Manually QA'd a lot of common disconnects --------- Co-authored-by: Coder Agents <noreply@coder.com>	2026-05-19 09:47:03 -05:00
Jon Ayers	5e4647bb3a	fix: synchronize access to drpc Send (#24600 )	2026-05-06 14:14:10 -05:00
Jon Ayers	2cab1b41ad	fix: increase MaxMessageSize to 16 MiB (#24599 )	2026-05-06 10:12:56 -05:00
Sas Swart	1ba7139f21	feat: add session correlation fields to BoundaryLog proto (#24809 ) 1 of 9 [next >>](https://github.com/coder/coder/pull/24811) RFC: [Bridge ↔ Boundaries Correlation RFC](https://www.notion.so/Bridge-Boundaries-Correlation-313d579be59281f3b4efdbfd6896775a) Adds three new proto fields for boundary session correlation. `ReportBoundaryLogsRequest` - `session_id` (string, field 2) — UUID generated by boundary at startup, shared across all batches from a single run. - `confined_process` (string, field 3) — name of the confined process (e.g. `claude-code`, `codex`, `copilot`). `BoundaryLog` - `sequence_number` (uint64, field 4) — monotonically increasing counter per session, primary ordering key when boundary is in use. `BoundaryLog.time` already existed at field 2; no change needed there. API version bumped to v2.9. No behaviour change in coderd or the agent. This is a pure schema bump that the boundary repo will consume in its own stack. > Generated by Coder Agents	2026-05-05 10:36:26 +02:00
Garrett Delfosse	54d650ea79	fix(tailnet): preserve DNS hosts across control plane reconnections (#24253 ) When the control plane connection drops and reconnects, a new `tunnelUpdater` is created with empty workspace state. This causes the in-memory DNS resolver to lose all host records, breaking `.coder` name resolution until the server sends a fresh workspace snapshot. If the API is unreachable (e.g., the route goes through a VPN that is also reconnecting), the snapshot never arrives and DNS stays broken indefinitely — requiring a full Coder Desktop restart. Fix: carry workspace state from the previous `tunnelUpdater` to the new one on reconnect, and immediately re-apply DNS hosts so the resolver stays populated during the reconnection window. Fixes https://linear.app/codercom/issue/PLAT-110 <details><summary>Investigation & decision log</summary> ### Root cause analysis Customer diagnostic data from Roblox (March 31) showed: - NRPT rule present (`.coder` → `fd60:627a:a42b::53`) — routing is correct - DNS resolver returns NXDOMAIN for everything including the sentinel `is.coder--connect--enabled--right--now.coder` — resolver is running but has zero host records - Coder Connect UI shows "connected" — the WireGuard data plane is up The resolver is empty because `TunnelAllWorkspaceUpdatesController.New()` creates a fresh `tunnelUpdater` with `workspaces: make(map[uuid.UUID]Workspace)` (empty). The previous updater's workspace data is discarded. If the server's workspace snapshot is delayed or the API is unreachable, the resolver has no records to serve. This is compounded by GlobalProtect VPN reconnects: the Coder API is behind the VPN, so when GP reconnects, the API route is temporarily lost and the snapshot can't arrive. ### What this PR changes - `TunnelAllWorkspaceUpdatesController.New()` now clones workspace state from the previous updater before creating the new one - Immediately re-applies DNS hosts with the inherited state (log: `re-applying DNS hosts from previous session`) - When the server's snapshot arrives, it replaces the inherited data normally - If `SetDNSHosts` fails during re-apply, it's logged as a warning and not fatal — the recvLoop will program DNS when the snapshot arrives ### What this PR does NOT fix (future work) - Tunnel binary restart: when the tunnel process itself is killed and relaunched, all in-memory state is lost. A DNS host cache on disk would be needed for this case. - NRPT rule cleanup on startup: the Tailscale fork's `nrptRuleDatabase` constructor unconditionally deletes all NRPT rules on engine creation. Deferring cleanup to the first successful `SetDNS` call would reduce the DNS gap. - Hosts file retry*: the `setHosts()` retry in the Tailscale fork (5×10ms) is too short for environments where endpoint security locks the file. These are tracked as follow-up items in the `coder/tailscale` fork. </details> > 🤖 Generated by Coder Agents	2026-04-29 12:29:44 -04:00
Spike Curtis	4c1a32cd7c	feat: wire DERPTLSConfig through CLI, SDK, tailnet, VPN, agent, and health checks (#24435 ) Wire DERPTLSConfig through the CLI, SDK, tailnet, VPN client, agent, and health checks to allow custom TLS configuration for DERP connections. The main use case is to be able to set a custom CA and also present client certs (mTLS). See https://github.com/coder/tailscale/pull/105 for related changes. Adds three new global CLI flags: - `--client-tls-ca-file` / `CODER_CLIENT_TLS_CA_FILE` - `--client-tls-cert-file` / `CODER_CLIENT_TLS_CERT_FILE` - `--client-tls-key-file` / `CODER_CLIENT_TLS_KEY_FILE` Based on community PR #22695 by @ibdafna, with autogeneration issues fixed (protobuf version mismatches in .pb.go files, golden file regeneration, lint fixes). > [!NOTE] > This PR was authored by Coder Agents on behalf of a Coder team member. <details> <summary>Relationship to #22695</summary> This is a clean reimplementation of the changes from #22695 on top of current `main`, with the following differences: - Removed: Accidental protobuf version changes in `.pb.go` files (contributor had `protoc v6.33.4` vs project's `protoc v4.23.4`) - Added: Properly regenerated golden files and docs via `make gen` - Fixed: Lint issue (`var-declaration` revive warning on explicit type in `createHTTPClient`) - All meaningful code changes are identical to the original PR </details>	2026-04-16 12:46:52 -04:00
Cian Johnston	847a88c6ca	chore: clean up stale and dangerous //nolint comments (#23643 ) ## Changes - Commit 1: Remove 17 unnecessary `//nolint` directives: - `//nolint:varnamelen` — linter not active - `//nolint:unused` on exported `SlimUnsupported` - `//nolint:govet` in `coderd/httpmw/csrf` — no longer fires - `//nolint:revive` on functions refactored since the nolint was added - `//nolint:paralleltest` citing Go 1.22 loop variable capture (obsolete) - Bare `//nolint` narrowed to specific `//nolint:gocritic` with justification - Commit 2: Fix root causes behind 5 dangerous nolint suppressions: - Add `MinVersion: tls.VersionTLS12` to TLS client config (removes `gosec` G402) - Delete trivial unexported wrappers `apiKey()`/`normalizeProvider()` in chatprovider (removes `revive` confusing-naming) - Add doc comments to `StartWithAssert` and `Router` (removes `revive` exported) - Rename unused parameters to `_` in integration test helpers > 🤖 This PR was created using Coder Agents and reviewed by me.	2026-03-26 14:13:53 +00:00
Ethan	5130404f2a	fix(tailnet): retry after transport dial timeouts (#22977 ) _Generated with mux but reviewed by a human_ This PR fixes a bug where Coder Desktop could stop retrying connections to coderd after a prolonged network interruption. When that happened, the client would no longer recoordinate or receive workspace updates, even after connectivity returned. This is likely the long-standing “stale connection” issue that has been reported both internally and by customers. In practice, it would cause all Coder Desktop workspaces to appear yellow or red in the UI and become unreachable. The underlying behavior matches the reports: peers are removed after 15 minutes without a handshake. So if network connectivity is lost for that long, the client must recoordinate to recover. This bug prevented that recoordination from happening. For that reason, I’m marking this as: Closes https://github.com/coder/coder-desktop-macos/issues/227 ## Problem The tailnet controller owns a long-lived retry loop in `Controller.Run`. That loop already had an important graceful-shutdown guard added in [`ba21ba87`](https://github.com/coder/coder/commit/ba21ba87ba2209fad3c9f4bb131d7de1fc0e58be) to prevent a phantom redial after cancellation: ```go if c.ctx.Err() != nil { return } ``` That guard was correct. It made controller lifetime depend on the controller's own context rather than on retry timing races. But the post-dial error path had since grown a broader terminal check: ```go if xerrors.Is(err, context.Canceled) \|\| xerrors.Is(err, context.DeadlineExceeded) { return } ``` That turns out to be too broad for desktop reconnects. A dial attempt can fail with a wrapped `context.DeadlineExceeded` even while the controller's own context is still live. ## Why that happens The workspace tailnet dialer uses the SDK HTTP client, which inherits `http.DefaultTransport`. That transport uses a `net.Dialer` with a 30s `Timeout`. Go implements that timeout by creating an internal deadline-bound sub-context for the TCP connect. So during a control-plane blackhole, the reconnect path can look like this: - the existing control-plane connection dies - `Controller.Run` re-enters the retry path - the next websocket/TCP dial hangs against unreachable coderd - `net.Dialer` times out the connect after ~30s - the returned error unwraps to `context.DeadlineExceeded` - `Controller.Run` treats that as terminal and returns - the retry goroutine exits forever even though `c.ctx` is still alive At that point the data plane can remain partially alive, the desktop app can still look online, and unblocking coderd does nothing because the process is no longer trying to redial. ## How this was found We reproduced the issue in the macOS vpn-daemon process with temporary diagnostics, blackholed coderd with `pfctl`, and captured multiple goroutine dumps while the daemon was wedged. Those dumps showed: - `manageGracefulTimeout` was still blocked on `<-c.ctx.Done()`, proving the controller context was not canceled - the `Controller.Run` retry goroutine was missing from later dumps - control-plane consumers stayed idle longer over time - once coderd became reachable again the daemon still did not dial it That narrowed the failure from "slow retry" to "retry loop exited", and tracing the dial path back through `http.DefaultTransport` and `net.Dialer` explained why a transport timeout was being mistaken for controller shutdown. In my testing with coderd blocked, as expected, I did retain a connection to the workspace agent. I suspect the scenarios where connection to the agent are lost is because we can't retry coordination. ## Fix Keep the graceful-shutdown guard from [`ba21ba87`](https://github.com/coder/coder/commit/ba21ba87ba2209fad3c9f4bb131d7de1fc0e58be) exactly as-is, but narrow the post-dial exit condition so it keys off the controller's own context instead of the error unwrap chain. Before: ```go if xerrors.Is(err, context.Canceled) \|\| xerrors.Is(err, context.DeadlineExceeded) { return } ``` After: ```go if c.ctx.Err() != nil { return } ``` ## Why this is the right behavior This preserves the original graceful-shutdown invariant from [`ba21ba87`](https://github.com/coder/coder/commit/ba21ba87ba2209fad3c9f4bb131d7de1fc0e58be) while restoring retryability for transient transport failures: - if `c.ctx` is canceled before dialing, the pre-dial guard still prevents a phantom redial - if `c.ctx` is canceled during a dial attempt, the error path still exits cleanly because `c.ctx.Err()` is non-nil - if a live controller hits a wrapped transport timeout, the loop no longer dies and instead retries as intended In other words, controller state remains the only authoritative signal for loop shutdown. ## Non-regression coverage This also preserves the earlier flaky-test fix from [`ba21ba87`](https://github.com/coder/coder/commit/ba21ba87ba2209fad3c9f4bb131d7de1fc0e58be): - `pipeDialer` still returns errors instead of asserting from background goroutines - `TestController_Disconnects` still waits for `uut.Closed()` before the test exits On top of that, this change adds focused controller tests that assert: - a wrapped `net.OpError(context.DeadlineExceeded)` under a live controller causes another dial attempt instead of closing the controller - cancellation still shuts the controller down without an extra redial ## Validation After blocking TCP connections to coderd for 20 minutes to force the retry path, unblocking coderd allowed the daemon to recover on its own without toggling Coder Connect.	2026-03-12 18:05:56 +11:00
Jon Ayers	6c44de951d	feat: add Prometheus collector for DERP server expvar metrics (#22583 ) This PR does three things: - Exports derp expvars to the pprof endpoint - Exports the expvar metrics as prometheus metrics in both coderd and wsproxy - Updates our tailscale to a fix I also had to make to avoid a data race condition I generated this with mux but I also manually tested that the metrics were getting properly emitted	2026-03-06 01:57:58 -06:00
Jon Ayers	43b8df86c1	fix: log WARN on ErrConnectionClosed in tailnet.Controller.Run (#22293 )	2026-02-25 01:27:53 -06:00
Spike Curtis	393b3874ac	feat: add UpdateAppStatus to the workspace agent API (#22219 ) <!-- If you have used AI to produce some or all of this PR, please ensure you have read our [AI Contribution guidelines](https://coder.com/docs/about/contributing/AI_CONTRIBUTING) before submitting. --> part of https://github.com/coder/coder/issues/21335 This moves updating app status (used by Tasks) into the workspace agent API over dRPC. This will allow us to update the status without having to re-authenticate each time, like we would with an HTTP PATCH request. Further PRs in this stack will pipe these requests thru from the CLI MCP server to the agentsock and finally to this dRPC call to coderd.	2026-02-24 13:26:55 +04:00
Danielle Maywood	2de8cdf160	feat(agent): add subagent ID fields to devcontainers in manifest (#21848 ) Update the agent protobuf schema (agent/proto/agent.proto) to include: - subagent_id field in WorkspaceAgentDevcontainer message - id field in CreateSubAgentRequest message Bump the Agent API version from v2.7 to v2.8 and update all client references throughout the codebase (ConnectRPC27 -> ConnectRPC28, DRPCAgentClient27 -> DRPCAgentClient28).	2026-02-03 12:37:30 +00:00
Spike Curtis	bddb808b25	chore: arrange imports in a standard way (#21452 ) Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example: ``` import ( "context" "time" "github.com/prometheus/client_golang/prometheus" "golang.org/x/xerrors" "gopkg.in/natefinch/lumberjack.v2" "cdr.dev/slog/v3" "github.com/coder/coder/v2/codersdk/agentsdk" "github.com/coder/serpent" ) ``` 3 groups: standard library, 3rd partly libs, Coder libs. This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.	2026-01-08 15:24:11 +04:00
Spike Curtis	49b34a716a	fix: fix slog to always use array of Fields (#21426 ) Upgrades to slog v3 which includes a small, but backward incompatible API change to the acceptible call arguments when logging. This change allows us to verify via compile time type checking that arguments are correct and won't cause a panic, as was possible in slog v1, which this replaces (v2 was tagged but never used in coder/coder). It also updates dependencies that also use slog and were updated. I've left the `aibridge` dependency as a commit SHA, under the assumption that the team there (cc @pawbana @dannykopping ) will tag and update the dependency soon and on their own schedule. Other dependencies, I pushed new tags.	2026-01-08 10:29:41 +04:00
Zach	9d1493a13a	feat: add initial API for boundary log forwarding to coderd (#21293 ) Add the AgentAPI changes to support the feature that transmits boundary logs from workspaces to coderd via the agent API for eventual re-emission to stderr. The API handlers are stubs for now because I'm trying to land this feature from multiple smaller PRs. High level architecture: - Boundary records resource access in batches and sends proto message to agent - Agent proxies messages to coderd (captured by the API changes in this PR) - coderd re-emits logs to stderr RFC: https://www.notion.so/coderhq/Agent-Boundary-Logs-2afd579be59280f29629fc9823ac41ba	2025-12-19 10:41:39 -07:00
Spike Curtis	40df21ed62	fix: fixes use of possibly nil RemoteAddr() and LocalAddr() return values (#21076 ) fixes: https://github.com/coder/internal/issues/1143 Both gVisor and the Go standard library implementations of `net.Conn` can under certain circumstances return `nil` for `RemoteAddr()` and `LocalAddr()` calls. If we call their methods, we segfault. This PR fixes these calls and adds ruleguard rules. Note that `slog.F("remote_addr", conn.RemoteAddr())` is fine because slog detects the `nil` before attempting to stringify the type.	2025-12-03 15:06:00 +04:00
Spike Curtis	e96ab0ef59	test: fix PingDirect test tear down (#20687 ) Fixes https://github.com/coder/internal/issues/66 The problem is a race during test tear down where the node callback can fire after the destination tailnet.Conn has already closed, causing an error. The fix I have employed is to remove the callback in a t.Cleanup() and also refactor some tests to ensure they close the tailnet.Conn in a Cleanup deeper in the stack.	2025-11-11 10:34:31 +04:00
Dean Sheather	e2ba9e7d62	chore: retry TestAgent_Dial subtests (#19387 ) Closes https://github.com/coder/internal/issues/595	2025-08-18 13:51:19 +00:00
Dean Sheather	bf78966256	chore: remove soft isolation configurability (#19069 ) Undoes a lot of the changes in `5319d47dfa` Keeps the `netns.SetCoderSoftIsolation()` call, but always sets it to `true` when using a TUN device.	2025-07-29 22:30:17 +10:00
Dean Sheather	5319d47dfa	chore: add support for tailscale soft isolation in VPN (#19023 )	2025-07-24 04:18:29 +00:00
Dean Sheather	a1b87a67c6	fix: use client preferred URL for the default DERP (#18911 ) The agentsdk currently does a remap of the DERP map to change the EmbeddedRelay node's URL to match the agent's access URL. This PR makes changes to the `workspacesdk` (used by clients like the CLI) and `vpn` (used by Coder Desktop) to match this behavior. This enables us the ability to try Coder clients in dogfood over a VPN without changing the global access URL.	2025-07-17 20:17:44 +10:00
ケイラ	fae30a00fd	chore: remove unnecessary redeclarations in for loops (#18440 )	2025-06-20 13:16:55 -06:00
ケイラ	5df70a613d	feat: add organization scope for shared ports (#18314 )	2025-06-16 16:15:59 -06:00
Spike Curtis	af4a6682b4	fix: use tailscale that avoids small MTU paths (#18323 ) Fixes #15523 Uses latest https://github.com/coder/tailscale which includes https://github.com/coder/tailscale/pull/85 to stop selecting paths with small MTU for direct connections. Also updates the tailnet integration test to reproduce the issue. The previous version had the 2 peers connected by a single veth, but this allows the OS to fragment the packet. In the new version, the 2 peers (and server) are all connected by a central router. The link between peer 1 and the router has an adjustable MTU. IPv6 does not allow packets to be fragmented by intermediate routers, so sending a too-large packet in this scenario forces the router to drop packets and reproduce the issue (without the tailscale changes).	2025-06-11 14:16:25 +04:00
Spike Curtis	08eff7f433	chore: improve tailnet integration test (#18124 ) Refactors tailnet integration test and adds UDP echo tests with different MTU related to #15523 I still haven't gotten to the bottom of what's causing the issue (the added test case I expected to fail actually succeeds), but these integration test improvements are generally useful. also: * consolidates networking setup with easy and hard NAT * consolidates client setup * makes Client2 act like an agent at the tailnet layer, so it will send ReadyForHandshake and speed up the tunnel establishment * adds support for logging tunneled packets * adds support for dumping outer (underlay) IP traffic * adds support for adjusting veth MTU * adds support for IPv6 in the outer (underlay) network topology	2025-06-06 10:18:08 +04:00
Ethan	0076e8479f	chore(vpn): send ping results over tunnel (#18200 ) Closes #17982. The purpose of this PR is to expose network latency via the API used by Coder Desktop. This PR has the tunnel ping all known agents every 5 seconds, in order to produce an instance of: ```proto message LastPing { // latency is the RTT of the ping to the agent. google.protobuf.Duration latency = 1; // did_p2p indicates whether the ping was sent P2P, or over DERP. bool did_p2p = 2; // preferred_derp is the human readable name of the preferred DERP region, // or the region used for the last ping, if it was sent over DERP. string preferred_derp = 3; // preferred_derp_latency is the last known latency to the preferred DERP // region. Unset if the region does not appear in the DERP map. optional google.protobuf.Duration preferred_derp_latency = 4; } ``` The contents of this message are stored and included on all subsequent upsertions of the agent. Note that we upsert existing agents every 5 seconds to update the `last_handshake` value. On the desktop apps, this message will be used to produce a tooltip similar to that of the VS Code extension: <img width="495" alt="image" src="https://github.com/user-attachments/assets/d8b65f3d-f536-4c64-9af9-35c1a42c92d2" /> (wording not final) Unlike the VS Code extension, we omit: - The Latency of all available DERP regions. It seems not ideal to send a copy of this entire map for every online agent, and it certainly doesn't make sense for it to be on the `Agent` or `LastPing` message. If we do want to expose this info on Coder Desktop, we should consider how best to do so; maybe we want to include it on a more generic `Netcheck` message. - The current throughput (Bytes up/down). This is out of scope of the linked issue, and is non-trivial to implement. I'm also not sure of the value given the frequency we're doing these status updates (every 5 seconds). If we want to expose it, it'll be in a separate PR. <img width="343" alt="image" src="https://github.com/user-attachments/assets/8447d03b-9721-4111-8ac1-332d70a1e8f1" />	2025-06-06 14:18:57 +10:00
Danielle Maywood	b712d0b23f	feat(coderd/agentapi): implement sub agent api (#17823 ) Closes https://github.com/coder/internal/issues/619 Implement the `coderd` side of the AgentAPI for the upcoming dev-container agents work. `agent/agenttest/client.go` is left unimplemented for a future PR working to implement the agent side of this feature.	2025-05-29 12:15:47 +01:00
Spike Curtis	6c0bed0f53	chore: update to coder/quartz v0.2.0 (#18007 ) Upgrade to coder/quartz v0.2.0 including fixing up a minor API breaking change.	2025-05-27 16:05:03 +04:00
Danielle Maywood	61f22a59ba	feat(agent): add `ParentId` to agent manifest (#17888 ) Closes https://github.com/coder/internal/issues/648 This change introduces a new `ParentId` field to the agent's manifest. This will allow an agent to know if it is a child or not, as well as knowing who the owner is. This is part of the Dev Container Agents work	2025-05-19 16:09:56 +01:00
Steven Masley	64807e1d61	chore: apply the 4mb max limit on drpc protocol message size (#17771 ) Respect the 4mb max limit on proto messages	2025-05-13 11:24:51 -05:00
Steven Masley	37832413ba	chore: resolve internal drpc package conflict (#17770 ) Our internal drpc package name conflicts with the external one in usage. `drpc.` == external `drpcsdk.` == internal	2025-05-12 10:31:38 -05:00
Steven Masley	e4c6c10369	chore: fix comment regarding provisioner api version release (#17705 ) See https://github.com/coder/coder/commit/bc609d0056adeb11b1d2dc282db4d0ad20f3444b	2025-05-07 15:05:00 -05:00
Michael Suchacz	5f516ed135	feat: improve coder connect tunnel handling on reconnect (#17598 ) Closes https://github.com/coder/internal/issues/563 The [Coder Connect tunnel](https://github.com/coder/coder/blob/main/vpn/tunnel.go) receives workspace state from the Coder server over a [dRPC stream.](https://github.com/coder/coder/blob/114ba4593b2a82dfd41cdcb7fd6eb70d866e7b86/tailnet/controllers.go#L1029) When first connecting to this stream, the current state of the user's workspaces is received, with subsequent messages being diffs on top of that state. However, if the client disconnects from this stream, such as when the user's device is suspended, and then reconnects later, no mechanism exists for the tunnel to differentiate that message containing the entire initial state from another diff, and so that state is incorrectly applied as a diff. In practice: - Tunnel connects, receives a workspace update containing all the existing workspaces & agents. - Tunnel loses connection, but isn't completely stopped. - All the user's workspaces are restarted, producing a new set of agents. - Tunnel regains connection, and receives a workspace update containing all the existing workspaces & agents. - This initial update is incorrectly applied as a diff, with the Tunnel's state containing both the old & new agents. This PR introduces a solution in which tunnelUpdater, when created, sends a FreshState flag with the WorkspaceUpdate type. This flag is handled in the vpn tunnel in the following fashion: - Preserve existing Agents - Remove current Agents in the tunnel that are not present in the WorkspaceUpdate - Remove unreferenced Workspaces	2025-05-06 16:00:16 +02:00
Dean Sheather	d566008087	fix: update tailscale to improve block endpoints functionality (#17496 ) Direct endpoints from the peer will no longer be processed.	2025-04-22 09:32:21 +00:00
Spike Curtis	345435a04c	feat: modify coordinators to send errors and peers to log them (#17467 ) Adds support to our coordinator implementations to send Error updates before disconnecting clients. I was recently debugging a connection issue where the client was getting repeatedly disconnected from the Coordinator, but since we never send any error information it was really hard without server logs. This PR aims to correct that, by sending a CoordinateResponse with `Error` set in cases where we disconnect a client without them asking us to. It also logs the error whenever we get one in the client controller.	2025-04-21 11:40:56 +04:00
ケイラ	f670bc31f5	chore: update testutil chan helpers (#17408 )	2025-04-16 10:37:09 -06:00
Michael Suchacz	06d39151dc	feat: extend request logs with auth & DB info (#17304 ) Closes #16903	2025-04-15 13:27:23 +02:00
Danny Kopping	0b18e458f4	fix: reduce excessive logging when database is unreachable (#17363 ) Fixes #17045 --------- Signed-off-by: Danny Kopping <dannykopping@gmail.com>	2025-04-15 10:55:30 +02:00
Spike Curtis	9e2af3e127	feat: add configurable DNS match domain for tailnet connections (#17336 ) Use the hostname suffix to set the DNS match domain when creating a Tailnet as part of the vpn `Tunnel`. part of: #16828	2025-04-11 15:00:48 +04:00
Spike Curtis	2c573dc023	feat: vpn uses WorkspaceHostnameSuffix for DNS names (#17335 ) Use the hostname suffix to set DNS names as programmed into the DNS service and returned by the vpn `Tunnel`. part of: #16828	2025-04-11 13:24:20 +04:00
Ethan	3c1cb5d05a	chore: add generic DNS record for checking if Coder Connect is running (#17298 ) Closes https://github.com/coder/internal/issues/466 ``` $ dig -6 @fd60:627a:a42b::53 is.coder--connect--enabled--right--now.coder AAAA ; <<>> DiG 9.10.6 <<>> -6 @fd60:627a:a42b::53 is.coder--connect--enabled--right--now.coder AAAA ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62390 ;; flags: qr aa rd ra ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;is.coder--connect--enabled--right--now.coder. IN AAAA ;; ANSWER SECTION: is.coder--connect--enabled--right--now.coder. 2 IN AAAA fd60:627a:a42b::53 ;; Query time: 3 msec ;; SERVER: fd60:627a:a42b::53#53(fd60:627a:a42b::53) ;; WHEN: Wed Apr 09 16:59:18 AEST 2025 ;; MSG SIZE rcvd: 134 ``` Hostname considerations: - Workspace names, usernames, and agent names can't have double hyphens, so this name can't conflict with a real Coder Connect hostname. - Components can't start or end with hyphens according to [RFC 952](https://www.rfc-editor.org/rfc/rfc952.html) - DNS records can't have hyphens in the 3rd and 4th positions, as to not conflict with IDNs https://datatracker.ietf.org/doc/html/rfc5891	2025-04-11 13:59:25 +10:00
Jon Ayers	17ddee05e5	chore: update golang to 1.24.1 (#17035 ) - Update go.mod to use Go 1.24.1 - Update GitHub Actions setup-go action to use Go 1.24.1 - Fix linting issues with golangci-lint by: - Updating to golangci-lint v1.57.1 (more compatible with Go 1.24.1) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <claude@anthropic.com>	2025-03-26 01:56:39 -05:00
Mathias Fredriksson	b79167293c	chore(Makefile): update golden files as part of make gen (#17039 ) Updating golden files is an unnecessary extra step in addition to gen that is easily overlooked, leading to the developer noticing the issue in CI leading to lost developer time waiting for tests to complete.	2025-03-21 13:04:30 +00:00
Eng Zer Jun	04c33968cf	refactor: replace `golang.org/x/exp/slices` with `slices` (#16772 ) The experimental functions in `golang.org/x/exp/slices` are now available in the standard library since Go 1.21. Reference: https://go.dev/doc/go1.21#slices Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>	2025-03-04 00:46:49 +11:00
Thomas Kosiewski	d0e2060692	feat(agent): add second SSH listener on port 22 (#16627 ) Fixes: https://github.com/coder/internal/issues/377 Added an additional SSH listener on port 22, so the agent now listens on both, port one and port 22. --- Change-Id: Ifd986b260f8ac317e37d65111cd4e0bd1dc38af8 Signed-off-by: Thomas Kosiewski <tk@coder.com>	2025-03-03 04:47:42 +01:00
Hugo Dutka	44499315ed	chore: reduce log volume on server startup (#16608 ) Addresses https://github.com/coder/coder/issues/16231. This PR reduces the volume of logs we print after server startup in order to surface the web UI URL better. Here are the logs after the changes a couple of seconds after starting the server: <img width="868" alt="Screenshot 2025-02-18 at 16 31 32" src="https://github.com/user-attachments/assets/786dc4b8-7383-48c8-a5c3-a997c01ca915" /> The warning is due to running a development site-less build. It wouldn't show in a release build.	2025-02-20 16:33:14 +01:00
Mathias Fredriksson	b07b33ec9d	feat: add agentapi endpoint to report connections for audit (#16507 ) This change adds a new `ReportConnection` endpoint to the `agentapi`. The protocol version was bumped previously, so it has been omitted here. This allows the agent to report connection events, for example when the user connects to the workspace via SSH or VS Code. Updates #15139	2025-02-20 14:52:01 +02:00
Ethan	92870f0642	fix: force lowercase DNS hostnames for VPN (#16613 ) Closes https://github.com/coder/coder-desktop-macos/issues/54 I've also double checked that agents with hyphens & underscores play nice once programmed, as do workspaces with hyphens: ``` $ ping6 main_agent-1.main-workspace.admin.coder PING6(56=40+8+8 bytes) fd60:627a:a42b:4e91:88c0:da4a:df4f:b54e --> fd60:627a:a42b:46d4:8b55:e549:e498:e6f5 ``` also fine in Firefox & Safari, though I'm a little surprised underscores work.	2025-02-20 13:02:45 +11:00
Vincent Vielle	bc609d0056	feat: integrate agentAPI with resources monitoring logic (#16438 ) As part of the new resources monitoring logic - more specifically for OOM & OOD Notifications , we need to update the AgentAPI , and the agents logic. This PR aims to do it, and more specifically : We are updating the AgentAPI & TailnetAPI to version 24 to add two new methods in the AgentAPI : - One method to fetch the resources monitoring configuration - One method to push the datapoints for the resources monitoring. Also, this PR adds a new logic on the agent side, with a routine running and ticking - fetching the resources usage each time , but also storing it in a FIFO like queue. Finally, this PR fixes a problem we had with RBAC logic on the resources monitoring model, applying the same logic than we have for similar entities.	2025-02-14 10:28:15 +01:00

1 2 3 4 5 ...

258 Commits