mirror of
https://github.com/coder/coder.git
synced 2026-06-03 04:58:23 +00:00
5130404f2a
_Generated with mux but reviewed by a human_ This PR fixes a bug where Coder Desktop could stop retrying connections to coderd after a prolonged network interruption. When that happened, the client would no longer recoordinate or receive workspace updates, even after connectivity returned. This is likely the long-standing “stale connection” issue that has been reported both internally and by customers. In practice, it would cause all Coder Desktop workspaces to appear yellow or red in the UI and become unreachable. The underlying behavior matches the reports: peers are removed after 15 minutes without a handshake. So if network connectivity is lost for that long, the client must recoordinate to recover. This bug prevented that recoordination from happening. For that reason, I’m marking this as: Closes https://github.com/coder/coder-desktop-macos/issues/227 ## Problem The tailnet controller owns a long-lived retry loop in `Controller.Run`. That loop already had an important graceful-shutdown guard added in [`ba21ba87`](https://github.com/coder/coder/commit/ba21ba87ba2209fad3c9f4bb131d7de1fc0e58be) to prevent a phantom redial after cancellation: ```go if c.ctx.Err() != nil { return } ``` That guard was correct. It made controller lifetime depend on the controller's own context rather than on retry timing races. But the post-dial error path had since grown a broader terminal check: ```go if xerrors.Is(err, context.Canceled) || xerrors.Is(err, context.DeadlineExceeded) { return } ``` That turns out to be too broad for desktop reconnects. A dial attempt can fail with a wrapped `context.DeadlineExceeded` even while the controller's own context is still live. ## Why that happens The workspace tailnet dialer uses the SDK HTTP client, which inherits `http.DefaultTransport`. That transport uses a `net.Dialer` with a 30s `Timeout`. Go implements that timeout by creating an internal deadline-bound sub-context for the TCP connect. So during a control-plane blackhole, the reconnect path can look like this: - the existing control-plane connection dies - `Controller.Run` re-enters the retry path - the next websocket/TCP dial hangs against unreachable coderd - `net.Dialer` times out the connect after ~30s - the returned error unwraps to `context.DeadlineExceeded` - `Controller.Run` treats that as terminal and returns - the retry goroutine exits forever even though `c.ctx` is still alive At that point the data plane can remain partially alive, the desktop app can still look online, and unblocking coderd does nothing because the process is no longer trying to redial. ## How this was found We reproduced the issue in the macOS vpn-daemon process with temporary diagnostics, blackholed coderd with `pfctl`, and captured multiple goroutine dumps while the daemon was wedged. Those dumps showed: - `manageGracefulTimeout` was still blocked on `<-c.ctx.Done()`, proving the controller context was not canceled - the `Controller.Run` retry goroutine was missing from later dumps - control-plane consumers stayed idle longer over time - once coderd became reachable again the daemon still did not dial it That narrowed the failure from "slow retry" to "retry loop exited", and tracing the dial path back through `http.DefaultTransport` and `net.Dialer` explained why a transport timeout was being mistaken for controller shutdown. In my testing with coderd blocked, as expected, I did retain a connection to the workspace agent. I suspect the scenarios where connection to the agent are lost is because we can't retry coordination. ## Fix Keep the graceful-shutdown guard from [`ba21ba87`](https://github.com/coder/coder/commit/ba21ba87ba2209fad3c9f4bb131d7de1fc0e58be) exactly as-is, but narrow the post-dial exit condition so it keys off the controller's own context instead of the error unwrap chain. Before: ```go if xerrors.Is(err, context.Canceled) || xerrors.Is(err, context.DeadlineExceeded) { return } ``` After: ```go if c.ctx.Err() != nil { return } ``` ## Why this is the right behavior This preserves the original graceful-shutdown invariant from [`ba21ba87`](https://github.com/coder/coder/commit/ba21ba87ba2209fad3c9f4bb131d7de1fc0e58be) while restoring retryability for transient transport failures: - if `c.ctx` is canceled before dialing, the pre-dial guard still prevents a phantom redial - if `c.ctx` is canceled during a dial attempt, the error path still exits cleanly because `c.ctx.Err()` is non-nil - if a live controller hits a wrapped transport timeout, the loop no longer dies and instead retries as intended In other words, controller state remains the only authoritative signal for loop shutdown. ## Non-regression coverage This also preserves the earlier flaky-test fix from [`ba21ba87`](https://github.com/coder/coder/commit/ba21ba87ba2209fad3c9f4bb131d7de1fc0e58be): - `pipeDialer` still returns errors instead of asserting from background goroutines - `TestController_Disconnects` still waits for `uut.Closed()` before the test exits On top of that, this change adds focused controller tests that assert: - a wrapped `net.OpError(context.DeadlineExceeded)` under a live controller causes another dial attempt instead of closing the controller - cancellation still shuts the controller down without an extra redial ## Validation After blocking TCP connections to coderd for 20 minutes to force the retry path, unblocking coderd allowed the daemon to recover on its own without toggling Coder Connect.