mirror of
https://github.com/coder/coder.git
synced 2026-06-02 20:48:20 +00:00
54d650ea79
When the control plane connection drops and reconnects, a new `tunnelUpdater` is created with empty workspace state. This causes the in-memory DNS resolver to lose all host records, breaking `.coder` name resolution until the server sends a fresh workspace snapshot. If the API is unreachable (e.g., the route goes through a VPN that is also reconnecting), the snapshot never arrives and DNS stays broken indefinitely — requiring a full Coder Desktop restart. Fix: carry workspace state from the previous `tunnelUpdater` to the new one on reconnect, and immediately re-apply DNS hosts so the resolver stays populated during the reconnection window. Fixes https://linear.app/codercom/issue/PLAT-110 <details><summary>Investigation & decision log</summary> ### Root cause analysis Customer diagnostic data from Roblox (March 31) showed: - NRPT rule present (`.coder` → `fd60:627a:a42b::53`) — routing is correct - DNS resolver returns NXDOMAIN for everything including the sentinel `is.coder--connect--enabled--right--now.coder` — resolver is running but has zero host records - Coder Connect UI shows "connected" — the WireGuard data plane is up The resolver is empty because `TunnelAllWorkspaceUpdatesController.New()` creates a fresh `tunnelUpdater` with `workspaces: make(map[uuid.UUID]*Workspace)` (empty). The previous updater's workspace data is discarded. If the server's workspace snapshot is delayed or the API is unreachable, the resolver has no records to serve. This is compounded by GlobalProtect VPN reconnects: the Coder API is behind the VPN, so when GP reconnects, the API route is temporarily lost and the snapshot can't arrive. ### What this PR changes - `TunnelAllWorkspaceUpdatesController.New()` now clones workspace state from the previous updater before creating the new one - Immediately re-applies DNS hosts with the inherited state (log: `re-applying DNS hosts from previous session`) - When the server's snapshot arrives, it replaces the inherited data normally - If `SetDNSHosts` fails during re-apply, it's logged as a warning and not fatal — the recvLoop will program DNS when the snapshot arrives ### What this PR does NOT fix (future work) - **Tunnel binary restart**: when the tunnel process itself is killed and relaunched, all in-memory state is lost. A DNS host cache on disk would be needed for this case. - **NRPT rule cleanup on startup**: the Tailscale fork's `nrptRuleDatabase` constructor unconditionally deletes all NRPT rules on engine creation. Deferring cleanup to the first successful `SetDNS` call would reduce the DNS gap. - **Hosts file retry**: the `setHosts()` retry in the Tailscale fork (5×10ms) is too short for environments where endpoint security locks the file. These are tracked as follow-up items in the `coder/tailscale` fork. </details> > 🤖 Generated by Coder Agents