Files
coder/tailnet
Garrett Delfosse 54d650ea79 fix(tailnet): preserve DNS hosts across control plane reconnections (#24253)
When the control plane connection drops and reconnects, a new
`tunnelUpdater` is created with empty workspace state. This causes the
in-memory DNS resolver to lose all host records, breaking `.coder` name
resolution until the server sends a fresh workspace snapshot.

If the API is unreachable (e.g., the route goes through a VPN that is
also reconnecting), the snapshot never arrives and DNS stays broken
indefinitely — requiring a full Coder Desktop restart.

Fix: carry workspace state from the previous `tunnelUpdater` to the new
one on reconnect, and immediately re-apply DNS hosts so the resolver
stays populated during the reconnection window.

Fixes https://linear.app/codercom/issue/PLAT-110

<details><summary>Investigation & decision log</summary>

### Root cause analysis

Customer diagnostic data from Roblox (March 31) showed:
- NRPT rule present (`.coder` → `fd60:627a:a42b::53`) — routing is
correct
- DNS resolver returns NXDOMAIN for everything including the sentinel
`is.coder--connect--enabled--right--now.coder` — resolver is running but
has zero host records
- Coder Connect UI shows "connected" — the WireGuard data plane is up

The resolver is empty because
`TunnelAllWorkspaceUpdatesController.New()` creates a fresh
`tunnelUpdater` with `workspaces: make(map[uuid.UUID]*Workspace)`
(empty). The previous updater's workspace data is discarded. If the
server's workspace snapshot is delayed or the API is unreachable, the
resolver has no records to serve.

This is compounded by GlobalProtect VPN reconnects: the Coder API is
behind the VPN, so when GP reconnects, the API route is temporarily lost
and the snapshot can't arrive.

### What this PR changes

- `TunnelAllWorkspaceUpdatesController.New()` now clones workspace state
from the previous updater before creating the new one
- Immediately re-applies DNS hosts with the inherited state (log:
`re-applying DNS hosts from previous session`)
- When the server's snapshot arrives, it replaces the inherited data
normally
- If `SetDNSHosts` fails during re-apply, it's logged as a warning and
not fatal — the recvLoop will program DNS when the snapshot arrives

### What this PR does NOT fix (future work)

- **Tunnel binary restart**: when the tunnel process itself is killed
and relaunched, all in-memory state is lost. A DNS host cache on disk
would be needed for this case.
- **NRPT rule cleanup on startup**: the Tailscale fork's
`nrptRuleDatabase` constructor unconditionally deletes all NRPT rules on
engine creation. Deferring cleanup to the first successful `SetDNS` call
would reduce the DNS gap.
- **Hosts file retry**: the `setHosts()` retry in the Tailscale fork
(5×10ms) is too short for environments where endpoint security locks the
file.

These are tracked as follow-up items in the `coder/tailscale` fork.

</details>

> 🤖 Generated by Coder Agents
2026-04-29 12:29:44 -04:00
..
2024-10-25 17:14:35 +01:00