## Summary
Fixes cross-replica chat relay failing with:
```
failed to open initial relay for chat stream
error= dial relay stream: - failed to WebSocket dial: expected handshake response status code 101 but got 200
failed to open relay for message parts
error= dial relay stream: - failed to WebSocket dial: expected handshake response status code 101 but got 200
```
Subscribers see accurate `status=running` (delivered via pubsub) but
miss all in-progress `message_part` events (delivered only via the relay
WebSocket that never connects).
## Root cause
`redirectToAccessURL` in `cli/server.go` redirects any request whose
`Host` header doesn't match the access URL. The enterprise chat relay
dials another replica directly via its DERP relay address (e.g.
`http://10.0.0.2:8080`), so the `Host` header is the pod IP — not the
access URL.
This triggers a **307 redirect** to the access URL. The WebSocket
library follows the redirect, but the second request is a plain GET —
`Connection: Upgrade` and `Upgrade: websocket` headers are **not carried
over** by HTTP redirect semantics. The load-balanced access URL routes
the plain GET to any replica, which serves the SPA catch-all handler and
returns **HTTP 200 with `index.html`**.
The WebSocket library then fails: `expected handshake response status
code 101 but got 200`.
DERP mesh already has an exemption for this exact scenario
(`isDERPPath`). Chat relay was added later and didn't get one.
## Fix
Bypass `redirectToAccessURL` for requests that carry the
`X-Coder-Relay-Source-Replica` header, which the enterprise relay
already sets on every request (`enterprise/coderd/chatd/chatd.go:573`).
## Sequence diagram
**Before (broken):**
```
Replica A (subscriber) Replica B (worker) Load Balancer
| | |
|--- WS dial pod-ip:8080 ----->| |
| |-- 307 redirect to LB --->|
| | |
|<----------- plain GET (no Upgrade headers) ------------->|
| | |-- routes to any replica
|<----------- 200 index.html -------------------------------|
| |
X 'expected 101 but got 200' |
```
**After (fixed):**
```
Replica A (subscriber) Replica B (worker)
| |
|--- WS dial pod-ip:8080 ----->|
| (X-Coder-Relay-Source- |
| Replica header set) |
| |-- bypass redirect
|<--------- 101 Upgrade ------|
|<==== message_part events ====|
```
Fixes all our Go file imports to match the preferred spec that we've _mostly_ been using. For example:
```
import (
"context"
"time"
"github.com/prometheus/client_golang/prometheus"
"golang.org/x/xerrors"
"gopkg.in/natefinch/lumberjack.v2"
"cdr.dev/slog/v3"
"github.com/coder/coder/v2/codersdk/agentsdk"
"github.com/coder/serpent"
)
```
3 groups: standard library, 3rd partly libs, Coder libs.
This PR makes the change across the codebase. The PR in the stack above modifies our formatting to maintain this state of affairs, and is a separate PR so it's possible to review that one in detail.
Upgrades to slog v3 which includes a small, but backward incompatible API change to the acceptible call arguments when logging. This change allows us to verify via compile time type checking that arguments are correct and won't cause a panic, as was possible in slog v1, which this replaces (v2 was tagged but never used in coder/coder).
It also updates dependencies that also use slog and were updated.
I've left the `aibridge` dependency as a commit SHA, under the assumption that the team there (cc @pawbana @dannykopping ) will tag and update the dependency soon and on their own schedule.
Other dependencies, I pushed new tags.
Fixes: https://github.com/coder/coder/issues/16319
This PR modifies existing escaping logic for special characters in
Postgres password, so it does fail on edge cases like `#` or `$` when
parser recognizes as invalid port.
Refactors our use of `slogtest` to instantiate a "standard logger" across most of our tests. This standard logger incorporates https://github.com/coder/slog/pull/217 to also ignore database query canceled errors by default, which are a source of low-severity flakes.
Any test that has set non-default `slogtest.Options` is left alone. In particular, `coderdtest` defaults to ignoring all errors. We might consider revisiting that decision now that we have better tools to target the really common flaky Error logs on shutdown.
This cipher is included by default in Go as a fallback, but is marked as
an insecure cipher. This removes the 3des cipher by default.
Before:
```
$ nmap --script ssl-enum-ciphers -p 443 xxxxxxx
Starting Nmap 7.94 ( https://nmap.org ) at 2024-07-08 14:16 CDT
Nmap scan report for xxxxx (xxx.xxx.xxx.xxx)
Host is up (0.038s latency).
rDNS record for xxx.xxx.xxx.xxx: xxx.xxx.xxx.xxx.bc.googleusercontent.com
PORT STATE SERVICE
443/tcp open https
| ssl-enum-ciphers:
| TLSv1.2:
| ciphers:
| TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (secp256r1) - A
| TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256 (secp256r1) - A
| TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (secp256r1) - A
| TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA (secp256r1) - A
| TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA (secp256r1) - A
| TLS_ECDHE_RSA_WITH_3DES_EDE_CBC_SHA (secp256r1) - C
| compressors:
| NULL
| cipher preference: server
| warnings:
| 64-bit block cipher 3DES vulnerable to SWEET32 attack
| TLSv1.3:
| ciphers:
| TLS_AKE_WITH_AES_128_GCM_SHA256 (ecdh_x25519) - A
| TLS_AKE_WITH_AES_256_GCM_SHA384 (ecdh_x25519) - A
| TLS_AKE_WITH_CHACHA20_POLY1305_SHA256 (ecdh_x25519) - A
| cipher preference: server
|_ least strength: C
```
After:
```
$ nmap --script ssl-enum-ciphers -p 443 xxxxxxx
Starting Nmap 7.94 ( https://nmap.org ) at 2024-07-08 15:04 CDT
Nmap scan report for xxxxx (xxx.xxx.xxx.xxx)
Host is up (0.039s latency).
rDNS record for xxx.xxx.xxx.xxx: xxx.xxx.xxx.xxx.bc.googleusercontent.com
PORT STATE SERVICE
443/tcp open https
| ssl-enum-ciphers:
| TLSv1.2:
| ciphers:
| TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA (secp256r1) - A
| TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (secp256r1) - A
| TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA (secp256r1) - A
| TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (secp256r1) - A
| TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256 (secp256r1) - A
| compressors:
| NULL
| cipher preference: client
| TLSv1.3:
| ciphers:
| TLS_AKE_WITH_AES_128_GCM_SHA256 (ecdh_x25519) - A
| TLS_AKE_WITH_AES_256_GCM_SHA384 (ecdh_x25519) - A
| TLS_AKE_WITH_CHACHA20_POLY1305_SHA256 (ecdh_x25519) - A
| cipher preference: server
|_ least strength: A
```
* fixup! fix!(cli): remove `TLS_ECDHE_RSA_WITH_3DES_EDE_CBC_SHA` cipher by default
* fixup! fix!(cli): remove `TLS_ECDHE_RSA_WITH_3DES_EDE_CBC_SHA` cipher by default
> Can someone help me understand the differences between these env variables:
>
> CODER_REDIRECT_TO_ACCESS_URL
> CODER_TLS_REDIRECT_HTTP_TO_HTTPS
> CODER_TLS_REDIRECT_HTTP
Oh man, what a mess. It looks like `CODER_TLS_REDIRECT_HTTP ` appears in our config docs. Maybe that was the initial name for the environment variable?
At some point, both the flag and the environment variable were `--tls-redirect-http-to-https` and `CODER_TLS_REDIRECT_HTTP_TO_HTTPS`. `CODER_TLS_REDIRECT_HTTP` did nothing.
However, then we introduced `CODER_REDIRECT_TO_ACCESS_URL`, we put in some deprecation code that was maybe fat-fingered such that we accept the environment variable `CODER_TLS_REDIRECT_HTTP` but the flag `--tls-redirect-http-to-https`. Our docs still refer to `CODER_TLS_REDIRECT_HTTP` at https://coder.com/docs/v2/latest/admin/configure#address
So, I think what we gotta do is still accept `CODER_TLS_REDIRECT_HTTP` since it was working and in an example doc, but also fix the deprecation code to accept `CODER_TLS_REDIRECT_HTTP_TO_HTTPS` environment variable.