From e5c3d151bb322b064651174856b72a84fcdaeef6 Mon Sep 17 00:00:00 2001
From: "blinkagent[bot]" <237617714+blinkagent[bot]@users.noreply.github.com>
Date: Fri, 6 Feb 2026 16:08:59 +0000
Subject: [PATCH] docs: add upgrade best practices guide (#21656)

---
 docs/admin/networking/high-availability.md |   6 +
 docs/install/upgrade-best-practices.md     | 200 +++++++++++++++++++++
 docs/install/upgrade.md                    |   3 +
 docs/manifest.json                         |   9 +-
 4 files changed, 217 insertions(+), 1 deletion(-)
 create mode 100644 docs/install/upgrade-best-practices.md

diff --git a/docs/admin/networking/high-availability.md b/docs/admin/networking/high-availability.md
index 7dee70a293..292309d44c 100644
--- a/docs/admin/networking/high-availability.md
+++ b/docs/admin/networking/high-availability.md
@@ -29,6 +29,12 @@ user <-> Coder connections.
 Coder automatically enters HA mode when multiple instances simultaneously
 connect to the same Postgres endpoint.
 
+> [!NOTE]
+> When upgrading HA deployments, database migrations may require special
+> handling to avoid lock contention. See
+> [Upgrading Best Practices](../../install/upgrade-best-practices.md) for
+> recommended procedures.
+
 HA brings one configuration variable to set in each Coderd node:
 `CODER_DERP_SERVER_RELAY_URL`. The HA nodes use these URLs to communicate with
 each other. Inter-node communication is only required while using the embedded
diff --git a/docs/install/upgrade-best-practices.md b/docs/install/upgrade-best-practices.md
new file mode 100644
index 0000000000..e1df11bf6a
--- /dev/null
+++ b/docs/install/upgrade-best-practices.md
@@ -0,0 +1,200 @@
+# Upgrading Best Practices
+
+This guide provides best practices for upgrading Coder, along with
+troubleshooting steps for common issues encountered during upgrades,
+particularly with database migrations in high availability (HA) deployments.
+
+## Before you upgrade
+
+> [!TIP]
+> To check your current Coder version, use `coder version` from the CLI, check
+> the bottom-right of the Coder dashboard, or query the `/api/v2/buildinfo`
+> endpoint. See the [version command](../reference/cli/version.md) for details.
+
+- **Schedule upgrades during off-peak hours.** Upgrades can cause a noticeable
+  disruption to the developer experience. Plan your maintenance window when
+  the fewest developers are actively using their workspaces.
+- **The larger the version jump, the more migrations will run.** If you are
+  upgrading across multiple minor versions, expect longer migration times.
+- **Large upgrades should complete in minutes** (typically 4-7 minutes). If your
+  upgrade is taking significantly longer, there may be an issue requiring
+  investigation.
+- **Check for known issues affecting your upgrade path.** Some version upgrades
+  have known issues that may require a larger maintenance window or additional
+  steps. For example, upgrades from v2.26.0 to v2.27.8 may encounter issues with
+  the `api_keys` table—upgrading to v2.26.6 first can help mitigate this.
+  Contact [Coder support](../support/index.md) for guidance on your specific
+  upgrade path.
+
+## Pre-upgrade strategy for Kubernetes HA deployments
+
+Standard Kubernetes rolling updates may fail when exclusive database locks are
+required because old replicas keep connections open. For production deployments
+running multiple replicas (HA), active connections from existing pods can
+prevent the new pod from acquiring necessary locks.
+
+### Recommended strategy for major upgrades
+
+1. **Scale down before upgrading:** Before running `helm upgrade`, scale your
+   Coder deployment down to eliminate database connection contention from
+   existing pods.
+
+   - **Scale to zero** for a clean cutover with no active database connections
+     when the upgrade starts. This momentarily ensures no application access to
+     the database, allowing migrations to acquire locks immediately:
+
+     ```shell
+     kubectl scale deployment coder --replicas=0
+     ```
+
+   - **Scale to one** if you prefer to minimize downtime. This keeps one pod
+     running but eliminates contention from multiple replicas:
+
+     ```shell
+     kubectl scale deployment coder --replicas=1
+     ```
+
+1. **Perform upgrade:** Run your standard Helm upgrade command. When scaling to
+   zero, this will bring up a fresh pod that can run migrations without
+   competing for database locks.
+
+1. **Scale back:** Once the upgrade is healthy, scale back to your desired
+   replica count.
+
+## Kubernetes liveness probes and long-running migrations
+
+Liveness probes can cause pods to be killed during long-running database
+migrations. Starting with Coder v2.30.0, liveness probes are *disabled by
+default* in the Helm chart.
+
+This change was made because:
+
+- Liveness probes can kill pods during legitimate long-running migrations
+- If a Coder pod becomes unresponsive (due to a deadlock, etc.), it's better to
+  investigate the issue rather than have Kubernetes silently restart the pod
+
+If you have enabled liveness probes in your deployment and observe pods
+restarting with `CrashLoopBackOff` during an upgrade, the liveness probe may be
+killing the pod prematurely.
+
+### Diagnosing liveness probe issues
+
+To confirm whether Kubernetes is killing pods due to liveness probe failures,
+check the Kubernetes events and pod logs:
+
+```shell
+# Check events for the Coder deployment
+kubectl get events --field-selector involvedObject.name=coder -n <namespace>
+
+# Check pod logs for migration progress
+kubectl logs -l app.kubernetes.io/name=coder -n <namespace> --previous
+```
+
+Look for events indicating `Liveness probe failed` or `Container coder failed
+liveness probe, will be restarted`.
+
+### Recommended approach
+
+If you have liveness probes enabled and experience issues during upgrades,
+disable them before upgrading:
+
+```shell
+kubectl edit deployment coder
+```
+
+Remove the `livenessProbe` section entirely, then proceed with the upgrade.
+
+> [!NOTE]
+> For versions prior to v2.30.0, liveness probes were enabled by default. You
+> can disable them by editing the Deployment directly with `kubectl edit
+> deployment coder` or by using a ConfigMap override. See the
+> [Helm chart values](https://artifacthub.io/packages/helm/coder-v2/coder?modal=values&path=coder.livenessProbe)
+> for configuration options available in v2.30.0+.
+
+### Workaround steps
+
+1. **Remove or adjust liveness probes:** Temporarily remove the `livenessProbe`
+   from your Deployment configuration to prevent Kubernetes from restarting the
+   pod during migrations.
+
+1. **Isolate the migration:** Ensure all extra replica sets are shut down. If
+   you have clear evidence of database locks from old pods, scale the deployment
+   to 1 replica to prevent old pods from holding locks on the tables being
+   upgraded.
+
+1. **Clear database locks:** Monitor database activity. If the migration remains
+   blocked by locks despite scaling down, you may need to manually terminate
+   existing connections. See
+   [Recovering from failed database migrations](#recovering-from-failed-database-migrations)
+   below for instructions.
+
+## Recovering from failed database migrations
+
+If an upgrade gets stuck in a restart loop due to database locks:
+
+1. **Scale to zero:** Scale the Coder deployment to 0 to stop all application
+   activity.
+
+   ```shell
+   kubectl scale deployment coder --replicas=0
+   ```
+
+1. **Clear connections:** Terminate existing connections to the Coder database
+   to release any lingering locks. This PostgreSQL command drops all active
+   connections to the database:
+
+   > [!CAUTION]
+   > This command is intrusive and should be used as a last resort. Contact
+   > [Coder support](../support/index.md) before running destructive database
+   > commands in production. SQL commands may vary depending on your PostgreSQL
+   > version and configuration.
+
+   ```sql
+   SELECT pg_terminate_backend(pid)
+   FROM pg_stat_activity
+   WHERE datname = 'coder'
+   AND pid <> pg_backend_pid();
+   ```
+
+1. **Check schema migrations:** Verify the level of upgrade and check if `dirty`
+   is true. If this has progressed, this now indicates your current Coder
+   installation state.
+
+   > [!NOTE]
+   > The SQL commands below are for informational purposes. If you are unsure
+   > about querying your database directly, contact
+   > [Coder support](../support/index.md) for assistance.
+
+   ```sql
+   SELECT * FROM schema_migrations;
+   ```
+
+1. **Ensure image version:** Confirm the Deployment image is set to the
+   appropriate version (old or new, depending on the database migration state
+   found in step 3). Match your tag in the
+   [migrations directory](https://github.com/coder/coder/tree/main/coderd/database/migrations)
+   to the value in the `schema_migrations` output.
+
+1. **Resume the upgrade:** Follow the
+   [pre-upgrade strategy](#recommended-strategy-for-major-upgrades) to scale
+   back up and continue the upgrade process.
+
+## When to contact support
+
+If you encounter any of the following issues, contact
+[Coder support](../support/index.md):
+
+- Locking issues that cannot be mitigated by the steps in this guide
+- Migrations taking significantly longer than expected (more than 15 minutes)
+  without evidence of lock contention—this may indicate database resource
+  constraints requiring investigation
+- Resource consumption issues (excessive memory, CPU, or OOM kills) during
+  upgrades
+- Any other upgrade problems not covered by this documentation
+
+When contacting support, please collect and provide:
+
+- `coderd` logs with details on the stages where the upgrade stalled
+- PostgreSQL logs if available
+- The Coder versions involved (source and target)
+- Your deployment configuration (number of replicas, resource limits)
diff --git a/docs/install/upgrade.md b/docs/install/upgrade.md
index 7b8b0347bd..2559217edc 100644
--- a/docs/install/upgrade.md
+++ b/docs/install/upgrade.md
@@ -6,6 +6,9 @@ This article describes how to upgrade your Coder server.
 > Prior to upgrading a production Coder deployment, take a database snapshot since
 > Coder does not support rollbacks.
 
+For upgrade recommendations and troubleshooting, see
+[Upgrading Best Practices](./upgrade-best-practices.md).
+
 ## Reinstall Coder to upgrade
 
 To upgrade your Coder server, reinstall Coder using your original method
diff --git a/docs/manifest.json b/docs/manifest.json
index 84b7c4428f..d505a7305a 100644
--- a/docs/manifest.json
+++ b/docs/manifest.json
@@ -169,7 +169,14 @@
 					"title": "Upgrading",
 					"description": "Learn how to upgrade Coder",
 					"path": "./install/upgrade.md",
-					"icon_path": "./images/icons/upgrade.svg"
+					"icon_path": "./images/icons/upgrade.svg",
+					"children": [
+						{
+							"title": "Upgrading Best Practices",
+							"description": "Best practices and troubleshooting for Coder upgrades",
+							"path": "./install/upgrade-best-practices.md"
+						}
+					]
 				},
 				{
 					"title": "Uninstall",