From e5c3d151bb322b064651174856b72a84fcdaeef6 Mon Sep 17 00:00:00 2001 From: "blinkagent[bot]" <237617714+blinkagent[bot]@users.noreply.github.com> Date: Fri, 6 Feb 2026 16:08:59 +0000 Subject: [PATCH] docs: add upgrade best practices guide (#21656) --- docs/admin/networking/high-availability.md | 6 + docs/install/upgrade-best-practices.md | 200 +++++++++++++++++++++ docs/install/upgrade.md | 3 + docs/manifest.json | 9 +- 4 files changed, 217 insertions(+), 1 deletion(-) create mode 100644 docs/install/upgrade-best-practices.md diff --git a/docs/admin/networking/high-availability.md b/docs/admin/networking/high-availability.md index 7dee70a293..292309d44c 100644 --- a/docs/admin/networking/high-availability.md +++ b/docs/admin/networking/high-availability.md @@ -29,6 +29,12 @@ user <-> Coder connections. Coder automatically enters HA mode when multiple instances simultaneously connect to the same Postgres endpoint. +> [!NOTE] +> When upgrading HA deployments, database migrations may require special +> handling to avoid lock contention. See +> [Upgrading Best Practices](../../install/upgrade-best-practices.md) for +> recommended procedures. + HA brings one configuration variable to set in each Coderd node: `CODER_DERP_SERVER_RELAY_URL`. The HA nodes use these URLs to communicate with each other. Inter-node communication is only required while using the embedded diff --git a/docs/install/upgrade-best-practices.md b/docs/install/upgrade-best-practices.md new file mode 100644 index 0000000000..e1df11bf6a --- /dev/null +++ b/docs/install/upgrade-best-practices.md @@ -0,0 +1,200 @@ +# Upgrading Best Practices + +This guide provides best practices for upgrading Coder, along with +troubleshooting steps for common issues encountered during upgrades, +particularly with database migrations in high availability (HA) deployments. + +## Before you upgrade + +> [!TIP] +> To check your current Coder version, use `coder version` from the CLI, check +> the bottom-right of the Coder dashboard, or query the `/api/v2/buildinfo` +> endpoint. See the [version command](../reference/cli/version.md) for details. + +- **Schedule upgrades during off-peak hours.** Upgrades can cause a noticeable + disruption to the developer experience. Plan your maintenance window when + the fewest developers are actively using their workspaces. +- **The larger the version jump, the more migrations will run.** If you are + upgrading across multiple minor versions, expect longer migration times. +- **Large upgrades should complete in minutes** (typically 4-7 minutes). If your + upgrade is taking significantly longer, there may be an issue requiring + investigation. +- **Check for known issues affecting your upgrade path.** Some version upgrades + have known issues that may require a larger maintenance window or additional + steps. For example, upgrades from v2.26.0 to v2.27.8 may encounter issues with + the `api_keys` table—upgrading to v2.26.6 first can help mitigate this. + Contact [Coder support](../support/index.md) for guidance on your specific + upgrade path. + +## Pre-upgrade strategy for Kubernetes HA deployments + +Standard Kubernetes rolling updates may fail when exclusive database locks are +required because old replicas keep connections open. For production deployments +running multiple replicas (HA), active connections from existing pods can +prevent the new pod from acquiring necessary locks. + +### Recommended strategy for major upgrades + +1. **Scale down before upgrading:** Before running `helm upgrade`, scale your + Coder deployment down to eliminate database connection contention from + existing pods. + + - **Scale to zero** for a clean cutover with no active database connections + when the upgrade starts. This momentarily ensures no application access to + the database, allowing migrations to acquire locks immediately: + + ```shell + kubectl scale deployment coder --replicas=0 + ``` + + - **Scale to one** if you prefer to minimize downtime. This keeps one pod + running but eliminates contention from multiple replicas: + + ```shell + kubectl scale deployment coder --replicas=1 + ``` + +1. **Perform upgrade:** Run your standard Helm upgrade command. When scaling to + zero, this will bring up a fresh pod that can run migrations without + competing for database locks. + +1. **Scale back:** Once the upgrade is healthy, scale back to your desired + replica count. + +## Kubernetes liveness probes and long-running migrations + +Liveness probes can cause pods to be killed during long-running database +migrations. Starting with Coder v2.30.0, liveness probes are *disabled by +default* in the Helm chart. + +This change was made because: + +- Liveness probes can kill pods during legitimate long-running migrations +- If a Coder pod becomes unresponsive (due to a deadlock, etc.), it's better to + investigate the issue rather than have Kubernetes silently restart the pod + +If you have enabled liveness probes in your deployment and observe pods +restarting with `CrashLoopBackOff` during an upgrade, the liveness probe may be +killing the pod prematurely. + +### Diagnosing liveness probe issues + +To confirm whether Kubernetes is killing pods due to liveness probe failures, +check the Kubernetes events and pod logs: + +```shell +# Check events for the Coder deployment +kubectl get events --field-selector involvedObject.name=coder -n + +# Check pod logs for migration progress +kubectl logs -l app.kubernetes.io/name=coder -n --previous +``` + +Look for events indicating `Liveness probe failed` or `Container coder failed +liveness probe, will be restarted`. + +### Recommended approach + +If you have liveness probes enabled and experience issues during upgrades, +disable them before upgrading: + +```shell +kubectl edit deployment coder +``` + +Remove the `livenessProbe` section entirely, then proceed with the upgrade. + +> [!NOTE] +> For versions prior to v2.30.0, liveness probes were enabled by default. You +> can disable them by editing the Deployment directly with `kubectl edit +> deployment coder` or by using a ConfigMap override. See the +> [Helm chart values](https://artifacthub.io/packages/helm/coder-v2/coder?modal=values&path=coder.livenessProbe) +> for configuration options available in v2.30.0+. + +### Workaround steps + +1. **Remove or adjust liveness probes:** Temporarily remove the `livenessProbe` + from your Deployment configuration to prevent Kubernetes from restarting the + pod during migrations. + +1. **Isolate the migration:** Ensure all extra replica sets are shut down. If + you have clear evidence of database locks from old pods, scale the deployment + to 1 replica to prevent old pods from holding locks on the tables being + upgraded. + +1. **Clear database locks:** Monitor database activity. If the migration remains + blocked by locks despite scaling down, you may need to manually terminate + existing connections. See + [Recovering from failed database migrations](#recovering-from-failed-database-migrations) + below for instructions. + +## Recovering from failed database migrations + +If an upgrade gets stuck in a restart loop due to database locks: + +1. **Scale to zero:** Scale the Coder deployment to 0 to stop all application + activity. + + ```shell + kubectl scale deployment coder --replicas=0 + ``` + +1. **Clear connections:** Terminate existing connections to the Coder database + to release any lingering locks. This PostgreSQL command drops all active + connections to the database: + + > [!CAUTION] + > This command is intrusive and should be used as a last resort. Contact + > [Coder support](../support/index.md) before running destructive database + > commands in production. SQL commands may vary depending on your PostgreSQL + > version and configuration. + + ```sql + SELECT pg_terminate_backend(pid) + FROM pg_stat_activity + WHERE datname = 'coder' + AND pid <> pg_backend_pid(); + ``` + +1. **Check schema migrations:** Verify the level of upgrade and check if `dirty` + is true. If this has progressed, this now indicates your current Coder + installation state. + + > [!NOTE] + > The SQL commands below are for informational purposes. If you are unsure + > about querying your database directly, contact + > [Coder support](../support/index.md) for assistance. + + ```sql + SELECT * FROM schema_migrations; + ``` + +1. **Ensure image version:** Confirm the Deployment image is set to the + appropriate version (old or new, depending on the database migration state + found in step 3). Match your tag in the + [migrations directory](https://github.com/coder/coder/tree/main/coderd/database/migrations) + to the value in the `schema_migrations` output. + +1. **Resume the upgrade:** Follow the + [pre-upgrade strategy](#recommended-strategy-for-major-upgrades) to scale + back up and continue the upgrade process. + +## When to contact support + +If you encounter any of the following issues, contact +[Coder support](../support/index.md): + +- Locking issues that cannot be mitigated by the steps in this guide +- Migrations taking significantly longer than expected (more than 15 minutes) + without evidence of lock contention—this may indicate database resource + constraints requiring investigation +- Resource consumption issues (excessive memory, CPU, or OOM kills) during + upgrades +- Any other upgrade problems not covered by this documentation + +When contacting support, please collect and provide: + +- `coderd` logs with details on the stages where the upgrade stalled +- PostgreSQL logs if available +- The Coder versions involved (source and target) +- Your deployment configuration (number of replicas, resource limits) diff --git a/docs/install/upgrade.md b/docs/install/upgrade.md index 7b8b0347bd..2559217edc 100644 --- a/docs/install/upgrade.md +++ b/docs/install/upgrade.md @@ -6,6 +6,9 @@ This article describes how to upgrade your Coder server. > Prior to upgrading a production Coder deployment, take a database snapshot since > Coder does not support rollbacks. +For upgrade recommendations and troubleshooting, see +[Upgrading Best Practices](./upgrade-best-practices.md). + ## Reinstall Coder to upgrade To upgrade your Coder server, reinstall Coder using your original method diff --git a/docs/manifest.json b/docs/manifest.json index 84b7c4428f..d505a7305a 100644 --- a/docs/manifest.json +++ b/docs/manifest.json @@ -169,7 +169,14 @@ "title": "Upgrading", "description": "Learn how to upgrade Coder", "path": "./install/upgrade.md", - "icon_path": "./images/icons/upgrade.svg" + "icon_path": "./images/icons/upgrade.svg", + "children": [ + { + "title": "Upgrading Best Practices", + "description": "Best practices and troubleshooting for Coder upgrades", + "path": "./install/upgrade-best-practices.md" + } + ] }, { "title": "Uninstall",