mirror of
https://github.com/coder/coder.git
synced 2026-06-04 13:38:21 +00:00
201 lines
7.9 KiB
Markdown
201 lines
7.9 KiB
Markdown
# Upgrading Best Practices
|
|
|
|
This guide provides best practices for upgrading Coder, along with
|
|
troubleshooting steps for common issues encountered during upgrades,
|
|
particularly with database migrations in high availability (HA) deployments.
|
|
|
|
## Before you upgrade
|
|
|
|
> [!TIP]
|
|
> To check your current Coder version, use `coder version` from the CLI, check
|
|
> the bottom-right of the Coder dashboard, or query the `/api/v2/buildinfo`
|
|
> endpoint. See the [version command](../reference/cli/version.md) for details.
|
|
|
|
- **Schedule upgrades during off-peak hours.** Upgrades can cause a noticeable
|
|
disruption to the developer experience. Plan your maintenance window when
|
|
the fewest developers are actively using their workspaces.
|
|
- **The larger the version jump, the more migrations will run.** If you are
|
|
upgrading across multiple minor versions, expect longer migration times.
|
|
- **Large upgrades should complete in minutes** (typically 4-7 minutes). If your
|
|
upgrade is taking significantly longer, there may be an issue requiring
|
|
investigation.
|
|
- **Check for known issues affecting your upgrade path.** Some version upgrades
|
|
have known issues that may require a larger maintenance window or additional
|
|
steps. For example, upgrades from v2.26.0 to v2.27.8 may encounter issues with
|
|
the `api_keys` table—upgrading to v2.26.6 first can help mitigate this.
|
|
Contact [Coder support](../support/index.md) for guidance on your specific
|
|
upgrade path.
|
|
|
|
## Pre-upgrade strategy for Kubernetes HA deployments
|
|
|
|
Standard Kubernetes rolling updates may fail when exclusive database locks are
|
|
required because old replicas keep connections open. For production deployments
|
|
running multiple replicas (HA), active connections from existing pods can
|
|
prevent the new pod from acquiring necessary locks.
|
|
|
|
### Recommended strategy for major upgrades
|
|
|
|
1. **Scale down before upgrading:** Before running `helm upgrade`, scale your
|
|
Coder deployment down to eliminate database connection contention from
|
|
existing pods.
|
|
|
|
- **Scale to zero** for a clean cutover with no active database connections
|
|
when the upgrade starts. This momentarily ensures no application access to
|
|
the database, allowing migrations to acquire locks immediately:
|
|
|
|
```shell
|
|
kubectl scale deployment coder --replicas=0
|
|
```
|
|
|
|
- **Scale to one** if you prefer to minimize downtime. This keeps one pod
|
|
running but eliminates contention from multiple replicas:
|
|
|
|
```shell
|
|
kubectl scale deployment coder --replicas=1
|
|
```
|
|
|
|
1. **Perform upgrade:** Run your standard Helm upgrade command. When scaling to
|
|
zero, this will bring up a fresh pod that can run migrations without
|
|
competing for database locks.
|
|
|
|
1. **Scale back:** Once the upgrade is healthy, scale back to your desired
|
|
replica count.
|
|
|
|
## Kubernetes liveness probes and long-running migrations
|
|
|
|
Liveness probes can cause pods to be killed during long-running database
|
|
migrations. Starting with Coder v2.30.0, liveness probes are *disabled by
|
|
default* in the Helm chart.
|
|
|
|
This change was made because:
|
|
|
|
- Liveness probes can kill pods during legitimate long-running migrations
|
|
- If a Coder pod becomes unresponsive (due to a deadlock, etc.), it's better to
|
|
investigate the issue rather than have Kubernetes silently restart the pod
|
|
|
|
If you have enabled liveness probes in your deployment and observe pods
|
|
restarting with `CrashLoopBackOff` during an upgrade, the liveness probe may be
|
|
killing the pod prematurely.
|
|
|
|
### Diagnosing liveness probe issues
|
|
|
|
To confirm whether Kubernetes is killing pods due to liveness probe failures,
|
|
check the Kubernetes events and pod logs:
|
|
|
|
```shell
|
|
# Check events for the Coder deployment
|
|
kubectl get events --field-selector involvedObject.name=coder -n <namespace>
|
|
|
|
# Check pod logs for migration progress
|
|
kubectl logs -l app.kubernetes.io/name=coder -n <namespace> --previous
|
|
```
|
|
|
|
Look for events indicating `Liveness probe failed` or `Container coder failed
|
|
liveness probe, will be restarted`.
|
|
|
|
### Recommended approach
|
|
|
|
If you have liveness probes enabled and experience issues during upgrades,
|
|
disable them before upgrading:
|
|
|
|
```shell
|
|
kubectl edit deployment coder
|
|
```
|
|
|
|
Remove the `livenessProbe` section entirely, then proceed with the upgrade.
|
|
|
|
> [!NOTE]
|
|
> For versions prior to v2.30.0, liveness probes were enabled by default. You
|
|
> can disable them by editing the Deployment directly with `kubectl edit
|
|
> deployment coder` or by using a ConfigMap override. See the
|
|
> [Helm chart values](https://artifacthub.io/packages/helm/coder-v2/coder?modal=values&path=coder.livenessProbe)
|
|
> for configuration options available in v2.30.0+.
|
|
|
|
### Workaround steps
|
|
|
|
1. **Remove or adjust liveness probes:** Temporarily remove the `livenessProbe`
|
|
from your Deployment configuration to prevent Kubernetes from restarting the
|
|
pod during migrations.
|
|
|
|
1. **Isolate the migration:** Ensure all extra replica sets are shut down. If
|
|
you have clear evidence of database locks from old pods, scale the deployment
|
|
to 1 replica to prevent old pods from holding locks on the tables being
|
|
upgraded.
|
|
|
|
1. **Clear database locks:** Monitor database activity. If the migration remains
|
|
blocked by locks despite scaling down, you may need to manually terminate
|
|
existing connections. See
|
|
[Recovering from failed database migrations](#recovering-from-failed-database-migrations)
|
|
below for instructions.
|
|
|
|
## Recovering from failed database migrations
|
|
|
|
If an upgrade gets stuck in a restart loop due to database locks:
|
|
|
|
1. **Scale to zero:** Scale the Coder deployment to 0 to stop all application
|
|
activity.
|
|
|
|
```shell
|
|
kubectl scale deployment coder --replicas=0
|
|
```
|
|
|
|
1. **Clear connections:** Terminate existing connections to the Coder database
|
|
to release any lingering locks. This PostgreSQL command drops all active
|
|
connections to the database:
|
|
|
|
> [!CAUTION]
|
|
> This command is intrusive and should be used as a last resort. Contact
|
|
> [Coder support](../support/index.md) before running destructive database
|
|
> commands in production. SQL commands may vary depending on your PostgreSQL
|
|
> version and configuration.
|
|
|
|
```sql
|
|
SELECT pg_terminate_backend(pid)
|
|
FROM pg_stat_activity
|
|
WHERE datname = 'coder'
|
|
AND pid <> pg_backend_pid();
|
|
```
|
|
|
|
1. **Check schema migrations:** Verify the level of upgrade and check if `dirty`
|
|
is true. If this has progressed, this now indicates your current Coder
|
|
installation state.
|
|
|
|
> [!NOTE]
|
|
> The SQL commands below are for informational purposes. If you are unsure
|
|
> about querying your database directly, contact
|
|
> [Coder support](../support/index.md) for assistance.
|
|
|
|
```sql
|
|
SELECT * FROM schema_migrations;
|
|
```
|
|
|
|
1. **Ensure image version:** Confirm the Deployment image is set to the
|
|
appropriate version (old or new, depending on the database migration state
|
|
found in step 3). Match your tag in the
|
|
[migrations directory](https://github.com/coder/coder/tree/main/coderd/database/migrations)
|
|
to the value in the `schema_migrations` output.
|
|
|
|
1. **Resume the upgrade:** Follow the
|
|
[pre-upgrade strategy](#recommended-strategy-for-major-upgrades) to scale
|
|
back up and continue the upgrade process.
|
|
|
|
## When to contact support
|
|
|
|
If you encounter any of the following issues, contact
|
|
[Coder support](../support/index.md):
|
|
|
|
- Locking issues that cannot be mitigated by the steps in this guide
|
|
- Migrations taking significantly longer than expected (more than 15 minutes)
|
|
without evidence of lock contention—this may indicate database resource
|
|
constraints requiring investigation
|
|
- Resource consumption issues (excessive memory, CPU, or OOM kills) during
|
|
upgrades
|
|
- Any other upgrade problems not covered by this documentation
|
|
|
|
When contacting support, please collect and provide:
|
|
|
|
- `coderd` logs with details on the stages where the upgrade stalled
|
|
- PostgreSQL logs if available
|
|
- The Coder versions involved (source and target)
|
|
- Your deployment configuration (number of replicas, resource limits)
|