Files
coder/docs/admin/infrastructure/validated-architectures/3k-users.md
T
Edward Angert e5ba8b7912 docs: update aws instance recommendations (#17344)
from @jatcod3r on Slack:

> for the AWS recs on our [validated
arch](https://coder.com/docs/admin/infrastructure/validated-architectures/1k-users)
docs, should we be referencing customers to use non-T type instances?
> Once you've exceeded EC2's [CPU
credits](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/burstable-performance-instances.html)
Coder starts performing poorly.
> We do suggest to [scale for peak
demand](https://coder.com/docs/tutorials/best-practices/scale-coder#scaling-3),
so does recommending something from the
[cpu](https://aws.amazon.com/ec2/instance-types/#Compute_Optimized) or
[memory
optimized](https://aws.amazon.com/ec2/instance-types/#Memory_Optimized)
types make sense?


[preview](https://coder.com/docs/@aws-ec2-arch/admin/infrastructure/validated-architectures#aws-instance-types)

---------

Co-authored-by: EdwardAngert <17991901+EdwardAngert@users.noreply.github.com>
2025-04-10 14:35:29 -04:00

70 lines
3.3 KiB
Markdown

# Reference Architecture: up to 3,000 users
The 3,000 users architecture targets large-scale enterprises, possibly with
on-premises network and cloud deployments.
**Target load**: API: up to 550 RPS
**High Availability**: Typically, such scale requires a fully-managed HA
PostgreSQL service, and all Coder observability features enabled for operational
purposes.
**Observability**: Deploy monitoring solutions to gather Prometheus metrics and
visualize them with Grafana to gain detailed insights into infrastructure and
application behavior. This allows operators to respond quickly to incidents and
continuously improve the reliability and performance of the platform.
## Hardware recommendations
### Coderd nodes
| Users | Node capacity | Replicas | GCP | AWS | Azure |
|-------------|----------------------|-----------------------|-----------------|-------------|-------------------|
| Up to 3,000 | 8 vCPU, 32 GB memory | 4 node, 1 coderd each | `n1-standard-4` | `m5.xlarge` | `Standard_D4s_v3` |
### Provisioner nodes
| Users | Node capacity | Replicas | GCP | AWS | Azure |
|-------------|----------------------|-------------------------------|------------------|--------------|-------------------|
| Up to 3,000 | 8 vCPU, 32 GB memory | 8 nodes, 30 provisioners each | `t2d-standard-8` | `c5.2xlarge` | `Standard_D8s_v3` |
**Footnotes**:
- An external provisioner is deployed as Kubernetes pod.
- It is strongly discouraged to run provisioner daemons on `coderd` nodes at
this level of scale.
- Separate provisioners into different namespaces in favor of zero-trust or
multi-cloud deployments.
### Workspace nodes
| Users | Node capacity | Replicas | GCP | AWS | Azure |
|-------------|----------------------|-------------------------------|------------------|--------------|-------------------|
| Up to 3,000 | 8 vCPU, 32 GB memory | 256 nodes, 12 workspaces each | `t2d-standard-8` | `m5.2xlarge` | `Standard_D8s_v3` |
**Footnotes**:
- Assumed that a workspace user needs 2 GB memory to perform
- Maximum number of Kubernetes workspace pods per node: 256
- As workspace nodes can be distributed between regions, on-premises networks
and cloud areas, consider different namespaces in favor of zero-trust or
multi-cloud deployments.
### Database nodes
| Users | Node capacity | Replicas | Storage | GCP | AWS | Azure |
|-------------|----------------------|----------|---------|---------------------|-----------------|-------------------|
| Up to 3,000 | 8 vCPU, 32 GB memory | 2 nodes | 1.5 TB | `db-custom-8-30720` | `db.m5.2xlarge` | `Standard_D8s_v3` |
**Footnotes**:
- Consider adding more replicas if the workspace activity is higher than 1500
workspace builds per day or to achieve higher RPS.
**Footnotes for AWS instance types**:
- For production deployments, we recommend using non-burstable instance types,
such as `m5` or `c5`, instead of burstable instances, such as `t3`.
Burstable instances can experience significant performance degradation once
CPU credits are exhausted, leading to poor user experience under sustained load.