docs: rootless podman support (#6026)

* rootless podman WIP

* docs: rootless podman support
This commit is contained in:
Ben Potter
2023-02-06 08:05:38 -06:00
committed by GitHub
parent e70b3f2973
commit 968d7e4dc5
7 changed files with 651 additions and 4 deletions
+74 -4
View File
@@ -2,10 +2,11 @@
There are a few ways to run Docker within container-based Coder workspaces.
| Method | Description | Limitations |
| ---------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [Sysbox container runtime](#sysbox-container-runtime) | Install sysbox on your Kubernetes nodes for secure docker-in-docker and systemd-in-docker. Works with GKE, EKS, AKS. | Requires [compatible nodes](https://github.com/nestybox/sysbox#host-requirements). Max of 16 sysbox pods per node. [See all](https://github.com/nestybox/sysbox/blob/master/docs/user-guide/limitations.md) |
| [Privileged docker sidecar](#privileged-sidecar-container) | Run docker as a privilged sidecar container. | Requires a privileged container. Workspaces can break out to root on the host machine. |
| Method | Description | Limitations |
| ------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [Sysbox container runtime](#sysbox-container-runtime) | Install the sysbox runtime on your Kubernetes nodes for secure docker-in-docker and systemd-in-docker. Works with GKE, EKS, AKS. | Requires [compatible nodes](https://github.com/nestybox/sysbox#host-requirements). Max of 16 sysbox pods per node. [See all](https://github.com/nestybox/sysbox/blob/master/docs/user-guide/limitations.md) |
| [Rootless Podman](https://github.com/bpmct/coder-templates/tree/main/rootless-podman) | Run podman inside Coder workspaces. Does not require a custom runtime or privileged containers. Works with GKE, EKS, AKS, RKE, OpenShift | Requires smarter-device-manager for FUSE mounts. [See all](https://github.com/containers/podman/blob/main/rootless.md#shortcomings-of-rootless-podman) |
| [Privileged docker sidecar](#privileged-sidecar-container) | Run docker as a privileged sidecar container. | Requires a privileged container. Workspaces can break out to root on the host machine. |
## Sysbox container runtime
@@ -109,6 +110,75 @@ resource "kubernetes_pod" "dev" {
> Sysbox CE (Community Edition) supports a maximum of 16 pods (workspaces) per node on Kubernetes. See the [Sysbox documentation](https://github.com/nestybox/sysbox/blob/master/docs/user-guide/install-k8s.md#limitations) for more details.
## Rootless podman
[Podman](https://docs.podman.io/en/latest/) is Docker alternative that is compatible with OCI containers specification. which can run rootless inside Kubernetes pods. No custom RuntimeClass is required.
Prior to completing the steps below, please review the following Podman documentation:
- [Basic setup and use of Podman in a rootless environment](https://github.com/containers/podman/blob/main/docs/tutorials/rootless_tutorial.md)
- [Shortcomings of Rootless Podman](https://github.com/containers/podman/blob/main/rootless.md#shortcomings-of-rootless-podman)
1. Enable [smart-device-manager](https://gitlab.com/arm-research/smarter/smarter-device-manager#enabling-access) to securely expose a FUSE devices to pods.
```sh
cat <<EOF | kubectl create -f -
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fuse-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: fuse-device-plugin-ds
template:
metadata:
labels:
name: fuse-device-plugin-ds
spec:
hostNetwork: true
containers:
- image: soolaugust/fuse-device-plugin:v1.0
name: fuse-device-plugin-ctr
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
imagePullSecrets:
- name: registry-secret
EOF
```
2. Be sure to label your nodes to enable smarter-device-manager:
```sh
kubectl get nodes
kubectl label nodes --all smarter-device-manager=enabled
```
> ⚠️ **Warning**: If you are using a managed Kubernetes distribution (e.g. AKS, EKS, GKE), be sure to set node labels via your cloud provider. Otherwise, your nodes may drop the labels and break podman functionality.
3. For systems running SELinux (typically Fedora-, CentOS-, and Red Hat-based systems), you may need to disable SELinux or set it to permissive mode.
4. Import our [kubernetes-podman](https://github.com/coder/coder/tree/main/examples/templates/kubernetes-podman) example template, or make your own.
```sh
echo "kubernetes-podman" | coder templates init
cd ./kubernetes-podman
coder templates create
```
> For more information around the requirements of rootless podman pods, see: [How to run Podman inside of Kubernetes](https://www.redhat.com/sysadmin/podman-inside-kubernetes)
## Privileged sidecar container
A [privileged container](https://docs.docker.com/engine/reference/run/#runtime-privilege-and-linux-capabilities) can be added to your templates to add docker support. This may come in handy if your nodes cannot run Sysbox.
@@ -0,0 +1,117 @@
---
name: Develop in Kubernetes
description: Get started with Kubernetes development.
tags: [cloud, kubernetes]
icon: /icon/k8s.png
---
# Getting started
This template creates [rootless podman](./images) pods with either an Ubuntu or Fedora base image.
> **Warning**: This template requires additional configuration on the Kubernetes cluster, such as installing `smarter-device-manager` for FUSE mounts. See our [Docker-in-Docker documentation](https://coder.com/docs/v2/latest/templates/docker-in-docker#rootless-podman) for instructions.
Base images are pushed to [Docker Hub](https://hub.docker.com//codercom)
## RBAC
The Coder provisioner requires permission to administer pods to use this template. The template
creates workspaces in a single Kubernetes namespace, using the `workspaces_namespace` parameter set
while creating the template.
Create a role as follows and bind it to the user or service account that runs the coder host.
```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: coder
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["*"]
```
## Authentication
This template can authenticate using in-cluster authentication, or using a kubeconfig local to the
Coder host. For additional authentication options, consult the [Kubernetes provider
documentation](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs).
### kubeconfig on Coder host
If the Coder host has a local `~/.kube/config`, you can use this to authenticate
with Coder. Make sure this is done with same user that's running the `coder` service.
To use this authentication, set the parameter `use_kubeconfig` to true.
### In-cluster authentication
If the Coder host runs in a Pod on the same Kubernetes cluster as you are creating workspaces in,
you can use in-cluster authentication.
To use this authentication, set the parameter `use_kubeconfig` to false.
The Terraform provisioner will automatically use the service account associated with the pod to
authenticate to Kubernetes. Be sure to bind a [role with appropriate permission](#rbac) to the
service account. For example, assuming the Coder host runs in the same namespace as you intend
to create workspaces:
```yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: coder
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: coder
subjects:
- kind: ServiceAccount
name: coder
roleRef:
kind: Role
name: coder
apiGroup: rbac.authorization.k8s.io
```
Then start the Coder host with `serviceAccountName: coder` in the pod spec.
## Namespace
The target namespace in which the pod will be deployed is defined via the `coder_workspace`
variable. The namespace must exist prior to creating workspaces.
## Persistence
The `/home/coder` directory in this example is persisted via the attached PersistentVolumeClaim.
Any data saved outside of this directory will be wiped when the workspace stops.
Since most binary installations and environment configurations live outside of
the `/home` directory, we suggest including these in the `startup_script` argument
of the `coder_agent` resource block, which will run each time the workspace starts up.
For example, when installing the `aws` CLI, the install script will place the
`aws` binary in `/usr/local/bin/aws`. To ensure the `aws` CLI is persisted across
workspace starts/stops, include the following code in the `coder_agent` resource
block of your workspace template:
```terraform
resource "coder_agent" "main" {
startup_script = <<-EOT
set -e
# install AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
EOT
}
```
## code-server
`code-server` is installed via the `startup_script` argument in the `coder_agent`
resource block. The `coder_app` resource is defined to access `code-server` through
the dashboard UI over `localhost:13337`.
@@ -0,0 +1,35 @@
FROM registry.fedoraproject.org/fedora:latest
LABEL org.opencontainers.image.description="Base Fedora image for rootless podman in Coder. See https://coder.com/docs/v2/latest/templates/docker-in-docker#rootless-podman"
RUN dnf -y update && \
rpm --setcaps shadow-utils 2>/dev/null && \
dnf -y install podman fuse-overlayfs openssh-clients \
--exclude container-selinux && \
dnf clean all && \
rm -rf /var/cache /var/log/dnf* /var/log/yum.*
RUN useradd podman; \
echo -e "podman:1:999\npodman:1001:64535" > /etc/subuid; \
echo -e "podman:1:999\npodman:1001:64535" > /etc/subgid;
ADD containers.conf /etc/containers/containers.conf
ADD storage.conf /etc/containers/storage.conf
RUN chmod 644 /etc/containers/containers.conf && \
chmod 644 /etc/containers/storage.conf
RUN mkdir -p /var/lib/shared/overlay-images \
/var/lib/shared/overlay-layers \
/var/lib/shared/vfs-images \
/var/lib/shared/vfs-layers && \
touch /var/lib/shared/overlay-images/images.lock && \
touch /var/lib/shared/overlay-layers/layers.lock && \
touch /var/lib/shared/vfs-images/images.lock && \
touch /var/lib/shared/vfs-layers/layers.lock
# Alias "docker" to "podman"
RUN ln -s /usr/bin/podman /usr/bin/docker
USER podman
ENV _CONTAINERS_USERNS_CONFIGURED=""
@@ -0,0 +1,59 @@
FROM ubuntu:22.04
LABEL org.opencontainers.image.description="Base Ubuntu image for rootless podman in Coder. See https://coder.com/docs/v2/latest/templates/docker-in-docker#rootless-podman"
USER root
# Install dependencies
RUN apt-get update && apt-get install -y sudo gnupg2 curl vim fuse-overlayfs libvshadow-utils openssh-client
# Install podman
RUN mkdir -p /etc/apt/keyrings
RUN curl -fsSL https://download.opensuse.org/repositories/devel:kubic:libcontainers:unstable/xUbuntu_22.04/Release.key \
| gpg --dearmor \
| tee /etc/apt/keyrings/devel_kubic_libcontainers_unstable.gpg > /dev/null
RUN echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/devel_kubic_libcontainers_unstable.gpg]\
https://download.opensuse.org/repositories/devel:kubic:libcontainers:unstable/xUbuntu_22.04/ /" \
| tee /etc/apt/sources.list.d/devel:kubic:libcontainers:unstable.list > /dev/null
RUN apt-get update && apt-get -y install podman
RUN setcap cap_setuid+ep /usr/bin/newuidmap
RUN setcap cap_setgid+ep /usr/bin/newgidmap
RUN chmod 0755 /usr/bin/newuidmap
RUN chmod 0755 /usr/bin/newgidmap
RUN useradd podman
RUN echo "podman:100000:65536" > /etc/subuid
RUN echo "podman:100000:65536" > /etc/subgid
RUN echo "podman ALL=(ALL) NOPASSWD:ALL" | sudo tee -a /etc/sudoers
ADD containers.conf /etc/containers/containers.conf
ADD storage.conf /etc/containers/storage.conf
RUN chmod 644 /etc/containers/containers.conf && \
chmod 644 /etc/containers/storage.conf
RUN mkdir -p /home/podman/.local/share/containers && \
chown podman:podman -R /home/podman && \
chmod 644 /etc/containers/containers.conf
RUN mkdir -p /var/lib/shared/overlay-images \
/var/lib/shared/overlay-layers \
/var/lib/shared/vfs-images \
/var/lib/shared/vfs-layers && \
touch /var/lib/shared/overlay-images/images.lock && \
touch /var/lib/shared/overlay-layers/layers.lock && \
touch /var/lib/shared/vfs-images/images.lock && \
touch /var/lib/shared/vfs-layers/layers.lock
ENV _CONTAINERS_USERNS_CONFIGURED=""
# Alias "docker" to "podman"
RUN ln -s /usr/bin/podman /usr/bin/docker
RUN chsh -s /bin/bash podman
USER podman
ENV SHELL=/bin/bash
@@ -0,0 +1,16 @@
[containers]
netns="host"
userns="host"
ipcns="host"
utsns="host"
cgroupns="host"
cgroups="disabled"
log_driver = "k8s-file"
volumes = [
"/proc:/proc",
]
default_sysctls = []
[engine]
cgroup_manager = "cgroupfs"
events_logger="file"
runtime="crun"
@@ -0,0 +1,233 @@
# This file is the configuration file for all tools
# that use the containers/storage library. The storage.conf file
# overrides all other storage.conf files. Container engines using the
# container/storage library do not inherit fields from other storage.conf
# files.
#
# Note: The storage.conf file overrides other storage.conf files based on this precedence:
# /usr/containers/storage.conf
# /etc/containers/storage.conf
# $HOME/.config/containers/storage.conf
# $XDG_CONFIG_HOME/containers/storage.conf (If XDG_CONFIG_HOME is set)
# See man 5 containers-storage.conf for more information
# The "container storage" table contains all of the server options.
[storage]
# Default Storage Driver, Must be set for proper operation.
driver = "overlay"
# Temporary storage location
runroot = "/run/containers/storage"
# Primary Read/Write location of container storage
# When changing the graphroot location on an SELINUX system, you must
# ensure the labeling matches the default locations labels with the
# following commands:
# semanage fcontext -a -e /var/lib/containers/storage /NEWSTORAGEPATH
# restorecon -R -v /NEWSTORAGEPATH
graphroot = "/var/lib/containers/storage"
# Storage path for rootless users
#
# rootless_storage_path = "$HOME/.local/share/containers/storage"
[storage.options]
# Storage options to be passed to underlying storage drivers
# AdditionalImageStores is used to pass paths to additional Read/Only image stores
# Must be comma separated list.
additionalimagestores = [
"/var/lib/shared",
]
# Allows specification of how storage is populated when pulling images. This
# option can speed the pulling process of images compressed with format
# zstd:chunked. Containers/storage looks for files within images that are being
# pulled from a container registry that were previously pulled to the host. It
# can copy or create a hard link to the existing file when it finds them,
# eliminating the need to pull them from the container registry. These options
# can deduplicate pulling of content, disk storage of content and can allow the
# kernel to use less memory when running containers.
# containers/storage supports four keys
# * enable_partial_images="true" | "false"
# Tells containers/storage to look for files previously pulled in storage
# rather then always pulling them from the container registry.
# * use_hard_links = "false" | "true"
# Tells containers/storage to use hard links rather then create new files in
# the image, if an identical file already existed in storage.
# * ostree_repos = ""
# Tells containers/storage where an ostree repository exists that might have
# previously pulled content which can be used when attempting to avoid
# pulling content from the container registry
pull_options = {enable_partial_images = "false", use_hard_links = "false", ostree_repos=""}
# Remap-UIDs/GIDs is the mapping from UIDs/GIDs as they should appear inside of
# a container, to the UIDs/GIDs as they should appear outside of the container,
# and the length of the range of UIDs/GIDs. Additional mapped sets can be
# listed and will be needed by libraries, but there are limits to the number of
# mappings which the kernel will allow when you later attempt to run a
# container.
#
# remap-uids = 0:1668442479:65536
# remap-gids = 0:1668442479:65536
# Remap-User/Group is a user name which can be used to look up one or more UID/GID
# ranges in the /etc/subuid or /etc/subgid file. Mappings are set up starting
# with an in-container ID of 0 and then a host-level ID taken from the lowest
# range that matches the specified name, and using the length of that range.
# Additional ranges are then assigned, using the ranges which specify the
# lowest host-level IDs first, to the lowest not-yet-mapped in-container ID,
# until all of the entries have been used for maps.
#
# remap-user = "containers"
# remap-group = "containers"
# Root-auto-userns-user is a user name which can be used to look up one or more UID/GID
# ranges in the /etc/subuid and /etc/subgid file. These ranges will be partitioned
# to containers configured to create automatically a user namespace. Containers
# configured to automatically create a user namespace can still overlap with containers
# having an explicit mapping set.
# This setting is ignored when running as rootless.
# root-auto-userns-user = "storage"
#
# Auto-userns-min-size is the minimum size for a user namespace created automatically.
# auto-userns-min-size=1024
#
# Auto-userns-max-size is the minimum size for a user namespace created automatically.
# auto-userns-max-size=65536
[storage.options.overlay]
# ignore_chown_errors can be set to allow a non privileged user running with
# a single UID within a user namespace to run containers. The user can pull
# and use any image even those with multiple uids. Note multiple UIDs will be
# squashed down to the default uid in the container. These images will have no
# separation between the users in the container. Only supported for the overlay
# and vfs drivers.
#ignore_chown_errors = "false"
# Inodes is used to set a maximum inodes of the container image.
# inodes = ""
# Path to an helper program to use for mounting the file system instead of mounting it
# directly.
mount_program = "/usr/bin/fuse-overlayfs"
# mountopt specifies comma separated list of extra mount options
mountopt = "nodev,fsync=0"
# Set to skip a PRIVATE bind mount on the storage home directory.
# skip_mount_home = "false"
# Size is used to set a maximum size of the container image.
# size = ""
# ForceMask specifies the permissions mask that is used for new files and
# directories.
#
# The values "shared" and "private" are accepted.
# Octal permission masks are also accepted.
#
# "": No value specified.
# All files/directories, get set with the permissions identified within the
# image.
# "private": it is equivalent to 0700.
# All files/directories get set with 0700 permissions. The owner has rwx
# access to the files. No other users on the system can access the files.
# This setting could be used with networked based homedirs.
# "shared": it is equivalent to 0755.
# The owner has rwx access to the files and everyone else can read, access
# and execute them. This setting is useful for sharing containers storage
# with other users. For instance have a storage owned by root but shared
# to rootless users as an additional store.
# NOTE: All files within the image are made readable and executable by any
# user on the system. Even /etc/shadow within your image is now readable by
# any user.
#
# OCTAL: Users can experiment with other OCTAL Permissions.
#
# Note: The force_mask Flag is an experimental feature, it could change in the
# future. When "force_mask" is set the original permission mask is stored in
# the "user.containers.override_stat" xattr and the "mount_program" option must
# be specified. Mount programs like "/usr/bin/fuse-overlayfs" present the
# extended attribute permissions to processes within containers rather than the
# "force_mask" permissions.
#
# force_mask = ""
[storage.options.thinpool]
# Storage Options for thinpool
# autoextend_percent determines the amount by which pool needs to be
# grown. This is specified in terms of % of pool size. So a value of 20 means
# that when threshold is hit, pool will be grown by 20% of existing
# pool size.
# autoextend_percent = "20"
# autoextend_threshold determines the pool extension threshold in terms
# of percentage of pool size. For example, if threshold is 60, that means when
# pool is 60% full, threshold has been hit.
# autoextend_threshold = "80"
# basesize specifies the size to use when creating the base device, which
# limits the size of images and containers.
# basesize = "10G"
# blocksize specifies a custom blocksize to use for the thin pool.
# blocksize="64k"
# directlvm_device specifies a custom block storage device to use for the
# thin pool. Required if you setup devicemapper.
# directlvm_device = ""
# directlvm_device_force wipes device even if device already has a filesystem.
# directlvm_device_force = "True"
# fs specifies the filesystem type to use for the base device.
# fs="xfs"
# log_level sets the log level of devicemapper.
# 0: LogLevelSuppress 0 (Default)
# 2: LogLevelFatal
# 3: LogLevelErr
# 4: LogLevelWarn
# 5: LogLevelNotice
# 6: LogLevelInfo
# 7: LogLevelDebug
# log_level = "7"
# min_free_space specifies the min free space percent in a thin pool require for
# new device creation to succeed. Valid values are from 0% - 99%.
# Value 0% disables
# min_free_space = "10%"
# mkfsarg specifies extra mkfs arguments to be used when creating the base
# device.
# mkfsarg = ""
# metadata_size is used to set the `pvcreate --metadatasize` options when
# creating thin devices. Default is 128k
# metadata_size = ""
# Size is used to set a maximum size of the container image.
# size = ""
# use_deferred_removal marks devicemapper block device for deferred removal.
# If the thinpool is in use when the driver attempts to remove it, the driver
# tells the kernel to remove it as soon as possible. Note this does not free
# up the disk space, use deferred deletion to fully remove the thinpool.
# use_deferred_removal = "True"
# use_deferred_deletion marks thinpool device for deferred deletion.
# If the device is busy when the driver attempts to delete it, the driver
# will attempt to delete device every 30 seconds until successful.
# If the program using the driver exits, the driver will continue attempting
# to cleanup the next time the driver is used. Deferred deletion permanently
# deletes the device and all data stored in device will be lost.
# use_deferred_deletion = "True"
# xfs_nospace_max_retries specifies the maximum number of retries XFS should
# attempt to complete IO when ENOSPC (no space) error is returned by
# underlying storage device.
# xfs_nospace_max_retries = "0"
@@ -0,0 +1,117 @@
terraform {
required_providers {
coder = {
source = "coder/coder"
version = "~> 0.5.3"
}
kubernetes = {
source = "hashicorp/kubernetes"
version = "~> 2.10"
}
}
}
provider "kubernetes" {
config_path = "~/.kube/config"
}
data "coder_workspace" "me" {}
variable "os" {
description = "Operating system"
validation {
condition = contains(["ubuntu", "fedora"], var.os)
error_message = "Invalid zone!"
}
default = "ubuntu"
}
resource "coder_agent" "dev" {
os = "linux"
arch = "amd64"
dir = "/home/podman"
startup_script = <<EOF
#!/bin/sh
curl -fsSL https://code-server.dev/install.sh | sh
code-server --auth none --port 13337 &
# Run once to avoid unnecessary warning: "/" is not a shared mount
podman ps
EOF
}
# code-server
resource "coder_app" "code-server" {
agent_id = coder_agent.dev.id
name = "code-server"
icon = "/icon/code.svg"
url = "http://localhost:13337"
}
resource "kubernetes_pod" "main" {
count = data.coder_workspace.me.start_count
depends_on = [
kubernetes_persistent_volume_claim.home-directory
]
metadata {
name = "coder-${data.coder_workspace.me.id}"
namespace = "default"
annotations = {
# Disables apparmor, required for Debian- and Ubuntu-derived systems
"container.apparmor.security.beta.kubernetes.io/dev" = "unconfined"
}
}
spec {
security_context {
# Runs as the "podman" user
run_as_user = 1000
fs_group = 1000
}
container {
name = "dev"
# We recommend building your own from our reference: see ./images directory
image = "ghcr.io/coder/podman:${var.os}"
image_pull_policy = "Always"
command = ["/bin/bash", "-c", coder_agent.dev.init_script]
security_context {
# Runs as the "podman" user
run_as_user = "1000"
}
resources {
limits = {
# Acquire a FUSE device, powered by smarter-device-manager
"github.com/fuse" : 1
}
}
env {
name = "CODER_AGENT_TOKEN"
value = coder_agent.dev.token
}
volume_mount {
mount_path = "/home/podman"
name = "home-directory"
}
}
volume {
name = "home-directory"
persistent_volume_claim {
claim_name = kubernetes_persistent_volume_claim.home-directory.metadata.0.name
}
}
}
}
resource "kubernetes_persistent_volume_claim" "home-directory" {
metadata {
name = "coder-pvc-${data.coder_workspace.me.id}"
namespace = "default"
}
spec {
access_modes = ["ReadWriteOnce"]
resources {
requests = {
storage = "10Gi"
}
}
}
}