OpenClaw从单机Docker部署迁移到Kubernetes集群的完整方案_F11

网站首页收藏本站

广告位联系

APP正在开发中...

返回顶部

分享到

OpenClaw从单机Docker部署迁移到Kubernetes集群的完整方案

Ai 来源：互联网作者：佚名发布时间：2026-06-29 22:29:52 人浏览

摘要

当你的 OpenClaw 从单机走向集群，Kubernetes 是绕不开的选择。本文从 K8s 核心概念出发，系统讲解 OpenClaw 的 K8s 部署架构包括 Deployment 无状态部署、ConfigMap/Secret 配置管理、PersistentVolume 数据持久

当你的 OpenClaw 从单机走向集群，Kubernetes 是绕不开的选择。本文从 K8s 核心概念出发，系统讲解 OpenClaw 的 K8s 部署架构——包括 Deployment 无状态部署、ConfigMap/Secret 配置管理、PersistentVolume 数据持久化、Service 服务发现、Ingress 流量入口、HPA 自动扩缩容，以及 Prometheus + Grafana 监控集成。通过一个完整的实战案例，你将掌握从 YAML 编写到集群上线的全流程。读完你会发现：K8s 不是"更复杂的 Docker"，而是让运维从手动变成自动。

1. 引言：从 Docker 到 Kubernetes 的必然跨越

1.1 单机 Docker 的局限

上一篇文章我们完成了 OpenClaw 的 Docker 部署——一个 docker-compose up -d 就能跑起来。这在初期完全够用。

但当你的业务增长后，问题开始浮现：

场景	Docker 的局限	K8s 的解法
流量突增	手动 docker run 新实例	HPA 自动扩缩容
实例挂了	手动重启或等 restart: always	自动重启 + 健康检查 + 就绪探测
滚动更新	手动停旧启新，有停机时间	RollingUpdate 零停机
多机器部署	每台机器手动操作	统一调度到集群任意节点
配置变更	改文件 + 重启容器	ConfigMap 热更新
服务发现	硬编码 IP 或 DNS	Service + CoreDNS 自动发现

1.2 架构演进

2. Kubernetes 核心概念速览

2.1 关键资源对象

在写 YAML 之前，先理解 K8s 的核心资源：

资源	作用	类比
Pod	最小部署单元，包含一个或多个容器	一个"进程组"
Deployment	管理 Pod 的副本数、更新策略	“部署管理器”
Service	为 Pod 提供稳定的网络入口和负载均衡	“内网负载均衡器”
ConfigMap	存储非敏感配置	“配置文件”
Secret	存储敏感信息（密钥、Token）	“加密配置文件”
PersistentVolume	持久化存储	“外接硬盘”
Ingress	外部流量入口，HTTP/HTTPS 路由	“Nginx 反向代理”
HPA	根据 CPU/内存自动调整 Pod 数量	“自动扩容控制器”

2.2 资源关系图

3. OpenClaw K8s 部署完整配置

3.1 Namespace 和 ConfigMap

# 00-namespace.yaml

apiVersion: v1

kind: Namespace

metadata:

labels:

app: openclaw

environment: production

# 01-configmap.yaml

apiVersion: v1

kind: ConfigMap

metadata:

namespace: openclaw

data:

openclaw.yaml: |

gateway:

port: 18789

auth_token: ${GATEWAY_AUTH_TOKEN}

model:

default: ${DEFAULT_MODEL}

providers:

openai:

api_key: ${OPENAI_API_KEY}

logging:

level: info

file: /data/logs/gateway.log

format: json

channels:

feishu:

enabled: true

app_id: ${FEISHU_APP_ID}

performance:

max_concurrent_requests: 50

request_timeout_ms: 120000

security:

rate_limit:

enabled: true

max_requests_per_minute: 100

# 02-secret.yaml

apiVersion: v1

kind: Secret

metadata:

namespace: openclaw

type: Opaque

stringData:

GATEWAY_AUTH_TOKEN: "your-secure-token-here"

OPENAI_API_KEY: "sk-prod-xxxxxxxxxxxxx"

FEISHU_APP_ID: "cli_xxxxxxxxxxxxx"

FEISHU_APP_SECRET: "xxxxxxxxxxxxxxxxxxxx"

TELEGRAM_BOT_TOKEN: "123456789:ABCdefGHIjklMNOpqrsTUVwxyz"

3.2 Deployment 和 Service

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

# 03-deployment.yaml

apiVersion: apps/v1

kind: Deployment

metadata:

namespace: openclaw

labels:

app: openclaw

component: gateway

spec:

# 副本数（HPA 启用后会自动调整）

replicas: 3

# 滚动更新策略

strategy:

type: RollingUpdate

rollingUpdate:

maxSurge: 1 # 更新时最多多出 1 个 Pod

maxUnavailable: 0 # 更新时不允许不可用 Pod

# Pod 选择器

selector:

matchLabels:

app: openclaw

component: gateway

# Pod 模板

template:

metadata:

labels:

app: openclaw

component: gateway

annotations:

prometheus.io/scrape: "true"

prometheus.io/port: "18789"

prometheus.io/path: "/metrics"

spec:

# 优雅终止时间

terminationGracePeriodSeconds: 60

# 容器定义

containers:

- name: gateway

image: openclaw:latest

imagePullPolicy: Always

ports:

- name: http

containerPort: 18789

protocol: TCP

# 环境变量（从 Secret 注入）

envFrom:

- secretRef:

env:

- name: DEFAULT_MODEL

value: "gpt-4o-mini"

- name: NODE_ENV

value: "production"

- name: TZ

value: "Asia/Shanghai"

# 配置文件挂载

volumeMounts:

- name: config

mountPath: /etc/openclaw

readOnly: true

- name: data

mountPath: /data/openclaw

- name: logs

mountPath: /data/logs

# 资源限制

resources:

requests:

cpu: "500m"

memory: "512Mi"

limits:

cpu: "2000m"

memory: "2Gi"

# 存活探针——检测容器是否存活

livenessProbe:

httpGet:

path: /health

port: 18789

initialDelaySeconds: 30

periodSeconds: 15

timeoutSeconds: 5

failureThreshold: 3

# 就绪探针——检测容器是否准备好接收流量

readinessProbe:

httpGet:

path: /health

port: 18789

initialDelaySeconds: 10

periodSeconds: 10

timeoutSeconds: 3

failureThreshold: 2

# 启动探针——给慢启动的容器更多时间

startupProbe:

httpGet:

path: /health

port: 18789

initialDelaySeconds: 5

periodSeconds: 10

failureThreshold: 12 # 最多等 2 分钟

# 数据卷定义

volumes:

- name: config

configMap:

- name: data

persistentVolumeClaim:

claimName: openclaw-data-pvc

- name: logs

emptyDir: {}

# Pod 反亲和性——尽量分散到不同节点

affinity:

podAntiAffinity:

preferredDuringSchedulingIgnoredDuringExecution:

- weight: 100

podAffinityTerm:

labelSelector:

matchLabels:

app: openclaw

component: gateway

topologyKey: kubernetes.io/hostname

# 04-service.yaml

apiVersion: v1

kind: Service

metadata:

namespace: openclaw

labels:

app: openclaw

component: gateway

spec:

type: ClusterIP

selector:

app: openclaw

component: gateway

ports:

- name: http

port: 80

targetPort: 18789

protocol: TCP

# 会话亲和性——同一客户端请求尽量路由到同一 Pod

sessionAffinity: ClientIP

sessionAffinityConfig:

clientIP:

timeoutSeconds: 3600

3.3 PersistentVolume 持久化存储

# 05-persistentvolume.yaml

apiVersion: v1

kind: PersistentVolumeClaim

metadata:

namespace: openclaw

spec:

accessModes:

- ReadWriteMany # 多 Pod 同时读写

resources:

requests:

storage: 20Gi

storageClassName: nfs-client # 根据你的存储类型调整

3.4 Ingress 流量入口

# 06-ingress.yaml

apiVersion: networking.k8s.io/v1

kind: Ingress

metadata:

namespace: openclaw

annotations:

# Nginx Ingress 配置

nginx.ingress.kubernetes.io/proxy-body-size: "50m"

nginx.ingress.kubernetes.io/proxy-read-timeout: "120"

nginx.ingress.kubernetes.io/proxy-send-timeout: "120"

nginx.ingress.kubernetes.io/websocket-services: "openclaw-gateway"

# SSL 重定向

nginx.ingress.kubernetes.io/ssl-redirect: "true"

# 速率限制

nginx.ingress.kubernetes.io/limit-rps: "100"

# 证书管理

cert-manager.io/cluster-issuer: "letsencrypt-prod"

spec:

ingressClassName: nginx

tls:

- hosts:

- openclaw.your-domain.com

secretName: openclaw-tls

rules:

- host: openclaw.your-domain.com

http:

paths:

- path: /

pathType: Prefix

backend:

service:

port:

number: 80

3.5 HPA 自动扩缩容

# 07-hpa.yaml

apiVersion: autoscaling/v2

kind: HorizontalPodAutoscaler

metadata:

namespace: openclaw

spec:

scaleTargetRef:

apiVersion: apps/v1

kind: Deployment

# 最小/最大副本数

minReplicas: 2

maxReplicas: 10

# 扩缩容指标

metrics:

# CPU 使用率

- type: Resource

resource:

target:

type: Utilization

averageUtilization: 70

# 内存使用率

- type: Resource

resource:

target:

type: Utilization

averageUtilization: 80

# 自定义指标——请求速率

- type: Pods

pods:

metric:

target:

type: AverageValue

averageValue: "50"

# 扩缩容行为

behavior:

# 扩容策略——快速响应

scaleUp:

stabilizationWindowSeconds: 60 # 60秒稳定窗口

policies:

- type: Percent

value: 100 # 每次最多翻倍

periodSeconds: 60

- type: Pods

value: 4 # 每次最多加4个

periodSeconds: 60

selectPolicy: Max

# 缩容策略——缓慢回收

scaleDown:

stabilizationWindowSeconds: 300 # 5分钟稳定窗口

policies:

- type: Percent

value: 10 # 每次最多减10%

periodSeconds: 120

selectPolicy: Min

4. 监控集成：Prometheus + Grafana

4.1 ServiceMonitor 配置

# 08-servicemonitor.yaml

apiVersion: monitoring.coreos.com/v1

kind: ServiceMonitor

metadata:

namespace: openclaw

labels:

release: prometheus

spec:

selector:

matchLabels:

app: openclaw

component: gateway

endpoints:

- port: http

path: /metrics

interval: 30s

scrapeTimeout: 10s

4.2 Grafana 仪表盘关键指标

指标	PromQL	告警阈值
Pod CPU 使用率	rate(container_cpu_usage_seconds_total{pod=~"openclaw-gateway-.*"}[5m])	> 80%
Pod 内存使用率	container_memory_usage_bytes{pod=~"openclaw-gateway-.*"} / container_spec_memory_limit_bytes	> 85%
HTTP 请求速率	rate(http_requests_total[5m])	> 100/s
请求延迟 P99	histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))	> 2s
错误率	rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])	> 1%
Pod 重启次数	rate(kube_pod_container_status_restarts_total{pod=~"openclaw-gateway-.*"}[1h])	> 0

截图位置：Grafana 仪表盘截图，展示 OpenClaw Gateway 的 CPU、内存、请求速率、P99 延迟、错误率五个核心指标面板。

5. 实战：一键部署到 K8s 集群

5.1 部署脚本

#!/bin/bash

# deploy-k8s.sh

# OpenClaw K8s 一键部署脚本

set -e

NAMESPACE="openclaw"

echo "???? 开始部署 OpenClaw 到 Kubernetes..."

# 1. 检查 kubectl 连接

if ! kubectl cluster-info &> /dev/null; then

echo "? 无法连接到 Kubernetes 集群"

echo "???? 请检查 kubeconfig 配置"

exit 1

echo "? 集群连接正常: $(kubectl cluster-info | head -1)"

# 2. 创建 Namespace

kubectl apply -f 00-namespace.yaml

# 3. 部署 ConfigMap 和 Secret

kubectl apply -f 01-configmap.yaml

kubectl apply -f 02-secret.yaml

# 4. 创建持久化存储

kubectl apply -f 05-persistentvolume.yaml

# 5. 部署应用

kubectl apply -f 03-deployment.yaml

kubectl apply -f 04-service.yaml

# 6. 等待 Deployment 就绪

echo "? 等待 Pod 就绪..."

kubectl wait --for=condition=available \

--timeout=300s \

deployment/openclaw-gateway \

-n $NAMESPACE

# 7. 部署 Ingress

kubectl apply -f 06-ingress.yaml

# 8. 部署 HPA

kubectl apply -f 07-hpa.yaml

# 9. 部署监控

kubectl apply -f 08-servicemonitor.yaml

# 10. 显示状态

echo ""

echo "???? 部署状态："

kubectl get all -n $NAMESPACE

echo ""

echo "???? HPA 状态："

kubectl get hpa -n $NAMESPACE

echo ""

echo "? 部署完成！"

echo " 访问地址: https://openclaw.your-domain.com"

echo " 查看日志: kubectl logs -f deployment/openclaw-gateway -n $NAMESPACE"

echo " 查看 Pod: kubectl get pods -n $NAMESPACE -w"

5.2 常用运维命令

# 查看 Pod 状态

kubectl get pods -n openclaw -o wide

# 查看 Pod 日志（实时）

kubectl logs -f deployment/openclaw-gateway -n openclaw

# 查看所有 Pod 日志（stern 工具）

stern -n openclaw openclaw-gateway

# 进入 Pod 调试

kubectl exec -it deployment/openclaw-gateway -n openclaw -- sh

# 手动扩容

kubectl scale deployment openclaw-gateway -n openclaw --replicas=5

# 滚动重启

kubectl rollout restart deployment/openclaw-gateway -n openclaw

# 查看滚动更新状态

kubectl rollout status deployment/openclaw-gateway -n openclaw

# 回滚到上一个版本

kubectl rollout undo deployment/openclaw-gateway -n openclaw

# 查看 HPA 状态

kubectl describe hpa openclaw-gateway-hpa -n openclaw

# 查看资源使用

kubectl top pods -n openclaw

kubectl top nodes

截图位置：kubectl get all -n openclaw 命令的输出截图，展示所有 K8s 资源的运行状态。

6. 生产环境最佳实践

6.1 安全加固清单

检查项	配置	说明
非 root 运行	securityContext.runAsNonRoot: true	容器不以 root 运行
只读文件系统	securityContext.readOnlyRootFilesystem: true	防止容器内写文件
资源限制	resources.limits	防止单个 Pod 耗尽节点资源
网络策略	NetworkPolicy	限制 Pod 间通信
镜像扫描	Trivy / Clair	扫描镜像漏洞
Secret 加密	Sealed Secrets / Vault	加密存储敏感信息

6.2 高可用架构总结

7. 总结

本文从零构建了 OpenClaw 的 Kubernetes 生产级部署方案：

核心要点：

Deployment 管理无状态 Pod：3副本起步，RollingUpdate 零停机更新，Pod 反亲和分散到不同节点
ConfigMap + Secret 分离配置：非敏感配置用 ConfigMap，密钥用 Secret，环境变量注入
三探针保障可用性：startupProbe（慢启动容忍）→ readinessProbe（流量就绪）→ livenessProbe（存活检测）
HPA 自动扩缩容：CPU > 70% 触发扩容（最多10个），CPU < 50% 触发缩容（最少2个），扩容快缩容慢
Prometheus + Grafana 全栈监控：ServiceMonitor 自动发现，5个核心指标面板，告警规则联动
Ingress 统一流量入口：TLS 终止、WebSocket 支持、速率限制、证书自动管理

思考题：

你的 OpenClaw 部署在 K8s 集群中，但模型 API（OpenAI）在集群外部。如果模型 API 响应变慢（P99 > 10s），HPA 会误判为"需要扩容"吗？如何避免这种"下游慢导致上游扩容"的级联问题？
ConfigMap 更新后，Pod 内的配置文件不会自动更新。你会如何设计配置热更新方案——是重启 Pod，还是用 sidecar 容器监听 ConfigMap 变化？
如果集群中某个节点宕机，上面的 Pod 会被调度到其他节点。但 PersistentVolume 如果是节点本地存储（hostPath），数据就丢了。你会如何选择存储方案——NFS、Ceph、还是云厂商的托管存储？

您可能感兴趣的文章 :

原文链接 :

Tag : OpenClaw(17)docker(33)

Claude Code Router实现一键接入多种AI模型的智能路由

什么是Claude Code Router？ Claude Code Router是一款革命性的开源代理工具，专为解决AI模型平台锁定问题而生。它作为Claude Code CLI与各大AI模型供应
使用OpenClaw Browser实现表单自动化的方法

本文通过实际案例演示 OpenClaw Browser 的表单自动化能力。从简单登录表单到复杂多步骤表单，全面解析表单自动化的实现技巧。涵盖表单分
OpenClaw从单机Docker部署迁移到Kubernetes集群的完整

当你的 OpenClaw 从单机走向集群，Kubernetes 是绕不开的选择。本文从 K8s 核心概念出发，系统讲解 OpenClaw 的 K8s 部署架构包括 Deployment 无状态部
claude code中的上下文压缩介绍

上下文压缩是什么，为什么要做？在 Claude Code 这类 Agent 里，它不是简单把历史消息砍短，也不是让模型选择性失忆，而是在上下文快装不
Codex桌面版配对码在哪里找?Codex手机连接电脑完整

一、为什么要关注 Codex 桌面版的设备连接 Codex桌面版开始支持更多跨设备协作能力后，很多人第一次使用时会卡在一个问题上：手机端提示
Ollama部署的五大崩溃场景与解决方法介绍

Ollama 是大多数人第一个接触的本地大模型工具。但它的问题也是最多的不是因为它质量差，而是因为它被用在太多奇奇怪怪的硬件组合上了
OpenClaw使用Canvas截图进行页面捕获与保存

本文详细介绍 OpenClaw Canvas 的截图功能。从基本截图、全页面捕获、元素截图到图像处理，全面解析如何通过 Canvas 实现灵活的页面捕获。通
OpenClaw中间件请求拦截、转换与增强的完整指南

中间件是 OpenClaw 处理链路中最灵活的一环。本文从中间件的设计哲学出发，系统讲解中间件的三种模式（前置、后置、环绕）、洋葱模型执
Claude Code中上下文压缩机制介绍

一次真实的编码会话，token 总量可能轻松冲到 400K 甚至更高可模型的上下文窗口只有 200K。Claude Code 是怎么把无限增长的对话塞进固定大小的
Ollama模型快速部署与使用的保姆级教程

在 AI 浪潮席卷各行各业的 2026 年，大语言模型早已不是云端巨头的专属玩具。从个人开发者到中小企业，越来越多的人希望在本地运行、微