跳到主要内容

K8S 常见问题

headless service

  • 不需要 clusterIP 的 service
  • 不用于路由,而是用于 dns 查询 - 返回 endpoint 地址
spec:
clusterIP: None

接口 CURL

curl --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt --header "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" -X GET https://kubernetes.default.svc/api
# Point to the internal API server hostname
APISERVER=https://kubernetes.default.svc

# Path to ServiceAccount token
SERVICEACCOUNT=/var/run/secrets/kubernetes.io/serviceaccount

# Read this Pod's namespace
NAMESPACE=$(cat ${SERVICEACCOUNT}/namespace)

# Read the ServiceAccount bearer token
TOKEN=$(cat ${SERVICEACCOUNT}/token)

# Reference the internal certificate authority (CA)
CACERT=${SERVICEACCOUNT}/ca.crt

# Explore the API with TOKEN
curl --cacert ${CACERT} --header "Authorization: Bearer ${TOKEN}" -X GET ${APISERVER}/api

注入环境变量

env:
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
# command 和 args 会展开环境变量
command: ['/bin/echo']
args: ['$(POD_IP)']

limits

  • 110 pods/node
  • 1 MB/object - etcd 限制 - #19781
    • 大型 CRD 定义可能达到这个限制 - embed 外部定义 - #82292
    • ConfigMap 如果过大,建议拆分
  • 所有 annotation KV 加起来不超过 256Kb - TotalAnnotationSizeLimitB
  • RFC-1123 - DNS LabelName 最长 63 字符 - K8S 中很多地方使用该验证
    • 所有支持 DNS 的对象
      • pod,service,namespace
    • label - #1830
  • gRPC 通常限制为 4MB

QoS

status:
qosClass: Guaranteed

operator vs controller

  • operator 包含一个或多个 controller
  • controller 负责 CRD 处理

cni-plugins 不再包含 flannel

资源限制

K3S 常见问题

  • #396 - Initializing eviction metric for zone

Pod 时区

  1. TZ 环境变量
  • 要求 语言/runtime 支持
  • 需要 tzdata
  1. 映射 tz 信息
  • 可以与 host 保持一致
  • 可以不需要安装 tzdata
  • 要求 host 安装了 tzdata
  • 要求调整 Pod
  1. 代码内控制
  • bundle tzdata,运行时加载
  • golang import _ "time/tzdata"
    • 增加约 450 KB
apiVersion: v1
kind: Pod
metadata:
name: busybox-sleep
spec:
containers:
- name: busybox
image: busybox
args:
- sleep
- '1000000'
env:
- name: TZ
value: Asia/Shanghai
volumeMounts:
- name: tz-config
mountPath: /etc/localtime
volumes:
- name: tz-config
hostPath:
path: /usr/share/zoneinfo/Asia/Shanghai
type: File

taint

  • node.kubernetes.io/disk-pressure:NoSchedule
  • 为 pod 设置 priorityClassName 确保调度,避免驱逐
    • system-node-critical,system-cluster-critical
df -h
df -Hi

# remove DiskPressure
kubectl taint nodes < nodename > node.kubernetes.io/disk-pressure-

集群限制

集群 Master 性能推荐

NodesGCE TypeSpec
1-5n1-standard-11 核 4G
6-10n1-standard-22 核 8G
11-100n1-standard-44 核 16G
101-250n1-standard-88 核 32G
251-500n1-standard-1616 核 64G
500+n1-standard-3232 核 128G

操作系统要求 / Linux Kernel Requirement

./check-config.sh kernel.config

invalid capacity 0 on image filesystem

  • 确保存在内核参数 cgroup_enable=cpuset cgroup_memory=1 cgroup_enable=memory
    • syslinux
      • /etc/update-extlinux.conf
      • update-extlinux && reboot
    • grub
      • /etc/default/grub
cat /proc/cmdline | grep cgroup_enable

导出资源忽略状态字段

  • 常见字段
    • status
    • metadata.managedFields
    • metadata
      • annotations
        • kubectl.kubernetes.io/last-applied-configuration
        • deployment.kubernetes.io/revision
      • resourceVersion
      • selfLink
      • uid
      • creationTimestamp
      • generation
kubectl get -o=yaml deploy whoami | yq d - status | yq d - 'metadata.managedFields'

删除卡在 Terminating 状态的资源

  • 删除 Pod 可尝试强制
  • 资源删不掉尝试移除 finalizers。
# 强制删除 Pod
kubectl delete pods --force < pod > --grace-period=0
# 尝试移除 Pod 的 finalizers
kubectl patch pod '{"metadata":{"finalizers":null}}' < pod > -p

# 查看所有的 Terminating Namespace
kubectl get ns --field-selector status.phase=Terminating

# 所有 NS
kubectl get ns --field-selector status.phase=Terminating -o jsonpath='{..metadata.name}'

# 批量
# kubectl patch ns -p '{"metadata":{"finalizers": null}}' $NS
# kubectl patch ns -p '{"metadata":{"finalizers":[]}}' --type=merge $NS
kubectl patch ns -p '{"metadata":{"finalizers":[]}}' --type=merge $(kubectl get ns --field-selector status.phase=Terminating -o jsonpath='{..metadata.name}')

删除 rancher 项目空间

  • 主要难点在于 get all 不会返回所有资源
  • 部分资源需要先 patch 才能删除
for ns in local p-66lfd; do
for error in app.project.cattle.io/cluster-alerting app.project.cattle.io/cluster-monitoring app.project.cattle.io/monitoring-operator app.project.cattle.io/project-monitoring clusteralertgroup.management.cattle.io/cluster-scan-alert clusteralertgroup.management.cattle.io/etcd-alert clusteralertgroup.management.cattle.io/event-alert clusteralertgroup.management.cattle.io/kube-components-alert clusteralertgroup.management.cattle.io/node-alert clusterroletemplatebinding.management.cattle.io/creator-cluster-owner clusterroletemplatebinding.management.cattle.io/u-b4qkhsnliz-admin node.management.cattle.io/machine-9sssc node.management.cattle.io/machine-ks6z6 node.management.cattle.io/machine-v4v89 project.management.cattle.io/p-cnj28 project.management.cattle.io/p-mbvfd projectalertgroup.management.cattle.io/projectalert-workload-alert projectalertrule.management.cattle.io/less-than-half-workload-available projectalertrule.management.cattle.io/memory-close-to-resource-limited projectroletemplatebinding.management.cattle.io/app-jdnmz projectroletemplatebinding.management.cattle.io/creator-project-owner projectroletemplatebinding.management.cattle.io/prtb-s6fhc projectroletemplatebinding.management.cattle.io/u-2gacgc4nfu-member projectroletemplatebinding.management.cattle.io/u-efxo6n6ndd-member; do
for resource in $(kubectl get -n $ns $error -o name); do
kubectl patch -n $ns $resource -p '{"metadata": {"finalizers": []}}' --type='merge'
done
done
done

# 全局资源
for res in $(kubectl api-resources --namespaced=false --api-group management.cattle.io | cut -d ' ' -f 1); do
echo "=== $res.management.cattle.io ==="
kubectl get $res.management.cattle.io
done

# namespaced
groups="management.cattle.io catalog.cattle.io project.cattle.io"
for grp in $groups; do
for res in $(kubectl api-resources --namespaced=true --api-group $grp -o name); do
echo "=== $res ==="
kubectl get --all-namespaces $res
done
done

# 清除资源
cleargroup() {
kubectl patch $1 -p '{"metadata":{"finalizers":[]}}' --type=merge $(kubectl get $1 -o jsonpath='{..metadata.name}')
kubectl delete --all $1
}

cleargroup globalroles.management.cattle.io

Unable to connect to the server: x509: certificate is valid for 10.10.1.2, 10.43.0.1, 127.0.0.1, 192.168.1.10, not 123.123.123.123

签发的证书不包含 IP,初始化的时候可以添加,建议新建子账号管理。

# 原先的位于
# /etc/kubernetes/pki/apiserver.*
kubeadm init --apiserver-cert-extra-sans=123.123.123.123

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "cron-1594372200-lmkcb": operation timeout: context deadline exceeded

测试是否正常

apiVersion: batch/v1
kind: Job
metadata:
name: alpine-pwd
spec:
template:
spec:
containers:
- name: alpine
image: wener/base
command: ['pwd']
restartPolicy: Never
backoffLimit: 4

一个主节点异常,重启解决。

error: specifying a root certificates file with the insecure flag is not allowed

insecure-skip-tls-verify 不能用于根证书,新建子账号。

node_lifecycle_controller.go:1433 Initializing eviction metric for zone

running "VolumeBinding" filter plugin for pod "web-0": pod has unbound immediate PersistentVolumeClaims

error: unable to retrieve the complete list of server APIs: write: broken pipe

  • 网络不稳定也可能导致
    • 例如 在使用 sshuttle 的时候偶尔就会出现
    • 可以考虑使用 ssh 转发 - ssh -vNL 6443:10.10.1.1:6443 [email protected] -o ExitOnForwardFailure=yes
  • an error on the server ("") has prevented the request from succeeding
# 概率失败
kubectl api-resources

# 确保 metric 正常
kubectl get apiservices
# 确保 系统服务 正常
kubectl get pods -n kube-system

dns 不通

  • 现象 - 所有 服务 504,网关超时
  • 排查
    • 验证 kube-dns 53 能否解析
    • 所有节点都有问题还是单个节点
      • 所有节点都有问题则可能是服务的问题
      • 单个节点则可能是环境问题
    • ping 后端 endpoint
    • ping 不通则说明可能是 flannel 插件之类异常或者使用的底层 interface 异常
  • 解决
    • 尝试重启 k3s
    • 尝试重启 网络
    • 尝试重启 系统
# 验证 DNS 能否解析
# k3s 默认使用 10.43.0.10
# kube-dns.kube-system.svc
nslookup wener.me 10.43.0.10
# 或
dig @10.43.0.10 wener.me

MountVolume.SetUp failed for volume "config-volume" : failed to sync secret cache: timed out waiting for the condition

  • 条件未满足,无法继续执行,且等待超时
  • 解决办法
    • 等待 或 删除 Pod

查看条件

status:
phase: Pending
conditions:
- type: Initialized
status: 'True'
lastProbeTime: null
lastTransitionTime: '2020-09-04T10:09:51Z'
- type: Ready
status: 'False'
lastProbeTime: null
lastTransitionTime: '2020-09-04T10:09:51Z'
reason: ContainersNotReady
message: 'containers with unready status: [alertmanager config-reloader]'
# 这一步未能成功
- type: ContainersReady
status: 'False'
lastProbeTime: null
lastTransitionTime: '2020-09-04T10:09:51Z'
reason: ContainersNotReady
message: 'containers with unready status: [alertmanager config-reloader]'
- type: PodScheduled
status: 'True'
lastProbeTime: null
lastTransitionTime: '2020-09-04T10:09:51Z'

kubernetes swap

不建议使用 swap

didn't have free ports for the requested pod ports

deploy 如果只有一个节点,使用 Recreate, RollingUpdate 会失败。

scale to zero

使用 keda 或者 knative

  • keda 简单 - 负责 autoscaler
  • knative 复杂 - 整套开发逻辑

The StorageClass is invalid: provisioner: Forbidden: updates to provisioner are forbidden.

provisioner 名字不匹配,还 sc 名字或删了重建。

预留资源

--kube-reserved=[cpu=100m][,][memory=100Mi][,][ephemeral-storage=1Gi][,][pid=1000]
--system-reserved=[cpu=100m][,][memory=100Mi][,][ephemeral-storage=1Gi][,][pid=1000]
--reserved-cpus=0-3
# 开始驱逐的阀值
--eviction-hard=[memory.available<500Mi]
--enforce-node-allocatable=pods[,][system-reserved][,][kube-reserved]

# 指向现有的 cgroup
--kube-reserved-cgroup=
--system-reserved-cgroup=

node(s) had volume node affinity conflict

  • 如果是 pv 无法迁移,切影响不大,可以考虑删除了从建

可能 pvc 和 pod zone 冲突,可以针对 zone 创建 sc

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: region1storageclass
provisioner: kubernetes.io/aws-ebs
volumeBindingMode: WaitForFirstConsumer
allowedTopologies:
- matchLabelExpressions:
- key: failure-domain.beta.kubernetes.io/zone
values:
- eu-west-2b

如何保持负载均衡

  1. 配置反亲和
  2. kill pod 触发重新调度
  3. 尝试 rollout 从新调度
  4. 使用类似 kubernetes-sigs/descheduler 这样的工具自动维护均衡

pause 是什么进程

1 node(s) had volume node affinity conflict

  • 检查 StorageClass 的亲和策略
  • 检查 PersistentVolume 的亲和策略
    • Pod 需要匹配 PV 的亲和策略

failed to run Kubelet: mountpoint for cpu not found

应该是没有挂载 cgroups

service cgroups start

orphaned pod found, but error not a directory occurred when trying to remove the volumes dir

pods=$(tail -n 1000 /var/log/k0s.log | grep 'orphaned pod' | sed -r 's/.*pod[^"]+"([^\\"]+).*/\1/' | uniq)
for i in $pods; do
echo "Deleting Orphaned pod id: $i"
rm -rf /var/lib/kubelet/pods/$i
done

1 node(s) didn't have free ports for the requested pod ports.

  • 检查 hostPort

资源

  • cpu
    • 1.0 -> 1vcpu -> 1000m
  • memory
    • base 1000 - E,T,P,G,M,k
    • 2^n - Ei,Pi,Ti,Gi,Mi,Ki
  • hugepages-<size>
  • ephemeral-storage
  • storage

failed to reserve container name "": name "" is reserved for

# reserved containerd
ctr -n k8s.io c ls

unable to find data in memory cache

parent snapshot does not exist: not found

invalid bearer token, Token has been invalidated

Node Shell

可通过一个 Pod 进入

apiVersion: v1
kind: Pod
metadata:
name: node-shell
namespace: kube-system
spec:
containers:
- name: shell
image: docker.io/alpine:3.13
command:
- nsenter
args:
- '-t'
- '1'
- '-m'
- '-u'
- '-i'
- '-n'
- sleep
- '14000'
resources: {}
volumeMounts:
- name: kube-api-access
readOnly: true
mountPath: /var/run/secrets/kubernetes.io/serviceaccount
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
volumes:
- name: kube-api-access
projected:
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
name: kube-root-ca.crt
items:
- key: ca.crt
path: ca.crt
- downwardAPI:
items:
- path: namespace
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
defaultMode: 420
restartPolicy: Never
terminationGracePeriodSeconds: 0
dnsPolicy: ClusterFirst
serviceAccountName: default
serviceAccount: default
# 希望进入的 Node
nodeName: master-1
hostNetwork: true
hostPID: true
hostIPC: true
securityContext: {}
schedulerName: default-scheduler
tolerations:
- operator: Exists
priorityClassName: system-node-critical
priority: 2000001000
enableServiceLinks: true
preemptionPolicy: PreemptLowerPriority

证书轮换

eStargz

  • k3s/k0s 使用 containerd 的能力支持 stargz
k3s server --snapshotter=stargz
/etc/containerd/config.toml
# Plug stargz snapshotter into containerd
# Containerd recognizes stargz snapshotter through specified socket address.
# The specified address below is the default which stargz snapshotter listen to.
[proxy_plugins]

[proxy_plugins.stargz]
type = "snapshot"
address = "/run/containerd-stargz-grpc/containerd-stargz-grpc.sock"

# Use stargz snapshotter through CRI
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "stargz"

could not load server certificate file : Permission denied

Usage of EmptyDir volume exceeds the limit 1m

  • emptyDir 使用 tmpfs 挂载
    • medium: Memory

常用 fieldRef

  • fieldRef
  • resourceFieldRef
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: POD_SERVICE_ACCOUNT
valueFrom:
fieldRef:
fieldPath: spec.serviceAccountName

镜像地址

  • k8s.gcr.io - 旧的
  • registry.k8s.io - 新的

failed to execute portforward in network namespace

高可用和复杂均衡

  • HA
    • 集群外 Kubelet API 复杂均衡
    • 用外部的 HAProxy/Nginx/VIP/Tinc 实现
    • 最终的目的是用一个不变的地址访问 Kubelet API
  • LLB
    • 集群内 LoadBalance 服务
    • 分配 ClutserIP
    • 不使用云服务可以用 MetalLB 这类的组件+Overlay/LAN 实现 ClusterIP 能力
      • 例如: flannel+host+tinc switch
    • 使用云服务可以用云服务的 LB 组件 - 可直接提供外网 IP