获取Prometheus服务NodePort端口

[root@k8s-master01 ~]# kubectl get svc -n monitoring | grep prometheus-k8s
prometheus-k8s          NodePort    10.0.110.215   <none>        9090:30186/TCP,8080:31400/TCP   2d2h

打开浏览器输入节点IP:30186后,点击【Alerts】后观察到KubeControllerManagerDown,出现的原因不一定是服务挂了,还有可能是没有监控到这个服务

排查KubeControllerManagerDown-1

此时我们需要按上面的排查步骤进行一步一步的排查:

1.检查kube-controller-manager的Service Monitor是否成功创建,观察到已成功创建

[root@k8s-master01 ~]# kubectl get servicemonitor -n monitoring  kube-controller-manager
kube-controller-manager   2d18h

2.检查kube-controller-manager的Service Monitor标签是否配置正确,观察到已配置

[root@k8s-master01 ~]# kubectl get servicemonitor -n monitoring kube-controller-manager -oyaml

排查KubeControllerManagerDown-2

使用kube-controller-manager的Service Monitor标签查询服务,观察到无服务,所以导致了找不到需要监控的目标

[root@k8s-master01 ~]# kubectl get svc -n kube-system -l app.kubernetes.io/name=kube-controller-manager
No resources found in kube-system namespace.

这时手动创建该Service和Endpoint 指向自己的Controller Manager

编写Yaml文件

[root@k8s-master01 ~]# vim controller-manager-svc.yaml
apiVersion: v1
kind: Endpoints
metadata:
  labels:
    app.kubernetes.io/name: kube-controller-manager
  name: kube-controller-manager-prom
  namespace: kube-system
subsets:
- addresses:
  - ip: 192.168.1.31
  - ip: 192.168.1.32
  - ip: 192.168.1.33
  ports:
  - name: https-metrics
    port: 10257
    protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: kube-controller-manager
  name: kube-controller-manager-prom
  namespace: kube-system
spec:
  ports:
  - name: https-metrics
    port: 10257
    protocol: TCP
    targetPort: 10257
  sessionAffinity: None
  type: ClusterIP

注意:ports.name和label必须跟kube-controller-manager的Service Monitor配置的一样

创建Service和Endpoint

[root@k8s-master01 ~]# kubectl create -f controller-manager-svc.yaml -n kube-system

查看服务

[root@k8s-master01 ~]# kubectl get svc -n kube-system kube-controller-manager-prom
NAME                           TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)     AGE
kube-controller-manager-prom   ClusterIP   10.0.103.188   <none>        10257/TCP   5h51m

3.点击【Status】-【Configuration】,检查Prometheus是否生成了相关配置,观察到配置已存在

排查KubeControllerManagerDown-3

如果出现上面报错信息:server returned HTTP status 403 Forbidden,我们需要修改三台Master节点上/usr/lib/systemd/system/kube-controller-manager.service内容(我这里是二进制安装,需要根据自身安装环境来),具体修改的地方如下:

(1)需要将address地址由127.0.0.1修改0.0.0.0

$ sed -i "s#address=127.0.0.1#address=0.0.0.0#g" /usr/lib/systemd/system/kube-controller-manager.service

(2)添加两个参数,指定kube-controller-manager在运行时使用的kubeconfig文件的位置。kubeconfig文件包含了与Kubernetes API服务器进行认证和授权所需的信息。

      --authentication-kubeconfig=/etc/kubernetes/controller-manager.kubeconfig \
      --authorization-kubeconfig=/etc/kubernetes/controller-manager.kubeconfig \

上面修改完成后,文件内容如下:

$ vim /usr/lib/systemd/system/kube-controller-manager.service
[Unit]
Description=Kubernetes Controller Manager
Documentation=https://github.com/kubernetes/kubernetes
After=network.target

[Service]
ExecStart=/usr/local/bin/kube-controller-manager \
      --v=2 \
      --logtostderr=true \
      --authentication-kubeconfig=/etc/kubernetes/controller-manager.kubeconfig \
      --authorization-kubeconfig=/etc/kubernetes/controller-manager.kubeconfig \
      --address=0.0.0.0 \
      --root-ca-file=/etc/kubernetes/pki/ca.pem \
      --cluster-signing-cert-file=/etc/kubernetes/pki/ca.pem \
      --cluster-signing-key-file=/etc/kubernetes/pki/ca-key.pem \
      --service-account-private-key-file=/etc/kubernetes/pki/sa.key \
      --kubeconfig=/etc/kubernetes/controller-manager.kubeconfig \
      --leader-elect=true \
      --use-service-account-credentials=true \
      --node-monitor-grace-period=40s \
      --node-monitor-period=5s \
      --pod-eviction-timeout=2m0s \
      --controllers=*,bootstrapsigner,tokencleaner \
      --allocate-node-cidrs=true \
      --cluster-cidr=172.16.0.0/12 \
      --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.pem \
      --node-cidr-mask-size=24

Restart=always
RestartSec=10s

[Install]
WantedBy=multi-user.target

三台Master节点重启kube-controller-manager

[root@k8s-master01 ~]# systemctl daemon-reload && systemctl restart kube-controller-manager

删掉所有网络策略

[root@k8s-master01 ~]# kubectl delete   networkpolicy  --all  -n monitoring

重新检查,发现Service Monitor找到kube-controller-manager

排查KubeControllerManagerDown-4

排查KubeControllerManagerDown-5

4.至此,结束。如果不行,可继续排查。确认存在Service Monitor匹配的Service,这个刚开始就发现了,自己重新创建新的Service。

5.确认通过Service能够访问程序的Metrics接口

查看服务地址

[root@k8s-master01 ~]# kubectl get svc -n kube-system kube-controller-manager-prom
NAME                           TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)     AGE
kube-controller-manager-prom   ClusterIP   10.0.190.144   <none>        10257/TCP   5h8m

因为是https测试访问不了,这个暂时没有影响

[root@k8s-master01 ~]# curl https://10.0.190.144:10257 -k
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "details": {},
  "code": 403

6.确认Service的端口和Scheme和Service Monitor一致

查看servicemonitor中port和scheme,其中port为https-metrics;scheme为https

[root@k8s-master01 ~]# kubectl get servicemonitor -n monitoring kube-controller-manager -oyaml
...
...
    port: https-metrics
    scheme: https
...
...

排查KubeControllerManagerDown-6

查看到service中name为https-metrics,观察到和servicemonitor一致

[root@k8s-master01 ~]# kubectl get svc -n kube-system kube-controller-manager-prom -oyaml
...
...
  ports:
  - name: https-metrics
    port: 10257
    protocol: TCP
...
...

排查KubeControllerManagerDown-7