获取Prometheus服务NodePort端口
[root@k8s-master01 ~]# kubectl get svc -n monitoring | grep prometheus-k8s
prometheus-k8s NodePort 10.0.110.215 <none> 9090:30186/TCP,8080:31400/TCP 2d2h
打开浏览器输入节点IP:30186后,点击【Alerts】后观察到KubeControllerManagerDown,出现的原因不一定是服务挂了,还有可能是没有监控到这个服务

此时我们需要按上面的排查步骤进行一步一步的排查:
1.检查kube-controller-manager的Service Monitor是否成功创建,观察到已成功创建
[root@k8s-master01 ~]# kubectl get servicemonitor -n monitoring kube-controller-manager
kube-controller-manager 2d18h
2.检查kube-controller-manager的Service Monitor标签是否配置正确,观察到已配置
[root@k8s-master01 ~]# kubectl get servicemonitor -n monitoring kube-controller-manager -oyaml

使用kube-controller-manager的Service Monitor标签查询服务,观察到无服务,所以导致了找不到需要监控的目标
[root@k8s-master01 ~]# kubectl get svc -n kube-system -l app.kubernetes.io/name=kube-controller-manager
No resources found in kube-system namespace.
这时手动创建该Service和Endpoint 指向自己的Controller Manager
编写Yaml文件
[root@k8s-master01 ~]# vim controller-manager-svc.yaml
apiVersion: v1
kind: Endpoints
metadata:
labels:
app.kubernetes.io/name: kube-controller-manager
name: kube-controller-manager-prom
namespace: kube-system
subsets:
- addresses:
- ip: 192.168.1.31
- ip: 192.168.1.32
- ip: 192.168.1.33
ports:
- name: https-metrics
port: 10257
protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: kube-controller-manager
name: kube-controller-manager-prom
namespace: kube-system
spec:
ports:
- name: https-metrics
port: 10257
protocol: TCP
targetPort: 10257
sessionAffinity: None
type: ClusterIP
注意:ports.name和label必须跟kube-controller-manager的Service Monitor配置的一样
创建Service和Endpoint
[root@k8s-master01 ~]# kubectl create -f controller-manager-svc.yaml -n kube-system
查看服务
[root@k8s-master01 ~]# kubectl get svc -n kube-system kube-controller-manager-prom
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-controller-manager-prom ClusterIP 10.0.103.188 <none> 10257/TCP 5h51m
3.点击【Status】-【Configuration】,检查Prometheus是否生成了相关配置,观察到配置已存在

如果出现上面报错信息:server returned HTTP status 403 Forbidden,我们需要修改三台Master节点上/usr/lib/systemd/system/kube-controller-manager.service内容(我这里是二进制安装,需要根据自身安装环境来),具体修改的地方如下:
(1)需要将address地址由127.0.0.1修改0.0.0.0
$ sed -i "s#address=127.0.0.1#address=0.0.0.0#g" /usr/lib/systemd/system/kube-controller-manager.service
(2)添加两个参数,指定kube-controller-manager在运行时使用的kubeconfig文件的位置。kubeconfig文件包含了与Kubernetes API服务器进行认证和授权所需的信息。
--authentication-kubeconfig=/etc/kubernetes/controller-manager.kubeconfig \
--authorization-kubeconfig=/etc/kubernetes/controller-manager.kubeconfig \
上面修改完成后,文件内容如下:
$ vim /usr/lib/systemd/system/kube-controller-manager.service
[Unit]
Description=Kubernetes Controller Manager
Documentation=https://github.com/kubernetes/kubernetes
After=network.target
[Service]
ExecStart=/usr/local/bin/kube-controller-manager \
--v=2 \
--logtostderr=true \
--authentication-kubeconfig=/etc/kubernetes/controller-manager.kubeconfig \
--authorization-kubeconfig=/etc/kubernetes/controller-manager.kubeconfig \
--address=0.0.0.0 \
--root-ca-file=/etc/kubernetes/pki/ca.pem \
--cluster-signing-cert-file=/etc/kubernetes/pki/ca.pem \
--cluster-signing-key-file=/etc/kubernetes/pki/ca-key.pem \
--service-account-private-key-file=/etc/kubernetes/pki/sa.key \
--kubeconfig=/etc/kubernetes/controller-manager.kubeconfig \
--leader-elect=true \
--use-service-account-credentials=true \
--node-monitor-grace-period=40s \
--node-monitor-period=5s \
--pod-eviction-timeout=2m0s \
--controllers=*,bootstrapsigner,tokencleaner \
--allocate-node-cidrs=true \
--cluster-cidr=172.16.0.0/12 \
--requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.pem \
--node-cidr-mask-size=24
Restart=always
RestartSec=10s
[Install]
WantedBy=multi-user.target
三台Master节点重启kube-controller-manager
[root@k8s-master01 ~]# systemctl daemon-reload && systemctl restart kube-controller-manager
删掉所有网络策略
[root@k8s-master01 ~]# kubectl delete networkpolicy --all -n monitoring
重新检查,发现Service Monitor找到kube-controller-manager


4.至此,结束。如果不行,可继续排查。确认存在Service Monitor匹配的Service,这个刚开始就发现了,自己重新创建新的Service。
5.确认通过Service能够访问程序的Metrics接口
查看服务地址
[root@k8s-master01 ~]# kubectl get svc -n kube-system kube-controller-manager-prom
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-controller-manager-prom ClusterIP 10.0.190.144 <none> 10257/TCP 5h8m
因为是https测试访问不了,这个暂时没有影响
[root@k8s-master01 ~]# curl https://10.0.190.144:10257 -k
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
"reason": "Forbidden",
"details": {},
"code": 403
6.确认Service的端口和Scheme和Service Monitor一致
查看servicemonitor中port和scheme,其中port为https-metrics;scheme为https
[root@k8s-master01 ~]# kubectl get servicemonitor -n monitoring kube-controller-manager -oyaml
...
...
port: https-metrics
scheme: https
...
...

查看到service中name为https-metrics,观察到和servicemonitor一致
[root@k8s-master01 ~]# kubectl get svc -n kube-system kube-controller-manager-prom -oyaml
...
...
ports:
- name: https-metrics
port: 10257
protocol: TCP
...
...
