一、背景

Consul Documentation | Consul | HashiCorp Developer

Prometheus配置文件 prometheus-config.yaml 配置了大量的采集规则,基本上都是 运维小伙伴手动处理,如果后面增加了节点或者组件信息,就得手动修改此配置,并热加载 promethues;那么能否动态的监听微服务呢?Prometheus 提供了多种动态服务发现的功能,这里以 consul 为例。

Consul是分布式k/v数据库,是一个服务注册组件,其他服务都可以注册到consul上, Prometheus也不例外,通过consul的服务发现,我们可以避免在Prometheus中指定大量的target。

prometheus基于consul的服务发现流程如下:

  1. 在consul注册服务或注销服务(监控targets)
  2. Prometheus一直监视consul服务,当发现consul中符合要求的服务有新变化就会更新Prometheus监控对象

二、Prometheus 支持的多种服务发现机制

Prometheus数据源的配置主要分为 静态配置动态发现 , 常用的为以下几类:

1)static_configs: #静态服务发现
2)file_sd_configs: #文件服务发现
3)dns_sd_configs: DNS #服务发现
4)kubernetes_sd_configs: #Kubernetes服务发现
5)consul_sd_configs: Consul #服务发现

三、工作原理

1、Prometheus通过Consul API查询Consul的KV存储中保存的配置信息,然后从中获取关于服务的元数据;

2、Prometheus使用这些信息来构造目标服务的URL,并将其添加到服务发现的目标列表中。

3、当服务被注销或不可用时,Prometheus将自动从其目标列表中删除该服务。

Day07-容器云平台监控一体化-图6

四、容器化Consul集群

测试验证,不可作为线上使用! 线上一定要基于集群的方式做整体的部署验证,并做服务进程的守护及监控。

创建一个只有一个节点的consul集群

docker run -id -expose=[8300,8301,8302,8500,8600] --restart always -p 18300:8300 -p 18301:8301 -p 18302:8302 -p 18500:8500 -p 18600:8600 --name server1 -e 'CONSUL_LOCAL_CONFIG={"skip_leave_on_interrupt": true}' registry.cn-hangzhou.aliyuncs.com/abroad_images/consul:1.15.4 agent -server -bootstrap-expect=1 -node=server1 -bind=0.0.0.0 -client=0.0.0.0 -ui -datacenter dc1

各项参数说明:

-expose:暴露出出来的端口,即consul启动所需的端口:8300,8301,8302,8500,8600
--restart:always表示容器挂了就自动重启
-p:建立宿主机与容器的端口映射
--name:容器名称
-e:环境变量,这里用于对consul进行配置
registry.cn-hangzhou.aliyuncs.com/abroad_images/consul:1.15.4:这是consul镜像名
agent:容器中执行的命令,各参数含义:
  -server:表示节点是server类型
  -bootstrap-expect:表示集群中有几个server节点后开始选举leader,既然是单节点集群,那自然就是1了
  -node:节点名称
  -bind:集群内部通信地址,默认是0.0.0.0
  -client:客户端地址,默认是127.0.0.1
  -ui:启用consul的web页面管理
  -datacenter:数据中心

测试验证:

可通过web端访问,例如:http://10.0.0.62:18500/

image-20250412100040029

也可以在10.0.0.62主机上进行测试访问

# curl localhost:18500

五、注册主机到Consul

例如:将某台虚机上的 node-exporter 注册到 consule.

# 启动node-exporter
docker run -d -p 9100:9100 \
-v "/proc:/host/proc" \
-v "/sys:/host/sys" \
-v "/:/rootfs" \
-v "/etc/localtime:/etc/localtime" \
registry.cn-hangzhou.aliyuncs.com/abroad_images/node-exporter:latest \
--path.procfs /host/proc \
--path.sysfs /host/sys \
--collector.filesystem.ignored-mount-points "^/(sys|proc|dev|host|etc)($|/)"

# 验证
curl localhost:9100/metrics

添加:

## 格式
$ curl -X PUT -d '{"id": "'${host_name}'","name": "node-exporter","address": "'${host_addr}'","port":9100,"tags": ["dam"],"checks": [{"http": "http://'${host_addr}':9100/","interval": "5s"}]}' http://10.0.0.62:18500/v1/agent/service/register

## 示例
$ curl -X PUT -d '{"id": "sh-middler2","name": "node-exporter","address": "101.201.68.158","port":9100,"tags": ["middleware"],"checks": [{"http": "http://101.201.68.158:9100/metrics","interval": "3s"}]}' http://10.0.0.62:18500/v1/agent/service/register

## 参数说明
id : 注册ID 在consul中为唯一标识
name :Service名称
address:自动注册绑定ip
port:自动注册绑定端口
tags:注册标签,可多个
checks : 健康检查
http:  检查数据来源
interval: 检查时间间隔
http://192.10.192.109:18500/v1/agent/service/register  consul注册接口

Day07-容器云平台监控一体化-图7

删除:

## 格式
$ curl -X PUT http://10.0.0.62:18500/v1/agent/service/deregister/${id}

## 示例
$ curl -X PUT http://10.0.0.62:18500/v1/agent/service/deregister/sh-middler2

六、Prometheus配置Consul实现自动服务发现

修改prometheus的configmap配置文件:prometheus-config.yaml

#添加如下配置
    ########## Consul 监控配置 ##########
    - job_name: consul
      honor_labels: true
      metrics_path: /metrics
      scheme: http
      consul_sd_configs:    #基于consul服务发现的配置
        - server: 10.0.0.62:18500    #consul的监听地址
          services: []                 #匹配consul中所有的service
      relabel_configs:             #relabel_configs下面都是重写标签相关配置
      - source_labels: ['__meta_consul_tags']    #将__meta_consul_tags标签的至赋值给product
        target_label: 'servername'
      - source_labels: ['__meta_consul_dc']   #将__meta_consul_dc的值赋值给idc
        target_label: 'idc'
      - source_labels: ['__meta_consul_service']
        regex: "consul"  #匹配为"consul"的service
        action: drop       #执行的动作为删除

#完整配置文件
[root@master01 7]# cat prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitor
data:
  prometheus.yml: |
    global:
      scrape_interval:     15s
      evaluation_interval: 15s
      external_labels:
        cluster: "kubernetes"

    ############ 数据采集job ###################

    scrape_configs:
    ########## prometheus 监控配置 ##########
    - job_name: prometheus
      static_configs:
      - targets: ['127.0.0.1:9090']
        labels:
          instance: prometheus

    ########## kube-apiserver 监控配置 ##########
    - job_name: kube-apiserver
      kubernetes_sd_configs:
      - role: endpoints
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name]
        action: keep
        regex: default;kubernetes
      - source_labels: [__meta_kubernetes_endpoints_name]
        action: replace
        target_label: endpoint
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: service
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace

    ########## kube-controller-manager 监控配置 ##########
    - job_name: 'kube-controller-manager'
      # 使用 Kubernetes Pod 发现机制
      kubernetes_sd_configs:
        - role: pod
      # 强制使用 HTTPS 协议
      scheme: https
      # TLS 配置(测试环境跳过验证)
      tls_config:
        insecure_skip_verify: true
      # 使用 ServiceAccount 的 Token 认证
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
        # 保留标签为 component=kube-controller-manager 的 Pod
        - source_labels: [__meta_kubernetes_pod_label_component]
          regex: kube-controller-manager
          action: keep
        # 重写目标地址为 Pod IP + 10257 端口
        - source_labels: [__meta_kubernetes_pod_ip]
          regex: (.+)
          target_label: __address__
          replacement: "${1}:10257"
        # 强制使用 HTTPS 协议(冗余但明确)
        - source_labels: []
          regex: .*
          target_label: __scheme__
          replacement: https
        # 附加元数据标签
        - source_labels: [__meta_kubernetes_endpoints_name]
          action: replace
          target_label: endpoint
        - source_labels: [__meta_kubernetes_pod_name]
          action: replace
          target_label: pod
        - source_labels: [__meta_kubernetes_service_name]
          action: replace
          target_label: service
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: namespace

    ########## kube-scheduler 监控配置 ##########
    - job_name: 'kube-scheduler'
      kubernetes_sd_configs:
        - role: pod
      scheme: https
      tls_config:
        insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
        - source_labels: [__meta_kubernetes_pod_label_component]
          regex: kube-scheduler
          action: keep
        - source_labels: [__meta_kubernetes_pod_ip]
          regex: (.+)
          target_label: __address__
          replacement: "${1}:10259"
        - source_labels: []
          regex: .*
          target_label: __scheme__
          replacement: https
        - source_labels: [__meta_kubernetes_endpoints_name]
          action: replace
          target_label: endpoint
        - source_labels: [__meta_kubernetes_pod_name]
          action: replace
          target_label: pod
        - source_labels: [__meta_kubernetes_service_name]
          action: replace
          target_label: service
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: namespace

    ########## kube-state-metrics 监控配置 ##########
    - job_name: kube-state-metrics
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_name]
        regex: kube-state-metrics
        action: keep
      - source_labels: [__meta_kubernetes_pod_ip]
        regex: (.+)
        target_label: __address__
        replacement: ${1}:8080
      - source_labels: [__meta_kubernetes_endpoints_name]
        action: replace
        target_label: endpoint
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: service
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace

    ########## coredns 监控配置 ##########
    - job_name: coredns
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - source_labels:
          - __meta_kubernetes_service_label_k8s_app
        regex: kube-dns
        action: keep
      - source_labels: [__meta_kubernetes_pod_ip]
        regex: (.+)
        target_label: __address__
        replacement: ${1}:9153
      - source_labels: [__meta_kubernetes_endpoints_name]
        action: replace
        target_label: endpoint
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: service
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace

    ########## etcd 监控配置 ##########
    - job_name: etcd
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels:
          - __meta_kubernetes_pod_label_component
        regex: etcd
        action: keep
      - source_labels: [__meta_kubernetes_pod_ip]
        regex: (.+)
        target_label: __address__
        replacement: ${1}:2381
      - source_labels: [__meta_kubernetes_endpoints_name]
        action: replace
        target_label: endpoint
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace

    ########## kubelet 监控配置 ##########
    - job_name: kubelet
      metrics_path: /metrics/cadvisor
      scheme: https
      tls_config:
        insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - source_labels: [__meta_kubernetes_endpoints_name]
        action: replace
        target_label: endpoint
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace

    ########## k8s-node 监控配置 ##########
    - job_name: k8s-nodes
      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - source_labels: [__address__]
        regex: '(.*):10250'
        replacement: '${1}:9100'
        target_label: __address__
        action: replace
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - source_labels: [__meta_kubernetes_endpoints_name]
        action: replace
        target_label: endpoint
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace

    ########## DNS 监控配置 ##########
    - job_name: "kubernetes-dns"
      metrics_path: /probe              # 不是metrics,是probe
      params:
        module: [dns_tcp]               # 使用DNS TCP模块
      static_configs:
        - targets:
          - kube-dns.kube-system:53             #不要省略端口号
          - 8.8.4.4:53
          - 8.8.8.8:53
          - 223.5.5.5:53
      relabel_configs:
        - source_labels: [__address__]
          target_label: __param_target
        - source_labels: [__param_target]
          target_label: instance
        - target_label: __address__
          replacement: blackbox-exporter.monitor:9115 # 服务地址,和上面的 Service 定义保持一致

    ########## ICMP 监控配置 ##########
    - job_name: icmp-status
      metrics_path: /probe
      params:
        module: [icmp]
      static_configs:
      - targets:
        - 10.0.0.61
        labels:
          group: icmp
      relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter.monitor:9115

    ########## HTTP 监控配置 ##########
    - job_name: 'kubernetes-services'
      metrics_path: /probe
      params:
        module:         ## 使用HTTP_GET_2xx与HTTP_GET_3XX模块
        - "http_get_2xx"
        - "http_get_3xx"
      kubernetes_sd_configs:            ## 使用Kubernetes动态服务发现,且使用Service类型的发现
      - role: service
      relabel_configs:          ## 设置只监测Kubernetes Service中Annotation里配置了注解prometheus.io/http_probe: true的service
      - action: keep
        source_labels: [__meta_kubernetes_service_annotation_prometheus_io_http_probe]
        regex: "true"
      - action: replace
        source_labels:
        - "__meta_kubernetes_service_name"
        - "__meta_kubernetes_namespace"
        - "__meta_kubernetes_service_annotation_prometheus_io_http_probe_port"
        - "__meta_kubernetes_service_annotation_prometheus_io_http_probe_path"
        target_label: __param_target
        regex: (.+);(.+);(.+);(.+)
        replacement: $1.$2:$3$4
      - target_label: __address__
        replacement: blackbox-exporter.monitor:9115             ## BlackBox Exporter 的 Service 地址
      - source_labels: [__param_target]
        target_label: instance
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_service_name]
        target_label: kubernetes_name

    ########## TCP 监控配置 ##########
    - job_name: "service-tcp-probe"
      scrape_interval: 1m
      metrics_path: /probe
      # 使用blackbox exporter配置文件的tcp_connect的探针
      params:
        module: [tcp_connect]
      kubernetes_sd_configs:
      - role: service
      relabel_configs:
      # 保留prometheus.io/scrape: "true"和prometheus.io/tcp-probe: "true"的service
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape, __meta_kubernetes_service_annotation_prometheus_io_tcp_probe]
        action: keep
        regex: true;true
      # 将原标签名__meta_kubernetes_service_name改成service_name
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        regex: (.*)
        target_label: service_name
      # 将原标签名__meta_kubernetes_service_name改成service_name
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        regex: (.*)
        target_label: namespace
      # 将instance改成 `clusterIP:port` 地址
      - source_labels: [__meta_kubernetes_service_cluster_ip, __meta_kubernetes_service_annotation_prometheus_io_http_probe_port]
        action: replace
        regex: (.*);(.*)
        target_label: __param_target
        replacement: $1:$2
      - source_labels: [__param_target]
        target_label: instance
      # 将__address__的值改成 `blackbox-exporter.monitor:9115`
      - target_label: __address__
        replacement: blackbox-exporter.monitor:9115

    ########## Ingress 监控配置 ##########
    - job_name: 'blackbox-k8s-ingresses'
      scrape_interval: 30s
      scrape_timeout: 10s
      metrics_path: /probe
      params:
        module: [http_get_2xx]  # 使用定义的http模块
      kubernetes_sd_configs:
      - role: ingress  # ingress 类型的服务发现
      relabel_configs:
      # 只有ingress的annotation中配置了 prometheus.io/http_probe=true 的才进行发现
      - source_labels: [__meta_kubernetes_ingress_annotation_prometheus_io_http_probe]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_ingress_scheme,__address__,__meta_kubernetes_ingress_path]
        regex: (.+);(.+);(.+)
        replacement: ${1}://${2}${3}
        target_label: __param_target
      - target_label: __address__
        replacement: blackbox-exporter.monitor:9115
      - source_labels: [__param_target]
        target_label: instance
      - action: labelmap
        regex: __meta_kubernetes_ingress_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_ingress_name]
        target_label: kubernetes_name

    ########## 外部域名 监控配置 ##########
    - job_name: "blackbox-external-website"
      scrape_interval: 30s
      scrape_timeout: 15s
      metrics_path: /probe
      params:
        module: [http_get_2xx]
      static_configs:
      - targets:
        - https://www.baidu.com # 改为公司对外服务的域名
        - https://www.jd.com
      relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter.monitor:9115
    ########## 云上ECS 监控配置 ##########
    - job_name: 'other-ECS'
      static_configs:
        - targets: ['101.201.68.158:9100']
          labels:
            hostname: 'test-node-exporter'

    ########## 进程 监控配置 ##########
    - job_name: 'process-exporter'
      static_configs:
      - targets: ['10.0.0.62:9256']

    ########## Mysql 监控配置 ##########
    - job_name: 'mysql-exporter'
      static_configs:
      - targets: ['10.0.0.62:9104']

    ########## Consul 监控配置 ##########
    - job_name: consul
      honor_labels: true
      metrics_path: /metrics
      scheme: http
      consul_sd_configs:    #基于consul服务发现的配置
        - server: 10.0.0.62:18500    #consul的监听地址
          services: []                 #匹配consul中所有的service
      relabel_configs:             #relabel_configs下面都是重写标签相关配置
      - source_labels: ['__meta_consul_tags']    #将__meta_consul_tags标签的至赋值给product
        target_label: 'servername'
      - source_labels: ['__meta_consul_dc']   #将__meta_consul_dc的值赋值给idc
        target_label: 'idc'
      - source_labels: ['__meta_consul_service']
        regex: "consul"  #匹配为"consul"的service
        action: drop       #执行的动作为删除

    ############ 指定告警规则文件路径位置 ###################
    rule_files:
    - /etc/prometheus/rules/*.rules

#应用
[root@master01 7]# kaf prometheus-config.yaml

按上面方法重载 Prometheus,打开 Prometheus 的 Target 页面,就会看到 上面定义的 consul 任务

curl -XPOST http://prometheus.zhang-qing.com/-/reload

如果没有出现 consul 任务,可能是因为consul没有监控相关服务,可以按照下面内容增添监控服务

curl -X PUT -d '{"id": "sh-middler2","name": "node-exporter","address": "101.201.68.158","port":9100,"tags": ["middleware"],"checks": [{"http": "http://101.201.68.158:9100/metrics","interval": "3s"}]}' http://10.0.0.62:18500/v1/agent/service/register

Day07-容器云平台监控一体化-图8

七、总结

  • 动态服务发现和监控:通过与Consul集成,Prometheus可以动态地维护其目标列表,确保在新服务上线时及时发现和监控它们。
  • 可扩展性:自动服务发现使得扩展基础架构变得更加容易,无需担心监控数据的可 用性和性能问题。
  • 无缝集成:Consul作为服务注册中心,使得Prometheus可以与Consul生态系统中 的其他工具进行无缝集成,提供完整的服务基础架构监控和管理解决方案。
  • 自愈能力:自动服务发现意味着Prometheus可以自动检测服务基础架构的变化,并 在实时调整监控目标列表,确保监控数据的连续性和高性能。