官方数据:https://argoproj.github.io/argo-rollouts/features/analysis/

一、主要 CRD 资源

1.1 渐进式部署(Rollout)

一个渐进式部署(Rollout)可以作为 Deployment 资源的一个直接替代品。它提供了额外的蓝绿 (blue-green)和金丝雀(canary)更新策略。这些策略可以在更新过程中创建分析运行 (AnalysisRuns)和验证,从而推进更新进程或者中止更新。

1.2 分析模板(AnalysisTemplate)

分析模板(AnalysisTemplate)是一个模板规范,定义了如何执行金丝雀分析,包括应该执行的度量指 标、其执行频率以及被认为是成功或失败的值。分析模板可以通过输入值进行参数化。

apiVersion: argoproj.io/v1alpha1

kind: AnalysisTemplate

metadata:

  name: success-rate

spec:

  args:

    # 模板参数,模板内部引用的格式为“{{args.NAME}}”;可在调用该模板时对其赋值;

    - name: <string>

      value: <string>

      valueFrom:

        secretKeyRef:

          name: <string>

          key: <string>

  metrics:

    # 必选字段,定义用于对交付效果进行分析的指标

    - name: <string>

      # 必选字段,指标名称;

      initialDelay: 5m

      # 延迟特定指标分析

      interval: 5m

      # 多次测试时的测试间隔时长

      consecutiveErrorLimit: <Object>

      count: <Object>

      # 总共测试的次数

      failureCondition: result[0] >= 0.95

      # 测试结果为“失败”的条件表达式

      # NOTE: prometheus queries return results in the form of a vector.

      # So it is common to access the index 0 of the returned array to obtain the value

      successCondition: result[0] >= 0.95

      # 测试结果为“成功”的条件表达式

      failureLimit: 3

      # 允许的最大失败运行次数

      provider:

        # 指标供应方,支持web、wavefront、skywalking、prometheus、plugin、newRelic、kayenta、job、influxdb、graphite、datadog、cloudWatch。

        prometheus:

          # Prometheus服务的访问入口

          address: http://prometheus.example.com:9090

          # 向Prometheus服务发起的查询请求(PromQL)

          query: |

            sum(irate(

              istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code!~"5.*"}[5m]

            )) /

            sum(irate(

              istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}"}[5m]

            ))

  # 运行于dryRun模式的metric列表,这些metric的结果不会影响最终分析结果

  dryRun:

    # 指标名称

    - metricName: <string>

    # 测量结果历史的保留数,dryRun模式的参数也支持历史结果保留

    measurementRetention:

    # 指标名称

    - metricName: <string>

      # 保留数量

      limit: <integer>

1.3 集群范围分析模板(ClusterAnalysisTemplate)

集群范围分析模板(ClusterAnalysisTemplate)类似于分析模板(AnalysisTemplate),但它的作用范围不限于特定的命名空间。它可以被集群中的任何渐进式部署(Rollout)所使用。

apiVersion: argoproj.io/v1alpha1

kind: ClusterAnalysisTemplate

metadata:

  name: success-rate

spec:

  args:

    # 模板参数,模板内部引用的格式为“{{args.NAME}}”;可在调用该模板时对其赋值;

    - name: <string>

      value: <string>

      valueFrom:

        secretKeyRef:

          name: <string>

          key: <string>

  metrics:

    # 必选字段,定义用于对交付效果进行分析的指标

    - name: <string>

      # 必选字段,指标名称;

      interval: 5m

      # 多次测试时的测试间隔时长

      # NOTE: prometheus queries return results in the form of a vector.

      # So it is common to access the index 0 of the returned array to obtain the value

      successCondition: result[0] >= 0.95

      # 测试结果为“成功”的条件表达式

      failureLimit: 3

      # 允许的最大失败运行次数

      provider:

        # 指标供应方,支持web、wavefront、skywalking、prometheus、plugin、newRelic、kayenta、job、influxdb、graphite、datadog、cloudWatch。

        prometheus:

          # Prometheus服务的访问入口

          address: http://prometheus.example.com:9090

          # 向Prometheus服务发起的查询请求(PromQL)

          query: |

            sum(irate(

              istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code!~"5.*"}[5m]

            )) /

            sum(irate(

              istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}"}[5m]

            ))

  dryRun:

    # 运行于dryRun模式的metric列表,这些metric的结果不会影响最终分析结果

    - metricName: <string>

      # 指标名称

  measurementRetention:

    # 测量结果历史的保留数,dryRun模式的参数也支持历史结果保留

    - metricName: <string>

      # 指标名称

      limit: <integer>

      # 保留数量

1.4 分析运行(AnalysisRun)

分析运行(AnalysisRun)是分析模板(AnalysisTemplate)的一次实例化。分析运行类似于 Job,在最 终会完成。完成的运行会被认为是成功、失败或不确定的结果,并且运行的结果会影响渐进式部署的更 新是否继续、中止或暂停。

apiVersion: argoproj.io/v1alpha1

kind: AnalysisRun

metadata:

  name: success-rate

spec:

  args:

    # 模板参数,模板内部引用的格式为“{{args.NAME}}”;可在调用该模板时对其赋值;

    - name: <string>

      value: <string>

      valueFrom:

        secretKeyRef:

          name: <string>

          key: <string>

  metrics:

    # 必选字段,定义用于对交付效果进行分析的指标

    - name: <string>

      # 必选字段,指标名称;

      interval: 5m

      # 多次测试时的测试间隔时长

      # NOTE: prometheus queries return results in the form of a vector.

      # So it is common to access the index 0 of the returned array to obtain the value

      successCondition: result[0] >= 0.95

      # 测试结果为“成功”的条件表达式

      failureLimit: 3

      # 允许的最大失败运行次数

      provider:

        # 指标供应方,支持web、wavefront、skywalking、prometheus、plugin、newRelic、kayenta、job、influxdb、graphite、datadog、cloudWatch。

        prometheus:

          # Prometheus服务的访问入口

          address: http://prometheus.example.com:9090

          # 向Prometheus服务发起的查询请求(PromQL)

          query: |

            sum(irate(

              istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code!~"5.*"}[5m]

            )) /

            sum(irate(

              istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}"}[5m]

            ))

  dryRun:

    # 运行于dryRun模式的metric列表,这些metric的结果不会影响最终分析结果

    - metricName: <string>

      # 指标名称

  measurementRetention:

    # 测量结果历史的保留数,dryRun模式的参数也支持历史结果保留

    - metricName: <string>

      # 指标名称

      limit: <integer>

      # 保留数量

  terminate: <boolean>

二、功能简介

Argo Rollouts 是一个用于 Kubernetes 上的渐进式交付工具,它提供了多种分析手段来驱动渐进式部署 过程。

这些分析手段可以帮助团队在部署新版本的应用程序时进行决策,确保新版本在生产环境中的稳定性和 可靠性。

Argo Rollout Analysis 支持多种分析模式:

  • Background Analysis:后台分析,canary 部署步骤执行时,可以在后台进行 AnalysisRun 分析, 以分析结果决定 canary rollout 的后续行为(推进/终止)。

  • Inline Analysis:内联分析,canary 部署步骤执行时,在到达某一阶段时启动 AnalysisRun,并在 运行完成之前阻止其推进,以分析结果决定 canary rollout 的后续行为(推进/终止)。

  • Analysis with Multiple Templates:多模板分析,在构建 AnalysisRun 时,Rollout 可以引用多个 AnalysisTemplate,允许从多个 AnalysisTemplate 进行分析。

  • Analysis Template Arguments:AnalysisTemplates 可以声明一组可以由 Rollouts 传递的参数。

  • BlueGreen Pre Promotion Analysis/BlueGreen Post Promotion Analysis:使用 BlueGreen 策略 的 Rollout 可以在流量切换到新版本前后启动 AnalysisRun进行分析,根据分析结果决定是否切换流量。

2.1 内联分析(Inline Analysis):阻塞

内联分析是指将分析作为部署步骤的一部分直接集成到工作流程中。当到达这个分析步骤时,会启动一 个 AnalysisRun,并且整个部署过程会暂停,直到分析运行完成为止。分析结果的成功或失败将决定是继续执行下一个部署步骤还是完全中止整个部署过程。

案例分析:

apiVersion: argoproj.io/v1alpha1

kind: Rollout

metadata:

  name: guestbook

spec:

...

  strategy:

    canary:

      steps:

        - setWeight: 20

        - pause:

            duration: 5m

        - analysis:

            templates:

              - templateName: success-rate

                args:

                  - name: service-name

                    value: guestbook-svc.default.svc.cluster.local

由于在这个例子中没有指定分析的时间间隔(interval),所以分析将只执行一次测量并完成。

apiVersion: argoproj.io/v1alpha1

kind: AnalysisTemplate

metadata:

  name: success-rate

spec:

  args:

    - name: service-name

    - name: prometheus-port

      value: 9090

  metrics:

    - name: success-rate

      successCondition: result[0] >= 0.95

      provider:

        prometheus:

          address: "http://prometheus.example.com:{{args.prometheus-port}}"

          query: |

            sum(irate(

              istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code!~"5.*"}[5m]

            )) /

            sum(irate(

              istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}"}[5m]

            ))

这是一个名为 success-rate 的分析模板(AnalysisTemplate),它定义了如何测量成功率。模板接受两个参数:

  • service-name :要分析的服务名称。

  • prometheus-port :Prometheus 监控服务器的端口,默认为 9090。

为了执行多次测量并在更长的时间范围内进行评估,可以指定 count 和 interval 字段:

metrics:

  - name: success-rate

    successCondition: result[0] >= 0.95

    interval: 60s

    count: 5

    provider:

      prometheus:

        address: http://prometheus.example.com:9090

        query: ...

2.2 后台分析(Background Analysis):非阻塞

后台分析允许在金丝雀部署过程中同时运行分析任务。这意味着即使是在逐步推广新版本的同时,也可以对应用的健康状况和其他关键指标进行监控和评估。

案例分析:

apiVersion: argoproj.io/v1alpha1

kind: Rollout

metadata:

  name: guestbook

spec:

  strategy:

    canary:

      analysis:

        templates:

          - templateName: success-rate

            startingStep: 2 # 延迟至 setWeight: 40% 时开始分析

            args:

              - name: service-name

                value: guestbook-svc.default.svc.cluster.local

      steps:

        - setWeight: 20

        - pause:

            duration: 10m

        - setWeight: 40

        - pause:

            duration: 10m

        - setWeight: 60

        - pause:

            duration: 10m

        - setWeight: 80

        - pause:

            duration: 10m

分析模板(AnalysisTemplate)配置:

apiVersion: argoproj.io/v1alpha1

kind: AnalysisTemplate

metadata:

  name: success-rate

spec:

  args:

    - name: service-name

  metrics:

    - name: success-rate

      interval: 5m

      successCondition: result[0] >= 0.95

      failureLimit: 3

      provider:

        prometheus:

          address: http://prometheus.example.com:9090

          query: |

            sum(irate(

              istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code!~"5.*"}[5m]

            )) /

            sum(irate(

              istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}"}[5m]

            ))

2.3 分析模板(Analysis Template)

分析模板是一个清单文件,它定义了用于判断新版本(通常称为 Green 或 Canary 版本)是否健康的标 准或度量指标。通过定义这些度量标准,团队可以设置自动化规则来判定新版本的表现,并据此做出是 否继续推广的决策。这有助于减少手动干预的需求,并且可以更加客观地评估应用程序的新版本在实际环境中的表现。

案例分析:

apiVersion: argoproj.io/v1alpha1

kind: AnalysisTemplate

metadata:

  name: service-success-rate

spec:

  args:

    - name: service-name

    - name: prometheus-port

      value: "9090"

  metrics:

    - name: success-rate

      interval: 5m

      count: 3

      successCondition: result[0] >= 0.95

      failureLimit: 2

      provider:

        prometheus:

          address: http://prometheus.example.com:{{args.prometheus-port}}

          query: |

            sum(irate(

              istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code!~"5.*"}[5m]

            )) /

            sum(irate(

              istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}"}[5m]

            ))

度量(Metrics):

  • name: success-rate 表示这个度量的名称。

  • interval: 5m 表示查询的时间间隔为 5 分钟。

  • count: 3 表示总共执行 3 次查询。

  • successCondition: result[0] >= 0.95 表示如果查询结果的第一个元素大于等于 0.95,则认为 度量成功。

  • failureLimit: 2 表示如果连续两次查询结果不满足 successCondition ,则认为分析失败。

  • provider: 数据来源配置。

  • address: Prometheus 服务器的地址,其中 {{args.prometheus-port}} 是一个模板参 数,表示Prometheus 服务器的端口号。

  • query: Prometheus 查询表达式,用于计算服务的成功率。

整体结论:

模板会在指定的时间间隔内(例如每 5 分钟)执行 3 次查询,以检查服务的成功率。如果在连续两次查询中成功率低于 95%,则分析失败。

三、智能渐进式交付

3.1 创建 Rollout 对象

[root@master01 ~]# cd /root/17/argo-rollouts/argorollout-examples

[root@master01 argorollout-examples]# mkdir canary-analysis

[root@master01 argorollout-examples]# cd canary-analysis/

[root@master01 canary-analysis]# cat <<EOF >> rollout-with-analysis.yaml

apiVersion: argoproj.io/v1alpha1

kind: Rollout

metadata:

  name: canary-rollouts-analysis-demo

  namespace: demo

spec:

  replicas: 3

  strategy:

    canary:

      analysis:

        templates:

        - templateName: success-rate  # 使用的 AnalysisTemplates

        startingStep: 2  # step 的索引,从第2个step开始分析(40%),第1个是20%初始的时候

        args:  # 传入AnalysisTemplates的参数

        - name: ingress

          value: rollouts-analysis-stable-ing

      canaryService: rollouts-analysis-canary

      stableService: rollouts-analysis-stable

      trafficRouting:

        nginx:

          stableIngress: rollouts-analysis-stable-ing

      # 发布的节奏

      steps:

      - setWeight: 20

      - pause: {}         # 需要手动确认通过

      - setWeight: 40

      - pause: {duration: 60s}

      - setWeight: 60

      - pause: {duration: 60s}

      - setWeight: 80

      - pause: {duration: 60s}

  revisionHistoryLimit: 2

  selector:

    matchLabels:

      app: rollouts-analysis-demo

  template:

    metadata:

      labels:

        app: rollouts-analysis-demo

    spec:

      containers:

      - name: rollouts-analysis-demo

        image: registry.cn-hangzhou.aliyuncs.com/zhdya/kubernetes-bootcamp:v1

        ports:

        - name: http

          containerPort: 8080

          protocol: TCP

EOF

analysis:定义了在金丝雀部署过程中要执行的分析模板和参数。这个例子中使用了名为 success-rate 的模板,并在第2步开始应用分析。 args 部分传递了参数 ingress ,其值为 canary-demo 。

3.2 创建应用的 stable/canary SVC

[root@master01 ~]# cd /root/17/argo-rollouts/argorollout-examples/canary-analysis

[root@master01 canary-analysis]# cat <<EOF >> rollouts-analysis-svc.yaml

apiVersion: v1

kind: Service

metadata:

  name: rollouts-analysis-canary

  namespace: demo

spec:

  ports:

  - port: 8080

    targetPort: http

    protocol: TCP

    name: http

  selector:

    app: rollouts-analysis-demo

---

apiVersion: v1

kind: Service

metadata:

  name: rollouts-analysis-stable

  namespace: demo

spec:

  ports:

  - port: 8080

    targetPort: http

    protocol: TCP

    name: http

  selector:

    app: rollouts-analysis-demo

EOF

3.3 创建应用的 ingress 路由

[root@master01 ~]# cd /root/17/argo-rollouts/argorollout-examples/canary-analysis

[root@master01 canary-analysis]# cat <<EOF >> canary-analysis-ing.yaml

apiVersion: networking.k8s.io/v1

kind: Ingress

metadata:

  name: rollouts-analysis-stable-ing

  namespace: demo

spec:

  ingressClassName: nginx

  rules:

  - host: canary-analysis.example.com

    http:

      paths:

      - backend:

          service:

            name: rollouts-analysis-stable

            port:

              number: 8080

        path: /

        pathType: Prefix

EOF

3.4 创建自动分析 AnalysisTemplate 模板

[root@master01 ~]# cd /root/17/argo-rollouts/argorollout-examples/canary-analysis

[root@master01 canary-analysis]# cat <<EOF >> analysis-success.yaml

apiVersion: argoproj.io/v1alpha1

kind: AnalysisTemplate

metadata:

  name: success-rate

  namespace: demo

spec:

  args:

  - name: ingress

  metrics:

  - name: success-rate

    initialDelay: 1s  #延迟 60s 后启动

    interval: 2s  #查询指标的频率

    failureLimit: 2  #3 次不满足 successCondition 则视为失败

    successCondition: result[0] > 0.90  #成功条件:测量值为空(指标还没采集到)或者大于 90%

    provider:

      prometheus:

        address: http://prometheus.monitor.svc:9090  #Prometheus 地址

        query: >+  #查询语句

          sum(

            rate(nginx_ingress_controller_requests{ingress="{{args.ingress}}",status!~"[4-5].*"}[60s]))

            /

            sum(rate(nginx_ingress_controller_requests{ingress="{{args.ingress}}"}[60s])

          )

EOF

以下是关于这个配置文件的详细解释:

  • metadata:部分中的name设置了AnalysisTemplate的名称为success-rate。

  • spec:部分定义了模板的具体参数和度量指标。

  • args:定义了模板需要的输入参数。在这个例子中,只有一个参数ingress。

  • metrics:定义了度量指标的集合。在这个例子中,只有一个度量指标success-rate。

  • interval:度量指标的采集间隔,此例中设置为10秒。

  • failureLimit:在分析过程中允许的失败次数,此例中设置为3次。

  • successCondition:成功条件,此例中设置为result[0] > 0.90,表示只有当成功率大于90%时,应 用程序才被认为是成功的。

  • provider:定义了度量指标数据来源。此例中使用Prometheus作为度量指标的提供者。

  • address:Prometheus实例的地址。

  • query:Prometheus查询表达式,用于计算应用程序的成功率。这个查询计算了过去60秒内 非4xx和5xx状态码的请求占总请求的比例。

3.5 创建APP

[root@master01 ~]# cd /root/17/argo-rollouts/argorollout-examples/

[root@master01 argorollout-examples]# cat <<EOF >> argorollout-canary-analysis.yaml

apiVersion: argoproj.io/v1alpha1

kind: Application

metadata:

  name: argorollout-analysis-traffic

  namespace: argocd

spec:

  destination:

    name: ''

    namespace: default

    server: 'https://kubernetes.default.svc'

  source:

    path: canary-analysis

    repoURL: 'http://gitlab.example.com/demoteam/argocd-example-apps.git'

    targetRevision: main

  sources: []

  project: default

EOF

把上述几个文件放入代码仓库 http://gitlab.example.com/demoteam/argocd-example-apps.git

[root@master01 ~]# cd /root/17/argo-rollouts/argorollout-examples

[root@master01 argorollout-examples]# git init

# 添加远端仓库

[root@master01 argorollout-examples]# git remote add origin http://gitlab.example.com/demoteam/argocd-example-apps.git

# 验证查看

[root@master01 argorollout-examples]# git remote -v

origin  http://gitlab.example.com/demoteam/argocd-example-apps.git (fetch)

origin  http://gitlab.example.com/demoteam/argocd-example-apps.git (push)

# 添加到暂存区

[root@master01 argorollout-examples]# git add .

# 提交到本地仓库

[root@master01 argorollout-examples]# git commit -m "third for argocd-example-apps"

# 切换到main分支

[root@master01 argorollout-examples]# git branch -M main

# 上传到main分支

[root@master01 argorollout-examples]# git push -uf origin main

Username for 'http://gitlab.example.com': root

Password for 'http://root@gitlab.example.com': <gitlab-password>

应用创建

[root@master01 ~]# cd /root/17/argo-rollouts/argorollout-examples

[root@master01 argorollout-examples]# kubectl apply -f argorollout-canary-analysis.yaml

浏览器输入https://argocd.example.com/打开argocd页面,点击【SYNC】-【SYNCHRPNIZE】进行手动同步

image-20250424205709987

查看树形图

Day17-ArgoCD-图49

查看网络走向

Day17-ArgoCD-图50

四、配置 ingress-nginx 的 metrics

4.1 开启 metrics 端口

第一种办法:

开启 metrics 端口:

[root@master01 ~]# kubectl patch deployment ingress-nginx-controller -n ingress-nginx --

type='json' -p='[{"op": "add", "path": "/spec/template/spec/containers/0/ports/-

", "value": {"name": "prometheus","containerPort":10254}}]'

开启 metrics 数据采集:

[root@master01 ~]# kubectl patch service ingress-nginx-controller -n ingress-nginx --type='json' -

p='[{"op": "add", "path": "/spec/ports/-", "value": {"name":

"prometheus","port":10254,"targetPort":"prometheus"}}]'

第二种,基于 helm 方式安装:

修改 values.yaml 文件:

[root@master01 ~]# cd /root/6/ingress-nginx

[root@master01 ingress-nginx]# vim values.yaml 

# 修改第655行内容

655     enabled: true

metrics:

  port: 10254

  portName: metrics

  # if this port is changed, change healthz-port: in extraArgs: accordingly

  enabled: true

  service:

    annotations: {}

    # prometheus.io/scrape: "true"

    # prometheus.io/port: "10254"

    # -- Labels to be added to the metrics service resource

    labels: {}

...

# 完整配置文件

[root@master01 ingress-nginx]# egrep -v "#|^$" values.yaml 

commonLabels: {}

controller:

  name: controller

  image:

    chroot: false

    registry: registry.cn-hangzhou.aliyuncs.com 

    image: google_containers/nginx-ingress-controller

    tag: "v1.7.0"

    digestChroot: sha256:e84ef3b44c8efeefd8b0aa08770a886bfea1f04c53b61b4ba9a7204e9f1a7edc

    pullPolicy: IfNotPresent

    runAsUser: 101

    allowPrivilegeEscalation: true

  existingPsp: ""

  containerName: controller

  containerPort:

    http: 80

    https: 443

  config:

    load-balance: "round_robin" 

  configAnnotations: {}

  proxySetHeaders: {}

  addHeaders: {}

  dnsConfig: {}

  hostname: {}

  dnsPolicy: ClusterFirstWithHostNet 

  reportNodeInternalIp: false

  watchIngressWithoutClass: false

  ingressClassByName: false

  enableTopologyAwareRouting: false

  allowSnippetAnnotations: true

  hostNetwork: true

  hostPort:

    enabled: false

    ports:

      http: 80

      https: 443

  electionID: ""

  ingressClassResource:

    name: nginx

    enabled: true

    default: false

    controllerValue: "k8s.io/ingress-nginx"

    parameters: {}

  ingressClass: nginx

  podLabels: {}

  podSecurityContext: {}

  sysctls: {}

  publishService:

    enabled: true

    pathOverride: ""

  scope:

    enabled: false

    namespace: ""

    namespaceSelector: ""

  configMapNamespace: ""

  tcp:

    configMapNamespace: ""

    annotations: {}

  udp:

    configMapNamespace: ""

    annotations: {}

  maxmindLicenseKey: ""

  extraArgs: {}

  extraEnvs: []

  kind: DaemonSet

  annotations: {}

  labels: {}

  updateStrategy: {}

  minReadySeconds: 0

  tolerations: []

  affinity: {}

  topologySpreadConstraints: []

  terminationGracePeriodSeconds: 300

  nodeSelector:

    kubernetes.io/os: linux

    ingress: "true"

  livenessProbe:

    httpGet:

      path: "/healthz"

      port: 10254

      scheme: HTTP

    initialDelaySeconds: 10

    periodSeconds: 10

    timeoutSeconds: 1

    successThreshold: 1

    failureThreshold: 5

  readinessProbe:

    httpGet:

      path: "/healthz"

      port: 10254

      scheme: HTTP

    initialDelaySeconds: 10

    periodSeconds: 10

    timeoutSeconds: 1

    successThreshold: 1

    failureThreshold: 3

  healthCheckPath: "/healthz"

  healthCheckHost: ""

  podAnnotations: {}

  replicaCount: 1

  minAvailable: 1

  resources:

    requests:

      cpu: 100m

      memory: 90Mi

  autoscaling:

    apiVersion: autoscaling/v2

    enabled: false

    annotations: {}

    minReplicas: 1

    maxReplicas: 11

    targetCPUUtilizationPercentage: 50

    targetMemoryUtilizationPercentage: 50

    behavior: {}

  autoscalingTemplate: []

  keda:

    apiVersion: "keda.sh/v1alpha1"

    enabled: false

    minReplicas: 1

    maxReplicas: 11

    pollingInterval: 30

    cooldownPeriod: 300

    restoreToOriginalReplicaCount: false

    scaledObject:

      annotations: {}

    triggers: []

    behavior: {}

  enableMimalloc: true

  customTemplate:

    configMapName: ""

    configMapKey: ""

  service:

    enabled: true

    appProtocol: true

    annotations: {}

    labels: {}

    externalIPs: []

    loadBalancerIP: ""

    loadBalancerSourceRanges: []

    enableHttp: true

    enableHttps: true

    ipFamilyPolicy: "SingleStack"

    ipFamilies:

      - IPv4

    ports:

      http: 80

      https: 443

    targetPorts:

      http: http

      https: https

    type: LoadBalancer

    nodePorts:

      http: ""

      https: ""

      tcp: {}

      udp: {}

    external:

      enabled: true

    internal:

      enabled: false

      annotations: {}

      loadBalancerSourceRanges: []

  shareProcessNamespace: false

  extraContainers: []

  extraVolumeMounts: []

  extraVolumes: []

  extraInitContainers: 

  - name: sysctl

    image: registry.cn-hangzhou.aliyuncs.com/abroad_images/alpine:3.10 

    imagePullPolicy: IfNotPresent

    command:

      - sh

      - -c

      - |

        mount -o remount rw /proc/sys

        sysctl -w net.core.somaxconn=65535

        sysctl -w net.ipv4.tcp_tw_reuse=1

        sysctl -w net.ipv4.ip_local_port_range="1024 65535"

        sysctl -w fs.file-max=1048576

        sysctl -w fs.inotify.max_user_instances=16384

        sysctl -w fs.inotify.max_user_watches=524288

        sysctl -w fs.inotify.max_queued_events=16384

    securityContext:

      privileged: true

  extraModules: []

  opentelemetry:

    enabled: false

    image: registry.k8s.io/ingress-nginx/opentelemetry:v20230312-helm-chart-4.5.2-28-g66a760794@sha256:40f766ac4a9832f36f217bb0e98d44c8d38faeccbfe861fbc1a76af7e9ab257f

    containerSecurityContext:

      allowPrivilegeEscalation: false

  admissionWebhooks:

    annotations: {}

    enabled: true

    extraEnvs: []

    failurePolicy: Fail

    port: 8443

    certificate: "/usr/local/certificates/cert"

    key: "/usr/local/certificates/key"

    namespaceSelector: {}

    objectSelector: {}

    labels: {}

    existingPsp: ""

    networkPolicyEnabled: false

    service:

      annotations: {}

      externalIPs: []

      loadBalancerSourceRanges: []

      servicePort: 443

      type: ClusterIP

    createSecretJob:

      securityContext:

        allowPrivilegeEscalation: false

      resources: {}

    patchWebhookJob:

      securityContext:

        allowPrivilegeEscalation: false

      resources: {}

    patch:

      enabled: true

      image:

        registry: registry.cn-hangzhou.aliyuncs.com 

        image: google_containers/kube-webhook-certgen

        tag: v20230312-helm-chart-4.5.2-28-g66a760794

        pullPolicy: IfNotPresent

      priorityClassName: ""

      podAnnotations: {}

      nodeSelector:

        kubernetes.io/os: linux

      tolerations: []

      labels: {}

      securityContext:

        runAsNonRoot: true

        runAsUser: 2000

        fsGroup: 2000

    certManager:

      enabled: false

      rootCert:

        duration: ""

      admissionCert:

        duration: ""

  metrics:

    port: 10254

    portName: metrics

    enabled: true

    service:

      annotations: 

        prometheus.io/scrape: "true"

        prometheus.io/port: "10254"

      labels: {}

      externalIPs: []

      loadBalancerSourceRanges: []

      servicePort: 10254

      type: ClusterIP

    serviceMonitor:

      enabled: false

      additionalLabels: {}

      namespace: ""

      namespaceSelector: {}

      scrapeInterval: 30s

      targetLabels: []

      relabelings: []

      metricRelabelings: []

    prometheusRule:

      enabled: true

      additionalLabels: {}

      namespace: "monitoring"

      rules:

        - alert: NginxFailedReload

          expr: nginx_ingress_controller_config_last_reload_successful == 0

          for: 1m

          labels:

            severity: critical

          annotations:

            summary: "Nginx配置重载失败"

            description: "Nginx Ingress Controller配置重载失败,实例: {{ $labels.instance }}"

        - alert: HighHttp4xxRate

          expr: |

            sum(rate(nginx_ingress_controller_requests{status=~"4.."}[5m])) by (host, namespace)

            /

            sum(rate(nginx_ingress_controller_requests[5m])) by (host, namespace)

          for: 5m

          labels:

            severity: warning

          annotations:

            summary: "HTTP 4xx错误率过高 ({{ $value }}%)"

            description: "命名空间 {{ $labels.namespace }} 主机 {{ $labels.host }}"

        - alert: HighHttp5xxRate

          expr: |

            sum(rate(nginx_ingress_controller_requests{status=~"5.."}[5m])) by (host, namespace)

            /

            sum(rate(nginx_ingress_controller_requests[5m])) by (host, namespace)

            * 100 > 1

          for: 2m

          labels:

            severity: critical

          annotations:

            summary: "HTTP 5xx错误率过高 ({{ $value }}%)"

            description: "命名空间 {{ $labels.namespace }} 主机 {{ $labels.host }}"

        - alert: HighLatency

          expr: |

            histogram_quantile(0.99,

              sum by (le, host, namespace) (

                rate(nginx_ingress_controller_request_duration_seconds_bucket[2m])

              )

          for: 5m

          labels:

            severity: warning

          annotations:

            summary: "高延迟请求 (p99: {{ $value }}秒)"

            description: "命名空间 {{ $labels.namespace }} 主机 {{ $labels.host }}"

        - alert: HighRequestRate

          expr: |

          for: 5m

          labels:

            severity: warning

          annotations:

            summary: "高请求速率 ({{ $value }} req/s)"

            description: "实例 {{ $labels.instance }}"

        - alert: SSLCertExpiring15d

          expr: |

          for: 1h

          labels:

            severity: warning

          annotations:

            summary: "SSL证书即将过期 ({{ $labels.host }})"

            description: "证书 {{ $labels.secret_name }} 将在15天内过期 (剩余: {{ $value | humanizeDuration }})"

        - alert: SSLCertExpiring7d

          expr: |

          for: 1h

          labels:

            severity: critical

          annotations:

            summary: "SSL证书即将过期 ({{ $labels.host }})"

            description: "证书 {{ $labels.secret_name }} 将在7天内过期 (剩余: {{ $value | humanizeDuration }})"

  lifecycle:

    preStop:

      exec:

        command:

          - /wait-shutdown

  priorityClassName: ""

revisionHistoryLimit: 10

defaultBackend:

  enabled: false

  name: defaultbackend

  image:

    registry: registry.k8s.io

    image: defaultbackend-amd64

    tag: "1.5"

    pullPolicy: IfNotPresent

    runAsUser: 65534

    runAsNonRoot: true

    readOnlyRootFilesystem: true

    allowPrivilegeEscalation: false

  existingPsp: ""

  extraArgs: {}

  serviceAccount:

    create: true

    name: ""

    automountServiceAccountToken: true

  extraEnvs: []

  port: 8080

  livenessProbe:

    failureThreshold: 3

    initialDelaySeconds: 30

    periodSeconds: 10

    successThreshold: 1

    timeoutSeconds: 5

  readinessProbe:

    failureThreshold: 6

    initialDelaySeconds: 0

    periodSeconds: 5

    successThreshold: 1

    timeoutSeconds: 5

  updateStrategy: {}

  minReadySeconds: 0

  tolerations: []

  affinity: {}

  podSecurityContext: {}

  containerSecurityContext: {}

  podLabels: {}

  nodeSelector:

    kubernetes.io/os: linux

  podAnnotations: {}

  replicaCount: 1

  minAvailable: 1

  resources: {}

  extraVolumeMounts: []

  extraVolumes: []

  autoscaling:

    apiVersion: autoscaling/v2

    annotations: {}

    enabled: false

    minReplicas: 1

    maxReplicas: 2

    targetCPUUtilizationPercentage: 50

    targetMemoryUtilizationPercentage: 50

  service:

    annotations: {}

    externalIPs: []

    loadBalancerSourceRanges: []

    servicePort: 80

    type: ClusterIP

  priorityClassName: ""

  labels: {}

rbac:

  create: true

  scope: false

podSecurityPolicy:

  enabled: false

serviceAccount:

  create: true

  name: ""

  automountServiceAccountToken: true

  annotations: {}

imagePullSecrets: []

tcp: {}

udp: {}

portNamePrefix: ""

dhParam: ""

更新 ingress-nginx 服务:

[root@master01 ~]# helm upgrade ingress-nginx ./ingress-nginx -f ./ingress-nginx/values.yaml -n ingress-nginx 

验证

# 查看pod

[root@master01 ingress-nginx]# kgp -ningress-nginx

NAME                             READY   STATUS    RESTARTS       AGE

ingress-nginx-controller-8lnp9   1/1     Running   11 (10h ago)   11d

ingress-nginx-controller-wx2hd   1/1     Running   11 (10h ago)   11d

# 查看svc,暴露了10254端口

[root@master01 ingress-nginx]# kgs -ningress-nginx | grep metrics

ingress-nginx-controller-metrics     ClusterIP      <cluster-ip>    <none>        10254/TCP                    11d

# 查看监控到的指标信息

[root@master01 ~]# curl <cluster-ip>:10254/metrics

...

...

promhttp_metric_handler_requests_total{code="200"} 7377

promhttp_metric_handler_requests_total{code="500"} 0

promhttp_metric_handler_requests_total{code="503"} 0

4.2 配置数据采集

Prometheus 能够抓取到 Ingress-Nginx 指标:

1)单体 yaml 文件:

[root@master01 ~]# cd /root/7

[root@master01 7]# vim prometheus-config.yaml 

# 新增下面配置

    ########## Ingress 监控配置 ##########

    - job_name: 'ingress-nginx-endpoints'

      kubernetes_sd_configs:

      - role: pod

        namespaces:

          names:

          - ingress-nginx

      relabel_configs:

      - source_labels: [__meta_kubernetes_pod_container_port_number]

        action: keep

        regex: "10254"

# 完整配置文件

[root@master01 7]# cat  prometheus-config.yaml 

apiVersion: v1

kind: ConfigMap

metadata:

  name: prometheus-config

  namespace: monitor

data:

  prometheus.yml: |

    global:

      scrape_interval:     15s

      evaluation_interval: 15s

      external_labels:

        cluster: "kubernetes"

    ############ 添加配置 Aertmanager 服务器地址 ###################

    alerting:

      alertmanagers:

      - static_configs:

        - targets: ["alertmanager:9093"]         

    ############ 数据采集job ###################

    scrape_configs:

    ########## Ingress 监控配置 ##########

    - job_name: 'ingress-nginx-endpoints'

      kubernetes_sd_configs:

      - role: pod

        namespaces:

          names:

          - ingress-nginx

      relabel_configs:

      - source_labels: [__meta_kubernetes_pod_container_port_number]

        action: keep

        regex: "10254"

    ########## Argocd 监控配置 ##########

    - job_name: 'argocd-metrics'

      static_configs:

        - targets: ['argocd-metrics.argocd.svc.cluster.local:8082']

    - job_name: 'argocd-server-metrics'

      static_configs:

        - targets: ['argocd-server-metrics.argocd.svc.cluster.local:8083']

    - job_name: 'argocd-repo-server-metrics'

      static_configs:

        - targets: ['argocd-repo-server.argocd.svc.cluster.local:8084']

    ########## prometheus 监控配置 ##########

    - job_name: prometheus

      static_configs:

      - targets: ['127.0.0.1:9090']

        labels:

          instance: prometheus

    ########## apisix 监控配置 ##########

    - job_name: "apisix"

      scrape_interval: 15s 

      metrics_path: "/apisix/prometheus/metrics"

      static_configs:

      - targets: [metrics.example.com]

    ########## minio 监控配置 ##########

    - job_name: minio-job

      bearer_token: <prometheus-bearer-token>

      metrics_path: /minio/v2/metrics/cluster

      scheme: http

      static_configs:

        - targets: [s3.example.com]

    ########## kube-apiserver 监控配置 ##########

    - job_name: kube-apiserver

      kubernetes_sd_configs:

      - role: endpoints

      scheme: https

      tls_config:

        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt

      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

      relabel_configs:

      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name]

        action: keep

        regex: default;kubernetes

      - source_labels: [__meta_kubernetes_endpoints_name]

        action: replace

        target_label: endpoint

      - source_labels: [__meta_kubernetes_pod_name]

        action: replace

        target_label: pod

      - source_labels: [__meta_kubernetes_service_name]

        action: replace

        target_label: service

      - source_labels: [__meta_kubernetes_namespace]

        action: replace

        target_label: namespace

    ########## kube-controller-manager 监控配置 ##########

    - job_name: 'kube-controller-manager'

      # 使用 Kubernetes Pod 发现机制

      kubernetes_sd_configs:

        - role: pod

      # 强制使用 HTTPS 协议

      scheme: https

      # TLS 配置(测试环境跳过验证)

      tls_config:

        insecure_skip_verify: true

      # 使用 ServiceAccount 的 Token 认证

      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

      relabel_configs:

        # 保留标签为 component=kube-controller-manager 的 Pod

        - source_labels: [__meta_kubernetes_pod_label_component]

          regex: kube-controller-manager

          action: keep

        # 重写目标地址为 Pod IP + 10257 端口

        - source_labels: [__meta_kubernetes_pod_ip]

          regex: (.+)

          target_label: __address__

          replacement: "${1}:10257"

        # 强制使用 HTTPS 协议(冗余但明确)

        - source_labels: []

          regex: .*

          target_label: __scheme__

          replacement: https

        # 附加元数据标签

        - source_labels: [__meta_kubernetes_endpoints_name]

          action: replace

          target_label: endpoint

        - source_labels: [__meta_kubernetes_pod_name]

          action: replace

          target_label: pod

        - source_labels: [__meta_kubernetes_service_name]

          action: replace

          target_label: service

        - source_labels: [__meta_kubernetes_namespace]

          action: replace

          target_label: namespace                

    ########## kube-scheduler 监控配置 ##########

    - job_name: 'kube-scheduler'

      kubernetes_sd_configs:

        - role: pod

      scheme: https

      tls_config:

        insecure_skip_verify: true

      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

      relabel_configs:

        - source_labels: [__meta_kubernetes_pod_label_component]

          regex: kube-scheduler

          action: keep

        - source_labels: [__meta_kubernetes_pod_ip]

          regex: (.+)

          target_label: __address__

          replacement: "${1}:10259"

        - source_labels: []

          regex: .*

          target_label: __scheme__

          replacement: https

        - source_labels: [__meta_kubernetes_endpoints_name]

          action: replace

          target_label: endpoint

        - source_labels: [__meta_kubernetes_pod_name]

          action: replace

          target_label: pod

        - source_labels: [__meta_kubernetes_service_name]

          action: replace

          target_label: service

        - source_labels: [__meta_kubernetes_namespace]

          action: replace

          target_label: namespace

    ########## kube-state-metrics 监控配置 ##########

    - job_name: kube-state-metrics

      kubernetes_sd_configs:

      - role: endpoints

      relabel_configs:

      - source_labels: [__meta_kubernetes_service_name]

        regex: kube-state-metrics

        action: keep

      - source_labels: [__meta_kubernetes_pod_ip]

        regex: (.+)

        target_label: __address__

        replacement: ${1}:8080

      - source_labels: [__meta_kubernetes_endpoints_name]

        action: replace

        target_label: endpoint

      - source_labels: [__meta_kubernetes_pod_name]

        action: replace

        target_label: pod

      - source_labels: [__meta_kubernetes_service_name]

        action: replace

        target_label: service

      - source_labels: [__meta_kubernetes_namespace]

        action: replace

        target_label: namespace

    ########## coredns 监控配置 ##########

    - job_name: coredns

      kubernetes_sd_configs:

      - role: endpoints

      relabel_configs:

      - source_labels:

          - __meta_kubernetes_service_label_k8s_app

        regex: kube-dns

        action: keep

      - source_labels: [__meta_kubernetes_pod_ip]

        regex: (.+)

        target_label: __address__

        replacement: ${1}:9153

      - source_labels: [__meta_kubernetes_endpoints_name]

        action: replace

        target_label: endpoint

      - source_labels: [__meta_kubernetes_pod_name]

        action: replace

        target_label: pod

      - source_labels: [__meta_kubernetes_service_name]

        action: replace

        target_label: service

      - source_labels: [__meta_kubernetes_namespace]

        action: replace

        target_label: namespace

    ########## etcd 监控配置 ##########

    - job_name: etcd

      kubernetes_sd_configs:

      - role: pod

      relabel_configs:

      - source_labels:

          - __meta_kubernetes_pod_label_component

        regex: etcd

        action: keep

      - source_labels: [__meta_kubernetes_pod_ip]

        regex: (.+)

        target_label: __address__

        replacement: ${1}:2381

      - source_labels: [__meta_kubernetes_endpoints_name]

        action: replace

        target_label: endpoint

      - source_labels: [__meta_kubernetes_pod_name]

        action: replace

        target_label: pod

      - source_labels: [__meta_kubernetes_namespace]

        action: replace

        target_label: namespace

    ########## kubelet 监控配置 ##########

    - job_name: kubelet

      metrics_path: /metrics/cadvisor

      scheme: https

      tls_config:

        insecure_skip_verify: true

      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

      kubernetes_sd_configs:

      - role: node

      relabel_configs:

      - action: labelmap

        regex: __meta_kubernetes_node_label_(.+)

      - source_labels: [__meta_kubernetes_endpoints_name]

        action: replace

        target_label: endpoint

      - source_labels: [__meta_kubernetes_pod_name]

        action: replace

        target_label: pod

      - source_labels: [__meta_kubernetes_namespace]

        action: replace

        target_label: namespace   

    ########## k8s-node 监控配置 ##########

    - job_name: k8s-nodes

      kubernetes_sd_configs:

      - role: node

      relabel_configs:

      - source_labels: [__address__]

        regex: '(.*):10250'

        replacement: '${1}:9100'

        target_label: __address__

        action: replace

      - action: labelmap

        regex: __meta_kubernetes_node_label_(.+)

      - source_labels: [__meta_kubernetes_endpoints_name]

        action: replace

        target_label: endpoint

      - source_labels: [__meta_kubernetes_pod_name]

        action: replace

        target_label: pod

      - source_labels: [__meta_kubernetes_namespace]

        action: replace

        target_label: namespace 

    ########## DNS 监控配置 ##########

    - job_name: "kubernetes-dns"

      metrics_path: /probe              # 不是metrics,是probe

      params:

        module: [dns_tcp]               # 使用DNS TCP模块

      static_configs:

        - targets:

          - kube-dns.kube-system:53             #不要省略端口号

          - 8.8.4.4:53

          - 8.8.8.8:53

          - 223.5.5.5:53

      relabel_configs:

        - source_labels: [__address__]

          target_label: __param_target

        - source_labels: [__param_target]

          target_label: instance

        - target_label: __address__

          replacement: blackbox-exporter.monitor:9115 # 服务地址,和上面的 Service 定义保持一致

    ########## ICMP 监控配置 ##########

    - job_name: icmp-status

      metrics_path: /probe

      params:

        module: [icmp]

      static_configs:

      - targets:

        - <node-ip>

        labels:

          group: icmp

      relabel_configs:

      - source_labels: [__address__]

        target_label: __param_target

      - source_labels: [__param_target]

        target_label: instance

      - target_label: __address__

        replacement: blackbox-exporter.monitor:9115

    ########## HTTP 监控配置 ##########

    - job_name: 'kubernetes-services'

      metrics_path: /probe

      params:

        module:         ## 使用HTTP_GET_2xx与HTTP_GET_3XX模块

        - "http_get_2xx"

        - "http_get_3xx"

      kubernetes_sd_configs:            ## 使用Kubernetes动态服务发现,且使用Service类型的发现

      - role: service

      relabel_configs:          ## 设置只监测Kubernetes Service中Annotation里配置了注解prometheus.io/http_probe: true的service

      - action: keep

        source_labels: [__meta_kubernetes_service_annotation_prometheus_io_http_probe]

        regex: "true"

      - action: replace

        source_labels: 

        - "__meta_kubernetes_service_name"

        - "__meta_kubernetes_namespace"

        - "__meta_kubernetes_service_annotation_prometheus_io_http_probe_port"

        - "__meta_kubernetes_service_annotation_prometheus_io_http_probe_path"

        target_label: __param_target

        regex: (.+);(.+);(.+);(.+)

        replacement: $1.$2:$3$4

      - target_label: __address__

        replacement: blackbox-exporter.monitor:9115             ## BlackBox Exporter 的 Service 地址

      - source_labels: [__param_target]

        target_label: instance

      - action: labelmap

        regex: __meta_kubernetes_service_label_(.+)

      - source_labels: [__meta_kubernetes_namespace]

        target_label: kubernetes_namespace

      - source_labels: [__meta_kubernetes_service_name]

        target_label: kubernetes_name   

    ########## TCP 监控配置 ##########

    - job_name: "service-tcp-probe"

      scrape_interval: 1m

      metrics_path: /probe

      # 使用blackbox exporter配置文件的tcp_connect的探针

      params:

        module: [tcp_connect]

      kubernetes_sd_configs:

      - role: service

      relabel_configs:

      # 保留prometheus.io/scrape: "true"和prometheus.io/tcp-probe: "true"的service

      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape, __meta_kubernetes_service_annotation_prometheus_io_tcp_probe]

        action: keep

        regex: true;true

      # 将原标签名__meta_kubernetes_service_name改成service_name

      - source_labels: [__meta_kubernetes_service_name]

        action: replace

        regex: (.*)

        target_label: service_name

      # 将原标签名__meta_kubernetes_service_name改成service_name

      - source_labels: [__meta_kubernetes_namespace]

        action: replace

        regex: (.*)

        target_label: namespace

      # 将instance改成 `clusterIP:port` 地址

      - source_labels: [__meta_kubernetes_service_cluster_ip, __meta_kubernetes_service_annotation_prometheus_io_http_probe_port]

        action: replace

        regex: (.*);(.*)

        target_label: __param_target

        replacement: $1:$2

      - source_labels: [__param_target]

        target_label: instance

      # 将__address__的值改成 `blackbox-exporter.monitor:9115`

      - target_label: __address__

        replacement: blackbox-exporter.monitor:9115

    ########## Ingress 监控配置 ########## 

    - job_name: 'blackbox-k8s-ingresses'

      scrape_interval: 30s

      scrape_timeout: 10s

      metrics_path: /probe

      params:

        module: [http_get_2xx]  # 使用定义的http模块

      kubernetes_sd_configs:

      - role: ingress  # ingress 类型的服务发现

      relabel_configs:

      # 只有ingress的annotation中配置了 prometheus.io/http_probe=true 的才进行发现

      - source_labels: [__meta_kubernetes_ingress_annotation_prometheus_io_http_probe]

        action: keep

        regex: true

      - source_labels: [__meta_kubernetes_ingress_scheme,__address__,__meta_kubernetes_ingress_path]

        regex: (.+);(.+);(.+)

        replacement: ${1}://${2}${3}

        target_label: __param_target

      - target_label: __address__

        replacement: blackbox-exporter.monitor:9115

      - source_labels: [__param_target]

        target_label: instance

      - action: labelmap

        regex: __meta_kubernetes_ingress_label_(.+)

      - source_labels: [__meta_kubernetes_namespace]

        target_label: kubernetes_namespace

      - source_labels: [__meta_kubernetes_ingress_name]

        target_label: kubernetes_name

    ########## 外部域名 监控配置 ##########

    - job_name: "blackbox-external-website"

      scrape_interval: 30s

      scrape_timeout: 15s

      metrics_path: /probe

      params:

        module: [http_get_2xx]

      static_configs:

      - targets:

        - https://www.baidu.com # 改为公司对外服务的域名

        - https://www.jd.com

      relabel_configs:

      - source_labels: [__address__]

        target_label: __param_target

      - source_labels: [__param_target]

        target_label: instance

      - target_label: __address__

        replacement: blackbox-exporter.monitor:9115

    ########## 云上ECS 监控配置 ##########

    - job_name: 'other-ECS'

      static_configs:

        - targets: ['101.201.68.158:9100']

          labels:

            hostname: 'test-node-exporter'

    ########## 进程 监控配置 ##########

    - job_name: 'process-exporter'

      static_configs:

      - targets: ['<node-ip>:9256']

    ########## Mysql 监控配置 ##########

    - job_name: 'mysql-exporter'

      static_configs:

      - targets: ['<node-ip>:9104']

    ########## Consul 监控配置 ##########

    - job_name: consul

      honor_labels: true

      metrics_path: /metrics

      scheme: http

      consul_sd_configs:    #基于consul服务发现的配置

        - server: <node-ip>:18500    #consul的监听地址

          services: []                 #匹配consul中所有的service

      relabel_configs:             #relabel_configs下面都是重写标签相关配置

      - source_labels: ['__meta_consul_tags']    #将__meta_consul_tags标签的至赋值给product

        target_label: 'servername'

      - source_labels: ['__meta_consul_dc']   #将__meta_consul_dc的值赋值给idc

        target_label: 'idc'

      - source_labels: ['__meta_consul_service']   

        regex: "consul"  #匹配为"consul"的service

        action: drop       #执行的动作为删除 

    ############ 指定告警规则文件路径位置 ###################

    rule_files:

    - /etc/prometheus/rules/*.rules

热加载 prometheus

[root@master01 7]# curl -XPOST http://prometheus.example.com/-/reload

点击【Targets】在 prometheus target 验证:

Day17-ArgoCD-图51

点击【Graph】输入下面内容进行数据指标验证:

nginx_ingress_controller_requests

Day17-ArgoCD-图52

打开grafana,导入模板9614

image-20250424221113837

打开grafana,导入模板14314

image-20250424221218787

2)prometheus-operator 数据采集:

apiVersion: monitoring.coreos.com/v1

kind: ServiceMonitor

metadata:

  name: nginx-ingress-controller-metrics

  namespace: monitor

  labels:

    app: nginx-ingress

    release: prometheus-operator

spec:

  endpoints:

    - interval: 10s

      port: prometheus

  selector:

    matchLabels:

      app.kubernetes.io/instance: ingress-nginx

      app.kubernetes.io/name: ingress-nginx

  namespaceSelector:

    matchNames:

      - ingress-nginx

4.3 指标数据验证

填入已知数据,随机访问已知 ingress 域名,验证测试 prometheus 数据:

# 计算成功率(百分比格式)

(

  sum by (ingress) (

    rate(nginx_ingress_controller_requests{

      ingress="argocd-server-ingress",

      status!~"^[45]\\d{2}$"  # 精确匹配 4xx/5xx 状态码

    }[60s])

  )

/

  sum by (ingress) (

    rate(nginx_ingress_controller_requests{

      ingress="argocd-server-ingress"

    }[60s])

  )

) * 100  

Day17-ArgoCD-图53

五、实战验证

5.1 智能渐进式交付成功案例

提前测试访问

[root@master01 ~]# for i in {1..5000}; do curl canary-analysis.example.com; done

通过 kubectl 插件来更新镜像:

[root@master01 ~]# kubectl argo rollouts set image canary-rollouts-analysis-demo rollouts-analysis-demo=registry.cn-hangzhou.aliyuncs.com/zhdya/kubernetes-bootcamp:v2 -ndemo

动态观察:

[root@master01 ~]# kubectl argo rollouts get rollout canary-rollouts-analysis-demo -ndemo --watch

进入控制台,观察自动渐进式交付过程。可以看到目前处在 20% 金丝雀流量的下一阶段,也就是暂停的阶段。

Day17-ArgoCD-图54

promote 后,将进入到 40% 金丝雀流量阶段:

[root@master01 ~]# kubectl argo rollouts promote canary-rollouts-analysis-demo -ndemo

从这个阶段开始,自动金丝雀分析开始工作,直到最后金丝雀发布完成,金丝雀环境提升为了生产环境,这时自动分析也完成了,如下图所示

Day17-ArgoCD-图55

到这里,一次完整的自动渐进式交付就完成了:

Day17-ArgoCD-图56

5.2 智能渐进式交付失败案例(注意!在模式失败的时候,按照视频来增加极端数据的调整)

在上面的实验中,由于应用返回的 HTTP 状态码都是 200 ,所以金丝雀分析自然是会成功的。 接下来,来尝试进行自动渐进式交付失败的实验。

提前制造失败访问

[root@master01 ~]#  for i in {1..66666}; do curl canary-analysis.example.com/-/aaaa; done

# 返回的结果都是200

...

...

Hello Kubernetes bootcamp! | Running on: canary-rollouts-analysis-demo-7549df9c55-tsjm8 | v=1

Hello Kubernetes bootcamp! | Running on: canary-rollouts-analysis-demo-7549df9c55-z4pfm | v=1

Hello Kubernetes bootcamp! | Running on: canary-rollouts-analysis-demo-7549df9c55-z4pfm | v=1

Hello Kubernetes bootcamp! | Running on: canary-rollouts-analysis-demo-7549df9c55-tsjm8 | v=1

通过 kubectl 插件来更新镜像:

[root@master01 ~]# kubectl argo rollouts set image canary-rollouts-analysis-demo rollouts-analysis-demo=registry.cn-hangzhou.aliyuncs.com/abroad_images/nginx:1.15.12 -ndemo

动态观察:

[root@master01 ~]# kubectl argo rollouts get rollout canary-rollouts-analysis-demo -ndemo --watch

点击【Promote】

image-20250425081748243

等待一段时间后,金丝雀分析将失败,如下图所示:

Day17-ArgoCD-图57

此时,Argo Rollout 将执行自动回滚操作,删除新起来的pod:

Day17-ArgoCD-图58

image-20250425082456136

到这里,一次完整的渐进式交付失败实验就成功了:

Day17-ArgoCD-图59

六、总结

通过 Argo Rollouts Analysis,实现智能的渐进式交付,确保新版本在生产环境中逐步推广时能够得到充 分的监控和评估。

这种方式可以显著提高部署的安全性和可靠性,减少因新版本引入问题而导致的服务中断风险。

通过灵活配置分析模板和渐进式部署策略,可以根据不同的业务需求和环境特性来定制最适合的部署流程。