一、基于Skywalking的告警概述

官方指南:apache/skywalking · GitHub

基本原理&告警媒介:

  • 每隔一段时间轮询 skywalking-oap收集到的链路追踪的数据。
  • 根据所配置的告警规则(如服务响应时间、服务响应时间百分比)等,一旦达到阈 值则发送响应的告警信息。
  • 告警方式支持:普通webhook、WeChat Hook(企微告警)、Dingtalk Hook(钉钉告警)、Feishu Hook(飞书告警)。
  • 告警的信息也可以在ui中查看。

二、告警规则

2.1 默认告警规则

在 Skywalking中,告警规则称为 rule,默认安装的 Skywalking oap server组件中包含了告警规则的配置文件,

安装目录下 config文件夹下 alarm-settings.yml文件中:

# kubectl -n devops exec -it skywalking-oap-5f45c8df5-49nn9  -- bash
bash-5.0# pwd
/skywalking
bash-5.0# cat config/alarm-settings.yml

2.2 告警rules

# Sample alarm rules.
rules:
  # Rule unique name, must be ended with `_rule`.
  service_resp_time_rule:
    metrics-name: service_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 3
    silence-period: 5
    message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
  service_sla_rule:
    # Metrics value need to be long, double or int
    metrics-name: service_sla
    op: "<"
    threshold: 8000
    # The length of time to evaluate the metrics
    period: 10
    # How many times after the metrics match the condition, will trigger alarm
    count: 2
    # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
    silence-period: 3
    message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
...

webhooks:
#  - http://127.0.0.1/notify/
#  - http://127.0.0.1/go-wechat/

2.3 告警规则详解

rules:
  # Rule unique name, must be ended with `_rule`.
  service_resp_time_rule:
    metrics-name: service_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 3
    silence-period: 5
    message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.

首先提示声明了告警规则名称应该具有唯一性,且 必须 以 _rule 结尾,这里是service_resp_time_rule (服务响应时间)

  • metrics-name:告警指标,指标度量值为 long、 double或 int类型
  • op:度量值和阈值的比较方式,这里是大于
  • threshold:阈值,这里是 1000,毫秒为单位
  • period:评估度量标准的时间长度,也就是告警检查周期,分钟为单位
  • count:累计达到多少次告警值后触发告警
  • silence-period:忽略相同告警信息的周期,默认与告警检查周期一致。简单来说, 就是在触发告警时开始计时 N,在 N+period时间内保持沉默 silence不会再次触发告警,这和 alertmanager的告警抑制类似
  • message:告警消息主体,通过变量在发送消息时进行自动替换

2.4 高级告警规则

  • 1、service_resp_time_rule:最近X分钟内服务平均响应时间超过X秒
  • 2、service_sla_rule:最近X分钟内服务成功率低于X秒
  • 3、service_resp_time_percentile_rule:最近X分钟的服务响应时间百分位超过X秒
  • 4、service_instance_resp_time_rule:最近X分钟内服务实例的平均响应时间超过X 秒
  • 5、database_access_resp_time_rule:最近X分钟内数据库访问的平均响应时间超 过X秒
  • 6、endpoint_relation_resp_time_rule:最近X分钟内端点平均响应时间超过X秒
  • 7、endpoint_avg_rule:过去X分钟内端点关系的平均响应时间超过X秒(默认未打开,官方提示:消耗更多内存)

三、测试验证

3.1 功能开启

Skywalking的配置大部分内容是通过应用的 application.yml及系统的环境变量设置的,同时也支持 configmap的动态配置来设定

参考Skywalking动态配置说明,如果开启了动态配置,可以通过键alarm.default.alarm-settings覆盖掉默认配置文件 alarm-settings.yml

Helm的方式中,针对咱们的告警模块已经进行了参数留置,只需要进行开启配置即可, 所以就无需在value.yaml中声明了

# 下面是默认配置文件,该步骤忽略即可
[root@master01 ~]# vim /root/8/skywalking/templates/oap-configmap.yaml

{{- if .Values.oap.dynamicConfigEnabled }}
apiVersion: v1
kind: ConfigMap
metadata:
  name: skywalking-dynamic-config
  labels:
    app: {{ template "skywalking.name" . }}
    release: {{ .Release.Name }}
    component: {{ .Values.oap.name }}
data:
{{- end }}

values.yaml 开启配置:

# 修改第137行,将dynamicConfigEnabled: false修改为dynamicConfigEnabled: true
vim /root/8/skywalking/values.yaml
...
...
oap:
  antiAffinity: soft
  dynamicConfigEnabled: true    # 开启动态配置功能

# 完整配置文件
[root@master01 ~]# egrep -v "#|^$" /root/8/skywalking/values.yaml
elasticsearch:
  antiAffinity: hard
  antiAffinityTopologyKey: kubernetes.io/hostname
  clusterHealthCheckParams: wait_for_status=green&timeout=1s
  clusterName: elasticsearch
  config:
    host: elasticsearch
    password:
    port:
      http: 9200
    user: elastic
  enabled: false
  esConfig: {}
  esJavaOpts: -Xmx3g -Xms1g
  esMajorVersion: ""
  extraEnvs: []
  extraInitContainers: ""
  extraVolumeMounts: ""
  extraVolumes: ""
  fsGroup: ""
  fullnameOverride: ""
  httpPort: 9200
  image: registry.cn-hangzhou.aliyuncs.com/github_images1024/elasticsearch
  imagePullPolicy: IfNotPresent
  imagePullSecrets: []
  imageTag: 7.17.3
  ingress:
    annotations: {}
    enabled: false
    hosts:
    - chart-example.local
    path: /
    tls: []
  initResources: {}
  keystore: []
  labels: {}
  lifecycle: {}
  masterService: ""
  masterTerminationFix: false
  maxUnavailable: 1
  minimumMasterNodes: 2
  nameOverride: ""
  networkHost: 0.0.0.0
  nodeAffinity: {}
  nodeGroup: master
  nodeSelector: {}
  persistence:
    annotations: {}
    enabled: true
  podAnnotations: {}
  podManagementPolicy: Parallel
  podSecurityContext:
    fsGroup: 1000
    runAsUser: 1000
  podSecurityPolicy:
    create: false
    name: ""
    spec:
      fsGroup:
        rule: RunAsAny
      privileged: true
      runAsUser:
        rule: RunAsAny
      seLinux:
        rule: RunAsAny
      supplementalGroups:
        rule: RunAsAny
      volumes:
      - secret
      - configMap
      - persistentVolumeClaim
  priorityClassName: ""
  protocol: http
  rbac:
    create: false
    serviceAccountName: ""
  readinessProbe:
    failureThreshold: 3
    initialDelaySeconds: 10
    periodSeconds: 10
    successThreshold: 3
    timeoutSeconds: 5
  replicas: 1
  resources:
    limits:
      cpu: 1000m
      memory: 2Gi
    requests:
      cpu: 100m
      memory: 2Gi
  roles:
    data: "true"
    ingest: "true"
    master: "true"
  schedulerName: ""
  secretMounts: []
  securityContext:
    capabilities:
      drop:
      - ALL
    runAsNonRoot: true
    runAsUser: 1000
  service:
    annotations: {}
    httpPortName: http
    labels: {}
    labelsHeadless: {}
    nodePort: ""
    transportPortName: transport
    type: ClusterIP
  sidecarResources: {}
  sysctlInitContainer:
    enabled: true
  sysctlVmMaxMapCount: 262144
  terminationGracePeriod: 120
  tolerations: []
  transportPort: 9300
  updateStrategy: RollingUpdate
  volumeClaimTemplate:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 50Gi
esInit:
  nodeAffinity: {}
  nodeSelector: {}
  tolerations: []
fullnameOverride: ""
imagePullSecrets: []
initContainer:
  image: registry.cn-hangzhou.aliyuncs.com/abroad_images/busybox
  tag: "1.30"
nameOverride: ""
oap:
  antiAffinity: soft
  dynamicConfigEnabled: true
  env: null
  envoy:
    als:
      enabled: false
  image:
    pullPolicy: IfNotPresent
    repository: registry.cn-hangzhou.aliyuncs.com/github_images1024/skywalking-oap-server
    tag: 8.9.0
  initEs: true
  javaOpts: -Xmx2g -Xms2g
  name: oap
  nodeAffinity: {}
  nodeSelector: {}
  ports:
    grpc: 11800
    rest: 12800
  replicas: 1
  resources: {}
  service:
    type: ClusterIP
  storageType: elasticsearch
  tolerations: []
satellite:
  antiAffinity: soft
  enabled: false
  env: null
  image:
    pullPolicy: IfNotPresent
    repository: registry.cn-hangzhou.aliyuncs.com/github_images1024/skywalking-satellite
    tag: v1.2.0
  name: satellite
  nodeAffinity: {}
  nodeSelector: {}
  podAnnotations: null
  ports:
    grpc: 11800
    prometheus: 1234
  replicas: 1
  resources: {}
  service:
    type: ClusterIP
  tolerations: []
serviceAccounts:
  oap: null
ui:
  image:
    pullPolicy: IfNotPresent
    repository: registry.cn-hangzhou.aliyuncs.com/github_images1024/skywalking-ui
    tag: 8.9.0
  ingress:
    annotations: {}
    enabled: false
    hosts: []
    path: /
    tls: []
  name: ui
  nodeAffinity: {}
  nodeSelector: {}
  replicas: 1
  service:
    annotations: {}
    externalPort: 80
    internalPort: 8080
    type: ClusterIP
  tolerations: []

修改chart包中templateoap-configmap.yaml,配置自定义的rule和企业微信webhook

# 重新定义oap-configmap.yaml
[root@master01 ~]# vim /root/8/skywalking/templates/oap-configmap.yaml
{{- if .Values.oap.dynamicConfigEnabled }}
apiVersion: v1
kind: ConfigMap
metadata:
  name: skywalking-dynamic-config
  labels:
    app: {{ template "skywalking.name" . }}
    release: {{ .Release.Name }}
    component: {{ .Values.oap.name }}
data:
  alarm.default.alarm-settings: |-
    rules:
      # Rule unique name, must be ended with `_rule`.
      service_resp_time_rule:
        metrics-name: service_resp_time
        op: ">"
        threshold: 2000
        period: 10
        count: 3
        silence-period: 5
        message: 服务:{name}\n 指标:响应时间\n 详情:至少3次超过2秒(最近10分钟内)
      service_sla_rule:
        # Metrics value need to be long, double or int
        metrics-name: service_sla
        op: "<"
        threshold: 2000
        # The length of time to evaluate the metrics
        period: 10
        # How many times after the metrics match the condition, will trigger alarm
        count: 3
        # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
        silence-period: 3
        message: 服务:{name}\n 指标:成功率\n 详情:至少3次低于80%(最近10分钟内)
      service_resp_time_percentile_rule:
        # Metrics value need to be long, double or int
        metrics-name: service_percentile
        op: ">"
        threshold: 1000,1000,1000,1000,1000
        period: 10
        count: 2
        silence-period: 5
        message: 服务:{name}\n 指标:响应时间\n 详情:至少3次百分位超过1秒(最近10分钟内)
      service_instance_resp_time_rule:
        metrics-name: service_instance_resp_time
        op: ">"
        threshold: 2000
        period: 10
        count: 2
        silence-period: 5
        message: 实例:{name}\n 指标:响应时间\n 详情:至少2次超过2秒(最近10分钟内)
      database_access_resp_time_rule:
        metrics-name: database_access_resp_time
        threshold: 2000
        op: ">"
        period: 10
        count: 2
        # message: Response time of database access {name} is more than 1000ms in 2 minutes of last 10 minutes
        message: 数据库访问:{name}\n 指标:响应时间\n 详情:至少2次超过2秒(最近10分钟内)
      endpoint_relation_resp_time_rule:
        metrics-name: endpoint_relation_resp_time
        threshold: 2000
        op: ">"
        period: 10
        count: 2
        message: 端点关系:{name}\n 指标:响应时间\n 详情:至少2次超过2秒(最近10分钟内)
      instance_jvm_old_gc_count_rule:
        metrics-name: instance_jvm_old_gc_count
        threshold: 1
        op: ">"
        period: 3
        count: 1
        message: 实例:{name}\n 指标:OldGC次数\n 详情:最近1天内大于1次
      instance_jvm_young_gc_count_rule:
        metrics-name: instance_jvm_young_gc_count
        threshold: 1
        op: ">"
        period: 5
        count: 100
        message: 实例:{name}\n 指标:YoungGC次数\n 详情:最近5分钟内大于100次
    wechatHooks:
      textTemplate: |-
        {
          "msgtype": "text",
          "text": {
            "content": "SkyWalking 链路追踪告警: \n %s."
          }
        }
      webhooks:
        - https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=9d8866d6-ab55-48f3-8336-786325667640
{{- end }}

修改完成后,执行helm进行更新

# 更新
[root@master01 ~]# cd /root/8/
[root@master01 8]# helm upgrade skywalking skywalking -n devops --values ./skywalking/values.yaml

[root@master01 8]# kgp -ndevops  | grep skywalking-oap
skywalking-oap-fb9b8fcd7-4jgkp   1/1     Running     0                38s

# 日志查看
$ kubectl logs -f skywalking-oap-fb9b8fcd7-4jgkp -ndevops
2023-06-21 06:40:00,742 org.apache.skywalking.oap.server.library.server.grpc.GRPCServer 142 [main] INFO  [] - Bind handler JVMMetricReportServiceHandler into gRPC server 0.0.0.0:11800
2023-06-21 06:40:00,747 org.apache.skywalking.oap.server.library.server.grpc.GRPCServer 142 [main] INFO  [] - Bind handler JVMMetricReportServiceHandlerCompat into gRPC server 0.0.0.0:11800
2023-06-21 06:40:00,749 org.apache.skywalking.oap.server.library.module.BootstrapFlow 46 [main] INFO  [] - start the provider default in receiver-meter module.
2023-06-21 06:40:00,750 org.apache.skywalking.oap.server.library.server.grpc.GRPCServer 142 [main] INFO  [] - Bind handler MeterServiceHandler into gRPC server 0.0.0.0:11800
2023-06-21 06:40:00,755 org.apache.skywalking.oap.server.library.server.grpc.GRPCServer 142 [main] INFO  [] - Bind handler MeterServiceHandlerCompat into gRPC server 0.0.0.0:11800
2023-06-21 06:40:00,757 org.apache.skywalking.oap.server.configuration.api.ConfigWatcherRegister 79 [main] INFO  [] - Current configurations after the bootstrap sync.
Following dynamic config items are available.
---------------------------------------------
key:core.default.log4j-xml    module:core    provider:default    value(current):null
key:agent-analyzer.default.uninstrumentedGateways    module:agent-analyzer    provider:default    value(current):null
key:configuration-discovery.default.agentConfigurations    module:configuration-discovery    provider:default    value(current):null
key:agent-analyzer.default.traceSamplingPolicy    module:agent-analyzer    provider:default    value(current):null
key:core.default.endpoint-name-grouping    module:core    provider:default    value(current):SkyWalking endpoint rule
key:core.default.apdexThreshold    module:core    provider:default    value(current):null
key:agent-analyzer.default.slowDBAccessThreshold    module:agent-analyzer    provider:default    value(current):null
key:alarm.default.alarm-settings    module:alarm    provider:default    value(current):null

2023-06-21 06:40:00,774 org.apache.skywalking.oap.server.core.alarm.provider.AlarmRulesWatcher 102 [pool-10-thread-1] INFO  [] - Update alarm rules to Rules(rules=[AlarmRule(alarmRuleName=service_resp_time_rule, metricsName=service_resp_time, includeNames=[], includeNamesRegex=, excludeNames=[], excludeNamesRegex=, includeLabels=[], includeLabelsRegex=, excludeLabels=[], excludeLabelsRegex=, threshold=2000, op=>, period=10, count=3, silencePeriod=5, message=服务:{name}\n 指标:响应时间\n 详情:至少3次超过2秒(最近10分钟内), onlyAsCondition=false, tags={}), AlarmRule(alarmRuleName=service_sla_rule, metricsName=service_sla, includeNames=[], includeNamesRegex=, excludeNames=[], excludeNamesRegex=, includeLabels=[], includeLabelsRegex=, excludeLabels=[], excludeLabelsRegex=, threshold=2000, op=<, period=10, count=3, silencePeriod=3, message=服务:{name}\n 指标:成功率\n 详情:至少3次低于80%(最近10分钟内), onlyAsCondition=false, tags={}), AlarmRule(alarmRuleName=service_resp_time_percentile_rule, metricsName=service_percentile, includeNames=[], includeNamesRegex=, excludeNames=[], excludeNamesRegex=, includeLabels=[], includeLabelsRegex=, excludeLabels=[], excludeLabelsRegex=, threshold=1000,1000,1000,1000,1000, op=>, period=10, count=2, silencePeriod=5, message=服务:{name}\n 指标:响应时间\n 详情:至少3次百分位超过1秒(最近10分钟内), onlyAsCondition=false, tags={}), AlarmRule(alarmRuleName=service_instance_resp_time_rule, metricsName=service_instance_resp_time, includeNames=[], includeNamesRegex=, excludeNames=[], excludeNamesRegex=, includeLabels=[], includeLabelsRegex=, excludeLabels=[], excludeLabelsRegex=, threshold=2000, op=>, period=10, count=2, silencePeriod=5, message=实例:{name}\n 指标:响应时间\n 详情:至少2次超过2秒(最近10分钟内), onlyAsCondition=false, tags={}), AlarmRule(alarmRuleName=database_access_resp_time_rule, metricsName=database_access_resp_time, includeNames=[], includeNamesRegex=, excludeNames=[], excludeNamesRegex=, includeLabels=[], includeLabelsRegex=, excludeLabels=[], excludeLabelsRegex=, threshold=2000, op=>, period=10, count=2, silencePeriod=10, message=数据库访问:{name}\n 指标:响应时间\n 详情:至少2次超过2秒(最近10分钟内), onlyAsCondition=false, tags={}), AlarmRule(alarmRuleName=endpoint_relation_resp_time_rule, metricsName=endpoint_relation_resp_time, includeNames=[], includeNamesRegex=, excludeNames=[], excludeNamesRegex=, includeLabels=[], includeLabelsRegex=, excludeLabels=[], excludeLabelsRegex=, threshold=2000, op=>, period=10, count=2, silencePeriod=10, message=端点关系:{name}\n 指标:响应时间\n 详情:至少2次超过2秒(最近10分钟内), onlyAsCondition=false, tags={}), AlarmRule(alarmRuleName=instance_jvm_old_gc_count_rule, metricsName=instance_jvm_old_gc_count, includeNames=[], includeNamesRegex=, excludeNames=[], excludeNamesRegex=, includeLabels=[], includeLabelsRegex=, excludeLabels=[], excludeLabelsRegex=, threshold=1, op=>, period=1440, count=1, silencePeriod=1440, message=实例:{name}\n 指标:OldGC次数\n 详情:最近1天内大于1次, onlyAsCondition=false, tags={}), AlarmRule(alarmRuleName=instance_jvm_young_gc_count_rule, metricsName=instance_jvm_young_gc_count, includeNames=[], includeNamesRegex=, excludeNames=[], excludeNamesRegex=, includeLabels=[], includeLabelsRegex=, excludeLabels=[], excludeLabelsRegex=, threshold=1, op=>, period=5, count=100, silencePeriod=5, message=实例:{name}\n 指标:YoungGC次数\n 详情:最近5分钟内大于100次, onlyAsCondition=false, tags={})], webhooks=[], grpchookSetting=null, slacks=null, wecchats=WechatSettings(textTemplate={
  "msgtype": "text",
  "text": {
    "content": "SkyWalking 链路追踪告警: \n %s."
  }
}, webhooks=[https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=71c0a6f0-43a0-XXXX-b8c9-52aff88f3b68]), compositeRules=[], dingtalks=null, feishus=null, welinks=null)

3.2 告警媒介介绍

  1. webhook:当告警触发时,被调用的服务端点列表。
  2. gRPCHook:当告警触发时,被调用的远程gRPC方法的主机和端口。
  3. Slack Chat Hook:当告警触发时,被调用的Slack Chat接口。
  4. 企微Hook:当告警触发时,被调用的微信接口。
  5. 钉钉Hook:当告警触发时,被调用的钉钉接口。

3.2.1 Webhook

要求一个点对点的 Web 容器。告警的消息会通过 HTTP 请求进行发送,请求方法为POST, Content-Type 为 application/json,JSON 格式包含以下信息:

  • scopeId:目标 Scope 的 ID。
  • name:目标 Scope 的实体名称。

  • id0:Scope 实体的 ID。

  • id1:未使用。
  • ruleName:您在 alarm-settings.yml 中配置的规则名。
  • alarmMessage. 告警消息内容。
  • startTime. 告警时间戳,当前时间与 UTC 1970/1/1 相差的毫秒数。
[{
    "scopeId": 1,
    "scope": "SERVICE",
    "name": "one-more-service",
    "id0": "b3JkZXItY2VudGVyLXNlYXJjaC1hcGk=.1",
    "id1": "",
    "ruleName": "service_resp_time_rule",
    "alarmMessage": "服务【one-more-service】的平均响应时间在最近10分钟内有2分钟超过1秒",
    "startTime": 1617670815000
}, {
    "scopeId": 2,
    "scope": "SERVICE_INSTANCE",
    "name": "e4b31262acaa47ef92a22b6a2b8a7cb1@192.168.30.11 of one-more-service",
    "id0": "dWF0LWxib2Mtc2VydmljZQ==.1_ZTRiMzEyNjJhY2FhNDdlZjkyYTIyYjZhMmI4YTdjYjFAMTcyLjI0LjMwLjEzOA==",
    "id1": "",
    "ruleName": "instance_jvm_young_gc_count_rule",
    "alarmMessage": "实例【e4b31262acaa47ef92a22b6a2b8a7cb1@192.168.30.11 of one-more-service】的YoungGC次数在最近10分钟内有2分钟超过10次",
    "startTime": 1617670815000
}, {
    "scopeId": 3,
    "scope": "ENDPOINT",
    "name": "/one/more/endpoint in one-more-service",
    "id0": "b25lcGllY2UtYXBp.1_L3RlYWNoZXIvc3R1ZGVudC92aXBsZXNzb25z",
    "id1": "",
    "ruleName": "endpoint_resp_time_rule",
    "alarmMessage": "端点【/one/more/endpoint in one-more-service】的平均响应时间在最近10分钟内有2分钟超过1秒",
    "startTime": 1617670815000
}]

3.2.2 企微Hook

只有微信的企业版才支持 Webhooks ,如何使用微信的 Webhooks 可参见如何配置群机器人。

如果您按以下方式配置了微信的 Webhooks ,则告警消息将按 Content-Type 为 application/json 通过HTTP的 POST 方式发送。

举个例子:

wechatHooks:
  textTemplate: |-
    {
      "msgtype": "text",
      "text": {
        "content": "Apache SkyWalking 告警: \n %s."
      }
    }
  webhooks:
    - https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=9d8866d6-ab55-48f3-8336-786325667640

3.2.3 钉钉Hook

您需要遵循自定义机器人开放并创建新的Webhooks。为了安全起见,您可以为 Webhook网址配置可选的密钥。

如果您按以下方式配置了钉钉的 Webhooks ,则告警消息将按 Content-Type 为 application/json 通过HTTP的 POST 方式发送。

举个例子:

dingtalkHooks:
  textTemplate: |-
    {
      "msgtype": "text",
      "text": {
        "content": "Apache SkyWalking 告警: \n %s."
      }
    }
  webhooks:
    - url: https://oapi.dingtalk.com/robot/send?access_token=5fddceb7c1a3169016bfcad7ae5e3412fd32a90e0ff919a8b480432c810fe4d3
      secret: dummysecret

四、测试验证

4.1 执行访问逻辑

## 模拟访问
$ for i in {1..5000}; do curl http://acme.zhang-qing.com/hello && sleep 1; done

$ for i in {1..5000}; do curl http://acme.zhang-qing.com/start && sleep 1; done

$ for i in {1..5000}; do curl http://acme.zhang-qing.com/readtimeout && sleep 1; done

4.2 测试验证

查看UI页面告警list

由于咱们部署的应用故意在接口中sleep 2~3s,故会触发skywalking的部分告警规则

Day08-可观察性-APM-图12

Day08-可观察性-APM-图13

五、总结

  • 告警规则:提供了丰富的告警规则设置,用户可以配置告警阈值、告警级别等参 数,使告警更加灵活和精确。
  • 告警通知:可以将告警消息通过邮件、短信、微信、钉钉等渠道进行推送,用户可 以根据需要选择所需告警通知方式。
  • 应用程序性能监控:基于 SkyWalking 的性能监控可以捕获应用程序中的性能问 题。
  • 慢事务、内存泄漏、CPU 占用率过高等,当这些问题超出预设阈值时,会触发 相应的告警。
  • 业务故障监控:通过基于 SkyWalking 的业务故障监控,可对系统中出现的异常情 况进行及时提醒。
  • 接口调用错误、HTTP 错误、进程崩溃、连接异常等,当这些异常超出预设阈值 时,会触发相应的告警。