一、基于Skywalking的告警概述¶
官方指南:apache/skywalking · GitHub
基本原理&告警媒介:
- 每隔一段时间轮询 skywalking-oap收集到的链路追踪的数据。
- 根据所配置的告警规则(如服务响应时间、服务响应时间百分比)等,一旦达到阈 值则发送响应的告警信息。
- 告警方式支持:普通webhook、WeChat Hook(企微告警)、Dingtalk Hook(钉钉告警)、Feishu Hook(飞书告警)。
- 告警的信息也可以在ui中查看。
二、告警规则¶
2.1 默认告警规则¶
在 Skywalking中,告警规则称为 rule,默认安装的 Skywalking oap server组件中包含了告警规则的配置文件,
安装目录下 config文件夹下 alarm-settings.yml文件中:
# kubectl -n devops exec -it skywalking-oap-5f45c8df5-49nn9 -- bash
bash-5.0# pwd
/skywalking
bash-5.0# cat config/alarm-settings.yml
2.2 告警rules¶
# Sample alarm rules.
rules:
# Rule unique name, must be ended with `_rule`.
service_resp_time_rule:
metrics-name: service_resp_time
op: ">"
threshold: 1000
period: 10
count: 3
silence-period: 5
message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
service_sla_rule:
# Metrics value need to be long, double or int
metrics-name: service_sla
op: "<"
threshold: 8000
# The length of time to evaluate the metrics
period: 10
# How many times after the metrics match the condition, will trigger alarm
count: 2
# How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
silence-period: 3
message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
...
webhooks:
# - http://127.0.0.1/notify/
# - http://127.0.0.1/go-wechat/
2.3 告警规则详解¶
rules:
# Rule unique name, must be ended with `_rule`.
service_resp_time_rule:
metrics-name: service_resp_time
op: ">"
threshold: 1000
period: 10
count: 3
silence-period: 5
message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
首先提示声明了告警规则名称应该具有唯一性,且 必须 以 _rule 结尾,这里是service_resp_time_rule (服务响应时间)
- metrics-name:告警指标,指标度量值为 long、 double或 int类型
- op:度量值和阈值的比较方式,这里是大于
- threshold:阈值,这里是 1000,毫秒为单位
- period:评估度量标准的时间长度,也就是告警检查周期,分钟为单位
- count:累计达到多少次告警值后触发告警
- silence-period:忽略相同告警信息的周期,默认与告警检查周期一致。简单来说, 就是在触发告警时开始计时 N,在 N+period时间内保持沉默 silence不会再次触发告警,这和 alertmanager的告警抑制类似
- message:告警消息主体,通过变量在发送消息时进行自动替换
2.4 高级告警规则¶
- 1、service_resp_time_rule:最近X分钟内服务平均响应时间超过X秒
- 2、service_sla_rule:最近X分钟内服务成功率低于X秒
- 3、service_resp_time_percentile_rule:最近X分钟的服务响应时间百分位超过X秒
- 4、service_instance_resp_time_rule:最近X分钟内服务实例的平均响应时间超过X 秒
- 5、database_access_resp_time_rule:最近X分钟内数据库访问的平均响应时间超 过X秒
- 6、endpoint_relation_resp_time_rule:最近X分钟内端点平均响应时间超过X秒
- 7、endpoint_avg_rule:过去X分钟内端点关系的平均响应时间超过X秒(默认未打开,官方提示:消耗更多内存)
三、测试验证¶
3.1 功能开启¶
Skywalking的配置大部分内容是通过应用的 application.yml及系统的环境变量设置的,同时也支持 configmap的动态配置来设定
参考Skywalking动态配置说明,如果开启了动态配置,可以通过键alarm.default.alarm-settings覆盖掉默认配置文件 alarm-settings.yml
Helm的方式中,针对咱们的告警模块已经进行了参数留置,只需要进行开启配置即可, 所以就无需在value.yaml中声明了
# 下面是默认配置文件,该步骤忽略即可
[root@master01 ~]# vim /root/8/skywalking/templates/oap-configmap.yaml
{{- if .Values.oap.dynamicConfigEnabled }}
apiVersion: v1
kind: ConfigMap
metadata:
name: skywalking-dynamic-config
labels:
app: {{ template "skywalking.name" . }}
release: {{ .Release.Name }}
component: {{ .Values.oap.name }}
data:
{{- end }}
values.yaml 开启配置:
# 修改第137行,将dynamicConfigEnabled: false修改为dynamicConfigEnabled: true
vim /root/8/skywalking/values.yaml
...
...
oap:
antiAffinity: soft
dynamicConfigEnabled: true # 开启动态配置功能
# 完整配置文件
[root@master01 ~]# egrep -v "#|^$" /root/8/skywalking/values.yaml
elasticsearch:
antiAffinity: hard
antiAffinityTopologyKey: kubernetes.io/hostname
clusterHealthCheckParams: wait_for_status=green&timeout=1s
clusterName: elasticsearch
config:
host: elasticsearch
password:
port:
http: 9200
user: elastic
enabled: false
esConfig: {}
esJavaOpts: -Xmx3g -Xms1g
esMajorVersion: ""
extraEnvs: []
extraInitContainers: ""
extraVolumeMounts: ""
extraVolumes: ""
fsGroup: ""
fullnameOverride: ""
httpPort: 9200
image: registry.cn-hangzhou.aliyuncs.com/github_images1024/elasticsearch
imagePullPolicy: IfNotPresent
imagePullSecrets: []
imageTag: 7.17.3
ingress:
annotations: {}
enabled: false
hosts:
- chart-example.local
path: /
tls: []
initResources: {}
keystore: []
labels: {}
lifecycle: {}
masterService: ""
masterTerminationFix: false
maxUnavailable: 1
minimumMasterNodes: 2
nameOverride: ""
networkHost: 0.0.0.0
nodeAffinity: {}
nodeGroup: master
nodeSelector: {}
persistence:
annotations: {}
enabled: true
podAnnotations: {}
podManagementPolicy: Parallel
podSecurityContext:
fsGroup: 1000
runAsUser: 1000
podSecurityPolicy:
create: false
name: ""
spec:
fsGroup:
rule: RunAsAny
privileged: true
runAsUser:
rule: RunAsAny
seLinux:
rule: RunAsAny
supplementalGroups:
rule: RunAsAny
volumes:
- secret
- configMap
- persistentVolumeClaim
priorityClassName: ""
protocol: http
rbac:
create: false
serviceAccountName: ""
readinessProbe:
failureThreshold: 3
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 3
timeoutSeconds: 5
replicas: 1
resources:
limits:
cpu: 1000m
memory: 2Gi
requests:
cpu: 100m
memory: 2Gi
roles:
data: "true"
ingest: "true"
master: "true"
schedulerName: ""
secretMounts: []
securityContext:
capabilities:
drop:
- ALL
runAsNonRoot: true
runAsUser: 1000
service:
annotations: {}
httpPortName: http
labels: {}
labelsHeadless: {}
nodePort: ""
transportPortName: transport
type: ClusterIP
sidecarResources: {}
sysctlInitContainer:
enabled: true
sysctlVmMaxMapCount: 262144
terminationGracePeriod: 120
tolerations: []
transportPort: 9300
updateStrategy: RollingUpdate
volumeClaimTemplate:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
esInit:
nodeAffinity: {}
nodeSelector: {}
tolerations: []
fullnameOverride: ""
imagePullSecrets: []
initContainer:
image: registry.cn-hangzhou.aliyuncs.com/abroad_images/busybox
tag: "1.30"
nameOverride: ""
oap:
antiAffinity: soft
dynamicConfigEnabled: true
env: null
envoy:
als:
enabled: false
image:
pullPolicy: IfNotPresent
repository: registry.cn-hangzhou.aliyuncs.com/github_images1024/skywalking-oap-server
tag: 8.9.0
initEs: true
javaOpts: -Xmx2g -Xms2g
name: oap
nodeAffinity: {}
nodeSelector: {}
ports:
grpc: 11800
rest: 12800
replicas: 1
resources: {}
service:
type: ClusterIP
storageType: elasticsearch
tolerations: []
satellite:
antiAffinity: soft
enabled: false
env: null
image:
pullPolicy: IfNotPresent
repository: registry.cn-hangzhou.aliyuncs.com/github_images1024/skywalking-satellite
tag: v1.2.0
name: satellite
nodeAffinity: {}
nodeSelector: {}
podAnnotations: null
ports:
grpc: 11800
prometheus: 1234
replicas: 1
resources: {}
service:
type: ClusterIP
tolerations: []
serviceAccounts:
oap: null
ui:
image:
pullPolicy: IfNotPresent
repository: registry.cn-hangzhou.aliyuncs.com/github_images1024/skywalking-ui
tag: 8.9.0
ingress:
annotations: {}
enabled: false
hosts: []
path: /
tls: []
name: ui
nodeAffinity: {}
nodeSelector: {}
replicas: 1
service:
annotations: {}
externalPort: 80
internalPort: 8080
type: ClusterIP
tolerations: []
修改chart包中template的oap-configmap.yaml,配置自定义的rule和企业微信webhook
# 重新定义oap-configmap.yaml
[root@master01 ~]# vim /root/8/skywalking/templates/oap-configmap.yaml
{{- if .Values.oap.dynamicConfigEnabled }}
apiVersion: v1
kind: ConfigMap
metadata:
name: skywalking-dynamic-config
labels:
app: {{ template "skywalking.name" . }}
release: {{ .Release.Name }}
component: {{ .Values.oap.name }}
data:
alarm.default.alarm-settings: |-
rules:
# Rule unique name, must be ended with `_rule`.
service_resp_time_rule:
metrics-name: service_resp_time
op: ">"
threshold: 2000
period: 10
count: 3
silence-period: 5
message: 服务:{name}\n 指标:响应时间\n 详情:至少3次超过2秒(最近10分钟内)
service_sla_rule:
# Metrics value need to be long, double or int
metrics-name: service_sla
op: "<"
threshold: 2000
# The length of time to evaluate the metrics
period: 10
# How many times after the metrics match the condition, will trigger alarm
count: 3
# How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
silence-period: 3
message: 服务:{name}\n 指标:成功率\n 详情:至少3次低于80%(最近10分钟内)
service_resp_time_percentile_rule:
# Metrics value need to be long, double or int
metrics-name: service_percentile
op: ">"
threshold: 1000,1000,1000,1000,1000
period: 10
count: 2
silence-period: 5
message: 服务:{name}\n 指标:响应时间\n 详情:至少3次百分位超过1秒(最近10分钟内)
service_instance_resp_time_rule:
metrics-name: service_instance_resp_time
op: ">"
threshold: 2000
period: 10
count: 2
silence-period: 5
message: 实例:{name}\n 指标:响应时间\n 详情:至少2次超过2秒(最近10分钟内)
database_access_resp_time_rule:
metrics-name: database_access_resp_time
threshold: 2000
op: ">"
period: 10
count: 2
# message: Response time of database access {name} is more than 1000ms in 2 minutes of last 10 minutes
message: 数据库访问:{name}\n 指标:响应时间\n 详情:至少2次超过2秒(最近10分钟内)
endpoint_relation_resp_time_rule:
metrics-name: endpoint_relation_resp_time
threshold: 2000
op: ">"
period: 10
count: 2
message: 端点关系:{name}\n 指标:响应时间\n 详情:至少2次超过2秒(最近10分钟内)
instance_jvm_old_gc_count_rule:
metrics-name: instance_jvm_old_gc_count
threshold: 1
op: ">"
period: 3
count: 1
message: 实例:{name}\n 指标:OldGC次数\n 详情:最近1天内大于1次
instance_jvm_young_gc_count_rule:
metrics-name: instance_jvm_young_gc_count
threshold: 1
op: ">"
period: 5
count: 100
message: 实例:{name}\n 指标:YoungGC次数\n 详情:最近5分钟内大于100次
wechatHooks:
textTemplate: |-
{
"msgtype": "text",
"text": {
"content": "SkyWalking 链路追踪告警: \n %s."
}
}
webhooks:
- https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=9d8866d6-ab55-48f3-8336-786325667640
{{- end }}
修改完成后,执行helm进行更新
# 更新
[root@master01 ~]# cd /root/8/
[root@master01 8]# helm upgrade skywalking skywalking -n devops --values ./skywalking/values.yaml
[root@master01 8]# kgp -ndevops | grep skywalking-oap
skywalking-oap-fb9b8fcd7-4jgkp 1/1 Running 0 38s
# 日志查看
$ kubectl logs -f skywalking-oap-fb9b8fcd7-4jgkp -ndevops
2023-06-21 06:40:00,742 org.apache.skywalking.oap.server.library.server.grpc.GRPCServer 142 [main] INFO [] - Bind handler JVMMetricReportServiceHandler into gRPC server 0.0.0.0:11800
2023-06-21 06:40:00,747 org.apache.skywalking.oap.server.library.server.grpc.GRPCServer 142 [main] INFO [] - Bind handler JVMMetricReportServiceHandlerCompat into gRPC server 0.0.0.0:11800
2023-06-21 06:40:00,749 org.apache.skywalking.oap.server.library.module.BootstrapFlow 46 [main] INFO [] - start the provider default in receiver-meter module.
2023-06-21 06:40:00,750 org.apache.skywalking.oap.server.library.server.grpc.GRPCServer 142 [main] INFO [] - Bind handler MeterServiceHandler into gRPC server 0.0.0.0:11800
2023-06-21 06:40:00,755 org.apache.skywalking.oap.server.library.server.grpc.GRPCServer 142 [main] INFO [] - Bind handler MeterServiceHandlerCompat into gRPC server 0.0.0.0:11800
2023-06-21 06:40:00,757 org.apache.skywalking.oap.server.configuration.api.ConfigWatcherRegister 79 [main] INFO [] - Current configurations after the bootstrap sync.
Following dynamic config items are available.
---------------------------------------------
key:core.default.log4j-xml module:core provider:default value(current):null
key:agent-analyzer.default.uninstrumentedGateways module:agent-analyzer provider:default value(current):null
key:configuration-discovery.default.agentConfigurations module:configuration-discovery provider:default value(current):null
key:agent-analyzer.default.traceSamplingPolicy module:agent-analyzer provider:default value(current):null
key:core.default.endpoint-name-grouping module:core provider:default value(current):SkyWalking endpoint rule
key:core.default.apdexThreshold module:core provider:default value(current):null
key:agent-analyzer.default.slowDBAccessThreshold module:agent-analyzer provider:default value(current):null
key:alarm.default.alarm-settings module:alarm provider:default value(current):null
2023-06-21 06:40:00,774 org.apache.skywalking.oap.server.core.alarm.provider.AlarmRulesWatcher 102 [pool-10-thread-1] INFO [] - Update alarm rules to Rules(rules=[AlarmRule(alarmRuleName=service_resp_time_rule, metricsName=service_resp_time, includeNames=[], includeNamesRegex=, excludeNames=[], excludeNamesRegex=, includeLabels=[], includeLabelsRegex=, excludeLabels=[], excludeLabelsRegex=, threshold=2000, op=>, period=10, count=3, silencePeriod=5, message=服务:{name}\n 指标:响应时间\n 详情:至少3次超过2秒(最近10分钟内), onlyAsCondition=false, tags={}), AlarmRule(alarmRuleName=service_sla_rule, metricsName=service_sla, includeNames=[], includeNamesRegex=, excludeNames=[], excludeNamesRegex=, includeLabels=[], includeLabelsRegex=, excludeLabels=[], excludeLabelsRegex=, threshold=2000, op=<, period=10, count=3, silencePeriod=3, message=服务:{name}\n 指标:成功率\n 详情:至少3次低于80%(最近10分钟内), onlyAsCondition=false, tags={}), AlarmRule(alarmRuleName=service_resp_time_percentile_rule, metricsName=service_percentile, includeNames=[], includeNamesRegex=, excludeNames=[], excludeNamesRegex=, includeLabels=[], includeLabelsRegex=, excludeLabels=[], excludeLabelsRegex=, threshold=1000,1000,1000,1000,1000, op=>, period=10, count=2, silencePeriod=5, message=服务:{name}\n 指标:响应时间\n 详情:至少3次百分位超过1秒(最近10分钟内), onlyAsCondition=false, tags={}), AlarmRule(alarmRuleName=service_instance_resp_time_rule, metricsName=service_instance_resp_time, includeNames=[], includeNamesRegex=, excludeNames=[], excludeNamesRegex=, includeLabels=[], includeLabelsRegex=, excludeLabels=[], excludeLabelsRegex=, threshold=2000, op=>, period=10, count=2, silencePeriod=5, message=实例:{name}\n 指标:响应时间\n 详情:至少2次超过2秒(最近10分钟内), onlyAsCondition=false, tags={}), AlarmRule(alarmRuleName=database_access_resp_time_rule, metricsName=database_access_resp_time, includeNames=[], includeNamesRegex=, excludeNames=[], excludeNamesRegex=, includeLabels=[], includeLabelsRegex=, excludeLabels=[], excludeLabelsRegex=, threshold=2000, op=>, period=10, count=2, silencePeriod=10, message=数据库访问:{name}\n 指标:响应时间\n 详情:至少2次超过2秒(最近10分钟内), onlyAsCondition=false, tags={}), AlarmRule(alarmRuleName=endpoint_relation_resp_time_rule, metricsName=endpoint_relation_resp_time, includeNames=[], includeNamesRegex=, excludeNames=[], excludeNamesRegex=, includeLabels=[], includeLabelsRegex=, excludeLabels=[], excludeLabelsRegex=, threshold=2000, op=>, period=10, count=2, silencePeriod=10, message=端点关系:{name}\n 指标:响应时间\n 详情:至少2次超过2秒(最近10分钟内), onlyAsCondition=false, tags={}), AlarmRule(alarmRuleName=instance_jvm_old_gc_count_rule, metricsName=instance_jvm_old_gc_count, includeNames=[], includeNamesRegex=, excludeNames=[], excludeNamesRegex=, includeLabels=[], includeLabelsRegex=, excludeLabels=[], excludeLabelsRegex=, threshold=1, op=>, period=1440, count=1, silencePeriod=1440, message=实例:{name}\n 指标:OldGC次数\n 详情:最近1天内大于1次, onlyAsCondition=false, tags={}), AlarmRule(alarmRuleName=instance_jvm_young_gc_count_rule, metricsName=instance_jvm_young_gc_count, includeNames=[], includeNamesRegex=, excludeNames=[], excludeNamesRegex=, includeLabels=[], includeLabelsRegex=, excludeLabels=[], excludeLabelsRegex=, threshold=1, op=>, period=5, count=100, silencePeriod=5, message=实例:{name}\n 指标:YoungGC次数\n 详情:最近5分钟内大于100次, onlyAsCondition=false, tags={})], webhooks=[], grpchookSetting=null, slacks=null, wecchats=WechatSettings(textTemplate={
"msgtype": "text",
"text": {
"content": "SkyWalking 链路追踪告警: \n %s."
}
}, webhooks=[https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=71c0a6f0-43a0-XXXX-b8c9-52aff88f3b68]), compositeRules=[], dingtalks=null, feishus=null, welinks=null)
3.2 告警媒介介绍¶
- webhook:当告警触发时,被调用的服务端点列表。
- gRPCHook:当告警触发时,被调用的远程gRPC方法的主机和端口。
- Slack Chat Hook:当告警触发时,被调用的Slack Chat接口。
- 企微Hook:当告警触发时,被调用的微信接口。
- 钉钉Hook:当告警触发时,被调用的钉钉接口。
3.2.1 Webhook¶
要求一个点对点的 Web 容器。告警的消息会通过 HTTP 请求进行发送,请求方法为POST, Content-Type 为 application/json,JSON 格式包含以下信息:
- scopeId:目标 Scope 的 ID。
-
name:目标 Scope 的实体名称。
-
id0:Scope 实体的 ID。
- id1:未使用。
- ruleName:您在 alarm-settings.yml 中配置的规则名。
- alarmMessage. 告警消息内容。
- startTime. 告警时间戳,当前时间与 UTC 1970/1/1 相差的毫秒数。
[{
"scopeId": 1,
"scope": "SERVICE",
"name": "one-more-service",
"id0": "b3JkZXItY2VudGVyLXNlYXJjaC1hcGk=.1",
"id1": "",
"ruleName": "service_resp_time_rule",
"alarmMessage": "服务【one-more-service】的平均响应时间在最近10分钟内有2分钟超过1秒",
"startTime": 1617670815000
}, {
"scopeId": 2,
"scope": "SERVICE_INSTANCE",
"name": "e4b31262acaa47ef92a22b6a2b8a7cb1@192.168.30.11 of one-more-service",
"id0": "dWF0LWxib2Mtc2VydmljZQ==.1_ZTRiMzEyNjJhY2FhNDdlZjkyYTIyYjZhMmI4YTdjYjFAMTcyLjI0LjMwLjEzOA==",
"id1": "",
"ruleName": "instance_jvm_young_gc_count_rule",
"alarmMessage": "实例【e4b31262acaa47ef92a22b6a2b8a7cb1@192.168.30.11 of one-more-service】的YoungGC次数在最近10分钟内有2分钟超过10次",
"startTime": 1617670815000
}, {
"scopeId": 3,
"scope": "ENDPOINT",
"name": "/one/more/endpoint in one-more-service",
"id0": "b25lcGllY2UtYXBp.1_L3RlYWNoZXIvc3R1ZGVudC92aXBsZXNzb25z",
"id1": "",
"ruleName": "endpoint_resp_time_rule",
"alarmMessage": "端点【/one/more/endpoint in one-more-service】的平均响应时间在最近10分钟内有2分钟超过1秒",
"startTime": 1617670815000
}]
3.2.2 企微Hook¶
只有微信的企业版才支持 Webhooks ,如何使用微信的 Webhooks 可参见如何配置群机器人。
如果您按以下方式配置了微信的 Webhooks ,则告警消息将按 Content-Type 为 application/json 通过HTTP的 POST 方式发送。
举个例子:
wechatHooks:
textTemplate: |-
{
"msgtype": "text",
"text": {
"content": "Apache SkyWalking 告警: \n %s."
}
}
webhooks:
- https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=9d8866d6-ab55-48f3-8336-786325667640
3.2.3 钉钉Hook¶
您需要遵循自定义机器人开放并创建新的Webhooks。为了安全起见,您可以为 Webhook网址配置可选的密钥。
如果您按以下方式配置了钉钉的 Webhooks ,则告警消息将按 Content-Type 为 application/json 通过HTTP的 POST 方式发送。
举个例子:
dingtalkHooks:
textTemplate: |-
{
"msgtype": "text",
"text": {
"content": "Apache SkyWalking 告警: \n %s."
}
}
webhooks:
- url: https://oapi.dingtalk.com/robot/send?access_token=5fddceb7c1a3169016bfcad7ae5e3412fd32a90e0ff919a8b480432c810fe4d3
secret: dummysecret
四、测试验证¶
4.1 执行访问逻辑¶
## 模拟访问
$ for i in {1..5000}; do curl http://acme.zhang-qing.com/hello && sleep 1; done
$ for i in {1..5000}; do curl http://acme.zhang-qing.com/start && sleep 1; done
$ for i in {1..5000}; do curl http://acme.zhang-qing.com/readtimeout && sleep 1; done
4.2 测试验证¶
查看UI页面告警list
由于咱们部署的应用故意在接口中sleep 2~3s,故会触发skywalking的部分告警规则


五、总结¶
- 告警规则:提供了丰富的告警规则设置,用户可以配置告警阈值、告警级别等参 数,使告警更加灵活和精确。
- 告警通知:可以将告警消息通过邮件、短信、微信、钉钉等渠道进行推送,用户可 以根据需要选择所需告警通知方式。
- 应用程序性能监控:基于 SkyWalking 的性能监控可以捕获应用程序中的性能问 题。
- 慢事务、内存泄漏、CPU 占用率过高等,当这些问题超出预设阈值时,会触发 相应的告警。
- 业务故障监控:通过基于 SkyWalking 的业务故障监控,可对系统中出现的异常情 况进行及时提醒。
- 接口调用错误、HTTP 错误、进程崩溃、连接异常等,当这些异常超出预设阈值 时,会触发相应的告警。