prometheus系统监控

prometheus系统监控

软件版本
prometheus2.24.1
node_exporter1.1.0
grafana7.3.4
consul1.9.3
alertmanager0.21.0
节点IP系统功能CPU内存硬盘
node110.80.10.1centos7.9prometheus、node_exporter、grafana、consul、alertmanager4核心8GB20GB
node210.80.10.2centos7.9node_exporter4核心8GB20GB

prometheus二进制安装

prometheus:

  • go语言编写的时间序列监控数据库。

  • 安装和使用方便:二进制。

node1

下载安装prometheus:

下载地址:https://prometheus.io/download/

1
2
3
4
5
6
7
8
9
10
# cd /usr/local/src/
# wget https://github.com/prometheus/prometheus/releases/download/v2.24.1/prometheus-2.24.1.linux-amd64.tar.gz
# tar -xzvf prometheus-2.24.1.linux-amd64.tar.gz
# mv prometheus-2.24.1.linux-amd64 /usr/local/prometheus
# /usr/local/prometheus/prometheus --version
prometheus, version 2.24.1 (branch: HEAD, revision: e4487274853c587717006eeda8804e597d120340)
build user: root@0b5231a0de0f
build date: 20210120-00:09:36
go version: go1.15.6
platform: linux/amd64

修改配置为60秒抓取一次数据:

1
2
3
# vim /usr/local/prometheus/prometheus.yml
# 3行,修改配置
scrape_interval: 60s

systemctl管理prometheus:

1
2
3
4
5
6
7
8
9
10
11
# vim /usr/lib/systemd/system/prometheus.service
[Unit]
Description=prometheus
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml --storage.tsdb.path=/data/prometheus --web.listen-address=127.0.0.1:9090

[Install]
WantedBy=multi-user.target

启动prometheus,设置开机自启:

1
2
3
# systemctl start prometheus
# systemctl enable prometheus
# systemctl status prometheus

查看端口和进程:

1
2
3
4
5
# netstat -tlunp | grep prometheus
tcp 0 0 127.0.0.1:9090 0.0.0.0:* LISTEN 12402/prometheus
# ps aux | grep prometheus
root 12402 0.5 0.5 770724 45188 ? Ssl 09:48 0:00 /usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml --storage.tsdb.path=/data/prometheus --web.listen-address=127.0.0.1:9090
root 12787 0.0 0.0 112824 980 pts/0 S+ 09:48 0:00 grep --color=auto prometheus

查看数据目录:

1
2
# ls /data/prometheus/
chunks_head lock queries.active wal

prometheus+nginx基础认证

node1

下载安装nginx:

1
# yum install -y nginx

修改nginx启动文件:

1
2
3
4
5
6
# vim /usr/lib/systemd/system/nginx.service
# 16~19行,删除配置
KillSignal=SIGQUIT
TimeoutStopSec=5
KillMode=process
PrivateTmp=true

修改nginx配置文件:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# vim /etc/nginx/nginx.conf
# 37行,添加代理配置
upstream prometheus {
server 127.0.0.1:9090;
}
# 重新配置server{}字段
server {
listen 19090;
root /usr/share/nginx/html;

include /etc/nginx/default.d/*.conf;

location / {
auth_basic "prometheus auth";
auth_basic_user_file /etc/nginx/htpasswd;
proxy_pass http://prometheus;
}
}

创建密码文件,用户student,密码studentpwd:

1
# printf "student:$(openssl passwd -1 studentpwd)\n" > /etc/nginx/htpasswd

启动nginx,设置开机自启:

1
2
3
4
# nginx -t
# systemctl start nginx
# systemctl enable nginx
# systemctl status nginx

查看端口和进程:

1
2
3
4
5
6
7
8
9
# netstat -tlunp | grep nginx
tcp 0 0 0.0.0.0:19090 0.0.0.0:* LISTEN 15892/nginx: master
# ps aux | grep nginx
root 15892 0.0 0.0 39308 1056 ? Ss 09:51 0:00 nginx: master process /usr/sbin/nginx
nginx 15893 0.0 0.0 39696 1824 ? S 09:51 0:00 nginx: worker process
nginx 15894 0.0 0.0 39696 1824 ? S 09:51 0:00 nginx: worker process
nginx 15895 0.0 0.0 39696 1824 ? S 09:51 0:00 nginx: worker process
nginx 15896 0.0 0.0 39696 1560 ? S 09:51 0:00 nginx: worker process
root 16141 0.0 0.0 112824 976 pts/0 S+ 09:51 0:00 grep --color=auto nginx

浏览器访问:http://10.80.10.1:19090/

1
2
用户名:student
密码:studentpwd

进入主页:

浏览器访问抓取信息网址:http://10.80.10.1:19090/metrics

查询监控数据:go_gc_duration_seconds

查询监控数据:go_memstats_heap_objects

点击Graph,显示图像,时间不对是因为默认使用UTC时区,不影响使用,可以点击Use local time进行时间转换:

查看当前监控节点:

Status—>Targets

关闭/metrics数据抓取:

1
2
3
4
5
# vim /usr/local/prometheus/prometheus.yml
# 倒数7行,注释配置
# - job_name: 'prometheus'
# static_configs:
# - targets: ['localhost:9090']
1
2
# systemctl restart prometheus
# systemctl status prometheus

当前没有监控:

还原配置,重启/metrics数据抓取:

1
2
3
4
5
# vim /usr/local/prometheus/prometheus.yml
# 倒数7行,开启注释
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
1
2
# systemctl restart prometheus
# systemctl status prometheus

等待监控自动添加:

node_exporter系统监控

node1

下载安装node_exporter:

下载地址:https://prometheus.io/download/

1
2
3
4
5
6
7
8
9
10
# cd /usr/local/src/
# wget https://github.com/prometheus/node_exporter/releases/download/v1.1.0/node_exporter-1.1.0.linux-amd64.tar.gz
# tar -xzvf node_exporter-1.1.0.linux-amd64.tar.gz
# mv node_exporter-1.1.0.linux-amd64 /usr/local/node_exporter
# /usr/local/node_exporter/node_exporter --version
node_exporter, version 1.1.0 (branch: HEAD, revision: 0e74fbcd5fe3b98246292829a8e81e3133e17033)
build user: root@c81c7415c0ee
build date: 20210205-22:54:09
go version: go1.15.8
platform: linux/amd64

生成加密密码,密码123456:

1
2
3
4
5
# yum install -y httpd-tools
# htpasswd -nBC 12 ''
New password:
Re-type new password:
:$2y$12$91cgC/lsAg2SiMniOg4w/OaXZN27aYzS3suRsK26fgQxwhHOYpzoK

编写配置文件,用户student,密码123456:

1
2
3
# vim /usr/local/node_exporter/config.yml
basic_auth_users:
student: $2y$12$91cgC/lsAg2SiMniOg4w/OaXZN27aYzS3suRsK26fgQxwhHOYpzoK

systemctl管理node_exporter:

1
2
3
4
5
6
7
8
9
10
11
# vim /usr/lib/systemd/system/node_exporter.service
[Unit]
Description=node_exporter
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/node_exporter/node_exporter --web.config=/usr/local/node_exporter/config.yml

[Install]
WantedBy=multi-user.target

启动node_exporter,设置开机自启:

1
2
3
# systemctl start node_exporter
# systemctl enable node_exporter
# systemctl status node_exporter

查看端口和进程:

1
2
3
4
5
# netstat -tlunp | grep node
tcp6 0 0 :::9100 :::* LISTEN 32107/node_exporter
# ps aux | grep node_exporter
root 32107 0.0 0.1 716908 11264 ? Ssl 10:05 0:00 /usr/local/node_exporter/node_exporter --web.config=/usr/local/node_exporter/config.yml
root 32343 0.0 0.0 112824 984 pts/0 S+ 10:05 0:00 grep --color=auto node_exporter

浏览器访问:http://10.80.10.1:9100/

1
2
用户名:student
密码:123456

进入主页:

进入Metrics,查看监控信息:

node1

添加数据采集配置,指定用户名和密码:

1
2
3
4
5
6
7
8
# vim /usr/local/prometheus/prometheus.yml
# 尾行,添加采集配置
- job_name: 'node exporter'
basic_auth:
username: student
password: 123456
static_configs:
- targets: ['10.80.10.1:9100']
1
# systemctl restart prometheus

查看当前监控节点:

Status—>Targets

查询监控数据:node_load1,显示图形,调整时间间隔为5分钟:

node1

使用命令提升服务器负载,等待图像变化:

1
# dd if=/dev/zero of=/dev/null

测试后取消dd命令。

prometheus常用监控函数

等待服务器负载稳定,浏览器访问:http://10.80.10.1:19090/

查询一分钟负载:node_load1

查询可用内存:node_memory_MemAvailable_bytes

查询总内存(单位:B):node_memory_MemTotal_bytes

查询总内存(单位:MB):node_memory_MemTotal_bytes/1024/1024

内存可用率:node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes

cpu(和cpu数量有关):node_cpu_seconds_total

cpu过滤:node_cpu_seconds_total{mode=”idle”}

cpu过滤:node_cpu_seconds_total{mode=”idle”,cpu=”0”}

cpu取两分钟的数据:node_cpu_seconds_total{mode=”idle”,cpu=”0”}[2m],需要指定当前时区时间,可以点击Use local time进行时区转换:

cpu差值:increase(node_cpu_seconds_total{mode=”idle”,cpu=”0”}[2m]),(最大-最小)*2=117.34

cpu每秒速率:rate(node_cpu_seconds_total{mode=”idle”,cpu=”0”}[2m])

流量(转换为Bytes/s,默认bit/s):rate(node_network_receive_bytes_total{device=”ens33”}[2m])*8

每秒速率:当有三个以上值的时候速率计算方式

  • irate:取最后两个计算—>峰值容易抓到。
  • rate:取第一个和最后一个计算。

查看5分钟的数据:node_network_receive_bytes_total{device=”ens33”}[5m]

流量:rate(node_network_receive_bytes_total{device=”ens33”}[5m])*8

流量:irate(node_network_receive_bytes_total{device=”ens33”}[5m])

cpu2分钟求和:sum(rate(node_cpu_seconds_total{mode=”idle”}[2m]))

cpu2分钟求平均:avg(rate(node_cpu_seconds_total{mode=”idle”}[2m]))

grafana的安装及配置数据源

下载安装grafana:

下载地址:https://grafana.com/grafana/download

1
2
3
# cd /usr/local/src/
# wget https://dl.grafana.com/oss/release/grafana-7.3.4-1.x86_64.rpm
# yum localinstall -y grafana-7.3.4-1.x86_64.rpm

启动grafana,设置开机自启:

1
2
3
# systemctl start grafana-server
# systemctl enable grafana-server
# systemctl status grafana-server

查看端口和进程:

1
2
3
4
5
# netstat -tlunp | grep grafana
tcp6 0 0 :::3000 :::* LISTEN 57366/grafana-serve
# ps aux | grep grafana
grafana 57366 2.2 0.5 1248108 41256 ? Ssl 10:27 0:00 /usr/sbin/grafana-server --config=/etc/grafana/grafana.ini --pidfile=/var/run/grafana/grafana-server.pid --packaging=rpm cfg:default.paths.logs=/var/log/grafana cfg:default.paths.data=/var/lib/grafana cfg:default.paths.plugins=/var/lib/grafana/plugins cfg:default.paths.provisioning=/etc/grafana/provisioning
root 57605 0.0 0.0 112828 980 pts/0 S+ 10:27 0:00 grep --color=auto grafana

浏览器访问:http://10.80.10.1:3000/

1
2
username:admin
password:admin

第一次登录需要重置密码,新密码123456:

进入主页:

添加prometheus数据源,选择prometheus:

Configuration—>Data Sources—>Add data source

添加url,如果使用ip要配置密码认证,配置后Save & Test:

1
2
Name:new_prometheus
URL:http://localhost:9090

添加面板,右上角保存:

Create—>Dashboard—>Save dashboard

1
Dashboard name:system

创建new folder文件夹:

Dashboards—>Manage—>New Folder

1
Folder name:new folder

将面板转移到new folder文件夹:

Search—>General—>system—>Dashboard settings—>Save dashboard—>Save

1
Folder:new folder

返回到system视图,添加监控图像,Add new panel—>Edit

1
2
3
4
5
6
A:
Metrics:node_load1
B:
Metrics:node_load5
C:
Metrics:node_load15
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Setting:
Panel title:system load
Display:
Point:勾选
Point Radius:1
Axes:
Left Y:
Unit:Misc--->none
Legend:
As Table:勾选
Max:勾选
Avg:勾选
Current:勾选
显示时间段:Last 1 hour

配置后点击Apply应用,Ctrl+S保存。

grafana展示cpu内存磁盘流量

复制监控面板进行编辑,每次Apply后需要保存:

system load下拉栏—>more—>Duplicate—>Edit

监控cpu两分钟空闲率:

1
2
3
4
A:
Metrics:rate(node_cpu_seconds_total{mode="idle"}[2m])
B:
Metrics:rate(node_cpu_seconds_total{mode="user"}[2m])
1
2
3
4
5
Setting:
Panel title:system cpu
Axes:
Left Y:
Unit:Misc--->Percent (0.0-1.0)

监控可用内存大小:

1
2
A:
Metrics:node_memory_MemAvailable_bytes
1
2
3
4
5
Setting:
Panel title:system memory size
Axes:
Left Y:
Unit:Data--->bytes(IEC)

监控可用内存百分比:

1
2
A:
Metrics:node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes
1
2
3
4
5
Setting:
Panel title:system memory percent
Axes:
Left Y:
Unit:Misc--->Percent (0.0-1.0)

监控磁盘使用率:

1
2
A:
Metrics:node_filesystem_free_bytes{mountpoint="/"}/node_filesystem_size_bytes
1
2
3
4
5
Setting:
Panel title:system disk
Axes:
Left Y:
Unit:Misc--->Percent (0.0-1.0)

监控流量:

1
2
3
4
5
6
A:
Metrics:rate(node_network_receive_bytes_total{device="ens33"}[2m])*8
Legend:traffic in: {{instance}} {{device}}
B:
Metrics:rate(node_network_transmit_bytes_total{device="ens33"}[2m])*8
Legend:traffic out: {{instance}} {{device}}
1
2
3
4
5
Setting:
Panel title:system traffic
Axes:
Left Y:
Unit:Data rate--->bits/sec(IEC)

右上角设置图像刷新时间为1分钟,最终整体如下:

ctrl+s保存。

grafana展示多node_exporter数据

node2

下载安装node_exporter:

下载地址:https://prometheus.io/download/

1
2
3
4
5
6
7
8
9
10
# cd /usr/local/src/
# wget https://github.com/prometheus/node_exporter/releases/download/v1.1.0/node_exporter-1.1.0.linux-amd64.tar.gz
# tar -xzvf node_exporter-1.1.0.linux-amd64.tar.gz
# mv node_exporter-1.1.0.linux-amd64 /usr/local/node_exporter
# /usr/local/node_exporter/node_exporter --version
node_exporter, version 1.1.0 (branch: HEAD, revision: 0e74fbcd5fe3b98246292829a8e81e3133e17033)
build user: root@c81c7415c0ee
build date: 20210205-22:54:09
go version: go1.15.8
platform: linux/amd64

编写配置文件,用户student,密码123456:

1
2
3
# vim /usr/local/node_exporter/config.yml
basic_auth_users:
student: $2y$12$91cgC/lsAg2SiMniOg4w/OaXZN27aYzS3suRsK26fgQxwhHOYpzoK

systemctl管理node_exporter:

1
2
3
4
5
6
7
8
9
10
11
# vim /usr/lib/systemd/system/node_exporter.service
[Unit]
Description=node_exporter
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/node_exporter/node_exporter --web.config=/usr/local/node_exporter/config.yml

[Install]
WantedBy=multi-user.target

启动node_exporter,设置开机自启:

1
2
3
# systemctl start node_exporter
# systemctl enable node_exporter
# systemctl status node_exporter

查看端口和进程:

1
2
3
4
5
# netstat -tlunp | grep node
tcp6 0 0 :::9100 :::* LISTEN 101861/node_exporte
# ps aux | grep node_exporter
root 101861 0.2 0.1 715244 11004 ? Ssl 10:58 0:00 /usr/local/node_exporter/node_exporter --web.config=/usr/local/node_exporter/config.yml
root 102141 0.0 0.0 112824 984 pts/0 S+ 10:58 0:00 grep --color=auto node_exporter

node1

添加数据采集配置:

1
2
3
# vim /usr/local/prometheus/prometheus.yml
# 36行,添加新节点
- targets: ['10.80.10.1:9100','10.80.10.2:9100']
1
# systemctl restart prometheus

prometheus查看监控节点:

Status—>Targets

等待grafana自动出图:

grafana变量的配置使用

grafana的system面板中点击配置,添加变量:

Dashboard settings—>Variables—>Add variable

1
2
3
4
5
6
Name:instance
Data source:new_prometheus
Refresh:On Time Range Change
Query:label_values(instance)
Multi-value:勾选
Include All option:勾选

system memory size配置变量instance:

1
2
A:
Metrics:node_memory_MemAvailable_bytes{instance=~"$instance"}

左上角可以选择变量值,选择All:

添加新变量:

Dashboard settings—>Variables—>New

1
2
3
4
5
6
Name:job
Data source:new_prometheus
Refresh:On Time Range Change
Query:label_values(job)
Multi-value:勾选
Include All option:勾选

system memory size配置变量job,job选择node exporter才有数据:

1
2
A:
Metrics:node_memory_MemAvailable_bytes{instance=~"$instance",job=~"$job"}

prometheus基于json文件的发现

node1

修改prometheus配置文件:

1
2
3
# vim /usr/local/prometheus/prometheus.yml
# 36行,删除新节点
- targets: ['10.80.10.1:9100']
1
# systemctl restart prometheus

prometheus查看监控节点:

Status—>Targets

配置基于json的文件发现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# vim /usr/local/prometheus/prometheus.yml
# 倒数6行,删除配置
- job_name: 'node exporter'
basic_auth:
username: student
password: 123456
static_configs:
- targets: ['192.168.80.71:9100']
# 倒数7行,注释配置
# - job_name: 'prometheus'
# static_configs:
# - targets: ['localhost:9090']
# 尾行,添加配置
- job_name: 'node_filediscover'
basic_auth:
username: student
password: 123456
file_sd_configs:
- files:
- '/usr/local/prometheus/node_filediscover.json'
refresh_interval: 15s
1
# systemctl restart prometheus

配置json发现主机,主机将自动被发件:

1
2
3
4
# vim /usr/local/prometheus/node_filediscover.json
[{
"targets": ["10.80.10.1:9100","10.80.10.2:9100"]
}]

prometheus查看监控节点:

Status—>Targets

consul+nginx发现服务安全部署

system memory size取消变量配置:

1
2
A:
Metric:node_memory_MemAvailable_bytes

node1

下载安装consul:

下载地址:https://developer.hashicorp.com/consul/downloads

1
2
3
4
5
6
7
8
9
# cd /usr/local/src/
# wget https://releases.hashicorp.com/consul/1.9.3/consul_1.9.3_linux_amd64.zip
# unzip consul_1.9.3_linux_amd64.zip
# mkdir -p /usr/local/consul
# mv consul /usr/local/consul/
# /usr/local/consul/consul --version
Consul v1.9.3
Revision f55da9306
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

systemctl管理consul:

1
2
3
4
5
6
7
8
9
10
11
# vim /usr/lib/systemd/system/consul.service
[Unit]
Description=consul
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/consul/consul agent -dev -data-dir=/data/consul

[Install]
WantedBy=multi-user.target

启动consul,设置开机自启:

1
2
3
# systemctl start consul
# systemctl enable consul
# systemctl status consul

查看端口和进程:

1
2
3
4
5
6
7
8
9
10
11
12
13
# netstat -tlunp | grep consul
tcp 0 0 127.0.0.1:8300 0.0.0.0:* LISTEN 45504/consul
tcp 0 0 127.0.0.1:8301 0.0.0.0:* LISTEN 45504/consul
tcp 0 0 127.0.0.1:8302 0.0.0.0:* LISTEN 45504/consul
tcp 0 0 127.0.0.1:8500 0.0.0.0:* LISTEN 45504/consul
tcp 0 0 127.0.0.1:8502 0.0.0.0:* LISTEN 45504/consul
tcp 0 0 127.0.0.1:8600 0.0.0.0:* LISTEN 45504/consul
udp 0 0 127.0.0.1:8301 0.0.0.0:* 45504/consul
udp 0 0 127.0.0.1:8302 0.0.0.0:* 45504/consul
udp 0 0 127.0.0.1:8600 0.0.0.0:* 45504/consul
# ps aux | grep consul
root 45504 1.8 0.5 785168 40548 ? Ssl 12:01 0:00 /usr/local/consul/consul agent -dev -data-dir=/data/consul
root 45845 0.0 0.0 112824 976 pts/0 S+ 12:01 0:00 grep --color=auto consul

配置nginx反向代理认证:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# vim /etc/nginx/nginx.conf
# 52行,添加配置
upstream consul {
server 127.0.0.1:8500;
}
server {
listen 18500;
root /usr/share/nginx/html;

include /etc/nginx/default.d/*.conf;

location / {
auth_basic "consul auth";
auth_basic_user_file /etc/nginx/htpasswd;
proxy_pass http://consul;
}
}
1
# systemctl restart nginx

浏览器访问:http://10.80.10.1:18500/

1
2
用户名:student
密码:studentpwd

进入主页:

node1

使用主机名+ip注册主机,不需要账户密码:

1
# curl -X PUT -d '{"id": "node1","name":"node1","address": "10.80.10.1","port": 9100}' http://127.0.0.1:8500/v1/agent/service/register

node2

使用主机名+对外ip注册主机,需要账户密码:

1
# curl -X PUT -d '{"id": "node2","name":"node2","address": "10.80.10.2","port": 9100}' http://student:studentpwd@10.80.10.1:18500/v1/agent/service/register

node1

取消注册:

1
2
# curl -X PUT http://127.0.0.1:8500/v1/agent/service/deregister/node1
# curl -X PUT http://127.0.0.1:8500/v1/agent/service/deregister/node2

prometheus+consul发现监控主机

node1

添加配置:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# vim /usr/local/prometheus/prometheus.yml
# 倒数8行,删除配置
- job_name: 'node_filediscover'
basic_auth:
username: student
password: 123456
file_sd_configs:
- files:
- '/usr/local/prometheus/node_filediscover.json'
refresh_interval: 15s
# 尾行,添加配置
- job_name: 'node_consul'
basic_auth:
username: student
password: 123456
consul_sd_configs:
- server: '127.0.0.1:8500'
services: []
1
# systemctl restart prometheus

查看当前监控节点,默认监控本机8300报错,可以不管:

Status—>Targets

使用主机名+ip注册主机,若使用对外ip需要用户名密码:

1
# curl -X PUT -d '{"id": "node1","name":"node1","address": "10.80.10.1","port": 9100}' http://127.0.0.1:8500/v1/agent/service/register

清空prometheus数据,出图数据比较乱:

1
2
3
4
# systemctl stop prometheus
# \rm -rf /data/prometheus/*
# systemctl start prometheus
# systemctl status prometheus

等待重新出图:

alertmanager邮件告警配置

邮箱服务器需要开启smtp:

QQ邮箱—>设置—>账户—>开启POP3/SMTP服务—>生成授权码

node1

下载安装alertmanager:

下载地址:https://prometheus.io/download/

1
2
3
4
5
6
7
8
9
# cd /usr/local/src/
# wget https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.linux-amd64.tar.gz
# tar -xzvf alertmanager-0.21.0.linux-amd64.tar.gz
# mv alertmanager-0.21.0.linux-amd64 /usr/local/alertmanager
# /usr/local/alertmanager/alertmanager --version
alertmanager, version 0.21.0 (branch: HEAD, revision: 4c6c03ebfe21009c546e4d1e9b92c371d67c021d)
build user: root@dee35927357f
build date: 20200617-08:54:02
go version: go1.14.4

alertmanager配置邮件告警,密码是授权码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# vim /usr/local/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
smtp_from: '2794080722@qq.com'
smtp_smarthost: 'smtp.qq.com:465'
smtp_auth_username: '2794080722@qq.com'
smtp_auth_password: 'oonywsznoinedfhc'
smtp_require_tls: false
smtp_hello: 'qq.com'
route:
group_by: ['alertname']
group_wait: 5s
group_interval: 5s
repeat_interval: 5m
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: '2794080722@qq.com'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname','dev','instance']

systemctl管alertmanager:

1
2
3
4
5
6
7
8
9
10
11
# vim /usr/lib/systemd/system/alertmanager.service
[Unit]
Description=alertmanager
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml --cluster.listen-address=127.0.0.1:9094 --web.listen-address=127.0.0.1:9093

[Install]
WantedBy=multi-user.target

启动alertmanager,设置开机自启:

1
2
3
# systemctl start alertmanager
# systemctl enable alertmanager
# systemctl status alertmanager

查看端口和进程:

1
2
3
4
5
6
7
# netstat -tlunp | grep alertmanager
tcp 0 0 127.0.0.1:9093 0.0.0.0:* LISTEN 88835/alertmanager
tcp 0 0 127.0.0.1:9094 0.0.0.0:* LISTEN 88835/alertmanager
udp 0 0 127.0.0.1:9094 0.0.0.0:* 88835/alertmanager
# ps aux | grep alertmanager
root 88835 0.9 0.2 724592 21136 ? Ssl 12:38 0:00 /usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml --cluster.listen-address=127.0.0.1:9094 --web.listen-address=127.0.0.1:9093
root 89043 0.0 0.0 112828 976 pts/2 S+ 12:38 0:00 grep --color=auto alertmanager

grafana添加alertmanager告警媒介:

Alerting—>Notification channels—>Add channel

1
2
3
Name:alertmanager
Type:Prometheus Alertmanager
Url:http://127.0.0.1:9093

Test测试,Save保存,等待邮件:

5分钟后收到第二个邮件:

grafana+alertmanager配置邮件告警

grafana配置system load告警:

1
2
3
4
5
6
7
8
9
10
Rule:
For:30
Conditions:
WHEN:last()
OF:query(A,1m,now)
IS ABOVE(0.02)
No Data & Error Handling:
If no data or all values are null:Keep Last State
Notifications:
Send to:alertmanager

触发告警后,调整阈值为1:

alertmanager企业微信告警

企业微信告警需要添加企业可信IP,发送告警需要公网ip

下载企业微信App进行注:

网址:https://work.weixin.qq.com/

应用管理—>创建应用

1
2
3
应用名称:prometheus
应用介绍:prometheus
可见范围:全部

记录Agentld和Secret:

设置可信IP,参考视频:https://www.bilibili.com/video/BV11G4y1M7cV/

node1

配置微信告警模板:

1
2
3
4
5
6
# vim /usr/local/alertmanager/wechat.tmpl
{{ define "wechat.default.message" }} {{ range .Alerts }}
{{ .Status }}
{{ .StartsAt.Format.Local.Format "2006-01-02 15:03:04" }}
{{ .Labels }}
{{ end }} {{ end }}

alertmanager配置微信告警:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# vim /usr/local/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
templates:
- '/usr/local/alertmanager/wechat.tmpl'
route:
group_by: ['alertname']
group_wait: 5s
group_interval: 5s
repeat_interval: 5m
receiver: 'wechat'
receivers:
- name: 'wechat'
wechat_configs:
- corp_id: 'ww0f8bd660940ef774'
to_party: '1'
agent_id: '1000006'
api_secret: 'NWKj9MUV9EenNWHjGc50WzFhSqiwhfajT7aLUvKoiK8'
to_user: 'ZhangQiaoQi'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname','dev','instance']
  • corp_id:企业ID。
1
2
# systemctl restart alertmanager
# systemctl status alertmanager

grafana测试告警,企业微信收到消息:

Alerting—>Notification channels—>alertmanager—->Test测试

grafana配置system load告警,触发告警后,调整阈值为1:

1
2
3
4
5
6
7
8
9
10
Rule:
For:30
Conditions:
WHEN:last()
OF:query(A,1m,now)
IS ABOVE(0.02)
No Data & Error Handling:
If no data or all values are null:Keep Last State
Notifications:
Send to:alertmanager

prometheus+alertmanager实现告警

node1

prometheus开启alertmanager告警:

1
2
3
4
5
6
# vim /usr/local/prometheus/prometheus.yml
# 12行,开启注释
- 127.0.0.1:9093
# 16~17行,修改配置
- "rules/*_rules.yml"
- "rules/*_alerts.yml"
1
2
# systemctl restart prometheus
# systemctl status prometheus

创建告警配置:

1
2
3
4
5
6
7
8
9
10
11
12
# mkdir -p /usr/local/prometheus/rules
# cd /usr/local/prometheus/rules/
# vim test_alerts.yml
groups:
- name: student_system_alert
rules:
- alert: system load5 alert
expr: node_load5 > 0.01
for: 1m
- alert: system load15 alert
expr: node_load15 > 0.1
for: 1m
1
2
# systemctl restart prometheus
# systemctl status prometheus
1
# dd if=/dev/zero of=/dev/null

会收到两条简单的告警信息:

更改告警规则,恢复正常:

1
2
3
4
5
6
7
8
9
10
# vim test_alerts.yml
groups:
- name: student_system_alert
rules:
- alert: system load5 alert
expr: node_load5 > 0.5
for: 1m
- alert: system load15 alert
expr: node_load15 > 1
for: 1m
1
# systemctl restart prometheus

exporter自定义监控负载-shell

自定义获取load1、load5、load15,分别为uptime命令的三个值:

node1

定义脚本:

1
2
3
4
5
6
7
# vim /usr/local/node_exporter/myself.sh
load1=$(uptime | awk '{print $(NF-2)}' | sed 's/,//')
load5=$(uptime | awk '{print $(NF-1)}' | sed 's/,//')
load15=$(uptime | awk '{print $(NF)}' | sed 's/,//')
echo myload1 $load1
echo myload5 $load5
echo myload15 $load15
1
2
3
4
# bash /usr/local/node_exporter/myself.sh
myload1 0.05
myload5 0.09
myload15 0.24

创建数据目录:

1
# mkdir /data/node_exporter

crontab定时执行脚本,每分钟执行一次:

1
2
# crontab -e
* * * * * /bin/bash /usr/local/node_exporter/myself.sh > /data/node_exporter/myself.prom

重新配置node_exporter启动文件,添加信息采集目录:

1
2
3
4
5
6
7
8
9
10
11
# vim /usr/lib/systemd/system/node_exporter.service
[Unit]
Description=node_exporter
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/node_exporter/node_exporter --web.config=/usr/local/node_exporter/config.yml --collector.textfile.directory=/data/node_exporter

[Install]
WantedBy=multi-user.target
1
2
3
# systemctl daemon-reload
# systemctl restart node_exporter
# systemctl status node_exporter

grafana添加图形:

1
2
3
4
5
6
A:
Metrics:myload1
B:
Metrics:myload5
C:
Metrics:myload15
1
2
Setting:
Panel title:system self load

export自定义监控mysql-shell

node1

下载安装mysql:

1
2
3
4
# yum install -y mariadb-server
# systemctl start mariadb
# systemctl enable mariadb
# systemctl status mariadb

mysql赋予监控权限,mysql8需要创建用户并授权,旧的5.5直接授权即可,用户:myuser,密码:my_test

1
2
3
4
5
6
7
8
9
# mysql -A

MariaDB [(none)]> grant usage,REPLICATION CLIENT on *.* to 'myuser'@'127.0.0.1' identified by 'my_test';
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> flush privileges;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> quit

测试:

1
# mysql -h127.0.0.1 -umyuser -pmy_test -A -e "show global status" 2> /dev/null

过滤监控:

1
2
3
4
5
6
7
8
9
# mysql -h127.0.0.1 -umyuser -pmy_test -A -e "show global status" 2> /dev/null | grep -i select
Com_insert_select 0
Com_replace_select 0
Com_select 3
Select_full_join 0
Select_full_range_join 0
Select_range 0
Select_range_check 0
Select_scan 2

收集mysql指标,添加了个标识:

1
2
3
4
5
6
7
# mysql -h127.0.0.1 -umyuser -pmy_test -A -e "show global status" 2> /dev/null | egrep -i '^Com_(select|update|insert|delete)\s|^Bytes_'
Bytes_received 753
Bytes_sent 23391
Com_delete 0
Com_insert 0
Com_select 4
Com_update 0
  • Bytes_received:流入流量。

  • Bytes_sent:出流量。

1
2
3
4
5
6
7
# mysql -h127.0.0.1 -umyuser -pmy_test -A -e "show global status" 2> /dev/null | egrep -i '^Com_(select|update|insert|delete)\s|^Bytes_' | sed 's/\s/{mytest="127.0.0.1"} /'
Bytes_received{mytest="127.0.0.1"} 904
Bytes_sent{mytest="127.0.0.1"} 34847
Com_delete{mytest="127.0.0.1"} 0
Com_insert{mytest="127.0.0.1"} 0
Com_select{mytest="127.0.0.1"} 5
Com_update{mytest="127.0.0.1"} 0

完善脚本,添加数据库信息:

1
2
3
# vim /usr/local/node_exporter/myself.sh
# 尾行,添加配置
mysql -h127.0.0.1 -umyuser -pmy_test -A -e "show global status" 2> /dev/null | egrep -i '^Com_(select|update|insert|delete)\s|^Bytes_' | sed 's/\s/{mytest="127.0.0.1"} /'
1
2
3
4
5
6
7
8
9
10
# bash /usr/local/node_exporter/myself.sh
myload1 0.29
myload5 0.29
myload15 0.25
Bytes_received{mytest="127.0.0.1"} 1055
Bytes_sent{mytest="127.0.0.1"} 46307
Com_delete{mytest="127.0.0.1"} 0
Com_insert{mytest="127.0.0.1"} 0
Com_select{mytest="127.0.0.1"} 6
Com_update{mytest="127.0.0.1"} 0

监控面板重新保存为mysql,清空除system traffic的所有图,以traffic为基准创建其它图:

Dashboard settings—>Save As

1
Dashboard name:mysql

监控mysql流量:

1
2
3
4
5
6
A:
Metrics:rate(Bytes_received[2m])*8
Legend:traffic in:{{mytest}}
B:
Metrics:rate(Bytes_sent[2m])*8
Legend:traffic out:{{mytest}}
1
2
Setting:
Panel title:my traffic

监控操作com_delete:

1
2
3
4
5
6
A:
Metrics:rate(Com_delete[2m])
Legend:{{mytest}}:delete
B:
Metrics:rate(Com_insert[2m])
Legend:{{mytest}}:insert
1
2
3
4
5
Setting:
Panel title:my operation
Axes:
Left Y:
Unit:Misc--->none

数据库创建测试数据,图像数值应该会接近1,一直创建数据:

1
2
3
4
5
6
7
8
9
10
# mysql -A

MariaDB [(none)]> use test;
Database changed

MariaDB [test]> create table test(id int);
Query OK, 0 rows affected (0.01 sec)

MariaDB [test]> quit
Bye
1
# while true; do mysql -A -e "insert into test.test values(1); select * from test.test; delete from test.test;"; sleep 1; done

调整时间为5分钟的数据,等待出图:

ctrl+c停止mysql数据插入。

export自定义监控mysql-python

node1

下载安装python3:

1
2
# yum install -y python36
# pip3 install pymysql==1.0.2 -i https://mirrors.aliyun.com/pypi/simple/

编写python脚本:

1
2
3
4
5
6
7
8
9
10
11
12
# vim /usr/local/node_exporter/myself.py
import pymysql,json
server='127.0.0.1'
port='3306'
conn=pymysql.connect(host=server,port=int(port),user="myuser",password="my_test")
cur=conn.cursor(pymysql.cursors.DictCursor)
cur.execute('show global status')
fc=cur.fetchall()
result=dict()
for oneresult in fc:
if oneresult['Variable_name'] in ['Threads_running','Com_select','Com_update','Com_delete','Com_insert','Connections','Bytes_received','Bytes_sent']:
print('py{}{{mytest="{}"}} {}'.format(oneresult['Variable_name'],server,oneresult['Value']))
1
2
3
4
5
6
7
8
9
# python3 /usr/local/node_exporter/myself.py
pyBytes_received{mytest="127.0.0.1"} 43881
pyBytes_sent{mytest="127.0.0.1"} 365508
pyCom_delete{mytest="127.0.0.1"} 200
pyCom_insert{mytest="127.0.0.1"} 200
pyCom_select{mytest="127.0.0.1"} 430
pyCom_update{mytest="127.0.0.1"} 0
pyConnections{mytest="127.0.0.1"} 232
pyThreads_running{mytest="127.0.0.1"} 1

添加采集脚本,等待采集数据:

1
2
3
# vim /usr/local/node_exporter/myself.sh
# 尾行,添加数据
python3 /usr/local/node_exporter/myself.py

循环操作数据:

1
# while true; do mysql -A -e "insert into test.test values(1); select * from test.test; delete from test.test;"; sleep 1; done

python脚本监控数据库连接数,当前为1:

1
2
A:
Metrics:pyThreads_running
1
2
3
4
5
Setting:
Panel title:my connect python
Axes:
Left Y:
Unit:Misc--->none

1
2
3
4
5
6
A:
Metrics:rate(pyCom_delete[2m])
Legend:{{mytest}}:delete
B:
Metrics:rate(pyCom_insert[2m])
Legend:{{mytest}}:insert
1
2
3
4
5
Setting:
Panel title:my operation python
Axes:
Left Y:
Unit:Misc--->none

最终如下:

ctrl+c停止mysql数据插入。

export自定义监控redis-shell

node1

关闭mysql数据库:

1
2
# systemctl stop mariadb
# systemctl disable mariadb

下载安装redis:

1
# yum install -y redis

修改redis配置,添加密码foobared:

1
2
3
# vim /etc/redis.conf
# 480行,开启注释
requirepass foobared
1
2
3
# systemctl start redis
# systemctl enable redis
# systemctl status redis

采集redis信息,但是直接写入会出错,添加| grep -v human:

1
2
3
4
# vim /usr/local/node_exporter/myself.sh
# 7~8行,注释配置
# 尾行,添加任务
redis-cli -h 127.0.0.1 -a foobared info | egrep '^(used_cpu|used_memory|total_net_)' | sed 's/:/ /g' | sed 's/\s$//g' | sed 's/\s/{myhost="127.0.0.1"} /' | grep -v human
1
2
3
4
5
6
7
8
9
10
11
12
# redis-cli -h 127.0.0.1 -a foobared info | egrep '^(used_cpu|used_memory|total_net_)' | sed 's/:/ /g' | sed 's/\s$//g' | sed 's/\s/{myhost="127.0.0.1"} /' | grep -v human
used_memory{myhost="127.0.0.1"} 813464
used_memory_rss{myhost="127.0.0.1"} 5885952
used_memory_peak{myhost="127.0.0.1"} 813464
used_memory_lua{myhost="127.0.0.1"} 37888
total_net_input_bytes{myhost="127.0.0.1"} 84
total_net_output_bytes{myhost="127.0.0.1"} 2143
used_cpu_sys{myhost="127.0.0.1"} 0.07
used_cpu_user{myhost="127.0.0.1"} 0.05
used_cpu_sys_children{myhost="127.0.0.1"} 0.00
used_cpu_user_children{myhost="127.0.0.1"} 0.00
You have new mail in /var/spool/mail/root

监控面板重新保存为redis,清空除my traffic的所有图,以traffic为基准创建其它图:

Dashboard settings—>Save As

1
Dashboard name:redis

监控redis的流量:

1
2
3
4
5
6
A:
Metrics:rate(total_net_input_bytes[2m])*8
Legend:traffic in:{{myhost}}
B:
Metrics:rate(total_net_output_bytes[2m])*8
Legend:traffic out:{{myhost}}
1
2
Setting:
Panel title:my redis traffic

监控redis的cpu占用:

1
2
3
4
A:
Metrics:used_cpu_sys
B:
Metrics:used_cpu_user
1
2
3
4
5
Setting:
Panel title:my redis cpu
Axes:
Left Y:
Unit:Misc--->Percent (0.0-1.0)

最终图形如下: