prometheus系统监控 软件 版本 prometheus 2.24.1 node_exporter 1.1.0 grafana 7.3.4 consul 1.9.3 alertmanager 0.21.0
节点 IP 系统 功能 CPU 内存 硬盘 node1 10.80.10.1 centos7.9 prometheus、node_exporter、grafana、consul、alertmanager 4核心 8GB 20GB node2 10.80.10.2 centos7.9 node_exporter 4核心 8GB 20GB
prometheus二进制安装 prometheus:
go语言编写的时间序列监控数据库。
安装和使用方便:二进制。
node1
下载安装prometheus:
下载地址:https://prometheus.io/download/
1 2 3 4 5 6 7 8 9 10 # cd /usr/local/src/ # wget https://github.com/prometheus/prometheus/releases/download/v2.24.1/prometheus-2.24.1.linux-amd64.tar.gz # tar -xzvf prometheus-2.24.1.linux-amd64.tar.gz # mv prometheus-2.24.1.linux-amd64 /usr/local/prometheus # /usr/local/prometheus/prometheus --version prometheus, version 2.24.1 (branch: HEAD, revision: e4487274853c587717006eeda8804e597d120340) build user: root@0b5231a0de0f build date: 20210120-00:09:36 go version: go1.15.6 platform: linux/amd64
修改配置为60秒抓取一次数据:
1 2 3 # vim /usr/local/prometheus/prometheus.yml # 3行,修改配置 scrape_interval: 60s
systemctl管理prometheus:
1 2 3 4 5 6 7 8 9 10 11 # vim /usr/lib/systemd/system/prometheus.service [Unit] Description=prometheus After=network.target [Service] Type=simple ExecStart=/usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml --storage.tsdb.path=/data/prometheus --web.listen-address=127.0.0.1:9090 [Install] WantedBy=multi-user.target
启动prometheus,设置开机自启:
1 2 3 # systemctl start prometheus # systemctl enable prometheus # systemctl status prometheus
查看端口和进程:
1 2 3 4 5 # netstat -tlunp | grep prometheus tcp 0 0 127.0.0.1:9090 0.0.0.0:* LISTEN 12402/prometheus # ps aux | grep prometheus root 12402 0.5 0.5 770724 45188 ? Ssl 09:48 0:00 /usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml --storage.tsdb.path=/data/prometheus --web.listen-address=127.0.0.1:9090 root 12787 0.0 0.0 112824 980 pts/0 S+ 09:48 0:00 grep --color=auto prometheus
查看数据目录:
1 2 # ls /data/prometheus/ chunks_head lock queries.active wal
prometheus+nginx基础认证 node1
下载安装nginx:
修改nginx启动文件:
1 2 3 4 5 6 # vim /usr/lib/systemd/system/nginx.service # 16~19行,删除配置 KillSignal=SIGQUIT TimeoutStopSec=5 KillMode=process PrivateTmp=true
修改nginx配置文件:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 # vim /etc/nginx/nginx.conf # 37行,添加代理配置 upstream prometheus { server 127.0.0.1:9090; } # 重新配置server{}字段 server { listen 19090; root /usr/share/nginx/html; include /etc/nginx/default.d/*.conf; location / { auth_basic "prometheus auth"; auth_basic_user_file /etc/nginx/htpasswd; proxy_pass http://prometheus; } }
创建密码文件,用户student,密码studentpwd:
1 # printf "student:$(openssl passwd -1 studentpwd)\n" > /etc/nginx/htpasswd
启动nginx,设置开机自启:
1 2 3 4 # nginx -t # systemctl start nginx # systemctl enable nginx # systemctl status nginx
查看端口和进程:
1 2 3 4 5 6 7 8 9 # netstat -tlunp | grep nginx tcp 0 0 0.0.0.0:19090 0.0.0.0:* LISTEN 15892/nginx: master # ps aux | grep nginx root 15892 0.0 0.0 39308 1056 ? Ss 09:51 0:00 nginx: master process /usr/sbin/nginx nginx 15893 0.0 0.0 39696 1824 ? S 09:51 0:00 nginx: worker process nginx 15894 0.0 0.0 39696 1824 ? S 09:51 0:00 nginx: worker process nginx 15895 0.0 0.0 39696 1824 ? S 09:51 0:00 nginx: worker process nginx 15896 0.0 0.0 39696 1560 ? S 09:51 0:00 nginx: worker process root 16141 0.0 0.0 112824 976 pts/0 S+ 09:51 0:00 grep --color=auto nginx
浏览器访问:http://10.80.10.1:19090/
1 2 用户名:student 密码:studentpwd
进入主页:
浏览器访问抓取信息网址:http://10.80.10.1:19090/metrics
查询监控数据:go_gc_duration_seconds
查询监控数据:go_memstats_heap_objects
点击Graph,显示图像,时间不对是因为默认使用UTC时区,不影响使用,可以点击Use local time进行时间转换:
查看当前监控节点:
Status—>Targets
关闭/metrics数据抓取:
1 2 3 4 5 # vim /usr/local/prometheus/prometheus.yml # 倒数7行,注释配置 # - job_name: 'prometheus' # static_configs: # - targets: ['localhost:9090']
1 2 # systemctl restart prometheus # systemctl status prometheus
当前没有监控:
还原配置,重启/metrics数据抓取:
1 2 3 4 5 # vim /usr/local/prometheus/prometheus.yml # 倒数7行,开启注释 - job_name: 'prometheus' static_configs: - targets: ['localhost:9090']
1 2 # systemctl restart prometheus # systemctl status prometheus
等待监控自动添加:
node_exporter系统监控 node1
下载安装node_exporter:
下载地址:https://prometheus.io/download/
1 2 3 4 5 6 7 8 9 10 # cd /usr/local/src/ # wget https://github.com/prometheus/node_exporter/releases/download/v1.1.0/node_exporter-1.1.0.linux-amd64.tar.gz # tar -xzvf node_exporter-1.1.0.linux-amd64.tar.gz # mv node_exporter-1.1.0.linux-amd64 /usr/local/node_exporter # /usr/local/node_exporter/node_exporter --version node_exporter, version 1.1.0 (branch: HEAD, revision: 0e74fbcd5fe3b98246292829a8e81e3133e17033) build user: root@c81c7415c0ee build date: 20210205-22:54:09 go version: go1.15.8 platform: linux/amd64
生成加密密码,密码123456:
1 2 3 4 5 # yum install -y httpd-tools # htpasswd -nBC 12 '' New password: Re-type new password: :$2y$12$91cgC/lsAg2SiMniOg4w/OaXZN27aYzS3suRsK26fgQxwhHOYpzoK
编写配置文件,用户student,密码123456:
1 2 3 # vim /usr/local/node_exporter/config.yml basic_auth_users: student: $2y$12$91cgC/lsAg2SiMniOg4w/OaXZN27aYzS3suRsK26fgQxwhHOYpzoK
systemctl管理node_exporter:
1 2 3 4 5 6 7 8 9 10 11 # vim /usr/lib/systemd/system/node_exporter.service [Unit] Description=node_exporter After=network.target [Service] Type=simple ExecStart=/usr/local/node_exporter/node_exporter --web.config=/usr/local/node_exporter/config.yml [Install] WantedBy=multi-user.target
启动node_exporter,设置开机自启:
1 2 3 # systemctl start node_exporter # systemctl enable node_exporter # systemctl status node_exporter
查看端口和进程:
1 2 3 4 5 # netstat -tlunp | grep node tcp6 0 0 :::9100 :::* LISTEN 32107/node_exporter # ps aux | grep node_exporter root 32107 0.0 0.1 716908 11264 ? Ssl 10:05 0:00 /usr/local/node_exporter/node_exporter --web.config=/usr/local/node_exporter/config.yml root 32343 0.0 0.0 112824 984 pts/0 S+ 10:05 0:00 grep --color=auto node_exporter
浏览器访问:http://10.80.10.1:9100/
进入主页:
进入Metrics,查看监控信息:
node1
添加数据采集配置,指定用户名和密码:
1 2 3 4 5 6 7 8 # vim /usr/local/prometheus/prometheus.yml # 尾行,添加采集配置 - job_name: 'node exporter' basic_auth: username: student password: 123456 static_configs: - targets: ['10.80.10.1:9100']
1 # systemctl restart prometheus
查看当前监控节点:
Status—>Targets
查询监控数据:node_load1,显示图形,调整时间间隔为5分钟:
node1
使用命令提升服务器负载,等待图像变化:
1 # dd if=/dev/zero of=/dev/null
测试后取消dd命令。
prometheus常用监控函数 等待服务器负载稳定,浏览器访问:http://10.80.10.1:19090/
查询一分钟负载:node_load1
查询可用内存:node_memory_MemAvailable_bytes
查询总内存(单位:B):node_memory_MemTotal_bytes
查询总内存(单位:MB):node_memory_MemTotal_bytes/1024/1024
内存可用率:node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes
cpu(和cpu数量有关):node_cpu_seconds_total
cpu过滤:node_cpu_seconds_total{mode=”idle”}
cpu过滤:node_cpu_seconds_total{mode=”idle”,cpu=”0”}
cpu取两分钟的数据:node_cpu_seconds_total{mode=”idle”,cpu=”0”}[2m],需要指定当前时区时间,可以点击Use local time进行时区转换:
cpu差值:increase(node_cpu_seconds_total{mode=”idle”,cpu=”0”}[2m]),(最大-最小)*2=117.34
cpu每秒速率:rate(node_cpu_seconds_total{mode=”idle”,cpu=”0”}[2m])
流量(转换为Bytes/s,默认bit/s):rate(node_network_receive_bytes_total{device=”ens33”}[2m])*8
每秒速率:当有三个以上值的时候速率计算方式
irate:取最后两个计算—>峰值容易抓到。 rate:取第一个和最后一个计算。 查看5分钟的数据:node_network_receive_bytes_total{device=”ens33”}[5m]
流量:rate(node_network_receive_bytes_total{device=”ens33”}[5m])*8
流量:irate(node_network_receive_bytes_total{device=”ens33”}[5m])
cpu2分钟求和:sum(rate(node_cpu_seconds_total{mode=”idle”}[2m]))
cpu2分钟求平均:avg(rate(node_cpu_seconds_total{mode=”idle”}[2m]))
grafana的安装及配置数据源 下载安装grafana:
下载地址:https://grafana.com/grafana/download
1 2 3 # cd /usr/local/src/ # wget https://dl.grafana.com/oss/release/grafana-7.3.4-1.x86_64.rpm # yum localinstall -y grafana-7.3.4-1.x86_64.rpm
启动grafana,设置开机自启:
1 2 3 # systemctl start grafana-server # systemctl enable grafana-server # systemctl status grafana-server
查看端口和进程:
1 2 3 4 5 # netstat -tlunp | grep grafana tcp6 0 0 :::3000 :::* LISTEN 57366/grafana-serve # ps aux | grep grafana grafana 57366 2.2 0.5 1248108 41256 ? Ssl 10:27 0:00 /usr/sbin/grafana-server --config=/etc/grafana/grafana.ini --pidfile=/var/run/grafana/grafana-server.pid --packaging=rpm cfg:default.paths.logs=/var/log/grafana cfg:default.paths.data=/var/lib/grafana cfg:default.paths.plugins=/var/lib/grafana/plugins cfg:default.paths.provisioning=/etc/grafana/provisioning root 57605 0.0 0.0 112828 980 pts/0 S+ 10:27 0:00 grep --color=auto grafana
浏览器访问:http://10.80.10.1:3000/
1 2 username:admin password:admin
第一次登录需要重置密码,新密码123456:
进入主页:
添加prometheus数据源,选择prometheus:
Configuration—>Data Sources—>Add data source
添加url,如果使用ip要配置密码认证,配置后Save & Test:
1 2 Name:new_prometheus URL:http://localhost:9090
添加面板,右上角保存:
Create—>Dashboard—>Save dashboard
创建new folder文件夹:
Dashboards—>Manage—>New Folder
将面板转移到new folder文件夹:
Search—>General—>system—>Dashboard settings—>Save dashboard—>Save
返回到system视图,添加监控图像,Add new panel—>Edit
1 2 3 4 5 6 A: Metrics:node_load1 B: Metrics:node_load5 C: Metrics:node_load15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 Setting: Panel title:system load Display: Point:勾选 Point Radius:1 Axes: Left Y: Unit:Misc--->none Legend: As Table:勾选 Max:勾选 Avg:勾选 Current:勾选 显示时间段:Last 1 hour
配置后点击Apply应用,Ctrl+S保存。
grafana展示cpu内存磁盘流量 复制监控面板进行编辑,每次Apply后需要保存:
system load下拉栏—>more—>Duplicate—>Edit
监控cpu两分钟空闲率:
1 2 3 4 A: Metrics:rate(node_cpu_seconds_total{mode="idle"}[2m]) B: Metrics:rate(node_cpu_seconds_total{mode="user"}[2m])
1 2 3 4 5 Setting: Panel title:system cpu Axes: Left Y: Unit:Misc--->Percent (0.0-1.0)
监控可用内存大小:
1 2 A: Metrics:node_memory_MemAvailable_bytes
1 2 3 4 5 Setting: Panel title:system memory size Axes: Left Y: Unit:Data--->bytes(IEC)
监控可用内存百分比:
1 2 A: Metrics:node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes
1 2 3 4 5 Setting: Panel title:system memory percent Axes: Left Y: Unit:Misc--->Percent (0.0-1.0)
监控磁盘使用率:
1 2 A: Metrics:node_filesystem_free_bytes{mountpoint="/"}/node_filesystem_size_bytes
1 2 3 4 5 Setting: Panel title:system disk Axes: Left Y: Unit:Misc--->Percent (0.0-1.0)
监控流量:
1 2 3 4 5 6 A: Metrics:rate(node_network_receive_bytes_total{device="ens33"}[2m])*8 Legend:traffic in: {{instance}} {{device}} B: Metrics:rate(node_network_transmit_bytes_total{device="ens33"}[2m])*8 Legend:traffic out: {{instance}} {{device}}
1 2 3 4 5 Setting: Panel title:system traffic Axes: Left Y: Unit:Data rate--->bits/sec(IEC)
右上角设置图像刷新时间为1分钟,最终整体如下:
ctrl+s保存。
grafana展示多node_exporter数据 node2
下载安装node_exporter:
下载地址:https://prometheus.io/download/
1 2 3 4 5 6 7 8 9 10 # cd /usr/local/src/ # wget https://github.com/prometheus/node_exporter/releases/download/v1.1.0/node_exporter-1.1.0.linux-amd64.tar.gz # tar -xzvf node_exporter-1.1.0.linux-amd64.tar.gz # mv node_exporter-1.1.0.linux-amd64 /usr/local/node_exporter # /usr/local/node_exporter/node_exporter --version node_exporter, version 1.1.0 (branch: HEAD, revision: 0e74fbcd5fe3b98246292829a8e81e3133e17033) build user: root@c81c7415c0ee build date: 20210205-22:54:09 go version: go1.15.8 platform: linux/amd64
编写配置文件,用户student,密码123456:
1 2 3 # vim /usr/local/node_exporter/config.yml basic_auth_users: student: $2y$12$91cgC/lsAg2SiMniOg4w/OaXZN27aYzS3suRsK26fgQxwhHOYpzoK
systemctl管理node_exporter:
1 2 3 4 5 6 7 8 9 10 11 # vim /usr/lib/systemd/system/node_exporter.service [Unit] Description=node_exporter After=network.target [Service] Type=simple ExecStart=/usr/local/node_exporter/node_exporter --web.config=/usr/local/node_exporter/config.yml [Install] WantedBy=multi-user.target
启动node_exporter,设置开机自启:
1 2 3 # systemctl start node_exporter # systemctl enable node_exporter # systemctl status node_exporter
查看端口和进程:
1 2 3 4 5 # netstat -tlunp | grep node tcp6 0 0 :::9100 :::* LISTEN 101861/node_exporte # ps aux | grep node_exporter root 101861 0.2 0.1 715244 11004 ? Ssl 10:58 0:00 /usr/local/node_exporter/node_exporter --web.config=/usr/local/node_exporter/config.yml root 102141 0.0 0.0 112824 984 pts/0 S+ 10:58 0:00 grep --color=auto node_exporter
node1
添加数据采集配置:
1 2 3 # vim /usr/local/prometheus/prometheus.yml # 36行,添加新节点 - targets: ['10.80.10.1:9100','10.80.10.2:9100']
1 # systemctl restart prometheus
prometheus查看监控节点:
Status—>Targets
等待grafana自动出图:
grafana变量的配置使用 grafana的system面板中点击配置,添加变量:
Dashboard settings—>Variables—>Add variable
1 2 3 4 5 6 Name:instance Data source:new_prometheus Refresh:On Time Range Change Query:label_values(instance) Multi-value:勾选 Include All option:勾选
system memory size配置变量instance:
1 2 A: Metrics:node_memory_MemAvailable_bytes{instance=~"$instance"}
左上角可以选择变量值,选择All:
添加新变量:
Dashboard settings—>Variables—>New
1 2 3 4 5 6 Name:job Data source:new_prometheus Refresh:On Time Range Change Query:label_values(job) Multi-value:勾选 Include All option:勾选
system memory size配置变量job,job选择node exporter才有数据:
1 2 A: Metrics:node_memory_MemAvailable_bytes{instance=~"$instance",job=~"$job"}
prometheus基于json文件的发现 node1
修改prometheus配置文件:
1 2 3 # vim /usr/local/prometheus/prometheus.yml # 36行,删除新节点 - targets: ['10.80.10.1:9100']
1 # systemctl restart prometheus
prometheus查看监控节点:
Status—>Targets
配置基于json的文件发现:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 # vim /usr/local/prometheus/prometheus.yml # 倒数6行,删除配置 - job_name: 'node exporter' basic_auth: username: student password: 123456 static_configs: - targets: ['192.168.80.71:9100'] # 倒数7行,注释配置 # - job_name: 'prometheus' # static_configs: # - targets: ['localhost:9090'] # 尾行,添加配置 - job_name: 'node_filediscover' basic_auth: username: student password: 123456 file_sd_configs: - files: - '/usr/local/prometheus/node_filediscover.json' refresh_interval: 15s
1 # systemctl restart prometheus
配置json发现主机,主机将自动被发件:
1 2 3 4 # vim /usr/local/prometheus/node_filediscover.json [{ "targets": ["10.80.10.1:9100","10.80.10.2:9100"] }]
prometheus查看监控节点:
Status—>Targets
consul+nginx发现服务安全部署 system memory size取消变量配置:
1 2 A: Metric:node_memory_MemAvailable_bytes
node1
下载安装consul:
下载地址:https://developer.hashicorp.com/consul/downloads
1 2 3 4 5 6 7 8 9 # cd /usr/local/src/ # wget https://releases.hashicorp.com/consul/1.9.3/consul_1.9.3_linux_amd64.zip # unzip consul_1.9.3_linux_amd64.zip # mkdir -p /usr/local/consul # mv consul /usr/local/consul/ # /usr/local/consul/consul --version Consul v1.9.3 Revision f55da9306 Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
systemctl管理consul:
1 2 3 4 5 6 7 8 9 10 11 # vim /usr/lib/systemd/system/consul.service [Unit] Description=consul After=network.target [Service] Type=simple ExecStart=/usr/local/consul/consul agent -dev -data-dir=/data/consul [Install] WantedBy=multi-user.target
启动consul,设置开机自启:
1 2 3 # systemctl start consul # systemctl enable consul # systemctl status consul
查看端口和进程:
1 2 3 4 5 6 7 8 9 10 11 12 13 # netstat -tlunp | grep consul tcp 0 0 127.0.0.1:8300 0.0.0.0:* LISTEN 45504/consul tcp 0 0 127.0.0.1:8301 0.0.0.0:* LISTEN 45504/consul tcp 0 0 127.0.0.1:8302 0.0.0.0:* LISTEN 45504/consul tcp 0 0 127.0.0.1:8500 0.0.0.0:* LISTEN 45504/consul tcp 0 0 127.0.0.1:8502 0.0.0.0:* LISTEN 45504/consul tcp 0 0 127.0.0.1:8600 0.0.0.0:* LISTEN 45504/consul udp 0 0 127.0.0.1:8301 0.0.0.0:* 45504/consul udp 0 0 127.0.0.1:8302 0.0.0.0:* 45504/consul udp 0 0 127.0.0.1:8600 0.0.0.0:* 45504/consul # ps aux | grep consul root 45504 1.8 0.5 785168 40548 ? Ssl 12:01 0:00 /usr/local/consul/consul agent -dev -data-dir=/data/consul root 45845 0.0 0.0 112824 976 pts/0 S+ 12:01 0:00 grep --color=auto consul
配置nginx反向代理认证:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 # vim /etc/nginx/nginx.conf # 52行,添加配置 upstream consul { server 127.0.0.1:8500; } server { listen 18500; root /usr/share/nginx/html; include /etc/nginx/default.d/*.conf; location / { auth_basic "consul auth"; auth_basic_user_file /etc/nginx/htpasswd; proxy_pass http://consul; } }
1 # systemctl restart nginx
浏览器访问:http://10.80.10.1:18500/
1 2 用户名:student 密码:studentpwd
进入主页:
node1
使用主机名+ip注册主机,不需要账户密码:
1 # curl -X PUT -d '{"id": "node1","name":"node1","address": "10.80.10.1","port": 9100}' http://127.0.0.1:8500/v1/agent/service/register
node2
使用主机名+对外ip注册主机,需要账户密码:
1 # curl -X PUT -d '{"id": "node2","name":"node2","address": "10.80.10.2","port": 9100}' http://student:studentpwd@10.80.10.1:18500/v1/agent/service/register
node1
取消注册:
1 2 # curl -X PUT http://127.0.0.1:8500/v1/agent/service/deregister/node1 # curl -X PUT http://127.0.0.1:8500/v1/agent/service/deregister/node2
prometheus+consul发现监控主机 node1
添加配置:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 # vim /usr/local/prometheus/prometheus.yml # 倒数8行,删除配置 - job_name: 'node_filediscover' basic_auth: username: student password: 123456 file_sd_configs: - files: - '/usr/local/prometheus/node_filediscover.json' refresh_interval: 15s # 尾行,添加配置 - job_name: 'node_consul' basic_auth: username: student password: 123456 consul_sd_configs: - server: '127.0.0.1:8500' services: []
1 # systemctl restart prometheus
查看当前监控节点,默认监控本机8300报错,可以不管:
Status—>Targets
使用主机名+ip注册主机,若使用对外ip需要用户名密码:
1 # curl -X PUT -d '{"id": "node1","name":"node1","address": "10.80.10.1","port": 9100}' http://127.0.0.1:8500/v1/agent/service/register
清空prometheus数据,出图数据比较乱:
1 2 3 4 # systemctl stop prometheus # \rm -rf /data/prometheus/* # systemctl start prometheus # systemctl status prometheus
等待重新出图:
alertmanager邮件告警配置 邮箱服务器需要开启smtp:
QQ邮箱—>设置—>账户—>开启POP3/SMTP服务—>生成授权码
node1
下载安装alertmanager:
下载地址:https://prometheus.io/download/
1 2 3 4 5 6 7 8 9 # cd /usr/local/src/ # wget https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.linux-amd64.tar.gz # tar -xzvf alertmanager-0.21.0.linux-amd64.tar.gz # mv alertmanager-0.21.0.linux-amd64 /usr/local/alertmanager # /usr/local/alertmanager/alertmanager --version alertmanager, version 0.21.0 (branch: HEAD, revision: 4c6c03ebfe21009c546e4d1e9b92c371d67c021d) build user: root@dee35927357f build date: 20200617-08:54:02 go version: go1.14.4
alertmanager配置邮件告警,密码是授权码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 # vim /usr/local/alertmanager/alertmanager.yml global: resolve_timeout: 5m smtp_from: '2794080722@qq.com' smtp_smarthost: 'smtp.qq.com:465' smtp_auth_username: '2794080722@qq.com' smtp_auth_password: 'oonywsznoinedfhc' smtp_require_tls: false smtp_hello: 'qq.com' route: group_by: ['alertname'] group_wait: 5s group_interval: 5s repeat_interval: 5m receiver: 'email' receivers: - name: 'email' email_configs: - to: '2794080722@qq.com' send_resolved: true inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname','dev','instance']
systemctl管alertmanager:
1 2 3 4 5 6 7 8 9 10 11 # vim /usr/lib/systemd/system/alertmanager.service [Unit] Description=alertmanager After=network.target [Service] Type=simple ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml --cluster.listen-address=127.0.0.1:9094 --web.listen-address=127.0.0.1:9093 [Install] WantedBy=multi-user.target
启动alertmanager,设置开机自启:
1 2 3 # systemctl start alertmanager # systemctl enable alertmanager # systemctl status alertmanager
查看端口和进程:
1 2 3 4 5 6 7 # netstat -tlunp | grep alertmanager tcp 0 0 127.0.0.1:9093 0.0.0.0:* LISTEN 88835/alertmanager tcp 0 0 127.0.0.1:9094 0.0.0.0:* LISTEN 88835/alertmanager udp 0 0 127.0.0.1:9094 0.0.0.0:* 88835/alertmanager # ps aux | grep alertmanager root 88835 0.9 0.2 724592 21136 ? Ssl 12:38 0:00 /usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml --cluster.listen-address=127.0.0.1:9094 --web.listen-address=127.0.0.1:9093 root 89043 0.0 0.0 112828 976 pts/2 S+ 12:38 0:00 grep --color=auto alertmanager
grafana添加alertmanager告警媒介:
Alerting—>Notification channels—>Add channel
1 2 3 Name:alertmanager Type:Prometheus Alertmanager Url:http://127.0.0.1:9093
Test测试,Save保存,等待邮件:
5分钟后收到第二个邮件:
grafana+alertmanager配置邮件告警 grafana配置system load告警:
1 2 3 4 5 6 7 8 9 10 Rule: For:30 Conditions: WHEN:last() OF:query(A,1m,now) IS ABOVE(0.02) No Data & Error Handling: If no data or all values are null:Keep Last State Notifications: Send to:alertmanager
触发告警后,调整阈值为1:
alertmanager企业微信告警 企业微信告警需要添加企业可信IP,发送告警需要公网ip
下载企业微信App进行注:
网址:https://work.weixin.qq.com/
应用管理—>创建应用
1 2 3 应用名称:prometheus 应用介绍:prometheus 可见范围:全部
记录Agentld和Secret:
设置可信IP,参考视频:https://www.bilibili.com/video/BV11G4y1M7cV/
node1
配置微信告警模板:
1 2 3 4 5 6 # vim /usr/local/alertmanager/wechat.tmpl {{ define "wechat.default.message" }} {{ range .Alerts }} {{ .Status }} {{ .StartsAt.Format.Local.Format "2006-01-02 15:03:04" }} {{ .Labels }} {{ end }} {{ end }}
alertmanager配置微信告警:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 # vim /usr/local/alertmanager/alertmanager.yml global: resolve_timeout: 5m templates: - '/usr/local/alertmanager/wechat.tmpl' route: group_by: ['alertname'] group_wait: 5s group_interval: 5s repeat_interval: 5m receiver: 'wechat' receivers: - name: 'wechat' wechat_configs: - corp_id: 'ww0f8bd660940ef774' to_party: '1' agent_id: '1000006' api_secret: 'NWKj9MUV9EenNWHjGc50WzFhSqiwhfajT7aLUvKoiK8' to_user: 'ZhangQiaoQi' send_resolved: true inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname','dev','instance']
1 2 # systemctl restart alertmanager # systemctl status alertmanager
grafana测试告警,企业微信收到消息:
Alerting—>Notification channels—>alertmanager—->Test测试
grafana配置system load告警,触发告警后,调整阈值为1:
1 2 3 4 5 6 7 8 9 10 Rule: For:30 Conditions: WHEN:last() OF:query(A,1m,now) IS ABOVE(0.02) No Data & Error Handling: If no data or all values are null:Keep Last State Notifications: Send to:alertmanager
prometheus+alertmanager实现告警 node1
prometheus开启alertmanager告警:
1 2 3 4 5 6 # vim /usr/local/prometheus/prometheus.yml # 12行,开启注释 - 127.0.0.1:9093 # 16~17行,修改配置 - "rules/*_rules.yml" - "rules/*_alerts.yml"
1 2 # systemctl restart prometheus # systemctl status prometheus
创建告警配置:
1 2 3 4 5 6 7 8 9 10 11 12 # mkdir -p /usr/local/prometheus/rules # cd /usr/local/prometheus/rules/ # vim test_alerts.yml groups: - name: student_system_alert rules: - alert: system load5 alert expr: node_load5 > 0.01 for: 1m - alert: system load15 alert expr: node_load15 > 0.1 for: 1m
1 2 # systemctl restart prometheus # systemctl status prometheus
1 # dd if=/dev/zero of=/dev/null
会收到两条简单的告警信息:
更改告警规则,恢复正常:
1 2 3 4 5 6 7 8 9 10 # vim test_alerts.yml groups: - name: student_system_alert rules: - alert: system load5 alert expr: node_load5 > 0.5 for: 1m - alert: system load15 alert expr: node_load15 > 1 for: 1m
1 # systemctl restart prometheus
exporter自定义监控负载-shell 自定义获取load1、load5、load15,分别为uptime命令的三个值:
node1
定义脚本:
1 2 3 4 5 6 7 # vim /usr/local/node_exporter/myself.sh load1=$(uptime | awk '{print $(NF-2)}' | sed 's/,//') load5=$(uptime | awk '{print $(NF-1)}' | sed 's/,//') load15=$(uptime | awk '{print $(NF)}' | sed 's/,//') echo myload1 $load1 echo myload5 $load5 echo myload15 $load15
1 2 3 4 # bash /usr/local/node_exporter/myself.sh myload1 0.05 myload5 0.09 myload15 0.24
创建数据目录:
1 # mkdir /data/node_exporter
crontab定时执行脚本,每分钟执行一次:
1 2 # crontab -e * * * * * /bin/bash /usr/local/node_exporter/myself.sh > /data/node_exporter/myself.prom
重新配置node_exporter启动文件,添加信息采集目录:
1 2 3 4 5 6 7 8 9 10 11 # vim /usr/lib/systemd/system/node_exporter.service [Unit] Description=node_exporter After=network.target [Service] Type=simple ExecStart=/usr/local/node_exporter/node_exporter --web.config=/usr/local/node_exporter/config.yml --collector.textfile.directory=/data/node_exporter [Install] WantedBy=multi-user.target
1 2 3 # systemctl daemon-reload # systemctl restart node_exporter # systemctl status node_exporter
grafana添加图形:
1 2 3 4 5 6 A: Metrics:myload1 B: Metrics:myload5 C: Metrics:myload15
1 2 Setting: Panel title:system self load
export自定义监控mysql-shell node1
下载安装mysql:
1 2 3 4 # yum install -y mariadb-server # systemctl start mariadb # systemctl enable mariadb # systemctl status mariadb
mysql赋予监控权限,mysql8需要创建用户并授权,旧的5.5直接授权即可,用户:myuser,密码:my_test
1 2 3 4 5 6 7 8 9 # mysql -A MariaDB [(none)]> grant usage,REPLICATION CLIENT on *.* to 'myuser'@'127.0.0.1' identified by 'my_test'; Query OK, 0 rows affected (0.00 sec) MariaDB [(none)]> flush privileges; Query OK, 0 rows affected (0.00 sec) MariaDB [(none)]> quit
测试:
1 # mysql -h127.0.0.1 -umyuser -pmy_test -A -e "show global status" 2> /dev/null
过滤监控:
1 2 3 4 5 6 7 8 9 # mysql -h127.0.0.1 -umyuser -pmy_test -A -e "show global status" 2> /dev/null | grep -i select Com_insert_select 0 Com_replace_select 0 Com_select 3 Select_full_join 0 Select_full_range_join 0 Select_range 0 Select_range_check 0 Select_scan 2
收集mysql指标,添加了个标识:
1 2 3 4 5 6 7 # mysql -h127.0.0.1 -umyuser -pmy_test -A -e "show global status" 2> /dev/null | egrep -i '^Com_(select|update|insert|delete)\s|^Bytes_' Bytes_received 753 Bytes_sent 23391 Com_delete 0 Com_insert 0 Com_select 4 Com_update 0
Bytes_received:流入流量。
Bytes_sent:出流量。
1 2 3 4 5 6 7 # mysql -h127.0.0.1 -umyuser -pmy_test -A -e "show global status" 2> /dev/null | egrep -i '^Com_(select|update|insert|delete)\s|^Bytes_' | sed 's/\s/{mytest="127.0.0.1"} /' Bytes_received{mytest="127.0.0.1"} 904 Bytes_sent{mytest="127.0.0.1"} 34847 Com_delete{mytest="127.0.0.1"} 0 Com_insert{mytest="127.0.0.1"} 0 Com_select{mytest="127.0.0.1"} 5 Com_update{mytest="127.0.0.1"} 0
完善脚本,添加数据库信息:
1 2 3 # vim /usr/local/node_exporter/myself.sh # 尾行,添加配置 mysql -h127.0.0.1 -umyuser -pmy_test -A -e "show global status" 2> /dev/null | egrep -i '^Com_(select|update|insert|delete)\s|^Bytes_' | sed 's/\s/{mytest="127.0.0.1"} /'
1 2 3 4 5 6 7 8 9 10 # bash /usr/local/node_exporter/myself.sh myload1 0.29 myload5 0.29 myload15 0.25 Bytes_received{mytest="127.0.0.1"} 1055 Bytes_sent{mytest="127.0.0.1"} 46307 Com_delete{mytest="127.0.0.1"} 0 Com_insert{mytest="127.0.0.1"} 0 Com_select{mytest="127.0.0.1"} 6 Com_update{mytest="127.0.0.1"} 0
监控面板重新保存为mysql,清空除system traffic的所有图,以traffic为基准创建其它图:
Dashboard settings—>Save As
监控mysql流量:
1 2 3 4 5 6 A: Metrics:rate(Bytes_received[2m])*8 Legend:traffic in:{{mytest}} B: Metrics:rate(Bytes_sent[2m])*8 Legend:traffic out:{{mytest}}
1 2 Setting: Panel title:my traffic
监控操作com_delete:
1 2 3 4 5 6 A: Metrics:rate(Com_delete[2m]) Legend:{{mytest}}:delete B: Metrics:rate(Com_insert[2m]) Legend:{{mytest}}:insert
1 2 3 4 5 Setting: Panel title:my operation Axes: Left Y: Unit:Misc--->none
数据库创建测试数据,图像数值应该会接近1,一直创建数据:
1 2 3 4 5 6 7 8 9 10 # mysql -A MariaDB [(none)]> use test; Database changed MariaDB [test]> create table test(id int); Query OK, 0 rows affected (0.01 sec) MariaDB [test]> quit Bye
1 # while true; do mysql -A -e "insert into test.test values(1); select * from test.test; delete from test.test;"; sleep 1; done
调整时间为5分钟的数据,等待出图:
ctrl+c停止mysql数据插入。
export自定义监控mysql-python node1
下载安装python3:
1 2 # yum install -y python36 # pip3 install pymysql==1.0.2 -i https://mirrors.aliyun.com/pypi/simple/
编写python脚本:
1 2 3 4 5 6 7 8 9 10 11 12 # vim /usr/local/node_exporter/myself.py import pymysql,json server='127.0.0.1' port='3306' conn=pymysql.connect(host=server,port=int(port),user="myuser",password="my_test") cur=conn.cursor(pymysql.cursors.DictCursor) cur.execute('show global status') fc=cur.fetchall() result=dict() for oneresult in fc: if oneresult['Variable_name'] in ['Threads_running','Com_select','Com_update','Com_delete','Com_insert','Connections','Bytes_received','Bytes_sent']: print('py{}{{mytest="{}"}} {}'.format(oneresult['Variable_name'],server,oneresult['Value']))
1 2 3 4 5 6 7 8 9 # python3 /usr/local/node_exporter/myself.py pyBytes_received{mytest="127.0.0.1"} 43881 pyBytes_sent{mytest="127.0.0.1"} 365508 pyCom_delete{mytest="127.0.0.1"} 200 pyCom_insert{mytest="127.0.0.1"} 200 pyCom_select{mytest="127.0.0.1"} 430 pyCom_update{mytest="127.0.0.1"} 0 pyConnections{mytest="127.0.0.1"} 232 pyThreads_running{mytest="127.0.0.1"} 1
添加采集脚本,等待采集数据:
1 2 3 # vim /usr/local/node_exporter/myself.sh # 尾行,添加数据 python3 /usr/local/node_exporter/myself.py
循环操作数据:
1 # while true; do mysql -A -e "insert into test.test values(1); select * from test.test; delete from test.test;"; sleep 1; done
python脚本监控数据库连接数,当前为1:
1 2 A: Metrics:pyThreads_running
1 2 3 4 5 Setting: Panel title:my connect python Axes: Left Y: Unit:Misc--->none
1 2 3 4 5 6 A: Metrics:rate(pyCom_delete[2m]) Legend:{{mytest}}:delete B: Metrics:rate(pyCom_insert[2m]) Legend:{{mytest}}:insert
1 2 3 4 5 Setting: Panel title:my operation python Axes: Left Y: Unit:Misc--->none
最终如下:
ctrl+c停止mysql数据插入。
export自定义监控redis-shell node1
关闭mysql数据库:
1 2 # systemctl stop mariadb # systemctl disable mariadb
下载安装redis:
修改redis配置,添加密码foobared:
1 2 3 # vim /etc/redis.conf # 480行,开启注释 requirepass foobared
1 2 3 # systemctl start redis # systemctl enable redis # systemctl status redis
采集redis信息,但是直接写入会出错,添加| grep -v human:
1 2 3 4 # vim /usr/local/node_exporter/myself.sh # 7~8行,注释配置 # 尾行,添加任务 redis-cli -h 127.0.0.1 -a foobared info | egrep '^(used_cpu|used_memory|total_net_)' | sed 's/:/ /g' | sed 's/\s$//g' | sed 's/\s/{myhost="127.0.0.1"} /' | grep -v human
1 2 3 4 5 6 7 8 9 10 11 12 # redis-cli -h 127.0.0.1 -a foobared info | egrep '^(used_cpu|used_memory|total_net_)' | sed 's/:/ /g' | sed 's/\s$//g' | sed 's/\s/{myhost="127.0.0.1"} /' | grep -v human used_memory{myhost="127.0.0.1"} 813464 used_memory_rss{myhost="127.0.0.1"} 5885952 used_memory_peak{myhost="127.0.0.1"} 813464 used_memory_lua{myhost="127.0.0.1"} 37888 total_net_input_bytes{myhost="127.0.0.1"} 84 total_net_output_bytes{myhost="127.0.0.1"} 2143 used_cpu_sys{myhost="127.0.0.1"} 0.07 used_cpu_user{myhost="127.0.0.1"} 0.05 used_cpu_sys_children{myhost="127.0.0.1"} 0.00 used_cpu_user_children{myhost="127.0.0.1"} 0.00 You have new mail in /var/spool/mail/root
监控面板重新保存为redis,清空除my traffic的所有图,以traffic为基准创建其它图:
Dashboard settings—>Save As
监控redis的流量:
1 2 3 4 5 6 A: Metrics:rate(total_net_input_bytes[2m])*8 Legend:traffic in:{{myhost}} B: Metrics:rate(total_net_output_bytes[2m])*8 Legend:traffic out:{{myhost}}
1 2 Setting: Panel title:my redis traffic
监控redis的cpu占用:
1 2 3 4 A: Metrics:used_cpu_sys B: Metrics:used_cpu_user
1 2 3 4 5 Setting: Panel title:my redis cpu Axes: Left Y: Unit:Misc--->Percent (0.0-1.0)
最终图形如下: