Prometheus is really good at pulling metrics but it needs help if you want to test if a given host is up with a simple ping.
In this post I'll show you my config that gets a list of hosts from consul (plus some static hosts) and then ping them to monitor if they're up or not.
If a host goes down then fire an alert.
blackbox_exporter
blackbox_exporter
is a helper daemon that can accept commands from
prometheus to do probes. For instance it can do dns lookups, check
if a given url contains a string, and in our case, ping a host.
Install blackbox_exporter. In my case I did the following:
cd /opt
wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.12.0/blackbox_exporter-0.12.0.linux-amd64.tar.gz
tar xvfz blackbox_exporter-0.12.0.linux-amd64.tar.gz
ln -s blackbox_exporter-0.12.0.linux-amd64 blackbox_exporter
# /opt/blackbox_exporter/blackbox_exporter now exists and is easy to upgrade
Create a blackbox.yml
config file:
# /etc/prometheus/blackbox.yml
modules:
ping_node:
prober: icmp
timeout: 3s
icmp:
preferred_ip_protocol: "ip4"
Now create a defaults
file:
# /etc/default/blackbox_exporter
BB_OPTS="--config.file=/etc/prometheus/blackbox.yml --log.level=info"
# 'debug' is a handy log.level but don't leave it on too long
Here's my blackbox_exporter.service
file for systemd
.
# /etc/systemd/system/blackbox_exporter.service
[Unit]
Description=blackbox_exporter for prometheus
Documentation=https://github.com/prometheus/blackbox_exporter
After=network-online.target
[Service]
EnvironmentFile=-/etc/default/blackbox_exporter
User=prometheus
Group=prometheus
LimitNOFILE=65536
WorkingDirectory=/opt/blackbox_exporter
ExecStart=/opt/blackbox_exporter/blackbox_exporter $BB_OPTS
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
[Install]
WantedBy=multi-user.target
Lets tell systemd
about it:
systemctl daemon-reload
systemctl enable blackbox_exporter.service
systemctl start blackbox_exporter.service
systemctl status blackbox_exporter.service
netstat -ntlp | grep blackbox
# blackbox_exporter should be running and listening on port 9115
# for debug purposes you can always stop the daemon and run it manually
# blackbox_exporter --config.file=/etc/prometheus/blackbox.yml --log.level=debug
prometheus
I'm assuming you already have a working prometheus
running, but if not
follow the same install steps blackbox_exporter
but with prometheus
.
I'm also assuming you have a consul cluster running and if you don't you really should, consul is fastastic.
If you don't have consul running keep reading as I've also got some
static hosts defined. For hosts that can't run consul
like our
pfsense
firewalls.
# /etc/prometheus/prometheus.yml
# ...
# tell prometheus where our alertmanager is running
alerting:
alertmanagers:
- scheme: http
static_configs:
- targets:
- '127.0.0.1:9093'
# these are the alerting rules. aka: tests
rule_files:
- "rules/*.yml"
scrape_configs:
# these are statically defined hosts, in this case we have two datacenters
- job_name: 'ping_static_hosts'
scrape_interval: 1m
metrics_path: /probe
params:
module: [ping_node] # the module stanza in blackbox.yml
static_configs:
- targets:
- static1.west.example.com
labels:
group: west
- targets:
- static1.example.com
labels:
group: east
relabel_configs:
# if the address is a fqdn just use hostname
- source_labels: ['__address__']
target_label: 'instance'
regex: '([^.]+).*'
replacement: '$1'
# point prometheus at blackbox_exporter's real hostname:port
# use fqdn so links prometheus ui work
- target_label: '__address__'
replacement: 'prometheus.example.com:9115'
# these are dynamically found hosts, in this case we have two datacenters
- job_name: 'ping_consul_hosts'
scrape_interval: 1m
metrics_path: /probe
params:
module: [ping_node] # the module stanza in blackbox.yml
consul_sd_configs:
- server: localhost:8500
datacenter: west
- server: localhost:8500
datacenter: east
relabel_configs:
- source_labels: ['__meta_consul_address']
target_label: '__address__'
separator: ';'
replacement: '$1'
- source_labels: ['__meta_consul_node']
target_label: 'instance'
- source_labels: ['__meta_consul_dc']
target_label: 'group'
# strip off the port if it's there (default from consul)
- source_labels: ['__address__']
target_label: '__param_target'
regex: '([^:]+)(:.*)?'
replacement: '$1'
- target_label: '__address__'
replacement: 'prometheus.example.com:9115'
systemctl reload prometheus.service
Check the prometheus ui (http://prometheus.example.com:9090) and you should see the static and dynamic consul hosts listed.
Note: the state
column can be misleading, it's up
if prometheus
can talk to blackbox_exporter
, not if blackbox_exporter
can talk
to the target host
alerting
So assuming all that works, let's make an alert rule:
groups:
- name: infrastructure
rules:
- alert: node-down
expr: probe_success{job=~"ping.*hosts"} == 0
for: 2m
annotations:
identifier: "{{ $labels.instance }}.{{ $labels.group }}"
description: "_{{ $labels.job }}_ is alerting on _{{ $labels.instance }}_"
fail_msg: "is down. "
restore_msg: "is back up. "
systemctl reload prometheus.service
testing
I tested by going to one of my hosts and blocked icmp traffic.
# block pings
iptables -I INPUT 1 -p icmp -j DROP
# wait until prometheus alerts
# re-enable pings
iptables -D INPUT 1
# wait until prometheus resolves the alert
conclusion
So there you have it, prometheus alerting via pings.