@ wrote... (4 years, 11 months ago)

Prometheus is really good at pulling metrics but it needs help if you want to test if a given host is up with a simple ping.

In this post I'll show you my config that gets a list of hosts from consul (plus some static hosts) and then ping them to monitor if they're up or not.

If a host goes down then fire an alert.

blackbox_exporter

blackbox_exporter is a helper daemon that can accept commands from prometheus to do probes. For instance it can do dns lookups, check if a given url contains a string, and in our case, ping a host.

Install blackbox_exporter. In my case I did the following:

cd /opt
wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.12.0/blackbox_exporter-0.12.0.linux-amd64.tar.gz
tar xvfz blackbox_exporter-0.12.0.linux-amd64.tar.gz
ln -s blackbox_exporter-0.12.0.linux-amd64 blackbox_exporter

# /opt/blackbox_exporter/blackbox_exporter now exists and is easy to upgrade

Create a blackbox.yml config file:

# /etc/prometheus/blackbox.yml

modules:
  ping_node:
    prober: icmp
    timeout: 3s
    icmp:
      preferred_ip_protocol: "ip4"

Now create a defaults file:

# /etc/default/blackbox_exporter
BB_OPTS="--config.file=/etc/prometheus/blackbox.yml --log.level=info"

# 'debug' is a handy log.level but don't leave it on too long

Here's my blackbox_exporter.service file for systemd.

# /etc/systemd/system/blackbox_exporter.service

[Unit]
Description=blackbox_exporter for prometheus
Documentation=https://github.com/prometheus/blackbox_exporter
After=network-online.target

[Service]
EnvironmentFile=-/etc/default/blackbox_exporter
User=prometheus
Group=prometheus
LimitNOFILE=65536
WorkingDirectory=/opt/blackbox_exporter
ExecStart=/opt/blackbox_exporter/blackbox_exporter $BB_OPTS
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure

[Install]
WantedBy=multi-user.target

Lets tell systemd about it:

systemctl daemon-reload
systemctl enable blackbox_exporter.service
systemctl start blackbox_exporter.service
systemctl status blackbox_exporter.service
netstat -ntlp | grep blackbox

# blackbox_exporter should be running and listening on port 9115

# for debug purposes you can always stop the daemon and run it manually
# blackbox_exporter --config.file=/etc/prometheus/blackbox.yml --log.level=debug

prometheus

I'm assuming you already have a working prometheus running, but if not follow the same install steps blackbox_exporter but with prometheus.

I'm also assuming you have a consul cluster running and if you don't you really should, consul is fastastic.

If you don't have consul running keep reading as I've also got some static hosts defined. For hosts that can't run consul like our pfsense firewalls.

# /etc/prometheus/prometheus.yml

# ...

# tell prometheus where our alertmanager is running
alerting:
  alertmanagers:
    - scheme: http
      static_configs:
      - targets:
        - '127.0.0.1:9093'

# these are the alerting rules. aka: tests
rule_files:
  - "rules/*.yml"

scrape_configs:

  # these are statically defined hosts, in this case we have two datacenters
  - job_name: 'ping_static_hosts'
    scrape_interval: 1m
    metrics_path: /probe
    params:
      module: [ping_node]           # the module stanza in blackbox.yml
    static_configs:
      - targets:
        - static1.west.example.com
        labels:
          group: west
      - targets:
        - static1.example.com
        labels:
          group: east
    relabel_configs:
      # if the address is a fqdn just use hostname
      - source_labels: ['__address__']
        target_label: 'instance'
        regex: '([^.]+).*'
        replacement: '$1'
      # point prometheus at blackbox_exporter's real hostname:port
      # use fqdn so links prometheus ui work
      - target_label: '__address__'
        replacement: 'prometheus.example.com:9115'

  # these are dynamically found hosts, in this case we have two datacenters
  - job_name: 'ping_consul_hosts'
    scrape_interval: 1m
    metrics_path: /probe
    params:
      module: [ping_node]           # the module stanza in blackbox.yml
    consul_sd_configs:
      - server: localhost:8500
        datacenter: west
      - server: localhost:8500
        datacenter: east
    relabel_configs:
      - source_labels: ['__meta_consul_address']
        target_label:  '__address__'
        separator: ';'
        replacement: '$1'
      - source_labels: ['__meta_consul_node']
        target_label:  'instance'
      - source_labels: ['__meta_consul_dc']
        target_label:  'group'
      # strip off the port if it's there (default from consul)
      - source_labels: ['__address__']
        target_label: '__param_target'
        regex: '([^:]+)(:.*)?'
        replacement: '$1'
      - target_label: '__address__'
        replacement: 'prometheus.example.com:9115'
systemctl reload prometheus.service

Check the prometheus ui (http://prometheus.example.com:9090) and you should see the static and dynamic consul hosts listed.

Note: the state column can be misleading, it's up if prometheus can talk to blackbox_exporter, not if blackbox_exporter can talk to the target host

alerting

So assuming all that works, let's make an alert rule:

groups:

- name: infrastructure
  rules:
  - alert: node-down
    expr: probe_success{job=~"ping.*hosts"} == 0
    for: 2m

    annotations:
      identifier: "{{ $labels.instance }}.{{ $labels.group }}"
      description: "_{{ $labels.job }}_ is alerting on _{{ $labels.instance }}_"
      fail_msg: "is down. "
      restore_msg: "is back up. "
systemctl reload prometheus.service

testing

I tested by going to one of my hosts and blocked icmp traffic.

# block pings
iptables -I INPUT 1 -p icmp -j DROP

# wait until prometheus alerts

# re-enable pings
iptables -D INPUT 1

# wait until prometheus resolves the alert

conclusion

So there you have it, prometheus alerting via pings.

Category: tech, Tags: consul, hashistack, prometheus
Comments: 0
Click here to add a comment