Notes on Grafana & Prometheus

This week I’ve been replacing the monitoring graphs in one a PHP application with a Grafana & Prometheus setup with the aim of simplifying the process of adding new graphs / statistics as well as taking advantage of the featureful graphing and alert facilities Grafana provides.

What follows are brief notes for installing and configuring Grafana & Prometheus on CentOS 7.4.

Grafana & Prometheus Server

  1. Install and configure Prometheus from Sergey Nartimov’s repository
  2. Install Grafana from the official Yum repository
  3. Configure Grafana, changing the following settings:
    1. Enable SSL in the [server] section
    2. Configure a non-default admin username and random password in the [security] section
    3. Disable allow_signup and allow_org_create
    4. Enable GitHub logins and set the allowed_organizations
  4. Restrict access to Prometheus (via iptables)
  5. Start and enable the services ‘prometheus’ and ‘grafana-server’
  6. Check prometheus is running (point your web browser to port 9090)
    1. Check scraping is running without errors under Status => Targets
  7. Use the default Grafana admin user to setup the first Github login as an admin
    1. In an incognito browser window, log in using the default admin user
    2. In a normal browser window), log in using GitHub
    3. (Default admin) Edit the Github user
      1. Check the ‘Grafana Admin’ checkbox and click the ‘Update’ button
      2. Change the users organization role to ‘Admin’
    4. (Github user) Log out and back in and check that you can see the ‘Admin’ list in the main menu (Grafana icon in the top right) and the ‘New dashboard’ button in the dashboard list.
  8. Edit the Grafana config and set disable_login_form = true in the [auth] section
  9. Restart Grafana
  10. Add the datasource with the following settings:
    1. Name: Prometheus
    2. Type: Prometheus
    3. URL: http://localhost:9090/
    4. Access: Proxy
  11. Install the following dashboards from Grafana.com:
    1. 1860 (Node Exporter Full)
    2. 3662 (Prometheus)

Monitored Servers

  1. Install node_exporter (plus any other exporters you want, such as apache_exporter) from Sergey Nartimov’s repository
  2. Start and enable the services
  3. Check the iptables rules to ensure the Prometheus server can talk to the monitored servers (on the appropriate ports)
  4. Update the prometheus configuration and reload prometheus (see example below)
    1. Check Prometheus is successfully scraping under Status => Targets in the Prometheus admin
  - job_name: 'node'
    scrape_interval: 1s
    scrape_timeout: 1s
    static_configs:
      - targets: ['<hostname>:9100']

Grafana: Creating Custom Graphs

Host list

The host list is setup via Templating. From the Dashaboard configuration (‘cog’ icon on the top bar), select ‘Templating’ and add a new entry with the following settings:

  • Name: host
  • Type: Query
  • Datasource: Prometheus
  • Refresh: On Dashboard Load
  • Query: label_values(node_filesystem_avail, instance)
    • You may want to change which metric is used depending on the dashboard
  • Regex: /([^:]+):.*/
  • Sort: Alphabetical

Then in queries you can use the following to get the metric for the instance (host), ignoring the port (which aids combining metrics from multiple exporters in a single dashboard): {instance=~"$host:.*"}

Disk Consumption Rate

One stat I wanted to reproduce from my custom dashboards was the disk consumption rate and est. time to disk full. You can easily change the period for these to get figures based on different periods of time (eg. 168h for 7 days).

I have these set up as tables with the instant query option checked. (It looks like the ability to combined them into a single table should be available in Grafana 5)

  • Disk consumption rate based on the past 24 hours: - delta(node_filesystem_avail{instance=~"$host",fstype!="tmpfs"}[24h])
  • Time to disk full based on the past 24 hours: node_filesystem_avail{instance=~"$host",fstype!="tmpfs"} / (- delta(node_filesystem_avail{instance=~"$host",fstype!="tmpfs"}[24h]))

Prometheus Exporters

PHP-FPM

While I found a couple of PHP-FPM exporters, getting them to export stats for a specific pool looked to be “fun” at best (some of my applications live in multi-tenant servers with a pool-per-application setup). Instead I wrote my own exporter script in PHP (designed to run in the same pool). I already had PHP-FPMs status page configured on the “/php-fpm-status” URL.

This code requires PHP 7 to run (because of the null coalesce operator).

<?php
declare(strict_types = 1);

header("Content-Type: text/plain; version=0.0.4");

$stats = null;
$protocol = ((!empty($_SERVER["HTTPS"]) && $_SERVER["HTTPS"] !== "off") ? "https" : "http");
$result = @file_get_contents($protocol ."://" . $_SERVER["SERVER_NAME"] . "/php-fpm-status?json&full");
if ($result !== false) {
    $stats = @json_decode($result, true);
}
if (! is_array($stats)) {
    exit();
}

$metrics = [
    "uptime" => [
        "type" => "counter",
        "help" => "seconds since last restart",
        "value" => ($stats["start since"] ?? null),
    ],
    "accepted_connections_total" => [
        "type" => "counter",
        "help" => "Accepted connections",
        "value" => ($stats["accepted conn"] ?? null),
    ],
    "listen_queue_connections" => [
        "type" => "gauge",
        "help" => "Pending connections queue",
        "value" => ($stats["listen queue"] ?? null),
    ],
    "listen_queue_max_connections" => [
        "type" => "counter",
        "help" => "Max. pending connections queue",
        "value" => ($stats["max listen queue"] ?? null),
    ],
    "listen_queue_length" => [
        "type" => "gauge",
        "help" => "Listen queue length",
        "value" => ($stats["listen queue len"] ?? null),
    ],
    "active_processes" => [
        "type" => "gauge",
        "help" => "Active processes",
        "value" => ($stats["active processes"] ?? null),
    ],
    "idle_processes" => [
        "type" => "gauge",
        "help" => "Idle processes",
        "value" => ($stats["idle processes"] ?? null),
    ],
    "processes" => [
        "type" => "gauge",
        "help" => "Total processes",
        "value" => ($stats["total processes"] ?? null),
    ],
    "active_processes_max" => [
        "type" => "counter",
        "help" => "Max. seen processes",
        "value" => ($stats["max active processes"] ?? null),
    ],
    "max_children_reached_count" => [
        "type" => "counter",
        "help" => "Max. number of processes hit count",
        "value" => ($stats["max children reached"] ?? null),
    ],
    "slow_requests_total" => [
        "type" => "counter",
        "help" => "Slow requests",
        "value" => ($stats["slow requests"] ?? null),
    ],
];

foreach ($metrics as $key => $metric) {
    $key = "phpfpm_" . $key;
    if ($metric["value"] === null) {
        continue;
    }

    print "# HELP {$key} {$metric["help"]}\n";
    print "# TYPE {$key} {$metric["type"]}\n";
    print $key . " " . $metric["value"] . "\n";
}

Linux Processes by Type

The following allows an overview of processes by type (PHP-FPM, PHP cli, Apache).

This code requires PHP 7 to run.

<?php
declare(strict_types = 1);

header("Content-Type: text/plain; version=0.0.4");

$returnVal = null;
$output = [];
// Format - what info do we want from ps?
$psFmt = "%cpu,%mem,c,cp,cputime,flags,group,nice,pid,ppid,rss,tty,user,vsz,cmd";
// List of process names we want to look at
$procList = ["httpd", "php", "php-fpm"];
$cmd = "ps -C \"" . join("\" -C \"", $procList) . "\" -o \"{$psFmt}\"";
$fieldCount = count(explode(',', $psFmt));
@exec($cmd, $output, $returnVal);

if (count($output) < 2) {
    exit();
}

$masterApache = null;
$childrenApache = [];
$memVszApache = [];
$memRssApache = [];

$memVszPhp = [];
$memRssPhp = [];
$phpProcesses = [];

$memVszPhpFpm = [];
$memRssPhpFpm = [];
$phpFpmProcesses = [];

$headers = null;
$lineNo = -1;
foreach ($output as $line) {
    $lineNo++;

    $line = ltrim(preg_replace('/\s+/', " ", $line));
    $fields = explode(" ", $line, $fieldCount);
    foreach ($fields as $k => $v) {
        $fields[$k] = trim($v);
    }

    if ($headers === null) {
        $headers = $fields;
        continue;
    }

    $process = array_combine($headers, $fields);

    // IMPORTANT: Note the order of the php-fpm and php blocks!
    // If php is first it will also capture the php-fpm processes
    if (strpos($process["CMD"], "php-fpm") === 0) {
        if ($process["USER"] === "root") {
            continue;
        }

        $phpFpmProcesses[] = $process;
        $memVszPhpFpm[] = $process["VSZ"];
        $memRssPhpFpm[] = $process["RSS"];
        continue;
    }

    if (strpos($process["CMD"], "php") === 0) {
        $memVszPhp[] = $process["VSZ"];
        $memRssPhp[] = $process["RSS"];
        $phpProcesses[] = $process;
        continue;
    }

    if (strpos($process["CMD"], "httpd") !== false) {
        if ($process["USER"] === "root") {
            $masterApache = $process;
            continue;
        }

        $childrenApache[] = $process;
        $memVszApache[] = $process["VSZ"];
        $memRssApache[] = $process["RSS"];
        continue;
    }
}

sort($memRssApache, SORT_NUMERIC);
sort($memVszApache, SORT_NUMERIC);

sort($memRssPhp, SORT_NUMERIC);
sort($memVszPhp, SORT_NUMERIC);

sort($memRssPhpFpm, SORT_NUMERIC);
sort($memVszPhpFpm, SORT_NUMERIC);

function array_min($array)
{
    sort($array, SORT_NUMERIC);
    $retval = null;
    while (count($array)) {
        $retval = array_shift($array);
        if ($retval > 0) {
            break;
        }
    }
    return $retval;
}

$metrics = [
    // Apache
    "apache_children" => [
        "type" => "gauge",
        "help" => "Number of Apache child processes",
        "value" => count($childrenApache),
    ],
    "apache_memory_rss_total" => [
        "type" => "gauge",
        "help" => "Apache processes RSS memory (total)",
        "value" => array_sum($memRssApache),
    ],
    "apache_memory_rss_min" => [
        "type" => "gauge",
        "help" => "Apache processes RSS memory (min)",
        "value" => array_min($memRssApache),
    ],
    "apache_memory_rss_max" => [
        "type" => "gauge",
        "help" => "Apache processes RSS memory (max)",
        "value" => array_pop($memRssApache),
    ],
    "apache_memory_vsz_total" => [
        "type" => "gauge",
        "help" => "Apache processes VSZ memory (total)",
        "value" => array_sum($memVszApache),
    ],
    "apache_memory_vsz_min" => [
        "type" => "gauge",
        "help" => "Apache processes VSZ memory (min)",
        "value" => array_min($memVszApache),
    ],
    "apache_memory_vsz_max" => [
        "type" => "gauge",
        "help" => "Apache processes VSZ memory (max)",
        "value" => array_pop($memVszApache),
    ],

    // PHP
    "php" => [
        "type" => "gauge",
        "help" => "Number of PHP CLI processes",
        "value" => count($phpProcesses),
    ],
    "php_memory_rss_total" => [
        "type" => "gauge",
        "help" => "PHP CLI processes RSS memory (total)",
        "value" => array_sum($memRssPhp),
    ],
    "php_memory_rss_min" => [
        "type" => "gauge",
        "help" => "PHP CLI processes RSS memory (min)",
        "value" => array_min($memRssPhp),
    ],
    "php_memory_rss_max" => [
        "type" => "gauge",
        "help" => "PHP CLI processes RSS memory (max)",
        "value" => array_pop($memRssPhp),
    ],
    "php_memory_vsz_total" => [
        "type" => "gauge",
        "help" => "PHP CLI processes VSZ memory (total)",
        "value" => array_sum($memVszPhp),
    ],
    "php_memory_vsz_min" => [
        "type" => "gauge",
        "help" => "PHP CLI processes VSZ memory (min)",
        "value" => array_min($memVszPhp),
    ],
    "php_memory_vsz_max" => [
        "type" => "gauge",
        "help" => "PHP CLI processes VSZ memory (max)",
        "value" => array_pop($memVszPhp),
    ],

    // PHP FPM
    "phpfpm" => [
        "type" => "gauge",
        "help" => "Number of PHP FPM processes",
        "value" => count($phpFpmProcesses),
    ],
    "phpfpm_memory_rss_total" => [
        "type" => "gauge",
        "help" => "PHP FPM processes RSS memory (total)",
        "value" => array_sum($memRssPhpFpm),
    ],
    "phpfpm_memory_rss_min" => [
        "type" => "gauge",
        "help" => "PHP FPM processes RSS memory (min)",
        "value" => array_min($memRssPhpFpm),
    ],
    "phpfpm_memory_rss_max" => [
        "type" => "gauge",
        "help" => "PHP FPM processes RSS memory (max)",
        "value" => array_pop($memRssPhpFpm),
    ],
    "phpfpm_memory_vsz_total" => [
        "type" => "gauge",
        "help" => "PHP FPM processes VSZ memory (total)",
        "value" => array_sum($memVszPhpFpm),
    ],
    "phpfpm_memory_vsz_min" => [
        "type" => "gauge",
        "help" => "PHP FPM processes VSZ memory (min)",
        "value" => array_min($memVszPhpFpm),
    ],
    "phpfpm_memory_vsz_max" => [
        "type" => "gauge",
        "help" => "PHP FPM processes VSZ memory (max)",
        "value" => array_pop($memVszPhpFpm),
    ],
];

foreach ($metrics as $key => $metric) {
    $key = "procs_" . $key;
    if ($metric["value"] === null) {
        continue;
    }

    print "# HELP {$key} {$metric["help"]}\n";
    print "# TYPE {$key} {$metric["type"]}\n";
    print $key . " " . $metric["value"] . "\n";
}