check_icmp performance data

alext · Post by **alext** » Tue Nov 05, 2019 10:56 pm

Recently the hosts on our site applied the upgrade for nagios plugins (from EPEL), bringing the version up to 2.2.2-2.20190926git1b8ad57.el7. Our environment has several separated Nagios instances monitoring hundreds of hosts, and at some point in the past I figured check_icmp could be a better option from performance prospective.

Over the past couple of days one of our hosts experiencing h/w problems is intermittently available in the network. When it is not - the check_icmp reports unreachable status and 100% packet loss. Makes sense.

Unfortunately this throws performance data collection off the rails. 4 values reported normally are replaced with one and the rrd file is not updated. When I tried to convert it to STORAGE_MULTIPLE the packet loss database is the only one updated, but then the graphing portion (which uses only XML file with only one data source and can't find packet loss at offset 4) is also having problems. The net result - when the performance data is the most needed - i.e. to figure out when the host went down - it is not available.

Compare the performance data:
CRITICAL - XXXXXXXX: Host unreachable @ XX.XX.XX.XX rta nan, lost 100%|pl=100%;40;80;0;100
OK - XXXXXXXX rta 0.205ms lost 0%|rta=0.205ms;200.000;500.000;0; rtmax=0.314ms;;;; rtmin=0.155ms;;;; pl=0%;40;80;0;100

Apparently this was introduced at some point between 2.2.1 and 2.2.2.
From the source rpm of the latest version on EPEL, in plugins-root/check_icmp, @ finish(), line 1582:

Code: Select all

  while (host) {
    ...
    if (rta_mode && host->pl < 100) {
      printf("%srta=%0.3fms;%0.3f;%0.3f;0; %srtmax=%0.3fms;;;; "
             "%srtmin=%0.3fms;;;; ",
             (targets > 1) ? host->name : "", (float)host->rta / 1000,
             (float)warn.rta / 1000, (float)crit.rta / 1000,
             (targets > 1) ? host->name : "", (float)host->rtmax / 1000,
             (targets > 1) ? host->name : "", (float)host->rtmin / 1000);
    }
(and continues with more sophisticated output conditionals...)

The conditional above kills all the hope for getting missing values for the case when the host is unreachable and 100% packet lost.
in v2.2.1 the same area looked much simpler:

Code: Select all

    /* iterate once more for pretty perfparse output */
   ...
    while(host) {
        if(debug) write(STDOUT_FILENO, "\n", 1);
        printf("%srta=%0.3fms;%0.3f;%0.3f;0; %spl=%u%%;%u;%u;; %srtmax=%0.3fms;;;; %srtmin=%0.3fms;;;; ",
               (targets > 1) ? host->name : "",
               host->rta / 1000, (float)warn.rta / 1000, (float)crit.rta / 1000,
               (targets > 1) ? host->name : "", host->pl, warn.pl, crit.pl,
               (targets > 1) ? host->name : "", (float)host->rtmax / 1000,
               (targets > 1) ? host->name : "", (host->rtmin < DBL_MAX) ? (float)host->rtmin / 1000 : (float)0);

        host = host->next;
    }

which produced the same set of performance data regardless of the host state so Nagios & PNP4Nagios were both happy.

The Question: Is there any way to restore the consistent behavior we had for ages? Or at least introduce some CLI switch to ensure consistency of the performance data? I realize that rt values make no sense when the host is unreachable, but I am not asking for values (which could be probably returned as NaN or empty - whichever is acceptable for pnp4nagios) but just for the structure.

Please advise.

Thank you very much in advance,
Alex

scottwilkerson · Post by **scottwilkerson** » Wed Nov 06, 2019 9:31 am

Wait a second.... I see you have

2.2.2-2.20190926git1b8ad57.el7

What is this?

Plugins 2.2.2 hasn't even been released yet?
https://github.com/nagios-plugins/nagio ... s/releases

Looks like there may be an overzealous RPM EPEL packager, OR you have a dev branch setup or something...

alext · Post by **alext** » Wed Nov 06, 2019 5:14 pm

I know, I have seen your comment in another thread. That's why I posted the version we are using to avoid any misunderstanding.

As far as I can tell EPEL is using the maint (maintenance?) branch, and the code in question was introduced in commit d789904a64911abb30b214688d5a8fe876b6baa3 made on Aug 18, 2017 by dirtyren.

I do not know the general policies around nagios-plugins release process, but chances are this code could make it into the release, therefore I feel it is important to raise an alarm as early as possible.

On the other hand, EPEL may have it's own reasons to switch to the maintenance branch - this happened in Aug 2019 - maybe there were some other issues unaddressed in the released version, but EPEL community is quite big and the issue can not simply be dismissed.

If you feel this is not the appropriate place to discuss this problem please kindly let me know where it is.

scottwilkerson · Post by **scottwilkerson** » Wed Nov 06, 2019 5:18 pm

It's possible, I have pointed this out the the Plugins developers and I know several of these are already fixed in maint and several are being looked into before the full Nagios Plugins release for 2.2.2

Nagios Support Forum

check_icmp performance data

check_icmp performance data

Re: check_icmp performance data

Re: check_icmp performance data

Re: check_icmp performance data