check_icmp performance data
Posted: Tue Nov 05, 2019 10:56 pm
Recently the hosts on our site applied the upgrade for nagios plugins (from EPEL), bringing the version up to 2.2.2-2.20190926git1b8ad57.el7. Our environment has several separated Nagios instances monitoring hundreds of hosts, and at some point in the past I figured check_icmp could be a better option from performance prospective.
Over the past couple of days one of our hosts experiencing h/w problems is intermittently available in the network. When it is not - the check_icmp reports unreachable status and 100% packet loss. Makes sense.
Unfortunately this throws performance data collection off the rails. 4 values reported normally are replaced with one and the rrd file is not updated. When I tried to convert it to STORAGE_MULTIPLE the packet loss database is the only one updated, but then the graphing portion (which uses only XML file with only one data source and can't find packet loss at offset 4) is also having problems. The net result - when the performance data is the most needed - i.e. to figure out when the host went down - it is not available.
Compare the performance data:
CRITICAL - XXXXXXXX: Host unreachable @ XX.XX.XX.XX rta nan, lost 100%|pl=100%;40;80;0;100
OK - XXXXXXXX rta 0.205ms lost 0%|rta=0.205ms;200.000;500.000;0; rtmax=0.314ms;;;; rtmin=0.155ms;;;; pl=0%;40;80;0;100
Apparently this was introduced at some point between 2.2.1 and 2.2.2.
From the source rpm of the latest version on EPEL, in plugins-root/check_icmp, @ finish(), line 1582:
The conditional above kills all the hope for getting missing values for the case when the host is unreachable and 100% packet lost.
in v2.2.1 the same area looked much simpler:
which produced the same set of performance data regardless of the host state so Nagios & PNP4Nagios were both happy.
The Question: Is there any way to restore the consistent behavior we had for ages? Or at least introduce some CLI switch to ensure consistency of the performance data? I realize that rt values make no sense when the host is unreachable, but I am not asking for values (which could be probably returned as NaN or empty - whichever is acceptable for pnp4nagios) but just for the structure.
Please advise.
Thank you very much in advance,
Alex
Over the past couple of days one of our hosts experiencing h/w problems is intermittently available in the network. When it is not - the check_icmp reports unreachable status and 100% packet loss. Makes sense.
Unfortunately this throws performance data collection off the rails. 4 values reported normally are replaced with one and the rrd file is not updated. When I tried to convert it to STORAGE_MULTIPLE the packet loss database is the only one updated, but then the graphing portion (which uses only XML file with only one data source and can't find packet loss at offset 4) is also having problems. The net result - when the performance data is the most needed - i.e. to figure out when the host went down - it is not available.
Compare the performance data:
CRITICAL - XXXXXXXX: Host unreachable @ XX.XX.XX.XX rta nan, lost 100%|pl=100%;40;80;0;100
OK - XXXXXXXX rta 0.205ms lost 0%|rta=0.205ms;200.000;500.000;0; rtmax=0.314ms;;;; rtmin=0.155ms;;;; pl=0%;40;80;0;100
Apparently this was introduced at some point between 2.2.1 and 2.2.2.
From the source rpm of the latest version on EPEL, in plugins-root/check_icmp, @ finish(), line 1582:
Code: Select all
while (host) {
...
if (rta_mode && host->pl < 100) {
printf("%srta=%0.3fms;%0.3f;%0.3f;0; %srtmax=%0.3fms;;;; "
"%srtmin=%0.3fms;;;; ",
(targets > 1) ? host->name : "", (float)host->rta / 1000,
(float)warn.rta / 1000, (float)crit.rta / 1000,
(targets > 1) ? host->name : "", (float)host->rtmax / 1000,
(targets > 1) ? host->name : "", (float)host->rtmin / 1000);
}
(and continues with more sophisticated output conditionals...)
in v2.2.1 the same area looked much simpler:
Code: Select all
/* iterate once more for pretty perfparse output */
...
while(host) {
if(debug) write(STDOUT_FILENO, "\n", 1);
printf("%srta=%0.3fms;%0.3f;%0.3f;0; %spl=%u%%;%u;%u;; %srtmax=%0.3fms;;;; %srtmin=%0.3fms;;;; ",
(targets > 1) ? host->name : "",
host->rta / 1000, (float)warn.rta / 1000, (float)crit.rta / 1000,
(targets > 1) ? host->name : "", host->pl, warn.pl, crit.pl,
(targets > 1) ? host->name : "", (float)host->rtmax / 1000,
(targets > 1) ? host->name : "", (host->rtmin < DBL_MAX) ? (float)host->rtmin / 1000 : (float)0);
host = host->next;
}
The Question: Is there any way to restore the consistent behavior we had for ages? Or at least introduce some CLI switch to ensure consistency of the performance data? I realize that rt values make no sense when the host is unreachable, but I am not asking for values (which could be probably returned as NaN or empty - whichever is acceptable for pnp4nagios) but just for the structure.
Please advise.
Thank you very much in advance,
Alex