Over the past couple of days one of our hosts experiencing h/w problems is intermittently available in the network. When it is not - the check_icmp reports unreachable status and 100% packet loss. Makes sense.
Unfortunately this throws performance data collection off the rails. 4 values reported normally are replaced with one and the rrd file is not updated. When I tried to convert it to STORAGE_MULTIPLE the packet loss database is the only one updated, but then the graphing portion (which uses only XML file with only one data source and can't find packet loss at offset 4) is also having problems. The net result - when the performance data is the most needed - i.e. to figure out when the host went down - it is not available.
Compare the performance data:
CRITICAL - XXXXXXXX: Host unreachable @ XX.XX.XX.XX rta nan, lost 100%|pl=100%;40;80;0;100
OK - XXXXXXXX rta 0.205ms lost 0%|rta=0.205ms;200.000;500.000;0; rtmax=0.314ms;;;; rtmin=0.155ms;;;; pl=0%;40;80;0;100
Apparently this was introduced at some point between 2.2.1 and 2.2.2.
From the source rpm of the latest version on EPEL, in plugins-root/check_icmp, @ finish(), line 1582:
Code: Select all
while (host) {
...
if (rta_mode && host->pl < 100) {
printf("%srta=%0.3fms;%0.3f;%0.3f;0; %srtmax=%0.3fms;;;; "
"%srtmin=%0.3fms;;;; ",
(targets > 1) ? host->name : "", (float)host->rta / 1000,
(float)warn.rta / 1000, (float)crit.rta / 1000,
(targets > 1) ? host->name : "", (float)host->rtmax / 1000,
(targets > 1) ? host->name : "", (float)host->rtmin / 1000);
}
(and continues with more sophisticated output conditionals...)
in v2.2.1 the same area looked much simpler:
Code: Select all
/* iterate once more for pretty perfparse output */
...
while(host) {
if(debug) write(STDOUT_FILENO, "\n", 1);
printf("%srta=%0.3fms;%0.3f;%0.3f;0; %spl=%u%%;%u;%u;; %srtmax=%0.3fms;;;; %srtmin=%0.3fms;;;; ",
(targets > 1) ? host->name : "",
host->rta / 1000, (float)warn.rta / 1000, (float)crit.rta / 1000,
(targets > 1) ? host->name : "", host->pl, warn.pl, crit.pl,
(targets > 1) ? host->name : "", (float)host->rtmax / 1000,
(targets > 1) ? host->name : "", (host->rtmin < DBL_MAX) ? (float)host->rtmin / 1000 : (float)0);
host = host->next;
}
The Question: Is there any way to restore the consistent behavior we had for ages? Or at least introduce some CLI switch to ensure consistency of the performance data? I realize that rt values make no sense when the host is unreachable, but I am not asking for values (which could be probably returned as NaN or empty - whichever is acceptable for pnp4nagios) but just for the structure.
Please advise.
Thank you very much in advance,
Alex