Bug with NagiosXI Performance Grapher

wneville · Post by **wneville** » Wed May 01, 2024 3:18 pm

I am currently experience what I believe to be a bug in the NagiosXI performance grapher. I have a service running to capture what we have defined as 'non-standard' disk mounts ('standard' mounts = /, /apps, /boot, /home, /opt, /tmp, and /var) with the following plugin:

Code: Select all

/check_snmp_storage_wizard.pl -H $HOSTADDRESS$ -G -m ^/tmp\|/boot\|/apps\|/dev/shm\|/home\|/run\|/var\|/opt\|/sys/fs/cgroup\|memory\|Memory\|Swap\|/mongoshare/.snapshot\|/emedia/.snapshot -e -2 -C <community_string> -w 96 -c 97 -o 20000 -f -S 0

This acts as a catch-all for db servers and unique application mounts.

It appears that performance graphs are not coming in correctly. On one host, all the mount points that are discovered show in the graph as 0 capacity (see attached photo dbdev3). Here are the check results for dbdev3 which correctly display perfdata from the check result:

Code: Select all

[root@nagiossrv1 libexec]# ./check_snmp_storage_wizard.pl -H dbdev3 -G -m ^/tmp\|/boot\|/apps\|/dev/shm\|/home\|/run\|/var\|/opt\|/sys/fs/cgroup\|memory\|Memory\|Swap\|/mongoshare/.snapshot\|/emedia/.snapshot -e -2 -C <community_string> -w 96 -c 97 -o 20000 -f -S 0
All selected storages (<96%) : OK | '/oradata/db07d'=17GB;31;31;0;32 '/oradata/db05d'=548GB;864;873;0;900 '/oradata/ppmdbt'=333GB;384;388;0;400 '/ck'=0GB;10;10;0;10 '/oradata/dmii'=15GB;24;24;0;25 '/oradata/mbtu'=156GB;259;262;0;270 '/oradata/db02xd'=187GB;287;290;0;299 '/oradata/db07dg'=98GB;192;194;0;200 '/oradata/tmii'=16GB;19;19;0;20 '/'=9GB;13;14;0;14 '/oradata/mbtdg'=201GB;355;359;0;370 '/oradata/ppmdbd'=337GB;383;387;0;399 '/oradata/db02dg'=635GB;671;678;0;699 '/oradata/db04d'=126GB;240;242;0;250 '/oradata/db04i'=97GB;192;194;0;200 '/orafra'=90GB;96;97;0;100 '/oradata/db07i'=13GB;31;31;0;32 '/orawork'=15GB;96;97;0;100 '/oradata/db05i'=240GB;360;364;0;375 '/orashare'=166804GB;1474560;1489920;0;1536000 '/ora'=113GB;259;262;0;270 '/oradata/db02d'=164GB;336;340;0;350 '/oradata/mbtd'=156GB;288;291;0;300 '/dev/vx'=0GB;0;0;0;0 '/oradata/ppmdba'=323GB;384;388;0;400 '/oradata/db04xd'=100GB;182;184;0;190 '/oradata/mbti'=156GB;240;242;0;250 '/oradata/ppmdbi'=318GB;384;388;0;400 '/oradata/db04dg'=229GB;288;291;0;300 '/oraarchive'=4GB;1056;1067;0;1100 '/oradata/db02i'=200GB;230;233;0;240

On another, the performance data is not consistent, likely due to the order in which the mounts are displayed in the check results (nagiossrv1). Each time the check results come in the mounts are in a different order and it looks like the performance grapher prioritizes order over mount name.

Please let me know if this is intended behavior or if there is a fix planned

jmichaelson · Post by **jmichaelson** » Thu May 02, 2024 2:55 pm

I don't have a quick solution, and this does indeed look buggy. I've opened an internal issue for the matter.

jmichaelson · Post by **jmichaelson** » Thu May 02, 2024 4:01 pm

I've been corrected. It appears that you are using the "-S 0" option for the check which re-arranges the output of the plugin depending on what is in Warning or Critical state.
-S, --short=<type>[,<where>,<cut>]
<type>: Make the output shorter :
0 : only print the global result except the disk in warning or critical
ex: "< 80% : OK"
1 : Don't print all info for every disk
ex : "/ : 66 %used (< 80) : OK"
<where>: (optional) if = 1, put the OK/WARN/CRIT at the beginning
<cut>: take the <n> first characters or <n> last if n<0

wneville · Post by **wneville** » Mon May 06, 2024 7:24 am

That seems to me to be unrelated, this service has never alerted and has only once been in a "Soft" state of CRITICAL due to a timeout. The mount point perfdata is swapped almost every time the check runs. I ran these checks back-to-back on command line:

Code: Select all

./check_snmp_storage_wizard.pl -H <hostname> -G -m ^/tmp\|/boot\|/apps\|/dev/shm\|/home\|/run\|/var\|/opt\|/sys/fs/cgroup\|memory\|Memory\|Swap\|/mongoshare/.snapshot\|/emedia/.snapshot -e -2 -C <community_string>-w 96 -c 97 -o 20000 -f -S 0
All selected storages (<96%) : OK | '/data'=51GB;163;165;0;170 '/data/nagiosramdisk'=0GB;0;0;0;0 '/'=29GB;50;50;0;52

Code: Select all

./check_snmp_storage_wizard.pl -H <hostname> -G -m ^/tmp\|/boot\|/apps\|/dev/shm\|/home\|/run\|/var\|/opt\|/sys/fs/cgroup\|memory\|Memory\|Swap\|/mongoshare/.snapshot\|/emedia/.snapshot -e -2 -C <community_string> -w 96 -c 97 -o 20000 -f -S 0
All selected storages (<96%) : OK | '/'=29GB;50;50;0;52 '/data'=51GB;163;165;0;170 '/data/nagiosramdisk'=0GB;0;0;0;0

Order of perfdata results in first check: /data, /data/nagiosramdisk, root (/)
Order of perfdata results in second check: root (/), /data, /data/nagiosramdisk

Post by **swolf** » Mon May 06, 2024 10:43 am

Hi @wneville,

Right now our performance graphing backend is using a component called pnp4nagios, which does require that a check outputs the same labels in the same order, every time the check is run. We have ambitions to move away from that solution in the near future, but for now any plugin that doesn't keep a consistent label order will run into issues.

This does look like it's from a plugin that we ship by default in XI - I've filed a bug for the plugin, and will try to get you a patch in advance of the next release.

-Sebastian

wneville · Post by **wneville** » Mon May 06, 2024 2:53 pm

Thanks so much! Any ideas on the behavior seen by the db host? The issue there was the plugin from the CLI shows perfdata being populated, but the graph in XI shows 0's for all the mount points

Post by **swolf** » Thu May 09, 2024 10:49 am

Hi @wneville,

For the graph showing all 0's, can you go into Home->Details->Service Status, click on the service, go into the Advanced tab, and copy the Performance Data entry into this thread? I think this could be caused either by a formatting error / difference between your CLI output and what the web interface sees, or it could be that the existing RRD isn't formatted properly for the number of entries you're currently seeing.

For the inconsistent label ordering, see attached for an updated plugin. Please 1) make a backup of your old plugin, then 2) copy this file into /usr/local/nagios/libexec/check_snmp_storage_wizard.pl . After that, go into the Service Detail for the check with the bad graphs, and go to Configure->Re-configure this service->Monitoring. In Monitor the service with this command, you'll want to add the argument --sort-perfdata. After saving, your performance data should show in a consistent order (but the graph labels may not match the returned performance data). Once that's confirmed, you'll want to delete the file at /usr/local/nagios/share/perfdata/<HOST_NAME>/<SERVICE_DESCRIPTION>.rrd (or move it to a different location), and any new data should be correctly named.

wneville · Post by **wneville** » Mon May 13, 2024 3:09 pm

My SFTP is not working at the moment but I will check out the changes with the version of the plugin you attached when that is back up and working.

In the meantime, I was able to get perfdata to sort by changing line 452 in the original check_snmp_storage_wizard.pl from:

Code: Select all

foreach my $key ( keys %$resultat) {

to:

Code: Select all

foreach my $key (sort keys %$resultat) {

I don't have a use case where I wouldn't like them sorted so this just becomes the default behavior with this change and hasn't had any negative impacts as far as I can tell. Also, since I made that change, the service in question is no longer bringing in all 0's to the performance graph. I would've like to get to the root of the issue, but it seems like it is working now (and is sorting correctly in order to graph correctly). I don't think the sort had anything to do with the graph giving all 0s but I have notified the dba team that the graph is properly populating and am awaiting their confirmation that the nagiosxi perfdata lines up with their data

Nagios Support Forum

Bug with NagiosXI Performance Grapher

Bug with NagiosXI Performance Grapher

Re: Bug with NagiosXI Performance Grapher

Re: Bug with NagiosXI Performance Grapher

Re: Bug with NagiosXI Performance Grapher

Re: Bug with NagiosXI Performance Grapher

Re: Bug with NagiosXI Performance Grapher

Re: Bug with NagiosXI Performance Grapher

Re: Bug with NagiosXI Performance Grapher