Page 1 of 2

Performance Data Not Working

Posted: Fri May 02, 2014 12:48 pm
by GreatWolfResorts
I've tried sifting through former posts on performance data issues without any real luck. This is a two phase change that occurred with our NagiosXI environment.

1. We needed to move our system off a physical server that requires decommissioning.
2. We wanted to upgrade to 2012 R2.9 to stay current.

The new server is a VM on ESXi. The former version of the software was 2012 R1.6 (So quite a jump.) The first step was to build the VM on R1.6, then restore the latest backup we to this server. Once complete, I logged in and verified things looked good. The second step was to download the R2.9 and run the upgrade script. This completed without error. One item that stood out right away was the performance graphs weren't working. This was apparent because many of our networking checks were complaining about the mrtg .rrd files not being present. I grabbed the files from the old server and populated them on the new one. These errors cleared, but we lack any new performance data.

Hopefully this information will prove helpful:

Code: Select all

[root@gwr-noc /]# tail -15 /usr/local/nagios/var/npcd.log
[05-02-2014 12:02:45] NPCD: npcd Daemon (0.4.14) started with PID=18461
[05-02-2014 12:02:45] NPCD: Please have a look at 'npcd -V' to get license information
[05-02-2014 12:02:45] NPCD: HINT: load_threshold is enabled - ('20.000000')
[05-02-2014 12:08:11] NPCD: Caught Termination Signal - Hasta la vista... baby
[05-02-2014 12:08:11] NPCD: npcd Daemon (0.4.14) started with PID=27269
[05-02-2014 12:08:11] NPCD: Please have a look at 'npcd -V' to get license information
[05-02-2014 12:08:11] NPCD: HINT: load_threshold is enabled - ('20.000000')
[05-02-2014 12:19:07] NPCD: Caught Termination Signal - Hasta la vista... baby
[05-02-2014 12:19:07] NPCD: npcd Daemon (0.4.14) started with PID=25768
[05-02-2014 12:19:07] NPCD: Please have a look at 'npcd -V' to get license information
[05-02-2014 12:19:07] NPCD: HINT: load_threshold is enabled - ('20.000000')
[05-02-2014 12:27:23] NPCD: Caught Termination Signal - Hasta la vista... baby
[05-02-2014 12:28:06] NPCD: npcd Daemon (0.4.14) started with PID=3144
[05-02-2014 12:28:06] NPCD: Please have a look at 'npcd -V' to get license information
[05-02-2014 12:28:06] NPCD: HINT: load_threshold is enabled - ('20.000000')

Code: Select all

[root@gwr-noc /]# tail -15 /usr/local/nagios/var/perfdata.log
2014-04-29 17:00:35 [29373] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1398808808.perfdata.host-PID-29373 deleted
2014-04-29 17:00:35 [29373] [0] *** Timeout while processing Host: "WB8000_DELLS_3" Service: "_HOST_"
2014-04-29 17:00:35 [29373] [0] *** process_perfdata.pl terminated on signal ALRM
2014-04-29 17:40:35 [26051] [0] *** TIMEOUT: Timeout after 12 secs. ***
2014-04-29 17:40:35 [26051] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2014-04-29 17:40:35 [26051] [0] *** TIMEOUT: Please check your npcd.cfg
2014-04-29 17:40:35 [26051] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1398811208.perfdata.host-PID-26051 deleted
2014-04-29 17:40:35 [26051] [0] *** Timeout while processing Host: "WI-FILER-IMM" Service: "_HOST_"
2014-04-29 17:40:35 [26051] [0] *** process_perfdata.pl terminated on signal ALRM
2014-04-30 13:05:31 [27739] [0] *** TIMEOUT: Timeout after 12 secs. ***
2014-04-30 13:05:31 [27739] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2014-04-30 13:05:31 [27739] [0] *** TIMEOUT: Please check your npcd.cfg
2014-04-30 13:05:31 [27739] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1398881110.perfdata.host-PID-27739 deleted
2014-04-30 13:05:31 [27739] [0] *** Timeout while processing Host: "WI-SW5" Service: "_HOST_"
2014-04-30 13:05:31 [27739] [0] *** process_perfdata.pl terminated on signal ALRM
Something interesting here is that you'll notice the latest timestamp on this log shows day before yesterday. Nothing further has been entered into this log post-migration.

Code: Select all

Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                      246G  6.7G  227G   3% /
tmpfs                 2.9G     0  2.9G   0% /dev/shm
/dev/sda1              97M   28M   65M  31% /boot

Code: Select all

[root@gwr-noc /]# ls /usr/local/nagios/var/spool/xidpe | wc -l
0
At this point I'm just picking at straws, so any and all help is greatly appreciated. Thanks!

Dan

Re: Performance Data Not Working

Posted: Fri May 02, 2014 12:56 pm
by abrist
1. Was there an architecture change in this process (32bit to 64 bit)?
2. Is npcd running?

Code: Select all

service npcd status

Re: Performance Data Not Working

Posted: Fri May 02, 2014 12:58 pm
by GreatWolfResorts
Both the old server and the new VM are 64bit, so we should be okay there.

Code: Select all

[root@gwr-noc /]# service npcd status
NPCD running (pid 3144).

Re: Performance Data Not Working

Posted: Fri May 02, 2014 1:00 pm
by lmiltchev
Can you also run the following commands, and show us the output?

Code: Select all

ls /usr/local/nagios/var/spool/perfdata | wc -l
ls /usr/local/nagios/var/spool/checkresults | wc -l
Have you tried utilizing a RAM disk on any of these two machines?

Re: Performance Data Not Working

Posted: Fri May 02, 2014 1:02 pm
by slansing
Hmm, what is the output of all three of these, I realize you did post one above:

Code: Select all

ls /usr/local/nagios/var/spool/xidpe | wc -l
ls /usr/local/nagios/var/spool/perfdata| wc -l
ls /usr/local/nagios/var/spool/checkresults | wc -l
Is this only happening with your MRTG based checks? Or everything?

Re: Performance Data Not Working

Posted: Fri May 02, 2014 1:06 pm
by GreatWolfResorts

Code: Select all

[root@gwr-noc /]# ls /usr/local/nagios/var/spool/perfdata | wc -l
0
[root@gwr-noc /]# ls /usr/local/nagios/var/spool/checkresults | wc -l
6
[root@gwr-noc /]# ls /usr/local/nagios/var/spool/xidpe | wc -l
0
I haven't used a RAM disk with either system.

This is happening on all performance data, not just the MRTG, though the MRTG based checks were the ones that initially drew me to the problem at hand just because of the blatant check failure being presented on the tactical screen.

Re: Performance Data Not Working

Posted: Fri May 02, 2014 1:40 pm
by lmiltchev
Are the perf graphs blank? What are the permissions on the "perdata" directory and the items in it?

Code: Select all

ll -d /usr/local/nagios/share/perfdata/
ll /usr/local/nagios/share/perfdata/
Have you tried restarting npcd?

Code: Select all

service npcd restart
Can you run the following command and watch the output for a while to see if there is going to be any activity?

Code: Select all

watch 'ls /usr/local/nagios/var/spool/xidpe | wc -l'

Re: Performance Data Not Working

Posted: Fri May 02, 2014 1:54 pm
by GreatWolfResorts
The performance graphs are displaying historical data to the point when the backup was made. Nothing further. Image is attached.

Code: Select all

[root@gwr-noc init.d]# ll -d /usr/local/nagios/share/perfdata/
drwxrwxr-x 595 nagios nagios 20480 Apr 10 13:26 /usr/local/nagios/share/perfdata/

[root@gwr-noc init.d]# ll /usr/local/nagios/share/perfdata/
total 741032
drwxrwxr-x 2 nagios nagios      4096 Jul 20  2012 5N-ASA
drwxrwxr-x 2 nagios nagios      4096 Feb 18 15:59 5N-BACKUP
drwxrwxr-x 2 nagios nagios      4096 Feb 18 16:01 5N-BACKUP-RSA2
drwxrwxr-x 2 nagios nagios      4096 Jan 16  2013 5N-CCI
drwxrwxr-x 2 nagios nagios      4096 Jul 20  2012 5N-DS3400A
drwxrwxr-x 2 nagios nagios      4096 Jul 20  2012 5N-DS3400B
drwxrwxr-x 2 nagios nagios      4096 Jul 20  2012 5N-FSW1
drwxrwxr-x 2 nagios nagios      4096 Jul 20  2012 5N-FSW2
The npcd has been restarted a few times already without success.

The watch on xidpe is showing the value jump from 0 to 2 and back again periodically. So it does appear to have some activity in the folder.

Re: Performance Data Not Working

Posted: Fri May 02, 2014 2:10 pm
by lmiltchev
Can you try adding a new, "test" host, wait for 15-20 min and check to see if perf graphs work for the "new" host?

Re: Performance Data Not Working

Posted: Fri May 02, 2014 2:26 pm
by GreatWolfResorts
I added a new host and disk space check to the server. Performance data is trending properly and displaying on the graphs.