Page 1 of 2

Gaps in Performance graphs

Posted: Mon Jul 01, 2013 10:09 am
by benhank
We are having an isolated instance where a device’s performance graph has gaps. We have checked multiple devices and their graphs are fine. I have attached a document
showing the issue.

Re: Gaps in Performance graphs

Posted: Mon Jul 01, 2013 10:20 am
by abrist
Lets make sure that graphing wasn't hitting a timeout or load limit:

Code: Select all

grep WKENCHP03.Healthone.org /usr/local/nagios/var/perfdata.log
grep WKENCHP03.Healthone.org /usr/local/nagios/var/npcd.log

Re: Gaps in Performance graphs

Posted: Mon Jul 01, 2013 10:44 am
by benhank
here are the responses from the grep:

Code: Select all

Using username "root".
Last login: Mon Jul  1 11:35:09 2013 from 172.22.161.161
[root@LkennagiosP01 ~]# grep WKENCHP03.Healthone.org /usr/local/nagios/var/perfdata.log
[root@LkennagiosP01 ~]# grep WKENCHP03.Healthone.org /usr/local/nagios/var/npcd.log
[root@LkennagiosP01 ~]#
no data was returned.
I did do a tail and got this:

Code: Select all

[root@LkennagiosP01 ~]# tail /usr/local/nagios/var/perfdata.log
2013-07-01 11:44:25 [15088] [0] *** TIMEOUT: Please check your npcd.cfg
2013-07-01 11:44:25 [15088] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1372693442.perfdata.service-PID-15088 deleted
2013-07-01 11:44:25 [15088] [0] *** Timeout while processing Host: "PBY_F1_2960-S12" Service: "If_Vlan100"
2013-07-01 11:44:25 [15088] [0] *** process_perfdata.pl terminated on signal ALRM
2013-07-01 11:45:34 [19357] [0] *** TIMEOUT: Timeout after 5 secs. ***
2013-07-01 11:45:34 [19357] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2013-07-01 11:45:34 [19357] [0] *** TIMEOUT: Please check your npcd.cfg
2013-07-01 11:45:34 [19357] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1372693517.perfdata.service-PID-19357 deleted
2013-07-01 11:45:34 [19357] [0] *** Timeout while processing Host: "SOM-UPS-IDF-3-1" Service: "Connectivity"
2013-07-01 11:45:34 [19357] [0] *** process_perfdata.pl terminated on signal ALRM
[root@LkennagiosP01 ~]#

Code: Select all

[root@LkennagiosP01 ~]# tail /usr/local/nagios/var/npcd.log
[07-01-2013 11:35:31] NPCD: ERROR: Executed command exits with return code '7'
[07-01-2013 11:35:31] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1372692917.perfdata.service'
[07-01-2013 11:39:29] NPCD: ERROR: Executed command exits with return code '7'
[07-01-2013 11:39:29] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1372693142.perfdata.service'
[07-01-2013 11:40:41] NPCD: ERROR: Executed command exits with return code '7'
[07-01-2013 11:40:41] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1372693217.perfdata.service'
[07-01-2013 11:44:25] NPCD: ERROR: Executed command exits with return code '7'
[07-01-2013 11:44:25] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1372693442.perfdata.service'
[07-01-2013 11:45:34] NPCD: ERROR: Executed command exits with return code '7'
[07-01-2013 11:45:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1372693517.perfdata.service'
[root@LkennagiosP01 ~]#

Re: Gaps in Performance graphs

Posted: Mon Jul 01, 2013 10:47 am
by abrist
Could you post those services' configs?

Re: Gaps in Performance graphs

Posted: Mon Jul 01, 2013 11:20 am
by benhank
Sent toy via pm.

Re: Gaps in Performance graphs

Posted: Mon Jul 01, 2013 12:57 pm
by abrist
Looks like you are hitting the timeout limit.
Edit:

Code: Select all

/usr/local/nagios/etc/pnp/process_perfdata.cfg
Change:

Code: Select all

TIMEOUT = 5
To:

Code: Select all

TIMEOUT = 20
Restart npcd:

Code: Select all

service npcd restart

Re: Gaps in Performance graphs

Posted: Mon Jul 01, 2013 1:18 pm
by benhank
This system has a recent gap in the CPU usage. We found this in the perfdata.log. Nothing in the NPCD.log

[root@LkennagiosP01 nagiosxi]# grep WKENAHPRESP236.Healthone.org /usr/local/nagios/var/perfdata.log
2013-06-29 22:37:59 [23517] [0] *** Timeout while processing Host: "WKENAHPRESP236.Healthone.org" Service: "NRPE__Event_ID_1008_Status"

Re: Gaps in Performance graphs

Posted: Mon Jul 01, 2013 1:31 pm
by abrist
Yeah. It probably stopped processing perfdata due to the timeouts. Increasing the timeout will make sure this does not happen in the future. Did you do so yet?

Re: Gaps in Performance graphs

Posted: Mon Jul 01, 2013 1:37 pm
by benhank
No, it is currently set to 5 secs. Should we go to 10?

Re: Gaps in Performance graphs

Posted: Mon Jul 01, 2013 1:44 pm
by slansing
10 may not be enough, you could try 15 or 20 to start with.