Max Service Check Time for Graphing to Work

maglaubig · Post by **maglaubig** » Thu Aug 29, 2019 7:39 am

I have a service check that is for a counter on a PDU that shows its peak power (until manually reset) along with the date of the last manual reset. There isn't really any reason to get these counters every 5 min and I tried every 6 hours, then set it down to 3 and cleared RRD files after making the service change and nothing is graphing.

I've searched for it, but probably calling it something it's not so I'm coming up with nothing. What is the maximum service check interval allowed so that the RRD graphs in XI will work?

benjaminsmith · Post by **benjaminsmith** » Thu Aug 29, 2019 2:38 pm

Hello,

I'm not aware of any exact interval limit, but if you have checks running every 3 hours it may take some time to gather enough data to generate performance graphs.

A few things to check:

1. Is npcd up and running?

Code: Select all

systemctl status npcd

2. Count the amount of spooled files. If these commands return more than 20,000, you may need to delete files so the processes can catch up.

Code: Select all

ls /usr/local/nagios/var/spool/perfdata/ | wc -l
ls /usr/local/nagios/var/spool/xidpe/ | wc -l

3. Review the performance data log for errors. Please post the output of the following command:

Code: Select all

tail -25 /usr/local/nagios/var/perfdata.log

See: Nagios Xi - Performance Graph Problems

Thanks.

maglaubig · Post by **maglaubig** » Thu Aug 29, 2019 2:57 pm

The service is up, and I'm not having trouble with graphing in general, just for this one service check, everything else graphing wise is working as expected. I forced some service checks on this particular service a few min apart a few times and a graph did generate but hasn't since.

It's been running for a few days, I would've expected something to show up by now. I'm open to putting this down to an hour but don't think the data is worth it with anything if I were to run it more frequently than that. Ideally I'd only run this once or twice a day and really want it to help drive the capacity planning reports.

The service is running (npcd), the number of files for the other commands you wanted me to run was 2 exactly for each. The tail of the perfdata.log file is as follows, I changed a few switch names to be generic as they are actual hostnames. Some of those switch ports time out on an SNMP check because the switches are old and pretty slow.

Code: Select all

[root@nagiosxi01 libexec]# tail -25 /usr/local/nagios/var/perfdata.log
2019-08-22 15:03:11 [16057] [0] *** process_perfdata.pl terminated on signal ALRM
2019-08-22 18:03:26 [64657] [0] *** TIMEOUT: Timeout after 5 secs. ***
2019-08-22 18:03:26 [64657] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2019-08-22 18:03:26 [64657] [0] *** TIMEOUT: Please check your npcd.cfg
2019-08-22 18:03:26 [64657] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1566514983.perfdata.service-PID-64657 deleted
2019-08-22 18:03:26 [64657] [0] *** Timeout while processing Host: "switch 1" Service: "Stack-Port-2_27_Packets_Error_Multicast_Broadcast_In_Out_-_Queue_Length_Out"
2019-08-22 18:03:26 [64657] [0] *** process_perfdata.pl terminated on signal ALRM
2019-08-25 18:03:20 [35564] [0] *** TIMEOUT: Timeout after 5 secs. ***
2019-08-25 18:03:20 [35564] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2019-08-25 18:03:20 [35564] [0] *** TIMEOUT: Please check your npcd.cfg
2019-08-25 18:03:20 [35564] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1566774168.perfdata.service-PID-35564 deleted
2019-08-25 18:03:20 [35566] [0] *** TIMEOUT: Timeout after 5 secs. ***
2019-08-25 18:03:20 [35564] [0] *** Timeout while processing Host: "switch 2" Service: "Stack-Port-1_40_Bandwidth"
2019-08-25 18:03:20 [35566] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2019-08-25 18:03:20 [35564] [0] *** process_perfdata.pl terminated on signal ALRM
2019-08-25 18:03:20 [35566] [0] *** TIMEOUT: Please check your npcd.cfg
2019-08-25 18:03:20 [35566] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1566774184.perfdata.service-PID-35566 deleted
2019-08-25 18:03:20 [35566] [0] *** Timeout while processing Host: "switch 3" Service: "Stack-Port-2_40_Packets_Error_Multicast_Broadcast_In_Out_-_Queue_Length_Out"
2019-08-25 18:03:20 [35566] [0] *** process_perfdata.pl terminated on signal ALRM
2019-08-29 00:03:03 [30017] [0] *** TIMEOUT: Timeout after 5 secs. ***
2019-08-29 00:03:03 [30017] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2019-08-29 00:03:03 [30017] [0] *** TIMEOUT: Please check your npcd.cfg
2019-08-29 00:03:03 [30017] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1567054968.perfdata.service-PID-30017 deleted
2019-08-29 00:03:03 [30017] [0] *** Timeout while processing Host: "switch 4" Service: "Stack-Port-2_19_Packets_Error_Multicast_Broadcast_In_Out_-_Queue_Length_Out"
2019-08-29 00:03:03 [30017] [0] *** process_perfdata.pl terminated on signal ALRM

benjaminsmith · Post by **benjaminsmith** » Thu Aug 29, 2019 4:15 pm

Hello,

Thank you for sending over the log file; there are quite a few timeout errors.

2019-08-25 18:03:20 [35566] [0] *** process_perfdata.pl terminated on signal ALRM
2019-08-29 00:03:03 [30017] [0] *** TIMEOUT: Timeout after 5 secs. ***
2019-08-29 00:03:03 [30017] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops

Let's increase the default timeout settings in process_perfdata.cfg:

1. Edit the file

Code: Select all

vi /usr/local/nagios/etc/pnp/process_perfdata.cfg

2. Change the TIMEOUT from 5 to 20, save and exit.

Code: Select all

TIMEOUT = 20

maglaubig · Post by **maglaubig** » Tue Sep 03, 2019 8:36 am

I've made the changes and haven't seen any changes in the logs, so it looks like the timeouts have stopped happening. I also let this run over the holiday weekend and thought I might see some improvement in graphing but no luck.

Is the answer here just perform the check once every hour?

benjaminsmith · Post by **benjaminsmith** » Tue Sep 03, 2019 1:40 pm

Hello,

I would recommend increasing the load_threshold in npcd.cfg to 50 or 50 to help process performance graphs.

Code: Select all

vi /usr/local/nagios/etc/pnp/npcd.cfg
# change 
load_threshold = 50.00

Is the nagios user account expired on the server?

Code: Select all

chage -l nagios

Lastly, is the plugin returning valid performance data? You can view the performance data in the Advanced Tab under service details.

perf-data.png

maglaubig · Post by **maglaubig** » Tue Sep 03, 2019 4:01 pm

I made the change to npcd.cfg, it was set to 10.0 previously. The nagios user isn't expired and set to not expire ever (I used the OVA for XI to deploy - only added CPU/RAM resources).

The service is returning performance data, but I'm not checking it for any warning or critical conditions. I did have an issue with some SNMP checks a while back where an integer was returned as a string and the graphing plugin had trouble graphing it at all. I don't think that's the case this time since if I force several checks in succession graphing does occur.

I did also put the checks to every 2 hours from every 3 earlier this morning. It looks like I'm getting graph data now, but for some odd reason the graph is putting it at every 5 min. I'm not sure if this is a by design sort of thing.

benjaminsmith · Post by **benjaminsmith** » Tue Sep 03, 2019 4:48 pm

Hello,

I did also put the checks to every 2 hours from every 3 earlier this morning. It looks like I'm getting graph data now, but for some odd reason the graph is putting it at every 5 min. I'm not sure if this is a by design sort of thing

Glad to hear it's graphing data. It's time-series data that is stored in a RRD database to conserve disk space. As such there's is curve fitting to the data. Can you let it collect data for a day or so and then post a screen shot of the graph, so we can take a look.

maglaubig · Post by **maglaubig** » Wed Sep 04, 2019 7:44 am

The graph isn't exciting in the least, but at least it's populating the data so it can be used in capacity planning reports now.

BoringNagiosGraph.png

So it looks like 2 hours or less work. I didn't try anything between 2 and 3 hours though, so in case anyone comes across this thread it might be possible, just isn't something I think is worth the time at the moment. At least it's not every 5 min like the other checks, which legitimately need to be that frequent and should reduce load overall.

I did find a reference in the best practices KB for Nagios that graphing should work to 3 hours at a maximum interval, however the post is a few years old and I'm not sure how many people would want to wait this long for a check to have come across it not working so it may not have come up. Additionally some of the internal components may have changed just enough to matter at that 3 hour mark.

benjaminsmith · Post by **benjaminsmith** » Wed Sep 04, 2019 11:42 am

Hi @maglaubig,

You should be able to set the check_interval up for 4 hours apart and still get valid performance data.

See: Nagios XI Check Interval Considerations

Nagios Support Forum

Max Service Check Time for Graphing to Work

Max Service Check Time for Graphing to Work

Re: Max Service Check Time for Graphing to Work

Re: Max Service Check Time for Graphing to Work

Re: Max Service Check Time for Graphing to Work

Re: Max Service Check Time for Graphing to Work

Re: Max Service Check Time for Graphing to Work

Re: Max Service Check Time for Graphing to Work

Re: Max Service Check Time for Graphing to Work

Re: Max Service Check Time for Graphing to Work

Re: Max Service Check Time for Graphing to Work