Max Service Check Time for Graphing to Work

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
maglaubig
Posts: 26
Joined: Thu Jan 03, 2019 2:02 pm

Max Service Check Time for Graphing to Work

Post by maglaubig »

I have a service check that is for a counter on a PDU that shows its peak power (until manually reset) along with the date of the last manual reset. There isn't really any reason to get these counters every 5 min and I tried every 6 hours, then set it down to 3 and cleared RRD files after making the service change and nothing is graphing.

I've searched for it, but probably calling it something it's not so I'm coming up with nothing. What is the maximum service check interval allowed so that the RRD graphs in XI will work?
benjaminsmith
Posts: 5324
Joined: Wed Aug 22, 2018 4:39 pm
Location: saint paul

Re: Max Service Check Time for Graphing to Work

Post by benjaminsmith »

Hello,

I'm not aware of any exact interval limit, but if you have checks running every 3 hours it may take some time to gather enough data to generate performance graphs.

A few things to check:

1. Is npcd up and running?

Code: Select all

systemctl status npcd
2. Count the amount of spooled files. If these commands return more than 20,000, you may need to delete files so the processes can catch up.

Code: Select all

ls /usr/local/nagios/var/spool/perfdata/ | wc -l
ls /usr/local/nagios/var/spool/xidpe/ | wc -l
3. Review the performance data log for errors. Please post the output of the following command:

Code: Select all

tail -25 /usr/local/nagios/var/perfdata.log
See: Nagios Xi - Performance Graph Problems

Thanks.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
maglaubig
Posts: 26
Joined: Thu Jan 03, 2019 2:02 pm

Re: Max Service Check Time for Graphing to Work

Post by maglaubig »

The service is up, and I'm not having trouble with graphing in general, just for this one service check, everything else graphing wise is working as expected. I forced some service checks on this particular service a few min apart a few times and a graph did generate but hasn't since.

It's been running for a few days, I would've expected something to show up by now. I'm open to putting this down to an hour but don't think the data is worth it with anything if I were to run it more frequently than that. Ideally I'd only run this once or twice a day and really want it to help drive the capacity planning reports.

The service is running (npcd), the number of files for the other commands you wanted me to run was 2 exactly for each. The tail of the perfdata.log file is as follows, I changed a few switch names to be generic as they are actual hostnames. Some of those switch ports time out on an SNMP check because the switches are old and pretty slow.

Code: Select all

[root@nagiosxi01 libexec]# tail -25 /usr/local/nagios/var/perfdata.log
2019-08-22 15:03:11 [16057] [0] *** process_perfdata.pl terminated on signal ALRM
2019-08-22 18:03:26 [64657] [0] *** TIMEOUT: Timeout after 5 secs. ***
2019-08-22 18:03:26 [64657] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2019-08-22 18:03:26 [64657] [0] *** TIMEOUT: Please check your npcd.cfg
2019-08-22 18:03:26 [64657] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1566514983.perfdata.service-PID-64657 deleted
2019-08-22 18:03:26 [64657] [0] *** Timeout while processing Host: "switch 1" Service: "Stack-Port-2_27_Packets_Error_Multicast_Broadcast_In_Out_-_Queue_Length_Out"
2019-08-22 18:03:26 [64657] [0] *** process_perfdata.pl terminated on signal ALRM
2019-08-25 18:03:20 [35564] [0] *** TIMEOUT: Timeout after 5 secs. ***
2019-08-25 18:03:20 [35564] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2019-08-25 18:03:20 [35564] [0] *** TIMEOUT: Please check your npcd.cfg
2019-08-25 18:03:20 [35564] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1566774168.perfdata.service-PID-35564 deleted
2019-08-25 18:03:20 [35566] [0] *** TIMEOUT: Timeout after 5 secs. ***
2019-08-25 18:03:20 [35564] [0] *** Timeout while processing Host: "switch 2" Service: "Stack-Port-1_40_Bandwidth"
2019-08-25 18:03:20 [35566] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2019-08-25 18:03:20 [35564] [0] *** process_perfdata.pl terminated on signal ALRM
2019-08-25 18:03:20 [35566] [0] *** TIMEOUT: Please check your npcd.cfg
2019-08-25 18:03:20 [35566] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1566774184.perfdata.service-PID-35566 deleted
2019-08-25 18:03:20 [35566] [0] *** Timeout while processing Host: "switch 3" Service: "Stack-Port-2_40_Packets_Error_Multicast_Broadcast_In_Out_-_Queue_Length_Out"
2019-08-25 18:03:20 [35566] [0] *** process_perfdata.pl terminated on signal ALRM
2019-08-29 00:03:03 [30017] [0] *** TIMEOUT: Timeout after 5 secs. ***
2019-08-29 00:03:03 [30017] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2019-08-29 00:03:03 [30017] [0] *** TIMEOUT: Please check your npcd.cfg
2019-08-29 00:03:03 [30017] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1567054968.perfdata.service-PID-30017 deleted
2019-08-29 00:03:03 [30017] [0] *** Timeout while processing Host: "switch 4" Service: "Stack-Port-2_19_Packets_Error_Multicast_Broadcast_In_Out_-_Queue_Length_Out"
2019-08-29 00:03:03 [30017] [0] *** process_perfdata.pl terminated on signal ALRM
benjaminsmith
Posts: 5324
Joined: Wed Aug 22, 2018 4:39 pm
Location: saint paul

Re: Max Service Check Time for Graphing to Work

Post by benjaminsmith »

Hello,

Thank you for sending over the log file; there are quite a few timeout errors.
2019-08-25 18:03:20 [35566] [0] *** process_perfdata.pl terminated on signal ALRM
2019-08-29 00:03:03 [30017] [0] *** TIMEOUT: Timeout after 5 secs. ***
2019-08-29 00:03:03 [30017] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
Let's increase the default timeout settings in process_perfdata.cfg:

1. Edit the file

Code: Select all

vi /usr/local/nagios/etc/pnp/process_perfdata.cfg
2. Change the TIMEOUT from 5 to 20, save and exit.

Code: Select all

TIMEOUT = 20
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
maglaubig
Posts: 26
Joined: Thu Jan 03, 2019 2:02 pm

Re: Max Service Check Time for Graphing to Work

Post by maglaubig »

I've made the changes and haven't seen any changes in the logs, so it looks like the timeouts have stopped happening. I also let this run over the holiday weekend and thought I might see some improvement in graphing but no luck.

Is the answer here just perform the check once every hour?
benjaminsmith
Posts: 5324
Joined: Wed Aug 22, 2018 4:39 pm
Location: saint paul

Re: Max Service Check Time for Graphing to Work

Post by benjaminsmith »

Hello,

I would recommend increasing the load_threshold in npcd.cfg to 50 or 50 to help process performance graphs.

Code: Select all

vi /usr/local/nagios/etc/pnp/npcd.cfg
# change 
load_threshold = 50.00
Is the nagios user account expired on the server?

Code: Select all

chage -l nagios
Lastly, is the plugin returning valid performance data? You can view the performance data in the Advanced Tab under service details.
perf-data.png
You do not have the required permissions to view the files attached to this post.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
maglaubig
Posts: 26
Joined: Thu Jan 03, 2019 2:02 pm

Re: Max Service Check Time for Graphing to Work

Post by maglaubig »

I made the change to npcd.cfg, it was set to 10.0 previously. The nagios user isn't expired and set to not expire ever (I used the OVA for XI to deploy - only added CPU/RAM resources).

The service is returning performance data, but I'm not checking it for any warning or critical conditions. I did have an issue with some SNMP checks a while back where an integer was returned as a string and the graphing plugin had trouble graphing it at all. I don't think that's the case this time since if I force several checks in succession graphing does occur.

I did also put the checks to every 2 hours from every 3 earlier this morning. It looks like I'm getting graph data now, but for some odd reason the graph is putting it at every 5 min. I'm not sure if this is a by design sort of thing.
benjaminsmith
Posts: 5324
Joined: Wed Aug 22, 2018 4:39 pm
Location: saint paul

Re: Max Service Check Time for Graphing to Work

Post by benjaminsmith »

Hello,
I did also put the checks to every 2 hours from every 3 earlier this morning. It looks like I'm getting graph data now, but for some odd reason the graph is putting it at every 5 min. I'm not sure if this is a by design sort of thing
Glad to hear it's graphing data. It's time-series data that is stored in a RRD database to conserve disk space. As such there's is curve fitting to the data. Can you let it collect data for a day or so and then post a screen shot of the graph, so we can take a look.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
maglaubig
Posts: 26
Joined: Thu Jan 03, 2019 2:02 pm

Re: Max Service Check Time for Graphing to Work

Post by maglaubig »

The graph isn't exciting in the least, but at least it's populating the data so it can be used in capacity planning reports now.
BoringNagiosGraph.png
So it looks like 2 hours or less work. I didn't try anything between 2 and 3 hours though, so in case anyone comes across this thread it might be possible, just isn't something I think is worth the time at the moment. At least it's not every 5 min like the other checks, which legitimately need to be that frequent and should reduce load overall.

I did find a reference in the best practices KB for Nagios that graphing should work to 3 hours at a maximum interval, however the post is a few years old and I'm not sure how many people would want to wait this long for a check to have come across it not working so it may not have come up. Additionally some of the internal components may have changed just enough to matter at that 3 hour mark.
You do not have the required permissions to view the files attached to this post.
benjaminsmith
Posts: 5324
Joined: Wed Aug 22, 2018 4:39 pm
Location: saint paul

Re: Max Service Check Time for Graphing to Work

Post by benjaminsmith »

Hi @maglaubig,

You should be able to set the check_interval up for 4 hours apart and still get valid performance data.

See: Nagios XI Check Interval Considerations
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
Locked