A lot of time out errors with NCPA

Post by **snapon_admin** » Thu Dec 02, 2021 11:42 am

We're currently working on replacing all NRPE checks with NCPA checks and for some reason a lot of our Solaris servers are getting a lot of time out errors. There have been 79 time out state changes in the past 24 hours. I'm just curious if there's anything that can be done from the Nagios server side to remediate this.

Post by **pbroste** » Thu Dec 02, 2021 4:46 pm

Hello @snapon_admin

Thanks for reaching out, let's start off by increasing the timeout in the config and bounce the ncpa_listener and ncpa_passive service by restarting.

https://support.nagios.com/kb/article/n ... s-872.html

Let us know how things look.

Thanks,
Perry

Post by **snapon_admin** » Thu Dec 02, 2021 4:57 pm

The NCPA time out is currently 90 seconds. Is it a good idea to increase it beyond that point? We have a fairly busy server and I don't know how harmful it'd be to have several dozen checks waiting 120+ seconds for results.

ssax · Post by **ssax** » Fri Dec 03, 2021 12:37 pm

It would be fine to increase it but you are right that it will have an impact on your system if all of the checks are taking that long.

What I recommend is to:
- Set a timeout on all of the commands (if they support it, it depends on the plugin) to a low level like 30 or 60
- Use one-off services when you need a long timeout (meaning separate commands for specific service that require long timeouts)
- If they are really long running checks (minutes) they should be converted to a passive service so as not to impact the other checks
- Make sure the host_check_timeout and service_check_timeout in your /usr/local/nagios/etc/nagios.cfg are longer than your highest timeout

You can also use these to get a better idea of what long running checks you have:

https://exchange.nagios.org/directory/A ... er/details

Or from the CLI:

https://exchange.nagios.org/directory/P ... me/details

Post by **snapon_admin** » Mon Jan 10, 2022 3:52 pm

We're talking about a LOT of checks here, so this level of granularity might be...difficult to achieve. As an example, when this happens literally EVERY check on a specific server times out and, at least in the case of one of these servers, we're talking about 53 checks on that one server that all time out. We never got these timeouts with NRPE so I'm just not sure why NCPA is having this issue. Would it be possible to setup a remote session so someone could take a better look at what i'm seeing and figure out a solution to this that isn't more of a band-aid?

Post by **snapon_admin** » Tue Jan 11, 2022 2:01 pm

I think i'm going to open up a ticket for this issue. It's become a fairly critical issue and I need a btter response on it so I'm hoping a ticket will help with that.

Nagios Support Forum

A lot of time out errors with NCPA

A lot of time out errors with NCPA

Re: A lot of time out errors with NCPA

Re: A lot of time out errors with NCPA

Re: A lot of time out errors with NCPA

Re: A lot of time out errors with NCPA

Re: A lot of time out errors with NCPA