Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Okay, I got you point. May be it's the 4k custom check that is running long. It's reasonable: its sending 5 4k icmp packets, which may be lost, which may take some time, and in general this should not be faster than 5 seconds. So I diminished the amount of packets from 5 to 3, which should give me 40% benefit for this check.
But nothing changed at all (mpstat after 15 minutes of running):
The host checks aren't as alarming as the service checks just because the service checks are taking even longer on average. Furthermore the proximity to "15.0" almost leads me to believe we have a zillion checks running up against a timeout of some sort or something. There is no doubt something under the hood isn't happening right. A well behaved Nagios server should never have an average check time over a second or so.
Next step to troubleshoot would be to see what checks are running the most. I think you might find if you look at /var/log/nagios.log that there are a great many more of 1 check than all the rest. Maybe tail the last 100 lines here? If you don't wish to post publicly PM is an option - I can make it available internally so all the Nagios folks can chime in.
Not a single one of your service or host check workers are exiting cleanly. It looks to me like you should see nothing but red in your Core interface, but you never mentioned that being a problem.
Are you getting visible check results in the UI? What do service details say the last check time was for some random services?
In fact, besides the huge load and the fact that it crashes when trying to check for the updates (so I just disabled this check), nagios 4 works fine.
That's why I didn't mention something other that load.
I'm somewhat at a loss. I'll have to build a lab and try to recreate your environment so I can work on it here. The problem can be "summarized" but that's about it. Apparently offloading the work to Nagios workers causes your host checks to take a MINIMUM of 10 seconds for hosts and 15 seconds for services, during which time they're spinning your CPUs for some random reason.
One thing I'd like to offer as a potential solution would be to make mod_gearman work on your system then use it to perform checks instead of Nagios workers. I can't promise that would fix the problem, but if I was alone in a quiet room and didn't have Nagios forums to ask questions on - that would be the next thing I'd try.