nagios 4 - high CPU load

drook · Post by **drook** » Tue May 19, 2015 1:26 am

Okay, I got you point. May be it's the 4k custom check that is running long. It's reasonable: its sending 5 4k icmp packets, which may be lost, which may take some time, and in general this should not be faster than 5 seconds. So I diminished the amount of packets from 5 to 3, which should give me 40% benefit for this check.

But nothing changed at all (mpstat after 15 minutes of running):

Code: Select all

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0 1276   0    5   366  108  341   74   24   13    0 21841   95   5   0   0
  1 4154   0    3   229  119  394  105   25   14    0 24743   91   9   0   0
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0 1358   0   10   384  114  450   86   31    4    0 21856   94   6   0   0
  1  947   0    4   233  120  309   81   19   12    0 22311   95   5   0   0

jdalrymple · Post by **jdalrymple** » Tue May 19, 2015 9:21 am

The host checks aren't as alarming as the service checks just because the service checks are taking even longer on average. Furthermore the proximity to "15.0" almost leads me to believe we have a zillion checks running up against a timeout of some sort or something. There is no doubt something under the hood isn't happening right. A well behaved Nagios server should never have an average check time over a second or so.

Next step to troubleshoot would be to see what checks are running the most. I think you might find if you look at /var/log/nagios.log that there are a great many more of 1 check than all the rest. Maybe tail the last 100 lines here? If you don't wish to post publicly PM is an option - I can make it available internally so all the Nagios folks can chime in.

drook · Post by **drook** » Fri May 22, 2015 6:31 am

(Sorry, I missed a couple of days).

No, there's nothing secret in my logs.

Here they are:

http://tech.hq.norma.perm.ru/files/nagios.log
http://tech.hq.norma.perm.ru/files/nagios.debug
http://tech.hq.norma.perm.ru/files/nagios.debug.old

jdalrymple · Post by **jdalrymple** » Fri May 22, 2015 2:01 pm

Not a single one of your service or host check workers are exiting cleanly. It looks to me like you should see nothing but red in your Core interface, but you never mentioned that being a problem.

Are you getting visible check results in the UI? What do service details say the last check time was for some random services?

drook · Post by **drook** » Tue May 26, 2015 8:44 am

In fact, besides the huge load and the fact that it crashes when trying to check for the updates (so I just disabled this check), nagios 4 works fine.
That's why I didn't mention something other that load.

Main screen:

http://static.enaza.ru/userupload/gyazo ... baa9a3.png

Critical screen:

http://static.enaza.ru/userupload/gyazo ... 37487a.png

drook · Post by **drook** » Tue May 26, 2015 8:45 am

And yeah, some state screen for a random service:

http://static.enaza.ru/userupload/gyazo ... d28eb0.png

jdalrymple · Post by **jdalrymple** » Tue May 26, 2015 4:17 pm

I'm somewhat at a loss. I'll have to build a lab and try to recreate your environment so I can work on it here. The problem can be "summarized" but that's about it. Apparently offloading the work to Nagios workers causes your host checks to take a MINIMUM of 10 seconds for hosts and 15 seconds for services, during which time they're spinning your CPUs for some random reason.

One thing I'd like to offer as a potential solution would be to make mod_gearman work on your system then use it to perform checks instead of Nagios workers. I can't promise that would fix the problem, but if I was alone in a quiet room and didn't have Nagios forums to ask questions on - that would be the next thing I'd try.

Nagios Support Forum

nagios 4 - high CPU load

Re: nagios 4 - high CPU load

Re: nagios 4 - high CPU load

Re: nagios 4 - high CPU load

Re: nagios 4 - high CPU load

Re: nagios 4 - high CPU load

Re: nagios 4 - high CPU load

Re: nagios 4 - high CPU load