nagios 4 - high CPU load

drookie · Post by **drookie** » Wed Apr 15, 2015 7:31 am

Hi.

I've upgraded my 2.x installation (yeah, yeah, I know) to nagios 4.0.8 - CPU usage went up in the sky (I tried the 3.5.x, but it just coredumps on my Solaris 10, and I was able to do some patching to make nagios 4.0.8 to work).
I'm using nagios to monitor about 400 hosts, and various services on 'em.

Any ideas how to lower the CPU usage ?

I've read the chapter about nagios tuning in large installations, did a couple of tricks - and no, it didn't help.
I'm also not using the perl scripts, I'm using mostly the shell ones, so I guess using the embedded perl interpreter cannot give me any significant performance.

jdalrymple · Post by **jdalrymple** » Wed Apr 15, 2015 12:30 pm

Are you running on Solaris Sparc or Solaris x86? It shouldn't really make any difference if it compiled, although there are a lot of moving parts and some of the libraries may not be Sparc optimized. Is there a specific reason you've chosen to run it on Solaris? While supported - I promise that continuing to maintain the environment on a non EL system is not going to be in any way pleasant going forward.

That said - when you look at `top` is it obvious what process is causing the load? Is it the parent process or are there plugins running that are hogging all the CPU time? Are you seeing any weird timeouts or anything in your Core interface that might indicate host or service checks that just aren't functioning properly?

Also, are you using NDO?

It is strange to go from a 2.x system to a 4.x system and see performance degradation.

drookie · Post by **drookie** » Thu Apr 16, 2015 1:51 am

I'm running nagios on an x86 Solaris. There's no specific reason for Solaris, except that the server runs it. I run a couple of 3.x nagios installations on Solaris 11 x86, and there's no such problem there, however I didn't try to run nagios 4.x there.

I'm not using NDO.

There's no top utility in Solaris 10, but there's prstat instead. It show that the load is caused by 4 workers processes. I see lots of complaining about core processes killed someone because of the timeout - don't know if it's normal (for example for hosts/services that don't answer) or this is a sign of the problem, since it's the first of my nagios 4.x installations. What is really weird - it's load of zombie processes, which Solaris shows as <defunct>. Their PPID indicates that there's a live parent, but since most of its children are zombies, looks like it's not really handling the SIGCHLD.

Since I ported on Solaris two missing string functions, asnprintf() and vasnprintf(), which are missing on Solaris 10, it's possible that the load is caused by them, guess the community version can be not that performing as stock one, but I think it's unlikely. And this doesn't have any connection with lots of zombies.

As for Solaris, and specifically Solaris 10, I run a whole stack of applications on top of it, including various databases, php, perl and other stuff. I haven't seen any problems caused by unportability or Solaris specificness.

I tried to trace nagios with dtrace - since nagios doesn't have the dtrace support I can trace only syscalls it's using, not it's own function calls. So if we talk about syscalls - nagios seems to spend most of it's time in polling - 99% of syscall time (!) (second columns is time spent on cpu in nanoseconds):

[root@twilight /home/emz/dtrace]# ./syscalls-pro.d
dtrace: script './syscalls-pro.d' matched 469 probes
CPU ID FUNCTION:NAME
1 46884 :tick-1sec

ioctl 3318
lwp_self 69067
fchmod 74122
lstat 117278
lwp_sigmask 137737
fsat 160629
kill 243555
times 254004
setpgrp 254322
getdents 279730
getpid 328219
fcntl 802048
fstat 831788
rename 904890
fdsync 1021240
gtime 1178700
pipe 2181621
doorfs 3111066
read 5227757
putmsg 5419175
waitsys 5651473
open 7948332
write 21954565
close 29127623
fork1 159572813
exece 188844461
pollsys 23148773439
- all syscalls - 23584472972
- total - 91055585623

If we compare this profile to the nagios 3.4.1 profile (different installations, different number of hosts, but still):

root@elena /home/emz/dtrace]# ./syscalls-pro.d
dtrace: script './syscalls-pro.d' matched 431 probes
CPU ID FUNCTION:NAME
14 96578 :tick-1sec

yield 2415
fdsync 4237
lwp_self 23412
getuid 24529
getgid 30776
getpid 73886
umask 91268
setpgrp 99918
setuid 111841
fchmodat 169588
ioctl 180064
lwp_sigmask 187547
setgid 219593
alarm 224394
lseek 240713
lwp_continue 315945
fcntl 330520
sigaction 392574
mmap 805212
pipe 953929
faccessat 1227547
lwp_suspend 1279602
munmap 1499004
read 1660219
getdents 1724321
schedctl 2086968
waitsys 2872757
nanosleep 3432122
close 4222696
pollsys 4391441
fstatat 6205520
unlinkat 10333991
renameat 10917926
openat 23069531
write 25901235
exece 39361330
forksys 40081089
- all syscalls - 184749660
- total - 1412231810

We will see that two profiles differ drastically. Nagios 4.x seems to be busy with polling most of it's syscall time (99%), while nagios 3.x isn't bothered by polling at all.

jdalrymple · Post by **jdalrymple** » Thu Apr 16, 2015 3:31 pm

drookie wrote:What is really weird - it's load of zombie processes, which Solaris shows as <defunct>. Their PPID indicates that there's a live parent, but since most of its children are zombies, looks like it's not really handling the SIGCHLD.

That's actually not that weird - http://nagios.sourceforge.net/docs/nagi ... ml#workers though they shouldn't hang out for long. They wait until the core process acknowledges the results. I wonder if you have some VERY slow results coming in and that's why so much time is spent on pollsys.

What kind of checks are you running? How many hosts/services? How many CPU cores do you have on this machine? We may want to statically set the number of workers and see if that brings the load down without adversely affecting monitoring ability.

drook · Post by **drook** » Mon May 18, 2015 1:37 pm

Still me. I just got banned somehow. BTW, I just dont understand how one can be banned as spammer on a pre-moderated (for one second !) message board. I guess the same logic got us into nagios 4 version with 21K syscalls/sec and unportable code.

Most of my checks are ICMP checks, because I control corporate VPN, so basically these are tunnels and networks availability. Minor part are TCP connection checks. Some are SNMP check, but it's really the minority. This installations watches over ~200 hosts, and about 1.5K services. This server has only 2 cores, which are not SMP-capable (so only two in total) - this is Intel E3110 .

So. Are there any tricks to continue to run nagios on this hardware ?

I really doubt this syscall rate can be called "normal":

Code: Select all

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0  896   0    2   395  130  298   89   12    2    0 21041   89  11   0   0
  1  567   0    4   115   53  132   54   10    8    0 20081   90  10   0   0
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0 1362   0    4   678  128  380  108   16  289    1 21390   88  12   0   0
  1  575   0  282   124   46  201   73   11  332    1 20256   89  11   0   0

I've tried to set the workers number to 2, then to 1. Got the same picture exactly.

jdalrymple · Post by **jdalrymple** » Mon May 18, 2015 2:05 pm

We don't see your user explicitly banned. Are you using the same network for this username as you were for the other?

As for the problem, can you show us what your System/Performance Info looks like? See my attachment for an example.

drook · Post by **drook** » Mon May 18, 2015 2:23 pm

Mine looks like:

As about my ban - when I try to post this under account of "drookie", I see this:

Right now I'm using my home ISP, which gave me dynamic IP. I've also tried this at work, tried different IPs of my corporate proxies - got same error, thus this is not linked to my IP, but it's linked to my username.
May be it has something to do with my e-mail domain being suspended right now (I'm canging registrator), but this is just an assumption.

tmcdonald · Post by **tmcdonald** » Mon May 18, 2015 2:44 pm

drook wrote:May be it has something to do with my e-mail domain being suspended right now (I'm canging registrator), but this is just an assumption.

We do check for a valid MX check, so this is likely the case. Comparing your two accounts, drookie's email failed the valid MX check but drook's did not.

If you would like I can see if I can swap the emails around, though not sure if that is the best thing going forward.

jdalrymple · Post by **jdalrymple** » Mon May 18, 2015 2:50 pm

Something is wonky with your checks - they're taking far too long. Do they take this long when run from the command line?

15.048 * 169 = 2543.112 seconds worth of service checks being processed every 60 seconds

10.054 * 75 = 754.05 seconds worth of host checks being processed every 60 seconds

55 minutes worth of checks being run every minute...

Next step is to verify that your checks are finishing in a timely fashion when run at the shell.

drook · Post by **drook** » Tue May 19, 2015 12:23 am

I'd say most of them are not: below is a bunch of checks of a typical running host, which is up (95% of them are), done under account that nagios is ran under, with timings (last one is a custom command checking how big-sized packets are transmitted):

[nagios@twilight ~]$ time libexec/check_ping -H 10.30.30.18 -w 100.0,20% -c 500.0,60% -p 5
PING OK - Packet loss = 0%, RTA = 1.48 ms|rta=1.480000ms;100.000000;500.000000;0.000000 pl=0%;20;60;0

real 0m4.001s
user 0m0.002s
sys 0m0.005s

[nagios@twilight ~]$ time libexec/check_icmp -H 172.16.1.151 -w 100.0,20% -c 500.0,60%
OK - 172.16.1.151: rta 1.580ms, lost 0%|rta=1.580ms;100.000;500.000;0; pl=0%;20;60;; rtmax=1.959ms;;;; rtmin=1.331ms;;;;

real 0m0.012s
user 0m0.001s
sys 0m0.002s

[nagios@twilight ~]$ time libexec/check_icmp -H 192.168.104.49 -w 100.0,20% -c 500.0,60% -p 5
OK - 192.168.104.49: rta 1.522ms, lost 0%|rta=1.522ms;100.000;500.000;0; pl=0%;20;60;; rtmax=1.722ms;;;; rtmin=1.439ms;;;;

real 0m0.011s
user 0m0.001s
sys 0m0.002s

[nagios@twilight ~]$ time libexec/check_icmp_4k 192.168.104.49
ICMP 4K OK: 5

real 0m4.021s
user 0m0.004s
sys 0m0.013s

Just to make sure that I checked all of the services here is it's config:

define host {
use cisco8xx
host_name 9maya16
alias 9maya16
parents kosm65-gw2
address 10.30.30.18
}
define service {
use cisco8xx-local-service
host_name 9maya16
service_description icmp
check_command check_ping!100.0,20%!500.0,60%
}
define service {
use cisco8xx-local-service
host_name 9maya16
service_description tunnel
check_command check_icmp!172.16.1.151!100.0,20%!500.0,60%
}
define service {
use cisco8xx-local-service
host_name 9maya16
service_description LAN
check_command check_icmp!192.168.104.49!100.0,20%!500.0,60%
}
define service {
use cisco8xx-local-service
host_name 9maya16
service_description icmp 4k
check_command check_icmp_4k!192.168.104.49
}

Nagios Support Forum

nagios 4 - high CPU load

nagios 4 - high CPU load

Re: nagios 4 - high CPU load

Re: nagios 4 - high CPU load

Re: nagios 4 - high CPU load

Re: nagios 4 - high CPU load

Re: nagios 4 - high CPU load

Re: nagios 4 - high CPU load

Re: nagios 4 - high CPU load

Re: nagios 4 - high CPU load

Re: nagios 4 - high CPU load