nagios 4 - high CPU load
nagios 4 - high CPU load
Hi.
I've upgraded my 2.x installation (yeah, yeah, I know) to nagios 4.0.8 - CPU usage went up in the sky (I tried the 3.5.x, but it just coredumps on my Solaris 10, and I was able to do some patching to make nagios 4.0.8 to work).
I'm using nagios to monitor about 400 hosts, and various services on 'em.
Any ideas how to lower the CPU usage ?
I've read the chapter about nagios tuning in large installations, did a couple of tricks - and no, it didn't help.
I'm also not using the perl scripts, I'm using mostly the shell ones, so I guess using the embedded perl interpreter cannot give me any significant performance.
I've upgraded my 2.x installation (yeah, yeah, I know) to nagios 4.0.8 - CPU usage went up in the sky (I tried the 3.5.x, but it just coredumps on my Solaris 10, and I was able to do some patching to make nagios 4.0.8 to work).
I'm using nagios to monitor about 400 hosts, and various services on 'em.
Any ideas how to lower the CPU usage ?
I've read the chapter about nagios tuning in large installations, did a couple of tricks - and no, it didn't help.
I'm also not using the perl scripts, I'm using mostly the shell ones, so I guess using the embedded perl interpreter cannot give me any significant performance.
-
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: nagios 4 - high CPU load
Are you running on Solaris Sparc or Solaris x86? It shouldn't really make any difference if it compiled, although there are a lot of moving parts and some of the libraries may not be Sparc optimized. Is there a specific reason you've chosen to run it on Solaris? While supported - I promise that continuing to maintain the environment on a non EL system is not going to be in any way pleasant going forward.
That said - when you look at `top` is it obvious what process is causing the load? Is it the parent process or are there plugins running that are hogging all the CPU time? Are you seeing any weird timeouts or anything in your Core interface that might indicate host or service checks that just aren't functioning properly?
Also, are you using NDO?
It is strange to go from a 2.x system to a 4.x system and see performance degradation.
That said - when you look at `top` is it obvious what process is causing the load? Is it the parent process or are there plugins running that are hogging all the CPU time? Are you seeing any weird timeouts or anything in your Core interface that might indicate host or service checks that just aren't functioning properly?
Also, are you using NDO?
It is strange to go from a 2.x system to a 4.x system and see performance degradation.
Re: nagios 4 - high CPU load
I'm running nagios on an x86 Solaris. There's no specific reason for Solaris, except that the server runs it. I run a couple of 3.x nagios installations on Solaris 11 x86, and there's no such problem there, however I didn't try to run nagios 4.x there.
I'm not using NDO.
There's no top utility in Solaris 10, but there's prstat instead. It show that the load is caused by 4 workers processes. I see lots of complaining about core processes killed someone because of the timeout - don't know if it's normal (for example for hosts/services that don't answer) or this is a sign of the problem, since it's the first of my nagios 4.x installations. What is really weird - it's load of zombie processes, which Solaris shows as <defunct>. Their PPID indicates that there's a live parent, but since most of its children are zombies, looks like it's not really handling the SIGCHLD.
Since I ported on Solaris two missing string functions, asnprintf() and vasnprintf(), which are missing on Solaris 10, it's possible that the load is caused by them, guess the community version can be not that performing as stock one, but I think it's unlikely. And this doesn't have any connection with lots of zombies.
As for Solaris, and specifically Solaris 10, I run a whole stack of applications on top of it, including various databases, php, perl and other stuff. I haven't seen any problems caused by unportability or Solaris specificness.
I tried to trace nagios with dtrace - since nagios doesn't have the dtrace support I can trace only syscalls it's using, not it's own function calls. So if we talk about syscalls - nagios seems to spend most of it's time in polling - 99% of syscall time (!) (second columns is time spent on cpu in nanoseconds):
[root@twilight /home/emz/dtrace]# ./syscalls-pro.d
dtrace: script './syscalls-pro.d' matched 469 probes
CPU ID FUNCTION:NAME
1 46884 :tick-1sec
ioctl 3318
lwp_self 69067
fchmod 74122
lstat 117278
lwp_sigmask 137737
fsat 160629
kill 243555
times 254004
setpgrp 254322
getdents 279730
getpid 328219
fcntl 802048
fstat 831788
rename 904890
fdsync 1021240
gtime 1178700
pipe 2181621
doorfs 3111066
read 5227757
putmsg 5419175
waitsys 5651473
open 7948332
write 21954565
close 29127623
fork1 159572813
exece 188844461
pollsys 23148773439
- all syscalls - 23584472972
- total - 91055585623
If we compare this profile to the nagios 3.4.1 profile (different installations, different number of hosts, but still):
root@elena /home/emz/dtrace]# ./syscalls-pro.d
dtrace: script './syscalls-pro.d' matched 431 probes
CPU ID FUNCTION:NAME
14 96578 :tick-1sec
yield 2415
fdsync 4237
lwp_self 23412
getuid 24529
getgid 30776
getpid 73886
umask 91268
setpgrp 99918
setuid 111841
fchmodat 169588
ioctl 180064
lwp_sigmask 187547
setgid 219593
alarm 224394
lseek 240713
lwp_continue 315945
fcntl 330520
sigaction 392574
mmap 805212
pipe 953929
faccessat 1227547
lwp_suspend 1279602
munmap 1499004
read 1660219
getdents 1724321
schedctl 2086968
waitsys 2872757
nanosleep 3432122
close 4222696
pollsys 4391441
fstatat 6205520
unlinkat 10333991
renameat 10917926
openat 23069531
write 25901235
exece 39361330
forksys 40081089
- all syscalls - 184749660
- total - 1412231810
We will see that two profiles differ drastically. Nagios 4.x seems to be busy with polling most of it's syscall time (99%), while nagios 3.x isn't bothered by polling at all.
I'm not using NDO.
There's no top utility in Solaris 10, but there's prstat instead. It show that the load is caused by 4 workers processes. I see lots of complaining about core processes killed someone because of the timeout - don't know if it's normal (for example for hosts/services that don't answer) or this is a sign of the problem, since it's the first of my nagios 4.x installations. What is really weird - it's load of zombie processes, which Solaris shows as <defunct>. Their PPID indicates that there's a live parent, but since most of its children are zombies, looks like it's not really handling the SIGCHLD.
Since I ported on Solaris two missing string functions, asnprintf() and vasnprintf(), which are missing on Solaris 10, it's possible that the load is caused by them, guess the community version can be not that performing as stock one, but I think it's unlikely. And this doesn't have any connection with lots of zombies.
As for Solaris, and specifically Solaris 10, I run a whole stack of applications on top of it, including various databases, php, perl and other stuff. I haven't seen any problems caused by unportability or Solaris specificness.
I tried to trace nagios with dtrace - since nagios doesn't have the dtrace support I can trace only syscalls it's using, not it's own function calls. So if we talk about syscalls - nagios seems to spend most of it's time in polling - 99% of syscall time (!) (second columns is time spent on cpu in nanoseconds):
[root@twilight /home/emz/dtrace]# ./syscalls-pro.d
dtrace: script './syscalls-pro.d' matched 469 probes
CPU ID FUNCTION:NAME
1 46884 :tick-1sec
ioctl 3318
lwp_self 69067
fchmod 74122
lstat 117278
lwp_sigmask 137737
fsat 160629
kill 243555
times 254004
setpgrp 254322
getdents 279730
getpid 328219
fcntl 802048
fstat 831788
rename 904890
fdsync 1021240
gtime 1178700
pipe 2181621
doorfs 3111066
read 5227757
putmsg 5419175
waitsys 5651473
open 7948332
write 21954565
close 29127623
fork1 159572813
exece 188844461
pollsys 23148773439
- all syscalls - 23584472972
- total - 91055585623
If we compare this profile to the nagios 3.4.1 profile (different installations, different number of hosts, but still):
root@elena /home/emz/dtrace]# ./syscalls-pro.d
dtrace: script './syscalls-pro.d' matched 431 probes
CPU ID FUNCTION:NAME
14 96578 :tick-1sec
yield 2415
fdsync 4237
lwp_self 23412
getuid 24529
getgid 30776
getpid 73886
umask 91268
setpgrp 99918
setuid 111841
fchmodat 169588
ioctl 180064
lwp_sigmask 187547
setgid 219593
alarm 224394
lseek 240713
lwp_continue 315945
fcntl 330520
sigaction 392574
mmap 805212
pipe 953929
faccessat 1227547
lwp_suspend 1279602
munmap 1499004
read 1660219
getdents 1724321
schedctl 2086968
waitsys 2872757
nanosleep 3432122
close 4222696
pollsys 4391441
fstatat 6205520
unlinkat 10333991
renameat 10917926
openat 23069531
write 25901235
exece 39361330
forksys 40081089
- all syscalls - 184749660
- total - 1412231810
We will see that two profiles differ drastically. Nagios 4.x seems to be busy with polling most of it's syscall time (99%), while nagios 3.x isn't bothered by polling at all.
-
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: nagios 4 - high CPU load
That's actually not that weird - http://nagios.sourceforge.net/docs/nagi ... ml#workers though they shouldn't hang out for long. They wait until the core process acknowledges the results. I wonder if you have some VERY slow results coming in and that's why so much time is spent on pollsys.drookie wrote:What is really weird - it's load of zombie processes, which Solaris shows as <defunct>. Their PPID indicates that there's a live parent, but since most of its children are zombies, looks like it's not really handling the SIGCHLD.
What kind of checks are you running? How many hosts/services? How many CPU cores do you have on this machine? We may want to statically set the number of workers and see if that brings the load down without adversely affecting monitoring ability.
Re: nagios 4 - high CPU load
Still me. I just got banned somehow. BTW, I just dont understand how one can be banned as spammer on a pre-moderated (for one second !) message board. I guess the same logic got us into nagios 4 version with 21K syscalls/sec and unportable code.
Most of my checks are ICMP checks, because I control corporate VPN, so basically these are tunnels and networks availability. Minor part are TCP connection checks. Some are SNMP check, but it's really the minority. This installations watches over ~200 hosts, and about 1.5K services. This server has only 2 cores, which are not SMP-capable (so only two in total) - this is Intel E3110 .
So. Are there any tricks to continue to run nagios on this hardware ?
I really doubt this syscall rate can be called "normal":
I've tried to set the workers number to 2, then to 1. Got the same picture exactly.
Most of my checks are ICMP checks, because I control corporate VPN, so basically these are tunnels and networks availability. Minor part are TCP connection checks. Some are SNMP check, but it's really the minority. This installations watches over ~200 hosts, and about 1.5K services. This server has only 2 cores, which are not SMP-capable (so only two in total) - this is Intel E3110 .
So. Are there any tricks to continue to run nagios on this hardware ?
I really doubt this syscall rate can be called "normal":
Code: Select all
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 896 0 2 395 130 298 89 12 2 0 21041 89 11 0 0
1 567 0 4 115 53 132 54 10 8 0 20081 90 10 0 0
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 1362 0 4 678 128 380 108 16 289 1 21390 88 12 0 0
1 575 0 282 124 46 201 73 11 332 1 20256 89 11 0 0
-
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: nagios 4 - high CPU load
We don't see your user explicitly banned. Are you using the same network for this username as you were for the other?
As for the problem, can you show us what your System/Performance Info looks like? See my attachment for an example.
As for the problem, can you show us what your System/Performance Info looks like? See my attachment for an example.
Re: nagios 4 - high CPU load
Mine looks like:
As about my ban - when I try to post this under account of "drookie", I see this:
Right now I'm using my home ISP, which gave me dynamic IP. I've also tried this at work, tried different IPs of my corporate proxies - got same error, thus this is not linked to my IP, but it's linked to my username.
May be it has something to do with my e-mail domain being suspended right now (I'm canging registrator), but this is just an assumption.
As about my ban - when I try to post this under account of "drookie", I see this:
Right now I'm using my home ISP, which gave me dynamic IP. I've also tried this at work, tried different IPs of my corporate proxies - got same error, thus this is not linked to my IP, but it's linked to my username.
May be it has something to do with my e-mail domain being suspended right now (I'm canging registrator), but this is just an assumption.
Re: nagios 4 - high CPU load
We do check for a valid MX check, so this is likely the case. Comparing your two accounts, drookie's email failed the valid MX check but drook's did not.drook wrote:May be it has something to do with my e-mail domain being suspended right now (I'm canging registrator), but this is just an assumption.
If you would like I can see if I can swap the emails around, though not sure if that is the best thing going forward.
Former Nagios employee
-
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: nagios 4 - high CPU load
Something is wonky with your checks - they're taking far too long. Do they take this long when run from the command line?
15.048 * 169 = 2543.112 seconds worth of service checks being processed every 60 seconds
10.054 * 75 = 754.05 seconds worth of host checks being processed every 60 seconds
55 minutes worth of checks being run every minute...
Next step is to verify that your checks are finishing in a timely fashion when run at the shell.
15.048 * 169 = 2543.112 seconds worth of service checks being processed every 60 seconds
10.054 * 75 = 754.05 seconds worth of host checks being processed every 60 seconds
55 minutes worth of checks being run every minute...
Next step is to verify that your checks are finishing in a timely fashion when run at the shell.
Re: nagios 4 - high CPU load
I'd say most of them are not: below is a bunch of checks of a typical running host, which is up (95% of them are), done under account that nagios is ran under, with timings (last one is a custom command checking how big-sized packets are transmitted):
[nagios@twilight ~]$ time libexec/check_ping -H 10.30.30.18 -w 100.0,20% -c 500.0,60% -p 5
PING OK - Packet loss = 0%, RTA = 1.48 ms|rta=1.480000ms;100.000000;500.000000;0.000000 pl=0%;20;60;0
real 0m4.001s
user 0m0.002s
sys 0m0.005s
[nagios@twilight ~]$ time libexec/check_icmp -H 172.16.1.151 -w 100.0,20% -c 500.0,60%
OK - 172.16.1.151: rta 1.580ms, lost 0%|rta=1.580ms;100.000;500.000;0; pl=0%;20;60;; rtmax=1.959ms;;;; rtmin=1.331ms;;;;
real 0m0.012s
user 0m0.001s
sys 0m0.002s
[nagios@twilight ~]$ time libexec/check_icmp -H 192.168.104.49 -w 100.0,20% -c 500.0,60% -p 5
OK - 192.168.104.49: rta 1.522ms, lost 0%|rta=1.522ms;100.000;500.000;0; pl=0%;20;60;; rtmax=1.722ms;;;; rtmin=1.439ms;;;;
real 0m0.011s
user 0m0.001s
sys 0m0.002s
[nagios@twilight ~]$ time libexec/check_icmp_4k 192.168.104.49
ICMP 4K OK: 5
real 0m4.021s
user 0m0.004s
sys 0m0.013s
Just to make sure that I checked all of the services here is it's config:
define host {
use cisco8xx
host_name 9maya16
alias 9maya16
parents kosm65-gw2
address 10.30.30.18
}
define service {
use cisco8xx-local-service
host_name 9maya16
service_description icmp
check_command check_ping!100.0,20%!500.0,60%
}
define service {
use cisco8xx-local-service
host_name 9maya16
service_description tunnel
check_command check_icmp!172.16.1.151!100.0,20%!500.0,60%
}
define service {
use cisco8xx-local-service
host_name 9maya16
service_description LAN
check_command check_icmp!192.168.104.49!100.0,20%!500.0,60%
}
define service {
use cisco8xx-local-service
host_name 9maya16
service_description icmp 4k
check_command check_icmp_4k!192.168.104.49
}
[nagios@twilight ~]$ time libexec/check_ping -H 10.30.30.18 -w 100.0,20% -c 500.0,60% -p 5
PING OK - Packet loss = 0%, RTA = 1.48 ms|rta=1.480000ms;100.000000;500.000000;0.000000 pl=0%;20;60;0
real 0m4.001s
user 0m0.002s
sys 0m0.005s
[nagios@twilight ~]$ time libexec/check_icmp -H 172.16.1.151 -w 100.0,20% -c 500.0,60%
OK - 172.16.1.151: rta 1.580ms, lost 0%|rta=1.580ms;100.000;500.000;0; pl=0%;20;60;; rtmax=1.959ms;;;; rtmin=1.331ms;;;;
real 0m0.012s
user 0m0.001s
sys 0m0.002s
[nagios@twilight ~]$ time libexec/check_icmp -H 192.168.104.49 -w 100.0,20% -c 500.0,60% -p 5
OK - 192.168.104.49: rta 1.522ms, lost 0%|rta=1.522ms;100.000;500.000;0; pl=0%;20;60;; rtmax=1.722ms;;;; rtmin=1.439ms;;;;
real 0m0.011s
user 0m0.001s
sys 0m0.002s
[nagios@twilight ~]$ time libexec/check_icmp_4k 192.168.104.49
ICMP 4K OK: 5
real 0m4.021s
user 0m0.004s
sys 0m0.013s
Just to make sure that I checked all of the services here is it's config:
define host {
use cisco8xx
host_name 9maya16
alias 9maya16
parents kosm65-gw2
address 10.30.30.18
}
define service {
use cisco8xx-local-service
host_name 9maya16
service_description icmp
check_command check_ping!100.0,20%!500.0,60%
}
define service {
use cisco8xx-local-service
host_name 9maya16
service_description tunnel
check_command check_icmp!172.16.1.151!100.0,20%!500.0,60%
}
define service {
use cisco8xx-local-service
host_name 9maya16
service_description LAN
check_command check_icmp!192.168.104.49!100.0,20%!500.0,60%
}
define service {
use cisco8xx-local-service
host_name 9maya16
service_description icmp 4k
check_command check_icmp_4k!192.168.104.49
}