Hi,
I upgraded from version 3.5.1 to 4.0.7 yesterday via the ports system on a FreeBSD 10 jail and, since then, I'm stuck with performance problems on the workers threads (this is a system load comparison of before and after the upgrade).
I thought it might be some incompatible parameters from nagios.cfg, so I took the nagios.cfg-sample, renamed it to nagios.cfg and added only necessary parameters from my 3.5.1 config (mostly perfdata stuff), but no success..
Also, logs don't show errors.
Anyone has a hint how I should begin to debug this?
Performance problems after upgrading from 3.5.1 to 4.0.7
-
- Posts: 7698
- Joined: Mon Apr 23, 2012 4:28 pm
- Location: Travelling through time and space...
Re: Performance problems after upgrading from 3.5.1 to 4.0.7
We have reports of this occurring on several systems but there does not seem to be much of a lead to follow:
http://support.nagios.com/forum/viewtop ... =7&t=27068
Currently, we're looking into it being related with a possible scheduling bug, you may also be able to alleviate this by changing your core 4 worker configuration:
http://support.nagios.com/forum/viewtop ... =7&t=27068
Currently, we're looking into it being related with a possible scheduling bug, you may also be able to alleviate this by changing your core 4 worker configuration:
I recently completed the migration of 28 servers to Nagios 2014R1.1. After experiencing the performance improvements, I debated over whether to move off of Mod Gearman and finally decided to move in that direction after some minor display problems within the Thruk browser caused by the modifications required with Mod Gearman on the Nagios 2014R1.1 release. I actually believe the NagiosXI 2014 release application runs better without Mod Gearman once it is proper tuned. Since it took several iterations of tuning to stabilize some of my larger server configuration, I thought I would share my experience to others in an effort to save others some time.
Our main Nagios server configuration used Mod Gearman with 3 remote workers.
Nagios XI Version : 2014R1.1
nagprod01.cellnet.com 2.6.32-279.11.1.el6.x86_64 x86_64
CentOS release 6.3 (Final)
nagios (pid 7793) is running...
NPCD running (pid 23335).
ndo2db (pid 2568) is running...
CPU Load 15: 3.09
Total Hosts: 1907
Total Services: 9263
8 Core CPU with 16GB memory
To remove Mod Gearmand, the following modification were implemented:
Modification to /usr/local/nagios/etc/nagios.cfg
check_workers=6 – The default 4 workers would not keep up
commented out the 2 “embedded_perl” entries
max_host_check_spread=60 - Required to provide a little more time to ramp up from a cold start.
max_service_check_spread=60 - Required to provide a little more time to ramp up from a cold start.
use_retained_program_state=1
use_retained_scheduling_info=1
commented out broker_module= -for Mod Gearman
service restart nagios
service gearmand stop
service mod_gearman_worker stop
chkconfig gearmand off -Prevent start of process during startups
chkconfig mod_gearman_worker off -Prevent start of process during startups
__________________________________________
Modifications to /usr/local/nagios/etc/pnp/npcd.cfg
load_threshold = 80.0 (Using 10 times # of CPU cores)
Removing Mod Gearman increases the system load and NPCD will shut down if this threshold is exceeded.
__________________________________________
Modification to /usr/local/nagios/etc/pnp/process_perfdata.cfg
TIMEOUT = 30 -Prevent timeouts while collecting perfdata under increased system load
__________________________________________
Once all modifications are made, a system restart helps insure a good clean start and stable run after the modifications.
shutdown –r 0 -Restart the server with a clean Non Mod Gearman configuration
Current status of server after removal of Mod Gearman
On smaller server configurations with 2 to 4 core CPUs and 2 to 4 GB memory, I needed to throttle down the default Nagios 2014 application to keep from overloading the server. Modify the above changes as follows for the smaller servers:
Modification to /usr/local/nagios/etc/nagios.cfg
check_workers=1 – The default 4 workers would overload the server
max_host_check_spread=120 - Required to provide a little more time to ramp up from a cold start.
max_service_check_spread=120 - Required to provide a little more time to ramp up from a cold start.
I also experimented with the nagios.cfg setting max_concurrent_checks=? But the modifications appeared to have no effect. Comments on this setting are welcome.
The above quotes were from a customer who experimented with core 4 worker settings and this ultimately helped lower his system load.My experience with increasing the workers would be to add a small additional load on the Nagios system but could overload some of my monitored systems. I did run for a while with 12 workers with average load of 3.0 which is not a problem but could overload some of the systems Nagios is monitoring. I dropped the workers down to 5 which move the average load down to 1.5 and provides a manageable load to the down stream servers. This is probably unique to my environment where I have a high number of monitors on a small spread of servers.
Re: Performance problems after upgrading from 3.5.1 to 4.0.7
Thanks for your response.
I've lowered workers from 4 to 2 and it doesn't seems to do anything for now..
One thing I haven't mention is I have a very small setup: 11 hosts with a total of 39 service checks.
I've decided to convert transport for internal checks from SSH to NRPE and it helped quite a bit but, still, I'm now averaging at 0,30 load while I was at 0,20 with core 3.5.1 and SSH transport.
So, here's a new updated graph:
- the left part is core 3.5.1 with SSH transport
- the middle part is core 4.0.7 with SSH transport
- the right part is core 4.0.7 with NRPE transport
I've lowered workers from 4 to 2 and it doesn't seems to do anything for now..
One thing I haven't mention is I have a very small setup: 11 hosts with a total of 39 service checks.
I've decided to convert transport for internal checks from SSH to NRPE and it helped quite a bit but, still, I'm now averaging at 0,30 load while I was at 0,20 with core 3.5.1 and SSH transport.
So, here's a new updated graph:
- the left part is core 3.5.1 with SSH transport
- the middle part is core 4.0.7 with SSH transport
- the right part is core 4.0.7 with NRPE transport
Last edited by spiky on Tue Jul 15, 2014 12:09 pm, edited 1 time in total.
-
- Posts: 7698
- Joined: Mon Apr 23, 2012 4:28 pm
- Location: Travelling through time and space...
Re: Performance problems after upgrading from 3.5.1 to 4.0.7
Interesting that you were getting such a much higher load using check_by_ssh... very odd... When you run TOP during a spike, can you post the output to here? I'd like to see what the top users are.
Re: Performance problems after upgrading from 3.5.1 to 4.0.7
Here's the first lines of top command:
The two Nagios processes you see are these ones:
What you can see here is the very same situation as when I was running core 4.0.7 with SSH transport, the only difference being that the 4 worker processes I had were taking even much more CPU time than what you can see now.
Code: Select all
last pid: 88226; load averages: 0.56, 0.39, 0.34 up 4+16:31:08 13:15:37
178 processes: 2 running, 175 sleeping, 1 zombie
CPU: 0.9% user, 0.0% nice, 11.8% system, 0.0% interrupt, 87.3% idle
Mem: 513M Active, 3363M Inact, 8372M Wired, 832K Cache, 3628M Free
ARC: 6144M Total, 1498M MFU, 3756M MRU, 2926K Anon, 144M Header, 743M Other
Swap: 2048M Total, 2048M Free
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
5440 nagios 1 81 0 25416K 6724K CPU3 3 81:19 29.69% nagios
1124 root 1 20 0 545M 62364K select 6 13:50 0.20% samba
1091 root 1 20 0 72064K 10060K select 7 19:20 0.10% snmpd
5439 nagios 1 20 0 25416K 6600K select 2 84:25 0.00% nagios
8028 root 4 20 0 1051M 639M vmidle 7 17:13 0.00% bhyve
37534 root 5 20 0 1051M 721M vmidle 5 17:06 0.00% bhyve
5760 root 4 20 0 1051M 554M vmidle 5 11:10 0.00% bhyve
1086 transmission 3 20 0 126M 16816K select 4 3:04 0.00% transmission-daemon
1131 root 1 20 0 542M 58888K select 7 1:34 0.00% samba
Code: Select all
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
nagios 5439 0.9 0.0 25416 6600 - SJ 6:50PM 84:41.67 /usr/local/bin/nagios --worker /var/spool/nagios/rw/nagios.qh
nagios 5090 0.0 0.0 0 0 - ZJ 6:48PM 0:06.64 <defunct>
nagios 5440 0.0 0.0 25416 6724 - SJ 6:50PM 81:33.03 /usr/local/bin/nagios --worker /var/spool/nagios/rw/nagios.qh
-
- Posts: 7698
- Joined: Mon Apr 23, 2012 4:28 pm
- Location: Travelling through time and space...
Re: Performance problems after upgrading from 3.5.1 to 4.0.7
What kind of hardware is this system running? I agree, if you were on core 3 and were fine before moving to core 4 this should not really have a bearing on the issue, it would be good to know though. Number of CPUs, number of cores per CPU, memory size. Also, is this running on localized storage, or some sort of networked SAN/NAS?
I'd highly suggest playing around with the worker options in the nagios.cfg you may need to increase the number to flatten the load out, or even detract.
I'd highly suggest playing around with the worker options in the nagios.cfg you may need to increase the number to flatten the load out, or even detract.
Re: Performance problems after upgrading from 3.5.1 to 4.0.7
The server is a "Dell PowerEdge T110 II" with the following:
- 1x CPU E3-1230 V2 @ 3.30GHz
- 4 cores with hyperthreading so a total of 8 threads
- 16 GB of DDR3 ECC memory
- It is running on localized storage.
I will play with workers and report here after trying.
- 1x CPU E3-1230 V2 @ 3.30GHz
- 4 cores with hyperthreading so a total of 8 threads
- 16 GB of DDR3 ECC memory
- It is running on localized storage.
I will play with workers and report here after trying.
Re: Performance problems after upgrading from 3.5.1 to 4.0.7
If I put "check_workers=8" in nagios.cfg (compared to "check_workers=2"), there's no difference on the load average and here's all the processes run by the nagios user:
It seems as if 3 workers become "zombie" processes right after restarting Nagios. Indeed, there's 5 working "worker" processes and 3 zombies so that make 8, the value I put for the check_workers parameter in nagios.cfg.
Code: Select all
[root@cozy /]# ps afuxww|grep nagios
nagios 38912 0.0 0.0 69540 6740 - SJ Sun06PM 0:05.51 /usr/local/bin/npcd -d -f /usr/local/etc/pnp/npcd.cfg
nagios 86325 0.0 0.0 37704 7084 - SsJ 8:53PM 0:00.09 /usr/local/bin/nagios -d /usr/local/etc/nagios/nagios.cfg
nagios 86326 0.0 0.0 25416 6492 - SJ 8:53PM 0:40.37 /usr/local/bin/nagios --worker /var/spool/nagios/rw/nagios.qh
nagios 86327 0.0 0.0 0 0 - ZJ 8:53PM 0:00.00 <defunct>
nagios 86328 0.0 0.0 25416 6468 - IJ 8:53PM 0:00.05 /usr/local/bin/nagios --worker /var/spool/nagios/rw/nagios.qh
nagios 86329 0.0 0.0 25416 6464 - IJ 8:53PM 0:04.16 /usr/local/bin/nagios --worker /var/spool/nagios/rw/nagios.qh
nagios 86330 0.0 0.0 25416 6476 - IJ 8:53PM 0:04.15 /usr/local/bin/nagios --worker /var/spool/nagios/rw/nagios.qh
nagios 86331 0.0 0.0 25416 6472 - IJ 8:53PM 0:08.25 /usr/local/bin/nagios --worker /var/spool/nagios/rw/nagios.qh
nagios 86332 0.0 0.0 0 0 - ZJ 8:53PM 0:00.00 <defunct>
nagios 86333 0.0 0.0 0 0 - ZJ 8:53PM 0:00.00 <defunct>
nagios 86336 0.0 0.0 37704 6864 - SJ 8:53PM 0:00.00 /usr/local/bin/nagios -d /usr/local/etc/nagios/nagios.cfg
root 88047 0.0 0.0 18724 2124 2 S+J 8:55PM 0:00.00 grep nagios
[root@cozy /]#
Re: Performance problems after upgrading from 3.5.1 to 4.0.7
The core devs are working on this. Some people have had success using mod_gearman for the perl checks, while others have found limiting the number of core workers to be beneficial. I wish I had better suggestions for you, but for now, we will need to wait until the next series of patches.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.