Performance problems after upgrading from 3.5.1 to 4.0.7

spiky · Post by **spiky** » Sun Jul 13, 2014 10:38 am

Hi,

I upgraded from version 3.5.1 to 4.0.7 yesterday via the ports system on a FreeBSD 10 jail and, since then, I'm stuck with performance problems on the workers threads (this is a system load comparison of before and after the upgrade).

: Load before and after upgrading to 4.0.7

I thought it might be some incompatible parameters from nagios.cfg, so I took the nagios.cfg-sample, renamed it to nagios.cfg and added only necessary parameters from my 3.5.1 config (mostly perfdata stuff), but no success..

Also, logs don't show errors.

Anyone has a hint how I should begin to debug this?

slansing · Post by **slansing** » Mon Jul 14, 2014 10:44 am

We have reports of this occurring on several systems but there does not seem to be much of a lead to follow:

http://support.nagios.com/forum/viewtop ... =7&t=27068

Currently, we're looking into it being related with a possible scheduling bug, you may also be able to alleviate this by changing your core 4 worker configuration:

I recently completed the migration of 28 servers to Nagios 2014R1.1. After experiencing the performance improvements, I debated over whether to move off of Mod Gearman and finally decided to move in that direction after some minor display problems within the Thruk browser caused by the modifications required with Mod Gearman on the Nagios 2014R1.1 release. I actually believe the NagiosXI 2014 release application runs better without Mod Gearman once it is proper tuned. Since it took several iterations of tuning to stabilize some of my larger server configuration, I thought I would share my experience to others in an effort to save others some time.

Our main Nagios server configuration used Mod Gearman with 3 remote workers.
Nagios XI Version : 2014R1.1
nagprod01.cellnet.com 2.6.32-279.11.1.el6.x86_64 x86_64
CentOS release 6.3 (Final)
nagios (pid 7793) is running...
NPCD running (pid 23335).
ndo2db (pid 2568) is running...
CPU Load 15: 3.09
Total Hosts: 1907
Total Services: 9263
8 Core CPU with 16GB memory

To remove Mod Gearmand, the following modification were implemented:
Modification to /usr/local/nagios/etc/nagios.cfg
check_workers=6 – The default 4 workers would not keep up
commented out the 2 “embedded_perl” entries
max_host_check_spread=60 - Required to provide a little more time to ramp up from a cold start.
max_service_check_spread=60 - Required to provide a little more time to ramp up from a cold start.
use_retained_program_state=1
use_retained_scheduling_info=1
commented out broker_module= -for Mod Gearman

service restart nagios
service gearmand stop
service mod_gearman_worker stop
chkconfig gearmand off -Prevent start of process during startups
chkconfig mod_gearman_worker off -Prevent start of process during startups
__________________________________________

Modifications to /usr/local/nagios/etc/pnp/npcd.cfg
load_threshold = 80.0 (Using 10 times # of CPU cores)
Removing Mod Gearman increases the system load and NPCD will shut down if this threshold is exceeded.
__________________________________________

Modification to /usr/local/nagios/etc/pnp/process_perfdata.cfg
TIMEOUT = 30 -Prevent timeouts while collecting perfdata under increased system load
__________________________________________

Once all modifications are made, a system restart helps insure a good clean start and stable run after the modifications.
shutdown –r 0 -Restart the server with a clean Non Mod Gearman configuration
Current status of server after removal of Mod Gearman

On smaller server configurations with 2 to 4 core CPUs and 2 to 4 GB memory, I needed to throttle down the default Nagios 2014 application to keep from overloading the server. Modify the above changes as follows for the smaller servers:
Modification to /usr/local/nagios/etc/nagios.cfg
check_workers=1 – The default 4 workers would overload the server
max_host_check_spread=120 - Required to provide a little more time to ramp up from a cold start.
max_service_check_spread=120 - Required to provide a little more time to ramp up from a cold start.

I also experimented with the nagios.cfg setting max_concurrent_checks=? But the modifications appeared to have no effect. Comments on this setting are welcome.

My experience with increasing the workers would be to add a small additional load on the Nagios system but could overload some of my monitored systems. I did run for a while with 12 workers with average load of 3.0 which is not a problem but could overload some of the systems Nagios is monitoring. I dropped the workers down to 5 which move the average load down to 1.5 and provides a manageable load to the down stream servers. This is probably unique to my environment where I have a high number of monitors on a small spread of servers.

The above quotes were from a customer who experimented with core 4 worker settings and this ultimately helped lower his system load.

spiky · Post by **spiky** » Mon Jul 14, 2014 6:19 pm

Thanks for your response.

I've lowered workers from 4 to 2 and it doesn't seems to do anything for now..

One thing I haven't mention is I have a very small setup: 11 hosts with a total of 39 service checks.

I've decided to convert transport for internal checks from SSH to NRPE and it helped quite a bit but, still, I'm now averaging at 0,30 load while I was at 0,20 with core 3.5.1 and SSH transport.

So, here's a new updated graph:
- the left part is core 3.5.1 with SSH transport
- the middle part is core 4.0.7 with SSH transport
- the right part is core 4.0.7 with NRPE transport

: Updated graph with NRPE transport

slansing · Post by **slansing** » Tue Jul 15, 2014 9:21 am

Interesting that you were getting such a much higher load using check_by_ssh... very odd... When you run TOP during a spike, can you post the output to here? I'd like to see what the top users are.

spiky · Post by **spiky** » Tue Jul 15, 2014 12:33 pm

Here's the first lines of top command:

Code: Select all

last pid: 88226;  load averages:  0.56,  0.39,  0.34                                                               up 4+16:31:08  13:15:37
178 processes: 2 running, 175 sleeping, 1 zombie
CPU:  0.9% user,  0.0% nice, 11.8% system,  0.0% interrupt, 87.3% idle
Mem: 513M Active, 3363M Inact, 8372M Wired, 832K Cache, 3628M Free
ARC: 6144M Total, 1498M MFU, 3756M MRU, 2926K Anon, 144M Header, 743M Other
Swap: 2048M Total, 2048M Free

  PID USERNAME       THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
 5440 nagios           1  81    0 25416K  6724K CPU3    3  81:19  29.69% nagios
 1124 root             1  20    0   545M 62364K select  6  13:50   0.20% samba
 1091 root             1  20    0 72064K 10060K select  7  19:20   0.10% snmpd
 5439 nagios           1  20    0 25416K  6600K select  2  84:25   0.00% nagios
 8028 root             4  20    0  1051M   639M vmidle  7  17:13   0.00% bhyve
37534 root             5  20    0  1051M   721M vmidle  5  17:06   0.00% bhyve
 5760 root             4  20    0  1051M   554M vmidle  5  11:10   0.00% bhyve
 1086 transmission     3  20    0   126M 16816K select  4   3:04   0.00% transmission-daemon
 1131 root             1  20    0   542M 58888K select  7   1:34   0.00% samba

The two Nagios processes you see are these ones:

Code: Select all

USER         PID %CPU %MEM    VSZ   RSS TT  STAT STARTED     TIME COMMAND
nagios      5439  0.9  0.0  25416  6600  -  SJ    6:50PM 84:41.67 /usr/local/bin/nagios --worker /var/spool/nagios/rw/nagios.qh
nagios      5090  0.0  0.0      0     0  -  ZJ    6:48PM  0:06.64 <defunct>
nagios      5440  0.0  0.0  25416  6724  -  SJ    6:50PM 81:33.03 /usr/local/bin/nagios --worker /var/spool/nagios/rw/nagios.qh

What you can see here is the very same situation as when I was running core 4.0.7 with SSH transport, the only difference being that the 4 worker processes I had were taking even much more CPU time than what you can see now.

slansing · Post by **slansing** » Thu Jul 17, 2014 10:15 am

What kind of hardware is this system running? I agree, if you were on core 3 and were fine before moving to core 4 this should not really have a bearing on the issue, it would be good to know though. Number of CPUs, number of cores per CPU, memory size. Also, is this running on localized storage, or some sort of networked SAN/NAS?

I'd highly suggest playing around with the worker options in the nagios.cfg you may need to increase the number to flatten the load out, or even detract.

spiky · Post by **spiky** » Thu Jul 17, 2014 12:42 pm

The server is a "Dell PowerEdge T110 II" with the following:

- 1x CPU E3-1230 V2 @ 3.30GHz
- 4 cores with hyperthreading so a total of 8 threads
- 16 GB of DDR3 ECC memory
- It is running on localized storage.

I will play with workers and report here after trying.

spiky · Post by **spiky** » Fri Jul 18, 2014 8:02 pm

If I put "check_workers=8" in nagios.cfg (compared to "check_workers=2"), there's no difference on the load average and here's all the processes run by the nagios user:

Code: Select all

[root@cozy /]# ps afuxww|grep nagios
nagios     38912  0.0  0.0  69540  6740  -  SJ   Sun06PM 0:05.51 /usr/local/bin/npcd -d -f /usr/local/etc/pnp/npcd.cfg
nagios     86325  0.0  0.0  37704  7084  -  SsJ   8:53PM 0:00.09 /usr/local/bin/nagios -d /usr/local/etc/nagios/nagios.cfg
nagios     86326  0.0  0.0  25416  6492  -  SJ    8:53PM 0:40.37 /usr/local/bin/nagios --worker /var/spool/nagios/rw/nagios.qh
nagios     86327  0.0  0.0      0     0  -  ZJ    8:53PM 0:00.00 <defunct>
nagios     86328  0.0  0.0  25416  6468  -  IJ    8:53PM 0:00.05 /usr/local/bin/nagios --worker /var/spool/nagios/rw/nagios.qh
nagios     86329  0.0  0.0  25416  6464  -  IJ    8:53PM 0:04.16 /usr/local/bin/nagios --worker /var/spool/nagios/rw/nagios.qh
nagios     86330  0.0  0.0  25416  6476  -  IJ    8:53PM 0:04.15 /usr/local/bin/nagios --worker /var/spool/nagios/rw/nagios.qh
nagios     86331  0.0  0.0  25416  6472  -  IJ    8:53PM 0:08.25 /usr/local/bin/nagios --worker /var/spool/nagios/rw/nagios.qh
nagios     86332  0.0  0.0      0     0  -  ZJ    8:53PM 0:00.00 <defunct>
nagios     86333  0.0  0.0      0     0  -  ZJ    8:53PM 0:00.00 <defunct>
nagios     86336  0.0  0.0  37704  6864  -  SJ    8:53PM 0:00.00 /usr/local/bin/nagios -d /usr/local/etc/nagios/nagios.cfg
root       88047  0.0  0.0  18724  2124  2  S+J   8:55PM 0:00.00 grep nagios
[root@cozy /]#

It seems as if 3 workers become "zombie" processes right after restarting Nagios. Indeed, there's 5 working "worker" processes and 3 zombies so that make 8, the value I put for the check_workers parameter in nagios.cfg.

abrist · Post by **abrist** » Mon Jul 21, 2014 4:48 pm

The core devs are working on this. Some people have had success using mod_gearman for the perl checks, while others have found limiting the number of core workers to be beneficial. I wish I had better suggestions for you, but for now, we will need to wait until the next series of patches.

Nagios Support Forum

Performance problems after upgrading from 3.5.1 to 4.0.7

Performance problems after upgrading from 3.5.1 to 4.0.7

Re: Performance problems after upgrading from 3.5.1 to 4.0.7

Re: Performance problems after upgrading from 3.5.1 to 4.0.7

Re: Performance problems after upgrading from 3.5.1 to 4.0.7

Re: Performance problems after upgrading from 3.5.1 to 4.0.7

Re: Performance problems after upgrading from 3.5.1 to 4.0.7

Re: Performance problems after upgrading from 3.5.1 to 4.0.7

Re: Performance problems after upgrading from 3.5.1 to 4.0.7

Re: Performance problems after upgrading from 3.5.1 to 4.0.7