Page 1 of 2
Nagios Mod_gearman worker overloaded
Posted: Tue Oct 01, 2019 2:43 pm
by rajsshah
Hi Team
We have currently around 828 Hosts checks and 8215 service checks configured in nagios . We have 1 mod_gearman worker node which is currently overloaded ( 100 % CPU utilization ) most of the time . Also because of this checks are pending ( number of JOB waiting ) in gearman_top command .
in worker config I have configured 150 max worker and 50 min worker . I know I can bring this utilization down if I introduce more worker , however this means I will just delay the clogging issue to a later time . sooner or later when I add more host/service checks , it will again start clogging my queue .
Also note that almost 90% of my checks are ACTIVE checks . Can you suggest if I should use PASSIVE checks instead ??
Also is mod_gearman worker is used for PASSIVE checks as well ?? if yes , any instructions I can follow .
Re: Nagios Mod_gearman worker overloaded
Posted: Tue Oct 01, 2019 3:37 pm
by mbellerue
I don't know that mod gearman can be used for passive checks. The goal behind that product was to split the active checks workload out to other servers.
If you're not wanting additional gearman workers, then your best bet is going to be converting some of your active checks to passive checks. The best general advice I can give is to watch your gearman worker, and find out which checks are consuming the most resources for the longest amount of time. See if you can convert those to passive checks first.
And definitely read through this document.
https://assets.nagios.com/downloads/nag ... ios-XI.pdf
And pay special attention to Freshness Checks. One of the nice things about active checks is that if something is broken in an unexpected way, like a server loses power, it stops responding to the active checks. However, if your Nagios server is just waiting for bad news to come rolling in from your server, and the server is unable to send the bad news, well then Nagios is none the wiser. Freshness Checks guard against this.
Re: Nagios Mod_gearman worker overloaded
Posted: Tue Oct 29, 2019 6:27 am
by rajsshah
Hi
I see HIGH CPU utilization of nagios server .. which 5 processes consuming all the CPU
[root@weeus01plnagi05 var]# ps -ef | grep nagios.qh
nagios 51560 51559 32 Oct28 ? 10:14:09 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 51561 51559 32 Oct28 ? 10:13:50 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 51562 51559 32 Oct28 ? 10:11:38 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 51563 51559 33 Oct28 ? 10:15:13 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 51564 51559 32 Oct28 ? 10:14:17 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios 51565 51559 32 Oct28 ? 10:09:23 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
Also I see lots of [php] <defunct> processes as their child process .
After restart of nagios , it remains stable for a day but then again same issues .
We have 1000 servers and 9500 services that we are monitoring .
Also I see lots of mod_Gearman errors in the logs . The time is synced across all the worker and gearman server . So time sync is not an issue .
Re: Nagios Mod_gearman worker overloaded
Posted: Tue Oct 29, 2019 2:19 pm
by mbellerue
Is this a result of moving checks from the mod gearman worker to Nagios? Was any performance tuning done on the Nagios server?
https://assets.nagios.com/downloads/nag ... ios-XI.pdf
What about passive checks? Were any of the checks able to be converted from active to passive?
Re: Nagios Mod_gearman worker overloaded
Posted: Wed Oct 30, 2019 3:24 am
by rajsshah
We have not moved any active checks to passive . This morning I nothiced the same behavior after yesterday's restart . We are expecting the modgearman workers to execute the checks . However I think the nagios modgearman server it self is having some issue .
If I check the o/p of SAR command I can see that my cpu utilization spiked up today at 8:30 AM
CPU %user %nice %system %iowait %steal %idle
07:40:01 AM all 8.67 0.00 2.78 1.25 0.00 87.30
07:50:01 AM all 7.36 0.00 2.53 1.06 0.00 89.04
08:00:01 AM all 6.17 0.00 2.18 0.91 0.00 90.75
08:10:01 AM all 6.71 0.00 2.24 0.72 0.00 90.33
08:20:01 AM all 5.09 0.00 2.00 0.69 0.00 92.22
08:30:02 AM all 32.03 0.09 63.56 0.03 0.00 4.28
08:40:01 AM all 30.07 0.00 69.93 0.00 0.00 0.00
08:50:01 AM all 30.17 0.00 69.83 0.00 0.00 0.00
Since 8:20 AM I see that defunct processes total 113 in number ( seee below ) . All of their parent are "
/usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
"
[root@XXX~]# ps -ef | grep defun
nagios 78619 97753 0 08:20 ? 00:00:00 [php] <defunct>
nagios 78620 97753 0 08:20 ? 00:00:00 [php] <defunct>
nagios 78700 97749 0 08:20 ? 00:00:00 [php] <defunct>
Do you know why we are getting these defunct processes
Also in gearman server logs I see lots of below error
closing connection due to previous errno error -> libgearman-server/io.cc:109
lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
Re: Nagios Mod_gearman worker overloaded
Posted: Wed Oct 30, 2019 11:01 am
by mbellerue
Can you PM me the system profile from the Nagios server? Admin -> System Profile -> Download Profile.
I'll also need:
The full ps -aux output from both the Nagios server and the modgearman worker
The modgearman worker.cnf from the worker
The modgearman module.cnf from the Nagios server
The output of gearman_top -b from the Nagios server
Re: Nagios Mod_gearman worker overloaded
Posted: Wed Oct 30, 2019 11:13 am
by rajsshah
Do you need this at the time of issue ?? or during normal behaviour
Re: Nagios Mod_gearman worker overloaded
Posted: Wed Oct 30, 2019 1:41 pm
by mbellerue
At time of issue, please.
Re: Nagios Mod_gearman worker overloaded
Posted: Thu Oct 31, 2019 11:24 am
by rajsshah
I have sent you PM with all details. Please check
Re: Nagios Mod_gearman worker overloaded
Posted: Thu Oct 31, 2019 11:28 am
by rajsshah
Also I am seeing below error in messages log file
Oct 31 16:04:14 XXXXX ndo2db: Warning: Retrying message send. This can occur because you have too few messages allowed or too few total bytes allowed in message queues. You are currently using 128000 of 32050 message
s and 131072000 of 131072000 bytes in the queue. See README for kernel tuning options.
Oct 31 16:04:34 XXXXX ndo2db: Error: max retries exceeded sending message to queue. Kernel queue parameters may need to be tuned. See README.
Oct 31 16:04:34 XXXXX ndo2db: Warning: queue send error, retrying...