Nagios XI 5.5.3 and Mod_Gearman compatibility

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
salami
Posts: 30
Joined: Tue Jun 26, 2018 4:36 am

Re: Nagios XI 5.5.3 and Mod_Gearman compatibility

Post by salami »

these are latest errors in /var/log/gearmand/gearmand.log

Code: Select all

  ERROR 2018-08-08 05:59:44.000000 [  main ] gearman_server_job_add _queue_replay_add(JOB_EXISTS) -> libgearman-server/server.c:820
  ERROR 2018-08-08 05:59:44.000000 [  main ] gearman_server_job_add _queue_replay_add(JOB_EXISTS) -> libgearman-server/server.c:820
  ERROR 2018-08-08 05:59:44.000000 [  main ] gearman_server_job_add _queue_replay_add(JOB_EXISTS) -> libgearman-server/server.c:820
  ERROR 2018-08-08 05:59:44.000000 [  main ] gearman_server_job_add _queue_replay_add(JOB_EXISTS) -> libgearman-server/server.c:820
  ERROR 2018-08-08 05:59:44.000000 [  main ] gearman_server_job_add _queue_replay_add(JOB_EXISTS) -> libgearman-server/server.c:820
  ERROR 2018-08-08 05:59:44.000000 [  main ] gearman_server_job_add _queue_replay_add(JOB_EXISTS) -> libgearman-server/server.c:820
  ERROR 2018-08-08 05:59:44.000000 [  main ] gearman_server_job_add _queue_replay_add(JOB_EXISTS) -> libgearman-server/server.c:820
  ERROR 2018-08-08 06:00:42.000000 [     1 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2018-08-08 06:00:42.000000 [     1 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2018-08-08 06:00:42.000000 [     3 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2018-08-08 06:00:42.000000 [     2 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2018-08-08 06:00:42.000000 [     3 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2018-08-08 06:00:42.000000 [     2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2018-08-08 06:03:05.000000 [     2 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2018-08-08 06:03:05.000000 [     2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2018-08-08 06:04:54.000000 [     2 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2018-08-08 06:04:54.000000 [     2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2018-08-08 06:33:03.000000 [     3 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2018-08-08 06:33:03.000000 [     3 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2018-08-08 06:33:03.000000 [     2 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2018-08-08 06:33:03.000000 [     2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2018-08-08 06:38:26.000000 [     4 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2018-08-08 06:38:26.000000 [     4 ] closing connection due to previous errno error -> libgearman-server/io.cc:109

result of gearman_top2 is as follow:

Code: Select all

 Queue Name                   | Worker Available | Jobs Waiting | Jobs Running
-------------------------------------------------------------------------------
 check_results                |               0  |           0  |           0
 eventhandler                 |             200  |           0  |           0
 host                         |             200  |           0  |           0
 hostgroup_Day                |             200  |           0  |           0
 hostgroup_ansar              |             200  |           0  |           0
 hostgroup_ayandeh            |             200  |           0  |           0
 hostgroup_borse_kala         |             200  |           0  |           0
 hostgroup_eghtesad_novin     |             200  |           0  |           0
 hostgroup_fereshtegan        |             200  |           0  |           0
 hostgroup_gardeshgari        |             200  |           0  |           0
 hostgroup_gostaresh          |             200  |           0  |           0
 hostgroup_hekmat             |             200  |           0  |           0
 hostgroup_iran_peyment       |             200  |           0  |           0
 hostgroup_iran_zamin         |             200  |           0  |           0
 hostgroup_karafarin          |             200  |           0  |           0
 hostgroup_karsazan_ayandeh   |             200  |           0  |           0
 hostgroup_khavarmiyaneh      |             200  |           0  |           0
 hostgroup_kosar              |             200  |           0  |           0
 hostgroup_kpec               |             200  |           0  |           0
 hostgroup_local              |             200  |           0  |           0
 hostgroup_mahak              |             200  |           0  |           0
 hostgroup_maskan             |             200  |           0  |           0
 hostgroup_melal              |             200  |           0  |           0
 hostgroup_mellat             |             200  |           0  |           0
 hostgroup_naji               |             200  |           0  |           0
 hostgroup_ofogh_kurosh       |             200  |           0  |           0
 hostgroup_other              |             200  |           0  |           0
 hostgroup_parsian            |             200  |           0  |           0
 hostgroup_pasargad           |             200  |           0  |           0
 hostgroup_post_bank          |             200  |           0  |           0
 hostgroup_railcom            |             200  |           0  |           0
 hostgroup_refah              |             200  |           0  |           0
 hostgroup_resalat            |             200  |           0  |           0
 hostgroup_saderat            |             200  |           0  |           0
 hostgroup_saman              |             200  |           0  |           0
 hostgroup_samat              |             200  |           0  |           0
 hostgroup_samen              |             200  |           0  |           0
 hostgroup_sarmayeh           |             200  |           0  |           0
 hostgroup_sepah              |             200  |           0  |           0
 hostgroup_shahr              |             200  |           0  |           0
 hostgroup_shahrdari          |             400  |           0  |           0
 hostgroup_sina               |             200  |           0  |           0
 hostgroup_tejarat            |             200  |           0  |           0
 hostgroup_tep                |             200  |           0  |           0
 hostgroup_tosenoavari        |             200  |           0  |           0
 hostgroup_vahdat             |             200  |           0  |           0
 hostgroup_zamzam             |             200  |           0  |           0
 service                      |             200  |           0  |           0
 worker_localhost.localdomain |               3  |           0  |           0
-------------------------------------------------------------------------------
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Nagios XI 5.5.3 and Mod_Gearman compatibility

Post by ssax »

Please edit your /usr/local/nagios/etc/nagios.cfg and set these values:
- Note: The first value is minus 1

Code: Select all

debug_level=-1
debug_verbosity=2
Then restart the nagios service and after it fails, please attach your /usr/local/nagios/var/nagios.debug file so that we can review it.

Please attach these files as well:

Code: Select all

/usr/local/nagios/etc/nagios.cfg
/etc/mod_gearman2/module.conf
Thank you
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Re: Nagios XI 5.5.3 and Mod_Gearman compatibility

Post by WillemDH »

@salami You never mentioned what mod_gearman version you are using? Could you add your used gearman versions?

for the record:

On my Nagios XI CentOS 6 server I'm using :

Code: Select all

gearmand.x86_64                    1:0.33-2                         @/gearmand-0.33-2.rhel6.x86_64
gearmand-devel.x86_64              1:0.33-2                         @/gearmand-devel-0.33-2.rhel6.x86_64
gearmand-server.x86_64             1:0.33-2                         @/gearmand-server-0.33-2.rhel6.x86_64
mod_gearman2.x86_64                2.1.1-1.el6                      @/mod_gearman2-2.1.1-1.rhel6.x86_64
On my mrtg worker, also CentOS 6:

Code: Select all

gearmand.x86_64                    1:0.33-2                         @/gearmand-0.33-2.rhel6.x86_64
gearmand-devel.x86_64              1:0.33-2                         @/gearmand-devel-0.33-2.rhel6.x86_64
mod_gearman2.x86_64                2.1.1-1.el6                      @/mod_gearman2-2.1.1-1.rhel6.x86_64
Thanks!
Nagios XI 5.8.1
https://outsideit.net
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Nagios XI 5.5.3 and Mod_Gearman compatibility

Post by ssax »

That's a very good point, thanks WillemDH! Please include the versions as well as the information I requested.

Thank you
salami
Posts: 30
Joined: Tue Jun 26, 2018 4:36 am

Re: Nagios XI 5.5.3 and Mod_Gearman compatibility

Post by salami »

I found the root cause of the issue
The issue occurred due to a configuration of mod-gearman module.conf file in NEB section.

Code: Select all

result_workers=1
When I change this item more than 1 the issue has been occurred and when I return it to 1, the issue has been recovered and Nagios Daemon has been started without any issue.
The configuration file comments told us we can change it to more than 1 but when I change it, this issue will appear.

Code: Select all

# Number of result worker threads. Usually one is
# enough. You may increase the value if your
# result queue is not processed fast enough.
# Default: 1
As I mentioned in the first post I have more than 12K Hosts and result_workers with value 1 may not enough for me.
After starting Nagios Daemon the load average of main server goes high (more than 500 for 1 min Average) and main server had been hanged. so I remove many hosts from the server (now I have just about 1700 hosts). but the issue does not recovered. too many PHP processes raised up that seems related to event handler process and cause CPU overload.

I installed the latest version of mod-gearman as below packages on my CentOS 7 server:

Code: Select all

gearmand-debuginfo-0.33-2.x86_64
mod_gearman2-2.1.1-1.el7.centos.x86_64
gearmand-devel-0.33-2.x86_64
gearmand-server-0.33-2.x86_64
gearmand-0.33-2.x86_64

and 2 workers each on CentOS 7 servers as below:

Code: Select all

gearmand-0.33-2.x86_64
gearmand-devel-0.33-2.x86_64
gearmand-debuginfo-0.33-2.x86_64
mod_gearman2-2.1.1-1.el7.centos.x86_64
main server hardware resource configuration are as below:
CPU 10 cores
RAM 10 GB

worker servers hardware resource configuration are as below:
CPU 4 cores
RAM 4 GB

I offloading my DB to a remote server based on nagios XI documentation on MariaDB ver 5.5 with 8 core CPU and 10 GB RAM

and also, RAMDISK has been installed on main server


the nagios.cfg, nagios.debug and module.conf file has been attached.
You do not have the required permissions to view the files attached to this post.
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Nagios XI 5.5.3 and Mod_Gearman compatibility

Post by ssax »

Looks like you should leave it as 1 per the developer, please see here:

https://github.com/sni/mod_gearman/issues/79

They are actually removing that as an option in the code.
salami
Posts: 30
Joined: Tue Jun 26, 2018 4:36 am

Re: Nagios XI 5.5.3 and Mod_Gearman compatibility

Post by salami »

Thanks for your reply.
I leave result_workers=1 and nagios daemon start but as I mentioned too many php processes appeared that seems related to event handler. after a while (may be 1 min) nagios did not check any hosts and services. as I checked in nagios.log there are error logs as below:

Code: Select all

[1537281814] wproc: Core Worker 5208: job 731 (pid=28079) timed out. Killing it
[1537281814] wproc: Core Worker 5208: job 0 with pid 28079 reaped at timeout. timeouts=2; started=732
[1537281814] wproc: Core Worker 5209: job 731 (pid=28081) timed out. Killing it
[1537281814] wproc: Core Worker 5209: job 0 with pid 28081 reaped at timeout. timeouts=1; started=732

would you please let me know what is the problem?

thanks
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Nagios XI 5.5.3 and Mod_Gearman compatibility

Post by ssax »

Please run these commands:

Code: Select all

service nagios stop
service ndo2db stop
pkill -9 nagios
killall -9 nagios
for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
service gearmand restart
service ndo2db start
service nagios start
Then once it's started and running, please send the output of the gearman_top2 command.

Are you seeing anything related in your /var/log/gearmand/gearmand.log?
salami
Posts: 30
Joined: Tue Jun 26, 2018 4:36 am

Re: Nagios XI 5.5.3 and Mod_Gearman compatibility

Post by salami »

all the command has been run on the server and the output on gearman_top2 is as follow:

Code: Select all

 Queue Name                   | Worker Available | Jobs Waiting | Jobs Running
-------------------------------------------------------------------------------
 check_results                |               1  |         227  |           1
 eventhandler                 |             250  |           0  |           0
 host                         |             250  |           0  |           0
 hostgroup_Day                |             250  |           0  |           0
 hostgroup_ansar              |             250  |           0  |           0
 hostgroup_ayandeh            |             250  |           0  |           0
 hostgroup_borse_kala         |             250  |           0  |           0
 hostgroup_eghtesad_novin     |             250  |           0  |           0
 hostgroup_fereshtegan        |             250  |           0  |           0
 hostgroup_gardeshgari        |             250  |           0  |           0
 hostgroup_gostaresh          |             250  |           0  |           0
 hostgroup_hekmat             |             250  |           0  |           0
 hostgroup_iran_peyment       |             250  |           0  |           0
 hostgroup_iran_zamin         |             250  |           0  |           0
 hostgroup_karafarin          |             250  |           0  |           0
 hostgroup_karsazan_ayandeh   |             250  |           0  |           0
 hostgroup_khavarmiyaneh      |             250  |           0  |           0
 hostgroup_kosar              |             250  |           0  |           0
 hostgroup_kpec               |             250  |           0  |           0
 hostgroup_linux-server       |              10  |           0  |           0
 hostgroup_mahak              |             250  |           0  |           0
 hostgroup_maskan             |             250  |           0  |           0
 hostgroup_melal              |             250  |           0  |           0
 hostgroup_mellat             |             250  |           0  |           0
 hostgroup_naji               |             250  |           0  |           0
 hostgroup_ofogh_kurosh       |             250  |           0  |           0
 hostgroup_other              |             250  |           0  |           0
 hostgroup_parsian            |             250  |           0  |           0
 hostgroup_pasargad           |             250  |           0  |           0
 hostgroup_post_bank          |             250  |           0  |           0
 hostgroup_railcom            |             250  |           0  |           0
 hostgroup_refah              |             250  |           0  |           0
 hostgroup_resalat            |             250  |           0  |           0
 hostgroup_saderat            |             250  |           0  |           0
 hostgroup_saman              |             250  |           0  |           0
 hostgroup_samat              |             250  |           0  |           0
 hostgroup_samen              |             250  |           0  |           0

also there are no logs on gearmand.log
please be inform that debugging level of gearman module is 2.

I run following command and the result is:

Code: Select all

ps -C php | grep php | wc -l
1488
the result of top command is as follow:

Code: Select all

top - 17:02:24 up 7 days,  4:01,  1 user,  load average: 17.90, 18.15, 19.81
Tasks: 1713 total,  20 running, 211 sleeping,   0 stopped, 1482 zombie
%Cpu(s): 16.1 us, 83.9 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 10072328 total,  8328628 free,   546584 used,  1197116 buff/cache
KiB Swap:  3907580 total,  3588848 free,   318732 used.  8659320 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                             
29268 nagios    20   0   11088   1312    752 R  98.4  0.0  25:35.44 nagios                                                                              
29261 nagios    20   0   11080   1320    752 R  78.9  0.0  25:38.93 nagios                                                                              
29258 nagios    20   0   11080   1304    752 R  67.2  0.0  25:49.32 nagios                                                                              
29257 nagios    20   0   11080   1312    752 R  64.3  0.0  25:43.38 nagios                                                                              
29265 nagios    20   0   11080   1324    752 R  62.3  0.0  25:42.85 nagios                                                                              
29255 nagios    20   0   11080   1324    752 R  61.4  0.0  26:18.84 nagios                                                                              
29256 nagios    20   0   11080   1320    752 R  58.1  0.0  25:31.73 nagios                                                                              
29264 nagios    20   0   11084   1340    752 R  58.1  0.0  25:31.82 nagios                                                                              
29267 nagios    20   0   11084   1336    752 R  57.8  0.0  25:42.77 nagios                                                                              
29262 nagios    20   0   11084   1340    752 R  56.2  0.0  25:49.91 nagios                                                                              
29266 nagios    20   0   11080   1336    752 R  55.8  0.0  25:13.27 nagios                                                                              
29259 nagios    20   0   11080   1324    752 R  53.6  0.0  24:55.28 nagios                                                                              
29254 nagios    20   0  138564  16628   2200 R  53.2  0.2  25:25.61 nagios                                                                              
29263 nagios    20   0   11084   1324    752 R  53.2  0.0  25:04.91 nagios                                                                              
29269 nagios    20   0   11084   1316    752 R  53.2  0.0  26:01.67 nagios                                                                              
29260 nagios    20   0   11084   1308    752 R  51.9  0.0  25:55.71 nagios
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: Nagios XI 5.5.3 and Mod_Gearman compatibility

Post by tgriep »

Can you run this command on the Nagios server so er can see what processes are running on it?

Code: Select all

ps -ef --cols=300
Can you post the worker.conf file from the remote gearman server?
Be sure to check out our Knowledgebase for helpful articles and solutions!
Locked