Nagios XI 5.5.3 and Mod_Gearman compatibility

This board serves as an open discussion and support collaboration point for Nagios XI. NOTE: Nagios XI customers should use the Customer Support forum to obtain expedited support.

Re: Nagios XI 5.5.3 and Mod_Gearman compatibility

Postby salami » Tue Sep 11, 2018 11:06 pm

these are latest errors in /var/log/gearmand/gearmand.log

Code: Select all
  ERROR 2018-08-08 05:59:44.000000 [  main ] gearman_server_job_add _queue_replay_add(JOB_EXISTS) -> libgearman-server/server.c:820
  ERROR 2018-08-08 05:59:44.000000 [  main ] gearman_server_job_add _queue_replay_add(JOB_EXISTS) -> libgearman-server/server.c:820
  ERROR 2018-08-08 05:59:44.000000 [  main ] gearman_server_job_add _queue_replay_add(JOB_EXISTS) -> libgearman-server/server.c:820
  ERROR 2018-08-08 05:59:44.000000 [  main ] gearman_server_job_add _queue_replay_add(JOB_EXISTS) -> libgearman-server/server.c:820
  ERROR 2018-08-08 05:59:44.000000 [  main ] gearman_server_job_add _queue_replay_add(JOB_EXISTS) -> libgearman-server/server.c:820
  ERROR 2018-08-08 05:59:44.000000 [  main ] gearman_server_job_add _queue_replay_add(JOB_EXISTS) -> libgearman-server/server.c:820
  ERROR 2018-08-08 05:59:44.000000 [  main ] gearman_server_job_add _queue_replay_add(JOB_EXISTS) -> libgearman-server/server.c:820
  ERROR 2018-08-08 06:00:42.000000 [     1 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2018-08-08 06:00:42.000000 [     1 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2018-08-08 06:00:42.000000 [     3 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2018-08-08 06:00:42.000000 [     2 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2018-08-08 06:00:42.000000 [     3 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2018-08-08 06:00:42.000000 [     2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2018-08-08 06:03:05.000000 [     2 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2018-08-08 06:03:05.000000 [     2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2018-08-08 06:04:54.000000 [     2 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2018-08-08 06:04:54.000000 [     2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2018-08-08 06:33:03.000000 [     3 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2018-08-08 06:33:03.000000 [     3 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2018-08-08 06:33:03.000000 [     2 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2018-08-08 06:33:03.000000 [     2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2018-08-08 06:38:26.000000 [     4 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2018-08-08 06:38:26.000000 [     4 ] closing connection due to previous errno error -> libgearman-server/io.cc:109



result of gearman_top2 is as follow:
Code: Select all
Queue Name                   | Worker Available | Jobs Waiting | Jobs Running
-------------------------------------------------------------------------------
check_results                |               0  |           0  |           0
eventhandler                 |             200  |           0  |           0
host                         |             200  |           0  |           0
hostgroup_Day                |             200  |           0  |           0
hostgroup_ansar              |             200  |           0  |           0
hostgroup_ayandeh            |             200  |           0  |           0
hostgroup_borse_kala         |             200  |           0  |           0
hostgroup_eghtesad_novin     |             200  |           0  |           0
hostgroup_fereshtegan        |             200  |           0  |           0
hostgroup_gardeshgari        |             200  |           0  |           0
hostgroup_gostaresh          |             200  |           0  |           0
hostgroup_hekmat             |             200  |           0  |           0
hostgroup_iran_peyment       |             200  |           0  |           0
hostgroup_iran_zamin         |             200  |           0  |           0
hostgroup_karafarin          |             200  |           0  |           0
hostgroup_karsazan_ayandeh   |             200  |           0  |           0
hostgroup_khavarmiyaneh      |             200  |           0  |           0
hostgroup_kosar              |             200  |           0  |           0
hostgroup_kpec               |             200  |           0  |           0
hostgroup_local              |             200  |           0  |           0
hostgroup_mahak              |             200  |           0  |           0
hostgroup_maskan             |             200  |           0  |           0
hostgroup_melal              |             200  |           0  |           0
hostgroup_mellat             |             200  |           0  |           0
hostgroup_naji               |             200  |           0  |           0
hostgroup_ofogh_kurosh       |             200  |           0  |           0
hostgroup_other              |             200  |           0  |           0
hostgroup_parsian            |             200  |           0  |           0
hostgroup_pasargad           |             200  |           0  |           0
hostgroup_post_bank          |             200  |           0  |           0
hostgroup_railcom            |             200  |           0  |           0
hostgroup_refah              |             200  |           0  |           0
hostgroup_resalat            |             200  |           0  |           0
hostgroup_saderat            |             200  |           0  |           0
hostgroup_saman              |             200  |           0  |           0
hostgroup_samat              |             200  |           0  |           0
hostgroup_samen              |             200  |           0  |           0
hostgroup_sarmayeh           |             200  |           0  |           0
hostgroup_sepah              |             200  |           0  |           0
hostgroup_shahr              |             200  |           0  |           0
hostgroup_shahrdari          |             400  |           0  |           0
hostgroup_sina               |             200  |           0  |           0
hostgroup_tejarat            |             200  |           0  |           0
hostgroup_tep                |             200  |           0  |           0
hostgroup_tosenoavari        |             200  |           0  |           0
hostgroup_vahdat             |             200  |           0  |           0
hostgroup_zamzam             |             200  |           0  |           0
service                      |             200  |           0  |           0
worker_localhost.localdomain |               3  |           0  |           0
-------------------------------------------------------------------------------
salami
 
Posts: 30
Joined: Tue Jun 26, 2018 4:36 am

Re: Nagios XI 5.5.3 and Mod_Gearman compatibility

Postby ssax » Wed Sep 12, 2018 4:41 pm

Please edit your /usr/local/nagios/etc/nagios.cfg and set these values:
- Note: The first value is minus 1

Code: Select all
debug_level=-1
debug_verbosity=2


Then restart the nagios service and after it fails, please attach your /usr/local/nagios/var/nagios.debug file so that we can review it.

Please attach these files as well:

Code: Select all
/usr/local/nagios/etc/nagios.cfg
/etc/mod_gearman2/module.conf


Thank you
Be sure to check out our Knowledgebase for helpful articles and solutions!
User avatar
ssax
Dreams In Code
 
Posts: 3379
Joined: Wed Feb 11, 2015 12:54 pm

Re: Nagios XI 5.5.3 and Mod_Gearman compatibility

Postby WillemDH » Fri Sep 14, 2018 4:48 am

@salami You never mentioned what mod_gearman version you are using? Could you add your used gearman versions?

for the record:

On my Nagios XI CentOS 6 server I'm using :

Code: Select all
gearmand.x86_64                    1:0.33-2                         @/gearmand-0.33-2.rhel6.x86_64
gearmand-devel.x86_64              1:0.33-2                         @/gearmand-devel-0.33-2.rhel6.x86_64
gearmand-server.x86_64             1:0.33-2                         @/gearmand-server-0.33-2.rhel6.x86_64
mod_gearman2.x86_64                2.1.1-1.el6                      @/mod_gearman2-2.1.1-1.rhel6.x86_64


On my mrtg worker, also CentOS 6:

Code: Select all
gearmand.x86_64                    1:0.33-2                         @/gearmand-0.33-2.rhel6.x86_64
gearmand-devel.x86_64              1:0.33-2                         @/gearmand-devel-0.33-2.rhel6.x86_64
mod_gearman2.x86_64                2.1.1-1.el6                      @/mod_gearman2-2.1.1-1.rhel6.x86_64


Thanks!
Nagios XI 5.5.4
https://outsideit.net
User avatar
WillemDH
 
Posts: 2256
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent

Re: Nagios XI 5.5.3 and Mod_Gearman compatibility

Postby ssax » Fri Sep 14, 2018 1:28 pm

That's a very good point, thanks WillemDH! Please include the versions as well as the information I requested.

Thank you
Be sure to check out our Knowledgebase for helpful articles and solutions!
User avatar
ssax
Dreams In Code
 
Posts: 3379
Joined: Wed Feb 11, 2015 12:54 pm

Re: Nagios XI 5.5.3 and Mod_Gearman compatibility

Postby salami » Sat Sep 15, 2018 2:57 am

I found the root cause of the issue
The issue occurred due to a configuration of mod-gearman module.conf file in NEB section.
Code: Select all
result_workers=1

When I change this item more than 1 the issue has been occurred and when I return it to 1, the issue has been recovered and Nagios Daemon has been started without any issue.
The configuration file comments told us we can change it to more than 1 but when I change it, this issue will appear.

Code: Select all
# Number of result worker threads. Usually one is
# enough. You may increase the value if your
# result queue is not processed fast enough.
# Default: 1


As I mentioned in the first post I have more than 12K Hosts and result_workers with value 1 may not enough for me.
After starting Nagios Daemon the load average of main server goes high (more than 500 for 1 min Average) and main server had been hanged. so I remove many hosts from the server (now I have just about 1700 hosts). but the issue does not recovered. too many PHP processes raised up that seems related to event handler process and cause CPU overload.

I installed the latest version of mod-gearman as below packages on my CentOS 7 server:
Code: Select all
gearmand-debuginfo-0.33-2.x86_64
mod_gearman2-2.1.1-1.el7.centos.x86_64
gearmand-devel-0.33-2.x86_64
gearmand-server-0.33-2.x86_64
gearmand-0.33-2.x86_64



and 2 workers each on CentOS 7 servers as below:
Code: Select all
gearmand-0.33-2.x86_64
gearmand-devel-0.33-2.x86_64
gearmand-debuginfo-0.33-2.x86_64
mod_gearman2-2.1.1-1.el7.centos.x86_64


main server hardware resource configuration are as below:
CPU 10 cores
RAM 10 GB

worker servers hardware resource configuration are as below:
CPU 4 cores
RAM 4 GB

I offloading my DB to a remote server based on nagios XI documentation on MariaDB ver 5.5 with 8 core CPU and 10 GB RAM

and also, RAMDISK has been installed on main server


the nagios.cfg, nagios.debug and module.conf file has been attached.
Attachments
nagios.debug.txt
please remove .txt extension from the end of file
(798.78 KiB) Downloaded 32 times
module.conf
(5.69 KiB) Downloaded 30 times
nagios.cfg
(5.65 KiB) Downloaded 39 times
salami
 
Posts: 30
Joined: Tue Jun 26, 2018 4:36 am

Re: Nagios XI 5.5.3 and Mod_Gearman compatibility

Postby ssax » Mon Sep 17, 2018 4:37 pm

Looks like you should leave it as 1 per the developer, please see here:

https://github.com/sni/mod_gearman/issues/79

They are actually removing that as an option in the code.
Be sure to check out our Knowledgebase for helpful articles and solutions!
User avatar
ssax
Dreams In Code
 
Posts: 3379
Joined: Wed Feb 11, 2015 12:54 pm

Re: Nagios XI 5.5.3 and Mod_Gearman compatibility

Postby salami » Tue Sep 18, 2018 9:57 am

Thanks for your reply.
I leave result_workers=1 and nagios daemon start but as I mentioned too many php processes appeared that seems related to event handler. after a while (may be 1 min) nagios did not check any hosts and services. as I checked in nagios.log there are error logs as below:

Code: Select all
[1537281814] wproc: Core Worker 5208: job 731 (pid=28079) timed out. Killing it
[1537281814] wproc: Core Worker 5208: job 0 with pid 28079 reaped at timeout. timeouts=2; started=732
[1537281814] wproc: Core Worker 5209: job 731 (pid=28081) timed out. Killing it
[1537281814] wproc: Core Worker 5209: job 0 with pid 28081 reaped at timeout. timeouts=1; started=732



would you please let me know what is the problem?

thanks
salami
 
Posts: 30
Joined: Tue Jun 26, 2018 4:36 am

Re: Nagios XI 5.5.3 and Mod_Gearman compatibility

Postby ssax » Tue Sep 18, 2018 4:51 pm

Please run these commands:

Code: Select all
service nagios stop
service ndo2db stop
pkill -9 nagios
killall -9 nagios
for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
service gearmand restart
service ndo2db start
service nagios start


Then once it's started and running, please send the output of the gearman_top2 command.

Are you seeing anything related in your /var/log/gearmand/gearmand.log?
Be sure to check out our Knowledgebase for helpful articles and solutions!
User avatar
ssax
Dreams In Code
 
Posts: 3379
Joined: Wed Feb 11, 2015 12:54 pm

Re: Nagios XI 5.5.3 and Mod_Gearman compatibility

Postby salami » Wed Sep 19, 2018 7:22 am

all the command has been run on the server and the output on gearman_top2 is as follow:

Code: Select all
Queue Name                   | Worker Available | Jobs Waiting | Jobs Running
-------------------------------------------------------------------------------
check_results                |               1  |         227  |           1
eventhandler                 |             250  |           0  |           0
host                         |             250  |           0  |           0
hostgroup_Day                |             250  |           0  |           0
hostgroup_ansar              |             250  |           0  |           0
hostgroup_ayandeh            |             250  |           0  |           0
hostgroup_borse_kala         |             250  |           0  |           0
hostgroup_eghtesad_novin     |             250  |           0  |           0
hostgroup_fereshtegan        |             250  |           0  |           0
hostgroup_gardeshgari        |             250  |           0  |           0
hostgroup_gostaresh          |             250  |           0  |           0
hostgroup_hekmat             |             250  |           0  |           0
hostgroup_iran_peyment       |             250  |           0  |           0
hostgroup_iran_zamin         |             250  |           0  |           0
hostgroup_karafarin          |             250  |           0  |           0
hostgroup_karsazan_ayandeh   |             250  |           0  |           0
hostgroup_khavarmiyaneh      |             250  |           0  |           0
hostgroup_kosar              |             250  |           0  |           0
hostgroup_kpec               |             250  |           0  |           0
hostgroup_linux-server       |              10  |           0  |           0
hostgroup_mahak              |             250  |           0  |           0
hostgroup_maskan             |             250  |           0  |           0
hostgroup_melal              |             250  |           0  |           0
hostgroup_mellat             |             250  |           0  |           0
hostgroup_naji               |             250  |           0  |           0
hostgroup_ofogh_kurosh       |             250  |           0  |           0
hostgroup_other              |             250  |           0  |           0
hostgroup_parsian            |             250  |           0  |           0
hostgroup_pasargad           |             250  |           0  |           0
hostgroup_post_bank          |             250  |           0  |           0
hostgroup_railcom            |             250  |           0  |           0
hostgroup_refah              |             250  |           0  |           0
hostgroup_resalat            |             250  |           0  |           0
hostgroup_saderat            |             250  |           0  |           0
hostgroup_saman              |             250  |           0  |           0
hostgroup_samat              |             250  |           0  |           0
hostgroup_samen              |             250  |           0  |           0



also there are no logs on gearmand.log
please be inform that debugging level of gearman module is 2.

I run following command and the result is:

Code: Select all
ps -C php | grep php | wc -l
1488


the result of top command is as follow:

Code: Select all
top - 17:02:24 up 7 days,  4:01,  1 user,  load average: 17.90, 18.15, 19.81
Tasks: 1713 total,  20 running, 211 sleeping,   0 stopped, 1482 zombie
%Cpu(s): 16.1 us, 83.9 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 10072328 total,  8328628 free,   546584 used,  1197116 buff/cache
KiB Swap:  3907580 total,  3588848 free,   318732 used.  8659320 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                             
29268 nagios    20   0   11088   1312    752 R  98.4  0.0  25:35.44 nagios                                                                             
29261 nagios    20   0   11080   1320    752 R  78.9  0.0  25:38.93 nagios                                                                             
29258 nagios    20   0   11080   1304    752 R  67.2  0.0  25:49.32 nagios                                                                             
29257 nagios    20   0   11080   1312    752 R  64.3  0.0  25:43.38 nagios                                                                             
29265 nagios    20   0   11080   1324    752 R  62.3  0.0  25:42.85 nagios                                                                             
29255 nagios    20   0   11080   1324    752 R  61.4  0.0  26:18.84 nagios                                                                             
29256 nagios    20   0   11080   1320    752 R  58.1  0.0  25:31.73 nagios                                                                             
29264 nagios    20   0   11084   1340    752 R  58.1  0.0  25:31.82 nagios                                                                             
29267 nagios    20   0   11084   1336    752 R  57.8  0.0  25:42.77 nagios                                                                             
29262 nagios    20   0   11084   1340    752 R  56.2  0.0  25:49.91 nagios                                                                             
29266 nagios    20   0   11080   1336    752 R  55.8  0.0  25:13.27 nagios                                                                             
29259 nagios    20   0   11080   1324    752 R  53.6  0.0  24:55.28 nagios                                                                             
29254 nagios    20   0  138564  16628   2200 R  53.2  0.2  25:25.61 nagios                                                                             
29263 nagios    20   0   11084   1324    752 R  53.2  0.0  25:04.91 nagios                                                                             
29269 nagios    20   0   11084   1316    752 R  53.2  0.0  26:01.67 nagios                                                                             
29260 nagios    20   0   11084   1308    752 R  51.9  0.0  25:55.71 nagios
salami
 
Posts: 30
Joined: Tue Jun 26, 2018 4:36 am

Re: Nagios XI 5.5.3 and Mod_Gearman compatibility

Postby tgriep » Thu Sep 20, 2018 4:47 pm

Can you run this command on the Nagios server so er can see what processes are running on it?
Code: Select all
ps -ef --cols=300


Can you post the worker.conf file from the remote gearman server?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
User avatar
tgriep
Madmin
 
Posts: 7247
Joined: Thu Oct 30, 2014 9:02 am

PreviousNext

Return to Nagios XI

Who is online

Users browsing this forum: Bing [Bot], Exabot [Bot] and 20 guests