NagiosXI+Remote-Workers-(Distributed Monitoring)

maartin.pii · Post by **maartin.pii** » Thu Jan 26, 2017 1:04 pm

rkennedy wrote:Can you show us the full output of gearman_top2 from the gearman machine?

2017-01-26 14:56:50  -  localhost:4730  -  v0.33

 Queue Name                                     | Worker Available | Jobs Waiting | Jobs Running
-------------------------------------------------------------------------------------------------
 check_results                                  |               1  |           0  |           0
 eventhandler                                   |             112  |           0  |           0
 host                                           |              82  |           0  |           0
 hostgroup_BPI_HG                               |              17  |           0  |           0
 hostgroup_NetDevCeragonHG                      |               7  |           0  |           0
 hostgroup_NetDevHG                             |               7  |           0  |           1
 hostgroup_mgmon002WorkerHostGroup              |              19  |           0  |           2
 service                                        |              29  |           0  |           3
 servicegroup_CallCenter_SG                     |               0  |           7  |           0
 servicegroup_MSSQL_Service-Group               |              11  |           0  |           0
 servicegroup_MySQL_Service-Group               |              11  |           0  |           0
 servicegroup_NetDevS-Scada-Ceragon             |               7  |           0  |           0
 servicegroup_NetDevS-Scada-R100                |               7  |           0  |           0
 servicegroup_NetDevS-Scada-Radio               |               7  |           0  |           0
 servicegroup_NetDevSG                          |               7  |           0  |           0
 servicegroup_ORACLE_Service-Group              |              11  |           0  |           3
 servicegroup_Scada_SG                          |               0  |          10  |           0
 servicegroup_Unix-Linux_Infra_Service-Group    |              47  |           0  |          14
 servicegroup_Unix-Linux_Services_Service-Group |               6  |           0  |           1
 servicegroup_mg_service-test                   |               0  |           6  |           0
 worker_mgmon001                                |               1  |           0  |           0
 worker_mgmon002                                |               1  |           0  |           0
 worker_mgmon003                                |               1  |           0  |           0
 worker_mgmon004                                |               1  |           0  |           0
 worker_mgmon005                                |               1  |           0  |           0
 worker_mgmon006                                |               1  |           0  |           0
 worker_mgmon010                                |               1  |           0  |           0
-------------------------------------------------------------------------------------------------

What I have also seen are the following errors:

Code: Select all

[root@mgmon001 gearmand]# tailf /var/log/gearmand/gearmand.log
  ERROR 2017-00-26 15:22:19.000000 [     1 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-00-26 15:22:19.000000 [     1 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-00-26 15:47:19.000000 [     3 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-00-26 15:47:19.000000 [     3 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-00-26 16:13:05.000000 [     3 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-00-26 16:13:05.000000 [     3 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-00-26 16:28:55.000000 [     1 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-00-26 16:28:55.000000 [     1 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2017-00-26 16:53:55.000000 [     1 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2017-00-26 16:53:55.000000 [     1 ] closing connection due to previous errno error -> libgearman-server/io.cc:109

I though that this could be a time synchronization issue so I have checked that all of my gearman workers have their ntp daemon running and all of them have the same NTP Server and all of them do.

Which is very curious of this log is the timestamp - From what I've been reading on google is that there is a bug with gearman log that uses a 'month before' on the log. However, it is using UTC time but not the local time I don't know why it does this.

Code: Select all

[root@ gearmand]# timedatectl
      Local time: Thu 2017-01-26 12:56:12 ART
  Universal time: Thu 2017-01-26 15:56:12 UTC
        RTC time: Thu 2017-01-26 15:56:12
       Time zone: America/Argentina/Buenos_Aires (ART, -0300)
     NTP enabled: yes
NTP synchronized: yes
 RTC in local TZ: no
      DST active: n/a

.----------------------------------------------------------

Another weird thing is that on the image that I am uploading you could see that there are like 4 or 5 orphan hosts and all of them are on their 2/4 attemp. However, the duration time is from hours ago - And this is not good since it should have to run the other 2 attemps to get their hard state.

Note: If I force the orphan checks to run inmediatly - they get OK state.

------------------------------------

maartin.pii · Post by **maartin.pii** » Thu Jan 26, 2017 1:08 pm

Another SS of what I was trying to explain on my previous reply.

bheden · Post by **bheden** » Thu Jan 26, 2017 5:55 pm

What version of gearmand are you running?

maartin.pii · Post by **maartin.pii** » Fri Jan 27, 2017 9:40 am

bheden wrote:What version of gearmand are you running?

Code: Select all

# rpm -qa | grep gearm
gearmand-devel-0.33-2.x86_64
mod_gearman2-2.1.1-1.el7.centos.x86_64
gearmand-server-0.33-2.x86_64
gearmand-0.33-2.x86_64
gearmand-debuginfo-0.33-2.x86_64

rkennedy · Post by **rkennedy** » Fri Jan 27, 2017 1:39 pm

On the XI machine running gearman, please modify /etc/mod_gearman2/module.conf and turn debugging on by setting debug=3 - then, make sure the logfile= is a valid path / file that is writable.

On the client machine that is set to hosts=yes, open up the worker.conf and turn on debug=3 as well as the logfile= too.

maartin.pii · Post by **maartin.pii** » Wed Feb 22, 2017 1:33 am

Hi Guys - I have been trying to debug this issue and still have no work around to this.

I have enabled the debug log but I don't see any errors.

The behavior is always the same:

1) A host arrive to the queue
2) It's been scheduled to run a few minutes later
3) Get orphan (1st attemp)
4) It get scheduled to tun a few minutes later again
5) Get orphan (2nd attemp)

And then it get stuck there on the 2nd attemp and never run again it get always been scheduled for a few minutes later.

If I force the check to run it get OK result.

Is there any way to force orphan checks to re run again till they are ok ?

I am uploading some logs and screenshots:

Code: Select all


<---
[2017-02-22 02:54:35][23623][TRACE] 344 +++>
7ZZYwlqEiAwJQDTTOr1y0h4/0CdO8oNbpSI8TUYTVHo/J6xWaWO8YXEhzrMlYQzsbi8/jdQAHnd/7LAcefBZPEPePvHBBo8D6+CkpG82lohDl380biACj57yNBOSE1vp6ZGgy0ysGyjFr/cr28jVenUb4oV83fFoYkFU3vSLg5U+TsUuTw9r3RdleAQQap4M5cJkTLzkBMSJ6yfS8eoe/BYnJtiuBf5DJPlBUyUJRrVDO3/MGacppQsSm/x+AYZ3xTuOJ+yUZItlYYcCVzPzQxVPfDOopTGPSEZYlWv56Nr2OUqSdJE/hmyV5aSqOPiBb43jlHhblmq+ENStMdTzgg==
<+++
[2017-02-22 02:54:35][23623][TRACE] add_job_to_queue() finished successfully: 0 0
[2017-02-22 02:54:35][23623][TRACE] handle_host_check() finished successfully -> 206
[2017-02-22 02:54:35][23623][TRACE] handle_host_check(7)
[2017-02-22 02:54:35][23623][TRACE] ---------------
host Job -> 7, 804
[2017-02-22 02:54:35][23623][DEBUG] received job for queue host: clickod1
[2017-02-22 02:54:35][23623][DEBUG] host: 'clickod1', next_check is at 2017-02-22 02:54:35, latency so far: 0
[2017-02-22 02:54:35][23623][TRACE] cmd_line: /usr/local/nagios/libexec/check_ping -H 10.65.231.150 -w 3000.0,80% -c 5000.0,100% -p 5
[2017-02-22 02:54:35][23623][TRACE] add_job_to_queue(host, clickod1, 2, 1, 1, 1)
[2017-02-22 02:54:35][23623][TRACE] 246 --->type=host
result_queue=check_results
host_name=clickod1
start_time=1487742875.0
next_check=1487742875.0
timeout=30
core_time=1487742875.204230
command_line=/usr/local/nagios/libexec/check_ping -H 10.65.231.150 -w 3000.0,80% -c 5000.0,100% -p 5

Code: Select all



<---
[2017-02-22 02:54:15][23623][TRACE] 384 +++>
QXIKTj7+crAnoZTrS98VuuXpvyA88jjWeiyNx0//a3M+an+DLE37AODGugsjpdo2AUgEx8/kg3ZfZgc9Hay0dDeTPG82EB3x4PXSwAwPK3WH3u6SrrEoIAKLmFvZnQNhhTsWjpZGV6xep+0FyYazQ9Sale5NmJPZHP0KR/doRbgRJipDppEEqOt7ITQygQoq+0SW8QFjVVAiihPqabiBJex0705xI8sNYm14dPH9vcmMUCEsRZ0lUkZfYyxHk/RUl5o9w39i+krCfwOhKnve2SCRHSmyjlc5SLe0PHWAyW4jsiOBUVVj1nX4Mikl90I6YBJRjUUe/G9mmRKe7Z1ZYAbb7MtLWPf7fKTiPmiSLpbCcj/aC2GN76dostEksQUn
<+++
[2017-02-22 02:54:15][23623][TRACE] add_job_to_queue() finished successfully: 0 0
[2017-02-22 02:54:15][23623][TRACE] handle_svc_check() finished successfully
[2017-02-22 02:54:15][23623][TRACE] handle_svc_check() finished successfully -> 206
[2017-02-22 02:54:15][23623][TRACE] handle_host_check(7)
[2017-02-22 02:54:15][23623][TRACE] ---------------
host Job -> 7, 804
[2017-02-22 02:54:15][23623][DEBUG] received job for queue host: clickod1
[2017-02-22 02:54:15][23623][DEBUG] host: 'clickod1', next_check is at 2017-02-22 02:54:15, latency so far: 0
[2017-02-22 02:54:15][23623][TRACE] cmd_line: /usr/local/nagios/libexec/check_ping -H 10.65.231.150 -w 3000.0,80% -c 5000.0,100% -p 5
[2017-02-22 02:54:15][23623][TRACE] add_job_to_queue(host, clickod1, 2, 1, 1, 1)
[2017-02-22 02:54:15][23623][TRACE] 246 --->type=host
result_queue=check_results
host_name=clickod1
start_time=1487742855.0
next_check=1487742855.0
timeout=30
core_time=1487742855.322881
command_line=/usr/local/nagios/libexec/check_ping -H 10.65.231.150 -w 3000.0,80% -c 5000.0,100% -p 5


<---
[2017-02-22 02:54:15][23623][TRACE] 344 +++>
7ZZYwlqEiAwJQDTTOr1y0h4/0CdO8oNbpSI8TUYTVHrJZGI/oFB3HiAo/jVGdiaRMJ/Eh9saNytHs9qopitHAKjZXrLCIe6liLXzToUQnVgWrTZW12bvj1R6a1zE8sdtrLYjVLDAC6HTTvSj6R6q0ob6HzOyoEIB8NxKO68maxn/cx35cnNZ5RwVOpp2vwz6n+7+wSxp9rd136c1Ik9YrlgNPWzJlmpJjsRK/YJC4SmedkhlESPDV0lrtjTPXyoyWctNtqk27Us4ecp84DCPrFxQAXKu4GNOHKUIfawIuJ8XHuabpWFjMxrrLoCmetcx2HwrzHz5mIqavpHIhdWpjQ==
<+++
[2017-02-22 02:54:15][23623][TRACE] add_job_to_queue() finished successfully: 0 0
[2017-02-22 02:54:15][23623][DEBUG] host check for clickod1 orphaned
[2017-02-22 02:54:15][23623][TRACE] handle_host_check() finished successfully -> 206
[2017-02-22 02:54:15][23623][TRACE] handle_host_check(7)
[2017-02-22 02:54:15][23623][TRACE] ---------------
host Job -> 7, 804

maartin.pii · Post by **maartin.pii** » Wed Feb 22, 2017 1:34 am

Code: Select all

2017-02-22 03:34:03  -  localhost:4730  -  v0.33

 Queue Name                                     | Worker Available | Jobs Waiting | Jobs Running
-------------------------------------------------------------------------------------------------
 check_results                                  |               1  |           0  |           0
 eventhandler                                   |             117  |           0  |           0
 host                                           |              87  |           0  |           0
 hostgroup_BPI_HG                               |              21  |           0  |           0
 hostgroup_NetDevCeragonHG                      |              18  |           0  |           0
 hostgroup_NetDevHG                             |              18  |           0  |           0
 hostgroup_mgmon002WorkerHostGroup              |              19  |           0  |           2
 service                                        |              44  |           0  |           7
 servicegroup_CallCenter_SG                     |               0  |           7  |           0
 servicegroup_MSSQL_Service-Group               |              11  |           0  |           0
 servicegroup_MySQL_Service-Group               |              11  |           0  |           0
 servicegroup_NetDevS-Scada-Ceragon             |              18  |           0  |           0
 servicegroup_NetDevS-Scada-R100                |              18  |           0  |           0
 servicegroup_NetDevS-Scada-Radio               |              18  |           0  |           0
 servicegroup_NetDevSG                          |              18  |           0  |           0
 servicegroup_ORACLE_Service-Group              |              11  |           0  |           1
 servicegroup_Scada_SG                          |               0  |          10  |           0
 servicegroup_Unix-Linux_Infra_Service-Group    |              37  |           0  |          29
 servicegroup_Unix-Linux_Services_Service-Group |               6  |           0  |           0
 servicegroup_mg_service-test                   |               0  |           6  |           0
 worker_mgmon001                                |               1  |           0  |           0
 worker_mgmon002                                |               1  |           0  |           0
 worker_mgmon003                                |               1  |           0  |           0
 worker_mgmon004                                |               1  |           0  |           0
 worker_mgmon005                                |               1  |           0  |           0
 worker_mgmon006                                |               1  |           0  |           0
 worker_mgmon010                                |               1  |           0  |           0
-------------------------------------------------------------------------------------------------

Post by **tgriep** » Wed Feb 22, 2017 4:34 pm

I looked in the module.conf file for the Gearman Server and the following entry is in there twice

Code: Select all

localhostgroups=

One has hostgroups defined and the other is empty.
Remove the empty entry and see if that helps.

I do have a question, what host group is the clickod1 host in?

maartin.pii · Post by **maartin.pii** » Fri Feb 24, 2017 3:30 pm

tgriep wrote:I looked in the module.conf file for the Gearman Server and the following entry is in there twice
Code: Select all
localhostgroups=
One has hostgroups defined and the other is empty.
Remove the empty entry and see if that helps.

I do have a question, what host group is the clickod1 host in?

That server is not part of any hostgroup... Unix/Linux Servers where going to be split their services into servicegroups to check on two workers and the hostcheck of that hosts where going to be monitored by distribuiting their checks on the workers that had the "host=yes" config.

Refer to previous replies on this post and it's explained

Regards

Post by **tgriep** » Mon Feb 27, 2017 9:53 am

It is hard to say why it it failing. You may want to look at the worker log and see if there is any better information.

Nagios Support Forum

NagiosXI+Remote-Workers-(Distributed Monitoring)

Re: NagiosXI+Remote-Workers-(Distributed Monitoring)

Re: NagiosXI+Remote-Workers-(Distributed Monitoring)

Re: NagiosXI+Remote-Workers-(Distributed Monitoring)

Re: NagiosXI+Remote-Workers-(Distributed Monitoring)

Re: NagiosXI+Remote-Workers-(Distributed Monitoring)

Re: NagiosXI+Remote-Workers-(Distributed Monitoring)

Re: NagiosXI+Remote-Workers-(Distributed Monitoring)

Re: NagiosXI+Remote-Workers-(Distributed Monitoring)

Re: NagiosXI+Remote-Workers-(Distributed Monitoring)

Re: NagiosXI+Remote-Workers-(Distributed Monitoring)