Page 3 of 4
Re: NagiosXI+Remote-Workers-(Distributed Monitoring)
Posted: Thu Jan 26, 2017 1:04 pm
by maartin.pii
rkennedy wrote:Can you show us the full output of gearman_top2 from the gearman machine?
Code: Select all
2017-01-26 14:56:50 - localhost:4730 - v0.33
Queue Name | Worker Available | Jobs Waiting | Jobs Running
-------------------------------------------------------------------------------------------------
check_results | 1 | 0 | 0
eventhandler | 112 | 0 | 0
host | 82 | 0 | 0
hostgroup_BPI_HG | 17 | 0 | 0
hostgroup_NetDevCeragonHG | 7 | 0 | 0
hostgroup_NetDevHG | 7 | 0 | 1
hostgroup_mgmon002WorkerHostGroup | 19 | 0 | 2
service | 29 | 0 | 3
servicegroup_CallCenter_SG | 0 | 7 | 0
servicegroup_MSSQL_Service-Group | 11 | 0 | 0
servicegroup_MySQL_Service-Group | 11 | 0 | 0
servicegroup_NetDevS-Scada-Ceragon | 7 | 0 | 0
servicegroup_NetDevS-Scada-R100 | 7 | 0 | 0
servicegroup_NetDevS-Scada-Radio | 7 | 0 | 0
servicegroup_NetDevSG | 7 | 0 | 0
servicegroup_ORACLE_Service-Group | 11 | 0 | 3
servicegroup_Scada_SG | 0 | 10 | 0
servicegroup_Unix-Linux_Infra_Service-Group | 47 | 0 | 14
servicegroup_Unix-Linux_Services_Service-Group | 6 | 0 | 1
servicegroup_mg_service-test | 0 | 6 | 0
worker_mgmon001 | 1 | 0 | 0
worker_mgmon002 | 1 | 0 | 0
worker_mgmon003 | 1 | 0 | 0
worker_mgmon004 | 1 | 0 | 0
worker_mgmon005 | 1 | 0 | 0
worker_mgmon006 | 1 | 0 | 0
worker_mgmon010 | 1 | 0 | 0
-------------------------------------------------------------------------------------------------
What I have also seen are the following errors:
Code: Select all
[root@mgmon001 gearmand]# tailf /var/log/gearmand/gearmand.log
ERROR 2017-00-26 15:22:19.000000 [ 1 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2017-00-26 15:22:19.000000 [ 1 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
ERROR 2017-00-26 15:47:19.000000 [ 3 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2017-00-26 15:47:19.000000 [ 3 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
ERROR 2017-00-26 16:13:05.000000 [ 3 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2017-00-26 16:13:05.000000 [ 3 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
ERROR 2017-00-26 16:28:55.000000 [ 1 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2017-00-26 16:28:55.000000 [ 1 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
ERROR 2017-00-26 16:53:55.000000 [ 1 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
ERROR 2017-00-26 16:53:55.000000 [ 1 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
I though that this could be a time synchronization issue so I have checked that all of my gearman workers have their ntp daemon running and all of them have the same NTP Server and all of them do.
Which is very curious of this log is the timestamp - From what I've been reading on google is that there is a bug with gearman log that uses a 'month before' on the log. However, it is using UTC time but not the local time I don't know why it does this.
Code: Select all
[root@ gearmand]# timedatectl
Local time: Thu 2017-01-26 12:56:12 ART
Universal time: Thu 2017-01-26 15:56:12 UTC
RTC time: Thu 2017-01-26 15:56:12
Time zone: America/Argentina/Buenos_Aires (ART, -0300)
NTP enabled: yes
NTP synchronized: yes
RTC in local TZ: no
DST active: n/a
.----------------------------------------------------------
Another weird thing is that on the image that I am uploading you could see that there are like 4 or 5 orphan hosts and all of them are on their 2/4 attemp. However, the duration time is from hours ago - And this is not good since it should have to run the other 2 attemps to get their hard state.
Note: If I force the orphan checks to run inmediatly - they get OK state.
------------------------------------
Re: NagiosXI+Remote-Workers-(Distributed Monitoring)
Posted: Thu Jan 26, 2017 1:08 pm
by maartin.pii
Another SS of what I was trying to explain on my previous reply.
Re: NagiosXI+Remote-Workers-(Distributed Monitoring)
Posted: Thu Jan 26, 2017 5:55 pm
by bheden
What version of gearmand are you running?
Re: NagiosXI+Remote-Workers-(Distributed Monitoring)
Posted: Fri Jan 27, 2017 9:40 am
by maartin.pii
bheden wrote:What version of gearmand are you running?
Code: Select all
# rpm -qa | grep gearm
gearmand-devel-0.33-2.x86_64
mod_gearman2-2.1.1-1.el7.centos.x86_64
gearmand-server-0.33-2.x86_64
gearmand-0.33-2.x86_64
gearmand-debuginfo-0.33-2.x86_64
Re: NagiosXI+Remote-Workers-(Distributed Monitoring)
Posted: Fri Jan 27, 2017 1:39 pm
by rkennedy
On the XI machine running gearman, please modify /etc/mod_gearman2/module.conf and turn debugging on by setting debug=3 - then, make sure the logfile= is a valid path / file that is writable.
On the client machine that is set to hosts=yes, open up the worker.conf and turn on debug=3 as well as the logfile= too.
Re: NagiosXI+Remote-Workers-(Distributed Monitoring)
Posted: Wed Feb 22, 2017 1:33 am
by maartin.pii
Hi Guys - I have been trying to debug this issue and still have no work around to this.
I have enabled the debug log but I don't see any errors.
The behavior is always the same:
1) A host arrive to the queue
2) It's been scheduled to run a few minutes later
3) Get orphan (1st attemp)
4) It get scheduled to tun a few minutes later again
5) Get orphan (2nd attemp)
And then it get stuck there on the 2nd attemp and never run again it get always been scheduled for a few minutes later.
If I force the check to run it get OK result.
Is there any way to force orphan checks to re run again till they are ok ?
I am uploading some logs and screenshots:
Code: Select all
<---
[2017-02-22 02:54:35][23623][TRACE] 344 +++>
7ZZYwlqEiAwJQDTTOr1y0h4/0CdO8oNbpSI8TUYTVHo/J6xWaWO8YXEhzrMlYQzsbi8/jdQAHnd/7LAcefBZPEPePvHBBo8D6+CkpG82lohDl380biACj57yNBOSE1vp6ZGgy0ysGyjFr/cr28jVenUb4oV83fFoYkFU3vSLg5U+TsUuTw9r3RdleAQQap4M5cJkTLzkBMSJ6yfS8eoe/BYnJtiuBf5DJPlBUyUJRrVDO3/MGacppQsSm/x+AYZ3xTuOJ+yUZItlYYcCVzPzQxVPfDOopTGPSEZYlWv56Nr2OUqSdJE/hmyV5aSqOPiBb43jlHhblmq+ENStMdTzgg==
<+++
[2017-02-22 02:54:35][23623][TRACE] add_job_to_queue() finished successfully: 0 0
[2017-02-22 02:54:35][23623][TRACE] handle_host_check() finished successfully -> 206
[2017-02-22 02:54:35][23623][TRACE] handle_host_check(7)
[2017-02-22 02:54:35][23623][TRACE] ---------------
host Job -> 7, 804
[2017-02-22 02:54:35][23623][DEBUG] received job for queue host: clickod1
[2017-02-22 02:54:35][23623][DEBUG] host: 'clickod1', next_check is at 2017-02-22 02:54:35, latency so far: 0
[2017-02-22 02:54:35][23623][TRACE] cmd_line: /usr/local/nagios/libexec/check_ping -H 10.65.231.150 -w 3000.0,80% -c 5000.0,100% -p 5
[2017-02-22 02:54:35][23623][TRACE] add_job_to_queue(host, clickod1, 2, 1, 1, 1)
[2017-02-22 02:54:35][23623][TRACE] 246 --->type=host
result_queue=check_results
host_name=clickod1
start_time=1487742875.0
next_check=1487742875.0
timeout=30
core_time=1487742875.204230
command_line=/usr/local/nagios/libexec/check_ping -H 10.65.231.150 -w 3000.0,80% -c 5000.0,100% -p 5
Code: Select all
<---
[2017-02-22 02:54:15][23623][TRACE] 384 +++>
QXIKTj7+crAnoZTrS98VuuXpvyA88jjWeiyNx0//a3M+an+DLE37AODGugsjpdo2AUgEx8/kg3ZfZgc9Hay0dDeTPG82EB3x4PXSwAwPK3WH3u6SrrEoIAKLmFvZnQNhhTsWjpZGV6xep+0FyYazQ9Sale5NmJPZHP0KR/doRbgRJipDppEEqOt7ITQygQoq+0SW8QFjVVAiihPqabiBJex0705xI8sNYm14dPH9vcmMUCEsRZ0lUkZfYyxHk/RUl5o9w39i+krCfwOhKnve2SCRHSmyjlc5SLe0PHWAyW4jsiOBUVVj1nX4Mikl90I6YBJRjUUe/G9mmRKe7Z1ZYAbb7MtLWPf7fKTiPmiSLpbCcj/aC2GN76dostEksQUn
<+++
[2017-02-22 02:54:15][23623][TRACE] add_job_to_queue() finished successfully: 0 0
[2017-02-22 02:54:15][23623][TRACE] handle_svc_check() finished successfully
[2017-02-22 02:54:15][23623][TRACE] handle_svc_check() finished successfully -> 206
[2017-02-22 02:54:15][23623][TRACE] handle_host_check(7)
[2017-02-22 02:54:15][23623][TRACE] ---------------
host Job -> 7, 804
[2017-02-22 02:54:15][23623][DEBUG] received job for queue host: clickod1
[2017-02-22 02:54:15][23623][DEBUG] host: 'clickod1', next_check is at 2017-02-22 02:54:15, latency so far: 0
[2017-02-22 02:54:15][23623][TRACE] cmd_line: /usr/local/nagios/libexec/check_ping -H 10.65.231.150 -w 3000.0,80% -c 5000.0,100% -p 5
[2017-02-22 02:54:15][23623][TRACE] add_job_to_queue(host, clickod1, 2, 1, 1, 1)
[2017-02-22 02:54:15][23623][TRACE] 246 --->type=host
result_queue=check_results
host_name=clickod1
start_time=1487742855.0
next_check=1487742855.0
timeout=30
core_time=1487742855.322881
command_line=/usr/local/nagios/libexec/check_ping -H 10.65.231.150 -w 3000.0,80% -c 5000.0,100% -p 5
<---
[2017-02-22 02:54:15][23623][TRACE] 344 +++>
7ZZYwlqEiAwJQDTTOr1y0h4/0CdO8oNbpSI8TUYTVHrJZGI/oFB3HiAo/jVGdiaRMJ/Eh9saNytHs9qopitHAKjZXrLCIe6liLXzToUQnVgWrTZW12bvj1R6a1zE8sdtrLYjVLDAC6HTTvSj6R6q0ob6HzOyoEIB8NxKO68maxn/cx35cnNZ5RwVOpp2vwz6n+7+wSxp9rd136c1Ik9YrlgNPWzJlmpJjsRK/YJC4SmedkhlESPDV0lrtjTPXyoyWctNtqk27Us4ecp84DCPrFxQAXKu4GNOHKUIfawIuJ8XHuabpWFjMxrrLoCmetcx2HwrzHz5mIqavpHIhdWpjQ==
<+++
[2017-02-22 02:54:15][23623][TRACE] add_job_to_queue() finished successfully: 0 0
[2017-02-22 02:54:15][23623][DEBUG] host check for clickod1 orphaned
[2017-02-22 02:54:15][23623][TRACE] handle_host_check() finished successfully -> 206
[2017-02-22 02:54:15][23623][TRACE] handle_host_check(7)
[2017-02-22 02:54:15][23623][TRACE] ---------------
host Job -> 7, 804
Re: NagiosXI+Remote-Workers-(Distributed Monitoring)
Posted: Wed Feb 22, 2017 1:34 am
by maartin.pii
Code: Select all
2017-02-22 03:34:03 - localhost:4730 - v0.33
Queue Name | Worker Available | Jobs Waiting | Jobs Running
-------------------------------------------------------------------------------------------------
check_results | 1 | 0 | 0
eventhandler | 117 | 0 | 0
host | 87 | 0 | 0
hostgroup_BPI_HG | 21 | 0 | 0
hostgroup_NetDevCeragonHG | 18 | 0 | 0
hostgroup_NetDevHG | 18 | 0 | 0
hostgroup_mgmon002WorkerHostGroup | 19 | 0 | 2
service | 44 | 0 | 7
servicegroup_CallCenter_SG | 0 | 7 | 0
servicegroup_MSSQL_Service-Group | 11 | 0 | 0
servicegroup_MySQL_Service-Group | 11 | 0 | 0
servicegroup_NetDevS-Scada-Ceragon | 18 | 0 | 0
servicegroup_NetDevS-Scada-R100 | 18 | 0 | 0
servicegroup_NetDevS-Scada-Radio | 18 | 0 | 0
servicegroup_NetDevSG | 18 | 0 | 0
servicegroup_ORACLE_Service-Group | 11 | 0 | 1
servicegroup_Scada_SG | 0 | 10 | 0
servicegroup_Unix-Linux_Infra_Service-Group | 37 | 0 | 29
servicegroup_Unix-Linux_Services_Service-Group | 6 | 0 | 0
servicegroup_mg_service-test | 0 | 6 | 0
worker_mgmon001 | 1 | 0 | 0
worker_mgmon002 | 1 | 0 | 0
worker_mgmon003 | 1 | 0 | 0
worker_mgmon004 | 1 | 0 | 0
worker_mgmon005 | 1 | 0 | 0
worker_mgmon006 | 1 | 0 | 0
worker_mgmon010 | 1 | 0 | 0
-------------------------------------------------------------------------------------------------
Re: NagiosXI+Remote-Workers-(Distributed Monitoring)
Posted: Wed Feb 22, 2017 4:34 pm
by tgriep
I looked in the module.conf file for the Gearman Server and the following entry is in there twice
One has hostgroups defined and the other is empty.
Remove the empty entry and see if that helps.
I do have a question, what host group is the clickod1 host in?
Re: NagiosXI+Remote-Workers-(Distributed Monitoring)
Posted: Fri Feb 24, 2017 3:30 pm
by maartin.pii
tgriep wrote:I looked in the module.conf file for the Gearman Server and the following entry is in there twice
One has hostgroups defined and the other is empty.
Remove the empty entry and see if that helps.
I do have a question, what host group is the clickod1 host in?
That server is not part of any hostgroup... Unix/Linux Servers where going to be split their services into servicegroups to check on two workers and the hostcheck of that hosts where going to be monitored by distribuiting their checks on the workers that had the "host=yes" config.
Refer to previous replies on this post and it's explained
Regards
Re: NagiosXI+Remote-Workers-(Distributed Monitoring)
Posted: Mon Feb 27, 2017 9:53 am
by tgriep
It is hard to say why it it failing. You may want to look at the worker log and see if there is any better information.