gearman troubles

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Re: gearman troubles

Post by WillemDH »

Our current supported version that we've done testing with is ..
We just need a stable Gearman CentOS 7 worker. I tried upgrading as we kept having issues with the worker. Are you saying we need to downgrade again in order to be supported?
Yes, we are working on our own version of gearman in house.
Good to hear that. I'm sure an inhouse supported Gearman version should be less of a hassle.

I can see in your repolist that you are using CentOS 6.

Code: Select all

yum repolist                                                                                                                                                                                     [17-01-17 17:38:25]
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
 * base: centos.mirror.iweb.ca
 * epel: ftp.nluug.nl
 * extras: centos.mirror.iweb.ca
 * updates: centos.mirror.iweb.ca
repo id                                                                                                repo name                                                                                                                    status
base/7/x86_64                                                                                          CentOS-7 - Base                                                                                                               9,363
epel/x86_64                                                                                            Extra Packages for Enterprise Linux 7 - x86_64                                                                               11,046
extras/7/x86_64                                                                                        CentOS-7 - Extras                                                                                                               200
labs_consol_stable/x86_64                                                                              labs_consol_stable                                                                                                               22
updates/7/x86_64                                                                                       CentOS-7 - Updates        
While writing this I'm again having issues... :(

Attached a screenshot. I can temporarily 'fix it' by executing:

Code: Select all

service nagios stop
service gearmand stop

service gearmand start
service nagios start
But as you can see 15 minutes later it stopped working..
You do not have the required permissions to view the files attached to this post.
Nagios XI 5.8.1
https://outsideit.net
SteveBeauchemin
Posts: 524
Joined: Mon Oct 14, 2013 7:19 pm

Re: gearman troubles

Post by SteveBeauchemin »

Willem...
You have a 2 - Mod_Gearman versus Mod_Gearman2

Maybe that is an issue.

Steve B
XI 5.7.3 / Core 4.4.6 / NagVis 1.9.8 / LiveStatus 1.5.0p11 / RRDCached 1.7.0 / Redis 3.2.8 /
SNMPTT / Gearman 0.33-7 / Mod_Gearman 3.0.7 / NLS 2.0.8 / NNA 2.3.1 /
NSClient 0.5.0 / NRPE Solaris 3.2.1 Linux 3.2.1 HPUX 3.2.1
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Re: gearman troubles

Post by WillemDH »

Ah yes indeed I have a 2, but that was consulted to me by you / Nagios some long time ago to solve a memory leak issue.

https://support.nagios.com/forum/viewto ... =+gearman2
https://support.nagios.com/forum/viewto ... t=gearman2

Seems like mod_gearman2 is installed on both my XI prod and both worker nodes. Before upgrading to mod_gearman2 we had memory leak issues, duplicate Nagios processes and instability which both were more or less resolved after upgrading to 2.

I'm a bit confused about what I'm supposed to do.. Not sure what's the difference between:
- mod_gearman2.x86_64 2.1.1-1.el7.centos
- mod_gearman.x86_64 3.0.0-1.el7.centos

Did Gearman really got forked? It's time this native worker things gets implemented... :)

I'm really not looking forward to tampering on my one Worker node which has been very stable for 6 months now. But I do need a new CentOS 7 worker.. And I have no time...
Nagios XI 5.8.1
https://outsideit.net
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: gearman troubles

Post by tgriep »

At the end of October last year, a new version of Mod Gearman 3.0 was released. Take a look at this link for more details.
https://github.com/sni/mod_gearman/blob/master/Changes
Be sure to check out our Knowledgebase for helpful articles and solutions!
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Re: gearman troubles

Post by WillemDH »

I was able to install 3.0 on the worker, but I'm having issues again.. The weird thing is that after restarting the nagios and gearmand service on the XI server, it works for a few checks. But 20 minutes later, the 'last check' stops updating in XI, gearman_top2 starts showing a rising number of 'Jobs Running'..

Logs on the worker:

Code: Select all

[2017-01-18 12:14:55][21725][DEBUG] child started with pid: 21725
[2017-01-18 12:14:55][21724][DEBUG] child started with pid: 21724
[2017-01-18 12:15:06][20434][INFO ] no checks in 2minutes, restarting all workers
[2017-01-18 12:15:10][21890][DEBUG] child started with pid: 21890
[2017-01-18 12:15:10][21891][DEBUG] child started with pid: 21891
[2017-01-18 12:15:10][21894][DEBUG] child started with pid: 21894
[2017-01-18 12:15:10][21893][DEBUG] child started with pid: 21893
[2017-01-18 12:15:10][21892][DEBUG] child started with pid: 21892
[2017-01-18 12:15:26][21973][DEBUG] child started with pid: 21973
[2017-01-18 12:15:41][22064][DEBUG] child started with pid: 22064
[2017-01-18 12:15:41][22065][DEBUG] child started with pid: 22065
[2017-01-18 12:15:41][22068][DEBUG] child started with pid: 22068
[2017-01-18 12:15:41][22067][DEBUG] child started with pid: 22067
[2017-01-18 12:15:41][22066][DEBUG] child started with pid: 22066
[2017-01-18 12:15:42][22067][DEBUG] got service job: google.be_ext - NET_Ping
[2017-01-18 12:15:57][22166][DEBUG] child started with pid: 22166
[2017-01-18 12:16:12][22257][DEBUG] child started with pid: 22257
[2017-01-18 12:16:12][22258][DEBUG] child started with pid: 22258
[2017-01-18 12:16:12][22261][DEBUG] child started with pid: 22261
[2017-01-18 12:16:12][22260][DEBUG] child started with pid: 22260
[2017-01-18 12:16:12][22259][DEBUG] child started with pid: 22259
[2017-01-18 12:16:28][22366][DEBUG] child started with pid: 22366
[2017-01-18 12:16:43][22457][DEBUG] child started with pid: 22457
[2017-01-18 12:16:43][22458][DEBUG] child started with pid: 22458
[2017-01-18 12:16:43][22460][DEBUG] child started with pid: 22460
[2017-01-18 12:16:43][22461][DEBUG] child started with pid: 22461
[2017-01-18 12:16:43][22459][DEBUG] child started with pid: 22459
[2017-01-18 12:16:47][22459][DEBUG] got service job: google.be_ext - URL_Content
[2017-01-18 12:16:59][22559][DEBUG] child started with pid: 22559
[2017-01-18 12:17:14][22650][DEBUG] child started with pid: 22650
[2017-01-18 12:17:14][22651][DEBUG] child started with pid: 22651
[2017-01-18 12:17:14][22652][DEBUG] child started with pid: 22652
[2017-01-18 12:17:14][22653][DEBUG] child started with pid: 22653
[2017-01-18 12:17:18][22678][DEBUG] child started with pid: 22678
[2017-01-18 12:17:30][22755][DEBUG] child started with pid: 22755
[2017-01-18 12:17:45][22903][DEBUG] child started with pid: 22903
[2017-01-18 12:17:45][22904][DEBUG] child started with pid: 22904
[2017-01-18 12:17:45][22906][DEBUG] child started with pid: 22906
[2017-01-18 12:17:45][22905][DEBUG] child started with pid: 22905
[2017-01-18 12:17:49][22931][DEBUG] child started with pid: 22931
[2017-01-18 12:18:01][23004][DEBUG] child started with pid: 23004
[2017-01-18 12:18:16][23095][DEBUG] child started with pid: 23095
[2017-01-18 12:18:16][23096][DEBUG] child started with pid: 23096
[2017-01-18 12:18:16][23098][DEBUG] child started with pid: 23098
[2017-01-18 12:18:16][23097][DEBUG] child started with pid: 23097
[2017-01-18 12:18:20][23123][DEBUG] child started with pid: 23123
[2017-01-18 12:18:32][23199][DEBUG] child started with pid: 23199
[2017-01-18 12:18:47][23284][DEBUG] child started with pid: 23284
[2017-01-18 12:18:47][23285][DEBUG] child started with pid: 23285
[2017-01-18 12:18:47][23287][DEBUG] child started with pid: 23287
[2017-01-18 12:18:47][23286][DEBUG] child started with pid: 23286
[2017-01-18 12:18:48][20434][INFO ] no checks in 2minutes, restarting all workers
[2017-01-18 12:18:52][23318][DEBUG] child started with pid: 23318
[2017-01-18 12:18:52][23319][DEBUG] child started with pid: 23319
[2017-01-18 12:18:52][23320][DEBUG] child started with pid: 23320
[2017-01-18 12:18:52][23321][DEBUG] child started with pid: 23321
[2017-01-18 12:18:52][23322][DEBUG] child started with pid: 23322
[2017-01-18 12:19:03][23389][DEBUG] child started with pid: 23389
[2017-01-18 12:19:23][23517][DEBUG] child started with pid: 23517
[2017-01-18 12:19:23][23518][DEBUG] child started with pid: 23518
[2017-01-18 12:19:23][23521][DEBUG] child started with pid: 23521
[2017-01-18 12:19:23][23520][DEBUG] child started with pid: 23520
[2017-01-18 12:19:23][23519][DEBUG] child started with pid: 23519
[2017-01-18 12:19:34][23629][DEBUG] child started with pid: 23629
[2017-01-18 12:19:54][23756][DEBUG] child started with pid: 23756
[2017-01-18 12:19:54][23757][DEBUG] child started with pid: 23757
[2017-01-18 12:19:54][23760][DEBUG] child started with pid: 23760
[2017-01-18 12:19:54][23759][DEBUG] child started with pid: 23759
[2017-01-18 12:19:54][23758][DEBUG] child started with pid: 23758
[2017-01-18 12:20:05][23905][DEBUG] child started with pid: 23905
[2017-01-18 12:20:25][24026][DEBUG] child started with pid: 24026
[2017-01-18 12:20:25][24028][DEBUG] child started with pid: 24028
[2017-01-18 12:20:25][24030][DEBUG] child started with pid: 24030
[2017-01-18 12:20:25][24027][DEBUG] child started with pid: 24027
[2017-01-18 12:20:25][24029][DEBUG] child started with pid: 24029
[2017-01-18 12:20:36][24097][DEBUG] child started with pid: 24097
[2017-01-18 12:20:49][20434][INFO ] no checks in 2minutes, restarting all workers
[2017-01-18 12:20:53][24200][DEBUG] child started with pid: 24200
[2017-01-18 12:20:53][24201][DEBUG] child started with pid: 24201
[2017-01-18 12:20:53][24203][DEBUG] child started with pid: 24203
[2017-01-18 12:20:53][24204][DEBUG] child started with pid: 24204
[2017-01-18 12:20:53][24202][DEBUG] child started with pid: 24202
[2017-01-18 12:21:07][24289][DEBUG] child started with pid: 24289
[2017-01-18 12:21:24][24411][DEBUG] child started with pid: 24411
[2017-01-18 12:21:24][24412][DEBUG] child started with pid: 24412
[2017-01-18 12:21:24][24414][DEBUG] child started with pid: 24414
[2017-01-18 12:21:24][24415][DEBUG] child started with pid: 24415
[2017-01-18 12:21:24][24413][DEBUG] child started with pid: 24413
[2017-01-18 12:21:38][24482][DEBUG] child started with pid: 24482
[2017-01-18 12:21:55][24585][DEBUG] child started with pid: 24585
[2017-01-18 12:21:55][24586][DEBUG] child started with pid: 24586
[2017-01-18 12:21:55][24589][DEBUG] child started with pid: 24589
[2017-01-18 12:21:55][24588][DEBUG] child started with pid: 24588
[2017-01-18 12:21:55][24587][DEBUG] child started with pid: 24587
[2017-01-18 12:22:09][24674][DEBUG] child started with pid: 24674
[2017-01-18 12:22:26][24810][DEBUG] child started with pid: 24810
[2017-01-18 12:22:26][24811][DEBUG] child started with pid: 24811
[2017-01-18 12:22:26][24814][DEBUG] child started with pid: 24814
[2017-01-18 12:22:26][24813][DEBUG] child started with pid: 24813
[2017-01-18 12:22:26][24812][DEBUG] child started with pid: 24812
[2017-01-18 12:22:40][24924][DEBUG] child started with pid: 24924
[2017-01-18 12:22:50][20434][INFO ] no checks in 2minutes, restarting all workers
[2017-01-18 12:22:54][25041][DEBUG] child started with pid: 25041
[2017-01-18 12:22:54][25042][DEBUG] child started with pid: 25042
[2017-01-18 12:22:54][25043][DEBUG] child started with pid: 25043
[2017-01-18 12:22:54][25044][DEBUG] child started with pid: 25044
[2017-01-18 12:22:54][25045][DEBUG] child started with pid: 25045
[2017-01-18 12:23:11][25148][DEBUG] child started with pid: 25148
[2017-01-18 12:23:25][25233][DEBUG] child started with pid: 25233
[2017-01-18 12:23:25][25232][DEBUG] child started with pid: 25232
[2017-01-18 12:23:25][25236][DEBUG] child started with pid: 25236
[2017-01-18 12:23:25][25235][DEBUG] child started with pid: 25235
[2017-01-18 12:23:25][25234][DEBUG] child started with pid: 25234
[2017-01-18 12:23:42][25325][DEBUG] child started with pid: 25325
[2017-01-18 12:23:56][25410][DEBUG] child started with pid: 25410
[2017-01-18 12:23:56][25411][DEBUG] child started with pid: 25411
[2017-01-18 12:23:56][25414][DEBUG] child started with pid: 25414
[2017-01-18 12:23:56][25413][DEBUG] child started with pid: 25413
[2017-01-18 12:23:56][25412][DEBUG] child started with pid: 25412
[2017-01-18 12:24:13][25520][DEBUG] child started with pid: 25520
[2017-01-18 12:24:27][25606][DEBUG] child started with pid: 25606
[2017-01-18 12:24:27][25610][DEBUG] child started with pid: 25610
[2017-01-18 12:24:27][25607][DEBUG] child started with pid: 25607
[2017-01-18 12:24:27][25608][DEBUG] child started with pid: 25608
[2017-01-18 12:24:27][25609][DEBUG] child started with pid: 25609
[2017-01-18 12:24:44][25774][DEBUG] child started with pid: 25774
[2017-01-18 12:24:51][20434][INFO ] no checks in 2minutes, restarting all workers
For now I switched to using a clean check_nrpe and executing the http checks with a custom -H param. Really looking forward to the native workers. :) Go Bryan.
Nagios XI 5.8.1
https://outsideit.net
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: gearman troubles

Post by tgriep »

Do you see any errors in the gearman server or worker log files?
Did you update the gearman server to the 3.x version?
Be sure to check out our Knowledgebase for helpful articles and solutions!
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Re: gearman troubles

Post by WillemDH »

Nope I didn't... But that is exactly the problem. The Xi - Mrtg CentOS 6 worker are working very stable for more then half a year now... This mrtg worker node is monitoring all our network devices and is considered very critical.
I would have to find a period of time where I can snapshot both XI and the mrtg worker, then upgrade, test and if it fails revert.. Can't afford this mrtg worker to become unstable (again)

@bheden
Just a quick question, the planned 'native workers' will they be able to work besides Gearman or will implementing a 'native worker' imply I have to migrate all our gearman workers to it?
Nagios XI 5.8.1
https://outsideit.net
bheden
Product Development Manager
Posts: 179
Joined: Thu Feb 13, 2014 9:50 am
Location: Nagios Enterprises

Re: gearman troubles

Post by bheden »

@WillemDH,

You should be able to run them side by side. The queue functionality is extremely similar, so as long as they don't overlap (in which case whichever NEB module loaded first would win the rights to check that host/service/whatever) there shouldn't be a problem with it!

This is an interesting point that I didn't consider, though. I'll make sure that functionality is solid and documented by release time. Thanks!
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Nagios Enterprises
Senior Developer
Locked