Page 2 of 2
Re: gearman troubles
Posted: Tue Jan 17, 2017 11:00 am
by WillemDH
Our current supported version that we've done testing with is ..
We just need a stable Gearman CentOS 7 worker. I tried upgrading as we kept having issues with the worker. Are you saying we need to downgrade again in order to be supported?
Yes, we are working on our own version of gearman in house.
Good to hear that. I'm sure an inhouse supported Gearman version should be less of a hassle.
I can see in your repolist that you are using CentOS 6.
Code: Select all
yum repolist [17-01-17 17:38:25]
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
* base: centos.mirror.iweb.ca
* epel: ftp.nluug.nl
* extras: centos.mirror.iweb.ca
* updates: centos.mirror.iweb.ca
repo id repo name status
base/7/x86_64 CentOS-7 - Base 9,363
epel/x86_64 Extra Packages for Enterprise Linux 7 - x86_64 11,046
extras/7/x86_64 CentOS-7 - Extras 200
labs_consol_stable/x86_64 labs_consol_stable 22
updates/7/x86_64 CentOS-7 - Updates
While writing this I'm again having issues...
Attached a screenshot. I can temporarily 'fix it' by executing:
Code: Select all
service nagios stop
service gearmand stop
service gearmand start
service nagios start
But as you can see 15 minutes later it stopped working..
Re: gearman troubles
Posted: Tue Jan 17, 2017 12:29 pm
by SteveBeauchemin
Willem...
You have a 2 - Mod_Gearman versus Mod_Gearman2
Maybe that is an issue.
Steve B
Re: gearman troubles
Posted: Tue Jan 17, 2017 12:42 pm
by WillemDH
Ah yes indeed I have a 2, but that was consulted to me by you / Nagios some long time ago to solve a memory leak issue.
https://support.nagios.com/forum/viewto ... =+gearman2
https://support.nagios.com/forum/viewto ... t=gearman2
Seems like mod_gearman2 is installed on both my XI prod and both worker nodes. Before upgrading to mod_gearman2 we had memory leak issues, duplicate Nagios processes and instability which both were more or less resolved after upgrading to 2.
I'm a bit confused about what I'm supposed to do.. Not sure what's the difference between:
- mod_gearman2.x86_64 2.1.1-1.el7.centos
- mod_gearman.x86_64 3.0.0-1.el7.centos
Did Gearman really got forked? It's time this native worker things gets implemented...
I'm really not looking forward to tampering on my one Worker node which has been very stable for 6 months now. But I do need a new CentOS 7 worker.. And I have no time...
Re: gearman troubles
Posted: Tue Jan 17, 2017 5:59 pm
by tgriep
At the end of October last year, a new version of Mod Gearman 3.0 was released. Take a look at this link for more details.
https://github.com/sni/mod_gearman/blob/master/Changes
Re: gearman troubles
Posted: Wed Jan 18, 2017 6:49 am
by WillemDH
I was able to install 3.0 on the worker, but I'm having issues again.. The weird thing is that after restarting the nagios and gearmand service on the XI server, it works for a few checks. But 20 minutes later, the 'last check' stops updating in XI, gearman_top2 starts showing a rising number of 'Jobs Running'..
Logs on the worker:
Code: Select all
[2017-01-18 12:14:55][21725][DEBUG] child started with pid: 21725
[2017-01-18 12:14:55][21724][DEBUG] child started with pid: 21724
[2017-01-18 12:15:06][20434][INFO ] no checks in 2minutes, restarting all workers
[2017-01-18 12:15:10][21890][DEBUG] child started with pid: 21890
[2017-01-18 12:15:10][21891][DEBUG] child started with pid: 21891
[2017-01-18 12:15:10][21894][DEBUG] child started with pid: 21894
[2017-01-18 12:15:10][21893][DEBUG] child started with pid: 21893
[2017-01-18 12:15:10][21892][DEBUG] child started with pid: 21892
[2017-01-18 12:15:26][21973][DEBUG] child started with pid: 21973
[2017-01-18 12:15:41][22064][DEBUG] child started with pid: 22064
[2017-01-18 12:15:41][22065][DEBUG] child started with pid: 22065
[2017-01-18 12:15:41][22068][DEBUG] child started with pid: 22068
[2017-01-18 12:15:41][22067][DEBUG] child started with pid: 22067
[2017-01-18 12:15:41][22066][DEBUG] child started with pid: 22066
[2017-01-18 12:15:42][22067][DEBUG] got service job: google.be_ext - NET_Ping
[2017-01-18 12:15:57][22166][DEBUG] child started with pid: 22166
[2017-01-18 12:16:12][22257][DEBUG] child started with pid: 22257
[2017-01-18 12:16:12][22258][DEBUG] child started with pid: 22258
[2017-01-18 12:16:12][22261][DEBUG] child started with pid: 22261
[2017-01-18 12:16:12][22260][DEBUG] child started with pid: 22260
[2017-01-18 12:16:12][22259][DEBUG] child started with pid: 22259
[2017-01-18 12:16:28][22366][DEBUG] child started with pid: 22366
[2017-01-18 12:16:43][22457][DEBUG] child started with pid: 22457
[2017-01-18 12:16:43][22458][DEBUG] child started with pid: 22458
[2017-01-18 12:16:43][22460][DEBUG] child started with pid: 22460
[2017-01-18 12:16:43][22461][DEBUG] child started with pid: 22461
[2017-01-18 12:16:43][22459][DEBUG] child started with pid: 22459
[2017-01-18 12:16:47][22459][DEBUG] got service job: google.be_ext - URL_Content
[2017-01-18 12:16:59][22559][DEBUG] child started with pid: 22559
[2017-01-18 12:17:14][22650][DEBUG] child started with pid: 22650
[2017-01-18 12:17:14][22651][DEBUG] child started with pid: 22651
[2017-01-18 12:17:14][22652][DEBUG] child started with pid: 22652
[2017-01-18 12:17:14][22653][DEBUG] child started with pid: 22653
[2017-01-18 12:17:18][22678][DEBUG] child started with pid: 22678
[2017-01-18 12:17:30][22755][DEBUG] child started with pid: 22755
[2017-01-18 12:17:45][22903][DEBUG] child started with pid: 22903
[2017-01-18 12:17:45][22904][DEBUG] child started with pid: 22904
[2017-01-18 12:17:45][22906][DEBUG] child started with pid: 22906
[2017-01-18 12:17:45][22905][DEBUG] child started with pid: 22905
[2017-01-18 12:17:49][22931][DEBUG] child started with pid: 22931
[2017-01-18 12:18:01][23004][DEBUG] child started with pid: 23004
[2017-01-18 12:18:16][23095][DEBUG] child started with pid: 23095
[2017-01-18 12:18:16][23096][DEBUG] child started with pid: 23096
[2017-01-18 12:18:16][23098][DEBUG] child started with pid: 23098
[2017-01-18 12:18:16][23097][DEBUG] child started with pid: 23097
[2017-01-18 12:18:20][23123][DEBUG] child started with pid: 23123
[2017-01-18 12:18:32][23199][DEBUG] child started with pid: 23199
[2017-01-18 12:18:47][23284][DEBUG] child started with pid: 23284
[2017-01-18 12:18:47][23285][DEBUG] child started with pid: 23285
[2017-01-18 12:18:47][23287][DEBUG] child started with pid: 23287
[2017-01-18 12:18:47][23286][DEBUG] child started with pid: 23286
[2017-01-18 12:18:48][20434][INFO ] no checks in 2minutes, restarting all workers
[2017-01-18 12:18:52][23318][DEBUG] child started with pid: 23318
[2017-01-18 12:18:52][23319][DEBUG] child started with pid: 23319
[2017-01-18 12:18:52][23320][DEBUG] child started with pid: 23320
[2017-01-18 12:18:52][23321][DEBUG] child started with pid: 23321
[2017-01-18 12:18:52][23322][DEBUG] child started with pid: 23322
[2017-01-18 12:19:03][23389][DEBUG] child started with pid: 23389
[2017-01-18 12:19:23][23517][DEBUG] child started with pid: 23517
[2017-01-18 12:19:23][23518][DEBUG] child started with pid: 23518
[2017-01-18 12:19:23][23521][DEBUG] child started with pid: 23521
[2017-01-18 12:19:23][23520][DEBUG] child started with pid: 23520
[2017-01-18 12:19:23][23519][DEBUG] child started with pid: 23519
[2017-01-18 12:19:34][23629][DEBUG] child started with pid: 23629
[2017-01-18 12:19:54][23756][DEBUG] child started with pid: 23756
[2017-01-18 12:19:54][23757][DEBUG] child started with pid: 23757
[2017-01-18 12:19:54][23760][DEBUG] child started with pid: 23760
[2017-01-18 12:19:54][23759][DEBUG] child started with pid: 23759
[2017-01-18 12:19:54][23758][DEBUG] child started with pid: 23758
[2017-01-18 12:20:05][23905][DEBUG] child started with pid: 23905
[2017-01-18 12:20:25][24026][DEBUG] child started with pid: 24026
[2017-01-18 12:20:25][24028][DEBUG] child started with pid: 24028
[2017-01-18 12:20:25][24030][DEBUG] child started with pid: 24030
[2017-01-18 12:20:25][24027][DEBUG] child started with pid: 24027
[2017-01-18 12:20:25][24029][DEBUG] child started with pid: 24029
[2017-01-18 12:20:36][24097][DEBUG] child started with pid: 24097
[2017-01-18 12:20:49][20434][INFO ] no checks in 2minutes, restarting all workers
[2017-01-18 12:20:53][24200][DEBUG] child started with pid: 24200
[2017-01-18 12:20:53][24201][DEBUG] child started with pid: 24201
[2017-01-18 12:20:53][24203][DEBUG] child started with pid: 24203
[2017-01-18 12:20:53][24204][DEBUG] child started with pid: 24204
[2017-01-18 12:20:53][24202][DEBUG] child started with pid: 24202
[2017-01-18 12:21:07][24289][DEBUG] child started with pid: 24289
[2017-01-18 12:21:24][24411][DEBUG] child started with pid: 24411
[2017-01-18 12:21:24][24412][DEBUG] child started with pid: 24412
[2017-01-18 12:21:24][24414][DEBUG] child started with pid: 24414
[2017-01-18 12:21:24][24415][DEBUG] child started with pid: 24415
[2017-01-18 12:21:24][24413][DEBUG] child started with pid: 24413
[2017-01-18 12:21:38][24482][DEBUG] child started with pid: 24482
[2017-01-18 12:21:55][24585][DEBUG] child started with pid: 24585
[2017-01-18 12:21:55][24586][DEBUG] child started with pid: 24586
[2017-01-18 12:21:55][24589][DEBUG] child started with pid: 24589
[2017-01-18 12:21:55][24588][DEBUG] child started with pid: 24588
[2017-01-18 12:21:55][24587][DEBUG] child started with pid: 24587
[2017-01-18 12:22:09][24674][DEBUG] child started with pid: 24674
[2017-01-18 12:22:26][24810][DEBUG] child started with pid: 24810
[2017-01-18 12:22:26][24811][DEBUG] child started with pid: 24811
[2017-01-18 12:22:26][24814][DEBUG] child started with pid: 24814
[2017-01-18 12:22:26][24813][DEBUG] child started with pid: 24813
[2017-01-18 12:22:26][24812][DEBUG] child started with pid: 24812
[2017-01-18 12:22:40][24924][DEBUG] child started with pid: 24924
[2017-01-18 12:22:50][20434][INFO ] no checks in 2minutes, restarting all workers
[2017-01-18 12:22:54][25041][DEBUG] child started with pid: 25041
[2017-01-18 12:22:54][25042][DEBUG] child started with pid: 25042
[2017-01-18 12:22:54][25043][DEBUG] child started with pid: 25043
[2017-01-18 12:22:54][25044][DEBUG] child started with pid: 25044
[2017-01-18 12:22:54][25045][DEBUG] child started with pid: 25045
[2017-01-18 12:23:11][25148][DEBUG] child started with pid: 25148
[2017-01-18 12:23:25][25233][DEBUG] child started with pid: 25233
[2017-01-18 12:23:25][25232][DEBUG] child started with pid: 25232
[2017-01-18 12:23:25][25236][DEBUG] child started with pid: 25236
[2017-01-18 12:23:25][25235][DEBUG] child started with pid: 25235
[2017-01-18 12:23:25][25234][DEBUG] child started with pid: 25234
[2017-01-18 12:23:42][25325][DEBUG] child started with pid: 25325
[2017-01-18 12:23:56][25410][DEBUG] child started with pid: 25410
[2017-01-18 12:23:56][25411][DEBUG] child started with pid: 25411
[2017-01-18 12:23:56][25414][DEBUG] child started with pid: 25414
[2017-01-18 12:23:56][25413][DEBUG] child started with pid: 25413
[2017-01-18 12:23:56][25412][DEBUG] child started with pid: 25412
[2017-01-18 12:24:13][25520][DEBUG] child started with pid: 25520
[2017-01-18 12:24:27][25606][DEBUG] child started with pid: 25606
[2017-01-18 12:24:27][25610][DEBUG] child started with pid: 25610
[2017-01-18 12:24:27][25607][DEBUG] child started with pid: 25607
[2017-01-18 12:24:27][25608][DEBUG] child started with pid: 25608
[2017-01-18 12:24:27][25609][DEBUG] child started with pid: 25609
[2017-01-18 12:24:44][25774][DEBUG] child started with pid: 25774
[2017-01-18 12:24:51][20434][INFO ] no checks in 2minutes, restarting all workers
For now I switched to using a clean check_nrpe and executing the http checks with a custom -H param. Really looking forward to the native workers.

Go Bryan.
Re: gearman troubles
Posted: Wed Jan 18, 2017 11:26 am
by tgriep
Do you see any errors in the gearman server or worker log files?
Did you update the gearman server to the 3.x version?
Re: gearman troubles
Posted: Wed Jan 18, 2017 12:42 pm
by WillemDH
Nope I didn't... But that is exactly the problem. The Xi - Mrtg CentOS 6 worker are working very stable for more then half a year now... This mrtg worker node is monitoring all our network devices and is considered very critical.
I would have to find a period of time where I can snapshot both XI and the mrtg worker, then upgrade, test and if it fails revert.. Can't afford this mrtg worker to become unstable (again)
@bheden
Just a quick question, the planned 'native workers' will they be able to work besides Gearman or will implementing a 'native worker' imply I have to migrate all our gearman workers to it?
Re: gearman troubles
Posted: Wed Jan 18, 2017 2:43 pm
by bheden
@WillemDH,
You should be able to run them side by side. The queue functionality is extremely similar, so as long as they don't overlap (in which case whichever NEB module loaded first would win the rights to check that host/service/whatever) there shouldn't be a problem with it!
This is an interesting point that I didn't consider, though. I'll make sure that functionality is solid and documented by release time. Thanks!