Page 1 of 2

gearman : Too many open files

Posted: Wed Nov 30, 2016 11:58 am
by Sisa
Hello everybody,

We are running a distributed monitoring with nagios and gearman solution.

Sometimes, we face the following issue :
On the thruk interface, the nagios hosts are down and the nagios services keep their status before the issue.
The message for host check :

Code: Select all

host check orphaned, is the mod-gearman worker on queue 'XXXXX' running?
It looks like this issue : https://support.nagios.com/forum/viewto ... =6&t=31202


In the /var/log/messages log file, we get this kind of error messages :

Code: Select all

Nov 30 08:58:43 marmara nagios: Warning: The check of service 'Lustre OSS health' on host 'scratchopera-oss2spg01a' looks like it was orphaned (results never came back; last_check=1480441715; next_check=1480492003).  I'm scheduling an immediate check of the service...
Nov 30 08:58:43 marmara nagios: Warning: The check of service 'NTP' on host 'scratchopera-oss2spg01b' looks like it was orphaned (results never came back; last_check=1480440279; next_check=1480492003).  I'm scheduling an immediate check of the service...
Nov 30 08:58:43 marmara nagios: Warning: The check of service 'disks state' on host 'taiwan060' looks like it was orphaned (results never came back; last_check=1480441706; next_check=1480492003).  I'm scheduling an immediate check of the service...
Nov 30 08:58:43 marmara nagios: Warning: The check of service '[META] Service' on host 'vavau' looks like it was orphaned (results never came back; last_check=1480443879; next_check=1480492003).  I'm scheduling an immediate check of the service...
Nov 30 08:59:42 marmara nagios: Warning: The check of host 'oleron-cisco-1' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the host...
Nov 30 08:59:42 marmara nagios: Warning: The check of host 'oleron-cisco-3' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the host...
Nov 30 08:59:42 marmara nagios: Warning: The check of host 'oleron-cisco-4' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the host...
Nov 30 08:59:42 marmara nagios: Warning: The check of host 'oleron-cisco-6' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the host...
Nov 30 08:59:42 marmara nagios: Warning: The check of host 'oleron-cisco-7' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the host...
Nov 30 08:59:42 marmara nagios: Warning: The check of host 'oleron-cisco-8' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the host...
Nov 30 08:59:42 marmara nagios: Warning: The check of host 'oleron-e2600-a' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the host...
Nov 30 08:59:42 marmara nagios: Warning: The check of host 'oleron-qlogic-1' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the host...
Nov 30 08:59:42 marmara nagios: Warning: The check of host 'oleron-qlogic-2' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the host...
in the gearmand log :

Code: Select all

 576808   ERROR 2016-10-29 17:34:40.000000 [  main ] accept(Too many open files) -> libgearman-server/gearmand.cc:691
 576809   ERROR 2016-10-29 17:34:40.000000 [  main ] accept(Too many open files) -> libgearman-server/gearmand.cc:691
 576810   ERROR 2016-10-29 17:34:40.000000 [  main ] accept(Too many open files) -> libgearman-server/gearmand.cc:691
 576811   ERROR 2016-10-29 17:34:40.000000 [  main ] accept(Too many open files) -> libgearman-server/gearmand.cc:691
 576812   ERROR 2016-10-29 17:34:40.000000 [  main ] accept(Too many open files) -> libgearman-server/gearmand.cc:691
 576813   ERROR 2016-10-29 17:34:40.000000 [  main ] accept(Too many open files) -> libgearman-server/gearmand.cc:691
 576814   ERROR 2016-10-29 17:34:40.000000 [  main ] accept(Too many open files) -> libgearman-server/gearmand.cc:691

Nagios version : Nagios Core 4.1.1
gearmand version : gearmand 0.33
most mod gearman version : version 1.4_nagios4 running on libgearman 1.1.12
The limit of open files for gearmand on the nagios server (/etc/security/limits.conf) :

Code: Select all

gearmand		hard	nofile		22000
gearmand		soft	nofile		22000 
Maybe, we have to monitor the number of open files by gearmand.

Our workaround is to restart gearmand and nagios core.

Sorry for my english, could you please help ?

Re: gearman : Too many open files

Posted: Wed Nov 30, 2016 1:10 pm
by dwhitfield
Mod_gearman is not our product, so we can only offer limited support. With that out of the way...

The error you are getting is due to a performance issue. Although much of it is not directly on point, https://support.nagios.com/kb/article.php?id=19 may be of use.

Our mod_gearman install script (http://assets.nagios.com/downloads/nagi ... Install.sh) uses 2.1.1-1. That script is written for XI, but the current version of XI uses Core 4.1.1, so I suspect it will work. Gearman 2.0 had some performance improvements.

If the mod_gearman improvements aren't enough, there are some performance improvements by updating Nagios Core to the 4.2.0 series. Of course, I'd suggest getting the latest bug fixes and getting 4.2.3.

Please let us know if you still have the issue after running through my suggestions.

Re: gearman : Too many open files

Posted: Thu Dec 01, 2016 7:30 am
by Sisa
Hello,

Thank you for your answer. We will perform your suggestions and we will let you know if we still have the issue.

Stay in touch

Re: gearman : Too many open files

Posted: Thu Dec 01, 2016 10:40 am
by dwhitfield
My guess is the issue is severe enough that you do not want to wait, but at some point next week we should be releasing Core 4.2.4. Of course, you can always upgrade now and then upgrade later. :)

Re: gearman : Too many open files

Posted: Tue Dec 13, 2016 8:35 am
by Sisa
Hello dwhitfield,

We have updated nagios core in version 4.2.4 :)
We also set a monitoring of the number of files opened by gearmand.
So, we are waiting for a new crash of gearmand. Maybe, with the Nagios update, it won't crash anymore :lol:

Thank you again,
Sisavang

Re: gearman : Too many open files

Posted: Tue Dec 13, 2016 10:30 am
by dwhitfield
So, the next step if you run into crashes is to upgrade mod_gearman, but at some point we are going to be releasing our own off-loader, so depending on how big of a deal upgrading gearman is, you might just want to wait. Unfortunately, all I can see if that development is "done" on our off-loader We are currently testing. I suspect it will come out with Nagios Core 5, but maybe 4.3?? Maybe it'll be it's own separate release? Too early to say. Just keep your eye on the upgrade notice and news on the front page of Core. :)

Re: gearman : Too many open files

Posted: Wed Jan 11, 2017 8:26 am
by Sisa
The gearmand daemon has crashed last week. According to the monitoring I put, the max number of opened files has been reached.

Code: Select all

05/01/17 02:31:02;7798
05/01/17 02:32:03;7146
05/01/17 02:33:02;7477
05/01/17 02:34:03;6946
05/01/17 02:35:02;8763
05/01/17 02:36:03;10030
05/01/17 02:37:04;11415
05/01/17 02:38:03;12179
05/01/17 02:39:03;13122
05/01/17 02:40:05;14095
05/01/17 02:41:05;14793
05/01/17 02:42:03;15421
05/01/17 02:43:04;16031
05/01/17 02:44:04;16358
05/01/17 02:45:04;16884
05/01/17 02:46:04;17043
05/01/17 02:47:03;17344
05/01/17 02:48:05;17461
05/01/17 02:49:04;17680
05/01/17 02:50:04;17807
05/01/17 02:51:04;17911
05/01/17 02:52:04;17967
05/01/17 02:53:05;18153
05/01/17 02:54:04;18220
05/01/17 02:55:04;18390
05/01/17 02:56:04;18468
05/01/17 02:57:04;18640
05/01/17 02:58:04;18676
05/01/17 02:59:05;18793
05/01/17 03:00:04;18905
05/01/17 03:01:05;19033
05/01/17 03:02:04;19017
05/01/17 03:03:04;19230
05/01/17 03:04:04;19277
05/01/17 03:05:04;19426
05/01/17 03:06:05;19567
05/01/17 03:07:04;19766
05/01/17 03:08:05;19819
05/01/17 03:09:04;19990
05/01/17 03:10:06;20155
05/01/17 03:11:04;20297
05/01/17 03:12:07;20366
05/01/17 03:13:04;20548
05/01/17 03:14:05;20645
05/01/17 03:15:05;20861
05/01/17 03:16:04;21028
05/01/17 03:17:05;21235
05/01/17 03:18:05;21334
05/01/17 03:19:04;21533
05/01/17 03:20:05;21632
05/01/17 03:21:04;21763
05/01/17 03:22:06;21786
05/01/17 03:23:04;21932
05/01/17 03:24:05;22015
05/01/17 03:25:04;22015
05/01/17 03:26:05;22015
05/01/17 03:27:04;22015
I have to check why, maybe a network outtage. All the workers looked fine.

About your offloader, I can't wait for testing it !

Re: gearman : Too many open files

Posted: Wed Jan 11, 2017 12:06 pm
by dwhitfield
Sisa wrote:I have to check why, maybe a network outtage.
Please let us know if you need any help digging through logs.
About your offloader, I can't wait for testing it !
So, I can give some more info about this. No guarantees, but here's the plan:
Nagios Core 4.3 is currently scheduled to be released around April of 2017. This will primarily be an enhancement release. Some of the changes projected to be in this release are:
Native remote workers
More info at https://www.nagios.com/roadmaps/

Re: gearman : Too many open files

Posted: Wed Jan 18, 2017 4:41 am
by Sisa
Hello,

Thank you for your help.
I will check the workers' log and I think I found a post on the gearman google discussion that concerns our issue :
https://groups.google.com/forum/#!searc ... BAXQi6DQAJ

But the problem doesn't seem solved on this discussion.
Moreover, even if I increase the max number of open files on the system, I think gearmand will reach because of an unknown root cause.

Re: gearman : Too many open files

Posted: Wed Jan 18, 2017 3:17 pm
by rkennedy
Are any checks running that could potentially not be closing the file properly? I imagine there are quite a few files in use, you might be able to track it down with ls -l /proc/<pid>/fd

I've seen a memory leak in the past, but not a 'file leak'. This could be the culprit of one plugin, to be honest.