gearman : Too many open files

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Sisa
Posts: 10
Joined: Mon Oct 10, 2016 7:44 am

gearman : Too many open files

Post by Sisa »

Hello everybody,

We are running a distributed monitoring with nagios and gearman solution.

Sometimes, we face the following issue :
On the thruk interface, the nagios hosts are down and the nagios services keep their status before the issue.
The message for host check :

Code: Select all

host check orphaned, is the mod-gearman worker on queue 'XXXXX' running?
It looks like this issue : https://support.nagios.com/forum/viewto ... =6&t=31202


In the /var/log/messages log file, we get this kind of error messages :

Code: Select all

Nov 30 08:58:43 marmara nagios: Warning: The check of service 'Lustre OSS health' on host 'scratchopera-oss2spg01a' looks like it was orphaned (results never came back; last_check=1480441715; next_check=1480492003).  I'm scheduling an immediate check of the service...
Nov 30 08:58:43 marmara nagios: Warning: The check of service 'NTP' on host 'scratchopera-oss2spg01b' looks like it was orphaned (results never came back; last_check=1480440279; next_check=1480492003).  I'm scheduling an immediate check of the service...
Nov 30 08:58:43 marmara nagios: Warning: The check of service 'disks state' on host 'taiwan060' looks like it was orphaned (results never came back; last_check=1480441706; next_check=1480492003).  I'm scheduling an immediate check of the service...
Nov 30 08:58:43 marmara nagios: Warning: The check of service '[META] Service' on host 'vavau' looks like it was orphaned (results never came back; last_check=1480443879; next_check=1480492003).  I'm scheduling an immediate check of the service...
Nov 30 08:59:42 marmara nagios: Warning: The check of host 'oleron-cisco-1' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the host...
Nov 30 08:59:42 marmara nagios: Warning: The check of host 'oleron-cisco-3' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the host...
Nov 30 08:59:42 marmara nagios: Warning: The check of host 'oleron-cisco-4' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the host...
Nov 30 08:59:42 marmara nagios: Warning: The check of host 'oleron-cisco-6' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the host...
Nov 30 08:59:42 marmara nagios: Warning: The check of host 'oleron-cisco-7' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the host...
Nov 30 08:59:42 marmara nagios: Warning: The check of host 'oleron-cisco-8' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the host...
Nov 30 08:59:42 marmara nagios: Warning: The check of host 'oleron-e2600-a' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the host...
Nov 30 08:59:42 marmara nagios: Warning: The check of host 'oleron-qlogic-1' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the host...
Nov 30 08:59:42 marmara nagios: Warning: The check of host 'oleron-qlogic-2' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the host...
in the gearmand log :

Code: Select all

 576808   ERROR 2016-10-29 17:34:40.000000 [  main ] accept(Too many open files) -> libgearman-server/gearmand.cc:691
 576809   ERROR 2016-10-29 17:34:40.000000 [  main ] accept(Too many open files) -> libgearman-server/gearmand.cc:691
 576810   ERROR 2016-10-29 17:34:40.000000 [  main ] accept(Too many open files) -> libgearman-server/gearmand.cc:691
 576811   ERROR 2016-10-29 17:34:40.000000 [  main ] accept(Too many open files) -> libgearman-server/gearmand.cc:691
 576812   ERROR 2016-10-29 17:34:40.000000 [  main ] accept(Too many open files) -> libgearman-server/gearmand.cc:691
 576813   ERROR 2016-10-29 17:34:40.000000 [  main ] accept(Too many open files) -> libgearman-server/gearmand.cc:691
 576814   ERROR 2016-10-29 17:34:40.000000 [  main ] accept(Too many open files) -> libgearman-server/gearmand.cc:691

Nagios version : Nagios Core 4.1.1
gearmand version : gearmand 0.33
most mod gearman version : version 1.4_nagios4 running on libgearman 1.1.12
The limit of open files for gearmand on the nagios server (/etc/security/limits.conf) :

Code: Select all

gearmand		hard	nofile		22000
gearmand		soft	nofile		22000 
Maybe, we have to monitor the number of open files by gearmand.

Our workaround is to restart gearmand and nagios core.

Sorry for my english, could you please help ?
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: gearman : Too many open files

Post by dwhitfield »

Mod_gearman is not our product, so we can only offer limited support. With that out of the way...

The error you are getting is due to a performance issue. Although much of it is not directly on point, https://support.nagios.com/kb/article.php?id=19 may be of use.

Our mod_gearman install script (http://assets.nagios.com/downloads/nagi ... Install.sh) uses 2.1.1-1. That script is written for XI, but the current version of XI uses Core 4.1.1, so I suspect it will work. Gearman 2.0 had some performance improvements.

If the mod_gearman improvements aren't enough, there are some performance improvements by updating Nagios Core to the 4.2.0 series. Of course, I'd suggest getting the latest bug fixes and getting 4.2.3.

Please let us know if you still have the issue after running through my suggestions.
Sisa
Posts: 10
Joined: Mon Oct 10, 2016 7:44 am

Re: gearman : Too many open files

Post by Sisa »

Hello,

Thank you for your answer. We will perform your suggestions and we will let you know if we still have the issue.

Stay in touch
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: gearman : Too many open files

Post by dwhitfield »

My guess is the issue is severe enough that you do not want to wait, but at some point next week we should be releasing Core 4.2.4. Of course, you can always upgrade now and then upgrade later. :)
Sisa
Posts: 10
Joined: Mon Oct 10, 2016 7:44 am

Re: gearman : Too many open files

Post by Sisa »

Hello dwhitfield,

We have updated nagios core in version 4.2.4 :)
We also set a monitoring of the number of files opened by gearmand.
So, we are waiting for a new crash of gearmand. Maybe, with the Nagios update, it won't crash anymore :lol:

Thank you again,
Sisavang
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: gearman : Too many open files

Post by dwhitfield »

So, the next step if you run into crashes is to upgrade mod_gearman, but at some point we are going to be releasing our own off-loader, so depending on how big of a deal upgrading gearman is, you might just want to wait. Unfortunately, all I can see if that development is "done" on our off-loader We are currently testing. I suspect it will come out with Nagios Core 5, but maybe 4.3?? Maybe it'll be it's own separate release? Too early to say. Just keep your eye on the upgrade notice and news on the front page of Core. :)
Sisa
Posts: 10
Joined: Mon Oct 10, 2016 7:44 am

Re: gearman : Too many open files

Post by Sisa »

The gearmand daemon has crashed last week. According to the monitoring I put, the max number of opened files has been reached.

Code: Select all

05/01/17 02:31:02;7798
05/01/17 02:32:03;7146
05/01/17 02:33:02;7477
05/01/17 02:34:03;6946
05/01/17 02:35:02;8763
05/01/17 02:36:03;10030
05/01/17 02:37:04;11415
05/01/17 02:38:03;12179
05/01/17 02:39:03;13122
05/01/17 02:40:05;14095
05/01/17 02:41:05;14793
05/01/17 02:42:03;15421
05/01/17 02:43:04;16031
05/01/17 02:44:04;16358
05/01/17 02:45:04;16884
05/01/17 02:46:04;17043
05/01/17 02:47:03;17344
05/01/17 02:48:05;17461
05/01/17 02:49:04;17680
05/01/17 02:50:04;17807
05/01/17 02:51:04;17911
05/01/17 02:52:04;17967
05/01/17 02:53:05;18153
05/01/17 02:54:04;18220
05/01/17 02:55:04;18390
05/01/17 02:56:04;18468
05/01/17 02:57:04;18640
05/01/17 02:58:04;18676
05/01/17 02:59:05;18793
05/01/17 03:00:04;18905
05/01/17 03:01:05;19033
05/01/17 03:02:04;19017
05/01/17 03:03:04;19230
05/01/17 03:04:04;19277
05/01/17 03:05:04;19426
05/01/17 03:06:05;19567
05/01/17 03:07:04;19766
05/01/17 03:08:05;19819
05/01/17 03:09:04;19990
05/01/17 03:10:06;20155
05/01/17 03:11:04;20297
05/01/17 03:12:07;20366
05/01/17 03:13:04;20548
05/01/17 03:14:05;20645
05/01/17 03:15:05;20861
05/01/17 03:16:04;21028
05/01/17 03:17:05;21235
05/01/17 03:18:05;21334
05/01/17 03:19:04;21533
05/01/17 03:20:05;21632
05/01/17 03:21:04;21763
05/01/17 03:22:06;21786
05/01/17 03:23:04;21932
05/01/17 03:24:05;22015
05/01/17 03:25:04;22015
05/01/17 03:26:05;22015
05/01/17 03:27:04;22015
I have to check why, maybe a network outtage. All the workers looked fine.

About your offloader, I can't wait for testing it !
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: gearman : Too many open files

Post by dwhitfield »

Sisa wrote:I have to check why, maybe a network outtage.
Please let us know if you need any help digging through logs.
About your offloader, I can't wait for testing it !
So, I can give some more info about this. No guarantees, but here's the plan:
Nagios Core 4.3 is currently scheduled to be released around April of 2017. This will primarily be an enhancement release. Some of the changes projected to be in this release are:
Native remote workers
More info at https://www.nagios.com/roadmaps/
Sisa
Posts: 10
Joined: Mon Oct 10, 2016 7:44 am

Re: gearman : Too many open files

Post by Sisa »

Hello,

Thank you for your help.
I will check the workers' log and I think I found a post on the gearman google discussion that concerns our issue :
https://groups.google.com/forum/#!searc ... BAXQi6DQAJ

But the problem doesn't seem solved on this discussion.
Moreover, even if I increase the max number of open files on the system, I think gearmand will reach because of an unknown root cause.
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: gearman : Too many open files

Post by rkennedy »

Are any checks running that could potentially not be closing the file properly? I imagine there are quite a few files in use, you might be able to track it down with ls -l /proc/<pid>/fd

I've seen a memory leak in the past, but not a 'file leak'. This could be the culprit of one plugin, to be honest.
Former Nagios Employee
Locked