gearman : Too many open files

An open discussion forum for obtaining help with Nagios Core. Nagios Core users of all experience levels are welcome here. Subforum have been created for the discussion of Nagios Core and Nagios Plugin development.

NOTE: The SourceForge.net mailing lists have been deprecated in favor of this forum in order to expedite support and provide additional features not available on the old mailing list.

gearman : Too many open files

Postby Sisa » Wed Nov 30, 2016 11:58 am

Hello everybody,

We are running a distributed monitoring with nagios and gearman solution.

Sometimes, we face the following issue :
On the thruk interface, the nagios hosts are down and the nagios services keep their status before the issue.
The message for host check :
Code: Select all
host check orphaned, is the mod-gearman worker on queue 'XXXXX' running?


It looks like this issue : viewtopic.php?f=6&t=31202


In the /var/log/messages log file, we get this kind of error messages :
Code: Select all
Nov 30 08:58:43 marmara nagios: Warning: The check of service 'Lustre OSS health' on host 'scratchopera-oss2spg01a' looks like it was orphaned (results never came back; last_check=1480441715; next_check=1480492003).  I'm scheduling an immediate check of the service...
Nov 30 08:58:43 marmara nagios: Warning: The check of service 'NTP' on host 'scratchopera-oss2spg01b' looks like it was orphaned (results never came back; last_check=1480440279; next_check=1480492003).  I'm scheduling an immediate check of the service...
Nov 30 08:58:43 marmara nagios: Warning: The check of service 'disks state' on host 'taiwan060' looks like it was orphaned (results never came back; last_check=1480441706; next_check=1480492003).  I'm scheduling an immediate check of the service...
Nov 30 08:58:43 marmara nagios: Warning: The check of service '[META] Service' on host 'vavau' looks like it was orphaned (results never came back; last_check=1480443879; next_check=1480492003).  I'm scheduling an immediate check of the service...
Nov 30 08:59:42 marmara nagios: Warning: The check of host 'oleron-cisco-1' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the host...
Nov 30 08:59:42 marmara nagios: Warning: The check of host 'oleron-cisco-3' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the host...
Nov 30 08:59:42 marmara nagios: Warning: The check of host 'oleron-cisco-4' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the host...
Nov 30 08:59:42 marmara nagios: Warning: The check of host 'oleron-cisco-6' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the host...
Nov 30 08:59:42 marmara nagios: Warning: The check of host 'oleron-cisco-7' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the host...
Nov 30 08:59:42 marmara nagios: Warning: The check of host 'oleron-cisco-8' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the host...
Nov 30 08:59:42 marmara nagios: Warning: The check of host 'oleron-e2600-a' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the host...
Nov 30 08:59:42 marmara nagios: Warning: The check of host 'oleron-qlogic-1' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the host...
Nov 30 08:59:42 marmara nagios: Warning: The check of host 'oleron-qlogic-2' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the host...


in the gearmand log :
Code: Select all
576808   ERROR 2016-10-29 17:34:40.000000 [  main ] accept(Too many open files) -> libgearman-server/gearmand.cc:691
576809   ERROR 2016-10-29 17:34:40.000000 [  main ] accept(Too many open files) -> libgearman-server/gearmand.cc:691
576810   ERROR 2016-10-29 17:34:40.000000 [  main ] accept(Too many open files) -> libgearman-server/gearmand.cc:691
576811   ERROR 2016-10-29 17:34:40.000000 [  main ] accept(Too many open files) -> libgearman-server/gearmand.cc:691
576812   ERROR 2016-10-29 17:34:40.000000 [  main ] accept(Too many open files) -> libgearman-server/gearmand.cc:691
576813   ERROR 2016-10-29 17:34:40.000000 [  main ] accept(Too many open files) -> libgearman-server/gearmand.cc:691
576814   ERROR 2016-10-29 17:34:40.000000 [  main ] accept(Too many open files) -> libgearman-server/gearmand.cc:691



Nagios version : Nagios Core 4.1.1
gearmand version : gearmand 0.33
most mod gearman version : version 1.4_nagios4 running on libgearman 1.1.12
The limit of open files for gearmand on the nagios server (/etc/security/limits.conf) :
Code: Select all
gearmand      hard   nofile      22000
gearmand      soft   nofile      22000


Maybe, we have to monitor the number of open files by gearmand.

Our workaround is to restart gearmand and nagios core.

Sorry for my english, could you please help ?
Sisa
 
Posts: 8
Joined: Mon Oct 10, 2016 7:44 am

Re: gearman : Too many open files

Postby dwhitfield » Wed Nov 30, 2016 1:10 pm

Mod_gearman is not our product, so we can only offer limited support. With that out of the way...

The error you are getting is due to a performance issue. Although much of it is not directly on point, https://support.nagios.com/kb/article.php?id=19 may be of use.

Our mod_gearman install script (http://assets.nagios.com/downloads/nagi ... Install.sh) uses 2.1.1-1. That script is written for XI, but the current version of XI uses Core 4.1.1, so I suspect it will work. Gearman 2.0 had some performance improvements.

If the mod_gearman improvements aren't enough, there are some performance improvements by updating Nagios Core to the 4.2.0 series. Of course, I'd suggest getting the latest bug fixes and getting 4.2.3.

Please let us know if you still have the issue after running through my suggestions.
Be sure to check out our Knowledgebase for helpful articles and solutions!
User avatar
dwhitfield
The Doctor
 
Posts: 2114
Joined: Wed Sep 21, 2016 10:29 am
Location: Nagios Enterprises, LLC

Re: gearman : Too many open files

Postby Sisa » Thu Dec 01, 2016 7:30 am

Hello,

Thank you for your answer. We will perform your suggestions and we will let you know if we still have the issue.

Stay in touch
Sisa
 
Posts: 8
Joined: Mon Oct 10, 2016 7:44 am

Re: gearman : Too many open files

Postby dwhitfield » Thu Dec 01, 2016 10:40 am

My guess is the issue is severe enough that you do not want to wait, but at some point next week we should be releasing Core 4.2.4. Of course, you can always upgrade now and then upgrade later. :)
Be sure to check out our Knowledgebase for helpful articles and solutions!
User avatar
dwhitfield
The Doctor
 
Posts: 2114
Joined: Wed Sep 21, 2016 10:29 am
Location: Nagios Enterprises, LLC

Re: gearman : Too many open files

Postby Sisa » Tue Dec 13, 2016 8:35 am

Hello dwhitfield,

We have updated nagios core in version 4.2.4 :)
We also set a monitoring of the number of files opened by gearmand.
So, we are waiting for a new crash of gearmand. Maybe, with the Nagios update, it won't crash anymore :lol:

Thank you again,
Sisavang
Sisa
 
Posts: 8
Joined: Mon Oct 10, 2016 7:44 am

Re: gearman : Too many open files

Postby dwhitfield » Tue Dec 13, 2016 10:30 am

So, the next step if you run into crashes is to upgrade mod_gearman, but at some point we are going to be releasing our own off-loader, so depending on how big of a deal upgrading gearman is, you might just want to wait. Unfortunately, all I can see if that development is "done" on our off-loader We are currently testing. I suspect it will come out with Nagios Core 5, but maybe 4.3?? Maybe it'll be it's own separate release? Too early to say. Just keep your eye on the upgrade notice and news on the front page of Core. :)
Be sure to check out our Knowledgebase for helpful articles and solutions!
User avatar
dwhitfield
The Doctor
 
Posts: 2114
Joined: Wed Sep 21, 2016 10:29 am
Location: Nagios Enterprises, LLC

Re: gearman : Too many open files

Postby Sisa » Wed Jan 11, 2017 8:26 am

The gearmand daemon has crashed last week. According to the monitoring I put, the max number of opened files has been reached.

Code: Select all
05/01/17 02:31:02;7798
05/01/17 02:32:03;7146
05/01/17 02:33:02;7477
05/01/17 02:34:03;6946
05/01/17 02:35:02;8763
05/01/17 02:36:03;10030
05/01/17 02:37:04;11415
05/01/17 02:38:03;12179
05/01/17 02:39:03;13122
05/01/17 02:40:05;14095
05/01/17 02:41:05;14793
05/01/17 02:42:03;15421
05/01/17 02:43:04;16031
05/01/17 02:44:04;16358
05/01/17 02:45:04;16884
05/01/17 02:46:04;17043
05/01/17 02:47:03;17344
05/01/17 02:48:05;17461
05/01/17 02:49:04;17680
05/01/17 02:50:04;17807
05/01/17 02:51:04;17911
05/01/17 02:52:04;17967
05/01/17 02:53:05;18153
05/01/17 02:54:04;18220
05/01/17 02:55:04;18390
05/01/17 02:56:04;18468
05/01/17 02:57:04;18640
05/01/17 02:58:04;18676
05/01/17 02:59:05;18793
05/01/17 03:00:04;18905
05/01/17 03:01:05;19033
05/01/17 03:02:04;19017
05/01/17 03:03:04;19230
05/01/17 03:04:04;19277
05/01/17 03:05:04;19426
05/01/17 03:06:05;19567
05/01/17 03:07:04;19766
05/01/17 03:08:05;19819
05/01/17 03:09:04;19990
05/01/17 03:10:06;20155
05/01/17 03:11:04;20297
05/01/17 03:12:07;20366
05/01/17 03:13:04;20548
05/01/17 03:14:05;20645
05/01/17 03:15:05;20861
05/01/17 03:16:04;21028
05/01/17 03:17:05;21235
05/01/17 03:18:05;21334
05/01/17 03:19:04;21533
05/01/17 03:20:05;21632
05/01/17 03:21:04;21763
05/01/17 03:22:06;21786
05/01/17 03:23:04;21932
05/01/17 03:24:05;22015
05/01/17 03:25:04;22015
05/01/17 03:26:05;22015
05/01/17 03:27:04;22015


I have to check why, maybe a network outtage. All the workers looked fine.

About your offloader, I can't wait for testing it !
Sisa
 
Posts: 8
Joined: Mon Oct 10, 2016 7:44 am

Re: gearman : Too many open files

Postby dwhitfield » Wed Jan 11, 2017 12:06 pm

Sisa wrote:I have to check why, maybe a network outtage.


Please let us know if you need any help digging through logs.

About your offloader, I can't wait for testing it !


So, I can give some more info about this. No guarantees, but here's the plan:

Nagios Core 4.3 is currently scheduled to be released around April of 2017. This will primarily be an enhancement release. Some of the changes projected to be in this release are:
Native remote workers


More info at https://www.nagios.com/roadmaps/
Be sure to check out our Knowledgebase for helpful articles and solutions!
User avatar
dwhitfield
The Doctor
 
Posts: 2114
Joined: Wed Sep 21, 2016 10:29 am
Location: Nagios Enterprises, LLC

Re: gearman : Too many open files

Postby Sisa » Wed Jan 18, 2017 4:41 am

Hello,

Thank you for your help.
I will check the workers' log and I think I found a post on the gearman google discussion that concerns our issue :
https://groups.google.com/forum/#!searchin/gearman/too$20many$20open$20files|sort:relevance/gearman/AxkDVxIxMj0/LFBAXQi6DQAJ

But the problem doesn't seem solved on this discussion.
Moreover, even if I increase the max number of open files on the system, I think gearmand will reach because of an unknown root cause.
Sisa
 
Posts: 8
Joined: Mon Oct 10, 2016 7:44 am

Re: gearman : Too many open files

Postby rkennedy » Wed Jan 18, 2017 3:17 pm

Are any checks running that could potentially not be closing the file properly? I imagine there are quite a few files in use, you might be able to track it down with ls -l /proc/<pid>/fd

I've seen a memory leak in the past, but not a 'file leak'. This could be the culprit of one plugin, to be honest.
rkennedy
 
Posts: 6545
Joined: Mon Oct 05, 2015 11:45 am

Next

Return to Nagios Core

Who is online

Users browsing this forum: No registered users and 4 guests