Page 2 of 3

Re: return code 127 on test that seems invalid

Posted: Tue Dec 04, 2012 5:55 pm
by ucemike
Additional testing. Created a test "check_test" that will output user running the script and the ulimit -a output.

User running it was nagios, however the ulimit output indicated it was not making use of the /etc/security/limits.conf changes I had made.

Code: Select all

nagios          soft    nproc           14000
nagios          soft    nofile          165000
nagios          hard    nproc           14000
nagios          hard    nofile          165000
(well above the values they needed to be but I was testing)

When su'd as nagios the ulimit-a showed this (as expected):

Code: Select all

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 191979
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 165000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 14000
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
the check_test script however showed this results:

Code: Select all

Whoami=nagios

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 191979
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 191979
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
As you can see the check_test limit shows 1024 as open files limit... I am going to do some digging around and see if I can find out why this is.

Re: return code 127 on test that seems invalid

Posted: Tue Dec 04, 2012 6:10 pm
by jsmurphy
I think you are right about it being a cap, change to the nagios user and type "ulimit -u" which will tell you the amount of processes the user is allowed to spawn. You can edit these limits in /etc/security/limits.conf.

And... I'm going to stop typing because you already updated your post doing exactly this :)

I'm not sure if the ulimit changes take effect immediately on a running session. Maybe get your check_test to run "ulimit -u 165000" which should force it to update the limit.

Re: return code 127 on test that seems invalid

Posted: Tue Dec 04, 2012 6:15 pm
by ucemike
jsmurphy wrote:I think you are right about it being a cap, change to the nagios user and type "ulimit -u" which will tell you the amount of processes the user is allowed to spawn. You can edit these limits in /etc/security/limits.conf.

And... I'm going to stop typing because you already updated your post doing exactly this :)

I'm not sure if the ulimit changes take effect immediately on a running session. Maybe get your check_test to run "ulimit -u 165000" which should force it to update the limit.
I had made the changes a couple weeks back (limits.conf) and to make sure I rebooted the host before making the above tests today. Still getting the same results in ulimit output after that ;(

Still poking around to see what can be done. This is quite the needle.

Re: return code 127 on test that seems invalid

Posted: Tue Dec 04, 2012 6:37 pm
by jsmurphy
I'm quickly devolving into "wild speculation" mode here but I wonder if it has something to do with a limit on the number of available file descriptors... have a look at section 3.2 to 3.3.1 also 3.4 to 3.4.2: https://access.redhat.com/knowledge/doc ... uning.html

I also found this that outlines this exact problem: https://access.redhat.com/knowledge/solutions/38043 but it only seems to be related to xinetd.

Re: return code 127 on test that seems invalid

Posted: Wed Dec 05, 2012 11:20 am
by ucemike
I tried using "/sbin/runuser - nagios -c "$NagiosBin -d $NagiosCfgFile"" in the /etc/init.d/rc.d/nagios startup script but it did not resolve the problem.

It seems based on the redhat fix that they moved from select() to poll(). I did some greping of source and found both select() and poll() in the nagios source. Not knowing just where the calls are made to run a commands.conf entry I am not sure which is used. I am guessing that it is using select() since the problem appears to be the exact same (1024 open file limit).

There are checks for sys/poll.h and it does exist on the system but I am unsure how to verify it actually uses that in the build once completed.

config.status had:
${ac_dA}HAVE_SYS_POLL_H${ac_dB}HAVE_SYS_POLL_H${ac_dC}1${ac_dD}
and
${ac_uA}HAVE_SYS_POLL_H${ac_uB}HAVE_SYS_POLL_H${ac_uC}1${ac_uD}

Which seems to indicate it thinks it does have poll and uses it?

Re: return code 127 on test that seems invalid

Posted: Wed Dec 05, 2012 5:17 pm
by jsmurphy
This is going further into the source than I have ever delved, I think it might be time to escalate this one to the mighty grey-beards on the nagios-devel mailing list (nobody tell Andreas I said that). Maybe for some reason it's not detecting Poll() and is falling back on select()? But your excerpt from config.status seems to suggest otherwise...

At any rate the mailing list is here: https://lists.sourceforge.net/lists/lis ... gios-devel

Re: return code 127 on test that seems invalid

Posted: Wed Jan 02, 2013 3:39 pm
by ucemike
ucemike wrote:Additional testing. Created a test "check_test" that will output user running the script and the ulimit -a output.

the check_test script however showed this results:

Code: Select all

Whoami=nagios

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 191979
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 191979
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
As you can see the check_test limit shows 1024 as open files limit... I am going to do some digging around and see if I can find out why this is.
So per some suggestions in another forum I added the ulimits commands to /etc/init.d/nagios script and the check_test script now reports the updated values.

That would be wonderful news if it had actually stopped the 127 errors but it has not. Only way I can fix that still is to reduce the number of tests below the 2770-2780 service test threshold.

The dev forum seems to have dried up on suggestions.

Re: return code 127 on test that seems invalid

Posted: Fri Jan 04, 2013 3:40 pm
by ucemike
So, I was poking around and found some info on large site tuning and was fiddling with them to see if I could fix my problem and... it did. However it makes NO sense to my why it did.

Setting in nagios.cfg:
enable_environment_macros=0

With that set to 0 the 127 errors stopped. I turned it back on, they start up again.

What the heck?

Re: return code 127 on test that seems invalid

Posted: Fri Jan 04, 2013 4:35 pm
by slansing
One possibility, do you have any custom macro's in your notification template?

Re: return code 127 on test that seems invalid

Posted: Fri Jan 04, 2013 5:05 pm
by ucemike
slansing wrote:One possibility, do you have any custom macro's in your notification template?
Nope.