Page 3 of 7

Re: Unable to ACK (again)

Posted: Tue Mar 01, 2016 2:57 pm
by highness
Captured a little more information when this happened again today:

Code: Select all

[email protected] (Linux) $ ps aux | grep nagios.cfg
root     31318  0.0  0.0 103252   880 pts/0    S+   11:53   0:00 grep nagios.cfg
nagios   54440  5.4  0.0  95064 73892 ?        Rs   11:18   1:53 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
total 12
And here are the file permissions:

Code: Select all

drwxrwsr-x 2 nagios nagcmd 4096 Mar  1 11:18 .
drwxrwxr-x 6 nagios nagios 4096 Mar  1 11:53 ..
-rw-rw-r-- 1 root   nagcmd  493 Mar  1 11:53 nagios.cmd
srw-rw---- 1 nagios nagcmd    0 Mar  1 11:18 nagios.qh
drwxrwsr-x 2 nagios nagcmd 4096 Mar  1 11:18 /usr/local/nagios/var/rw/
looks like nagios.cmd is owned by root - and I'm not sure why...

doing this fixed it:

Code: Select all

service nagios stop
rm -rf /usr/local/nagios/var/rw/*
chown nagios.nagcmd /usr/local/nagios/var/rw
chmod g+s /usr/local/nagios/var/rw
service nagios start
For now...

Re: Unable to ACK (again)

Posted: Wed Mar 02, 2016 12:00 pm
by tmcdonald
If this happens again, I'd like you to try a slightly different ps command: ps -ef | grep bin/nagios. The output should look similar to this:

Code: Select all

[root@localhost ~]# ps -ef | grep bin/nagios
root      6381 12390  0 11:58 pts/2    00:00:00 grep bin/nagios
nagios   32503     1  0 10:44 ?        00:00:01 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios   32507 32503  0 10:44 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   32508 32503  0 10:44 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   32509 32503  0 10:44 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   32510 32503  0 10:44 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   32519 32503  0 10:44 ?        00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
You'll see two nagios processes with the -d parameter, one will have a PPID of 1 (the second column) and the other will have a PPID that is equal to the PID (first column) of that process. In between you should have a number of workers whose PPID is also the PID of the first process. This is a healthy system - if you have more than 2 "-d" processes, something is wrong. At that point it is best to service nagios restart and inform us here.

Re: Unable to ACK (again)

Posted: Wed Mar 02, 2016 12:07 pm
by ssax
Ahh, I see why it's occurring but not how:

Code: Select all

-rw-rw-r-- 1 root   nagcmd  493 Mar  1 11:53 nagios.cmd
Your nagios.cmd file is being created as a regular file instead of a named pipe, you should see a "p" on the beginning like this:

Code: Select all

prw-rw---- 1 nagios nagcmd 0 Mar  2 10:51 nagios.cmd
Do you have SNMP traps setup on this machine? If so, please attach your /usr/local/bin/snmptraphandling.py script.

Do you have any other cron jobs, checks, event handlers, or anything that write to the nagios.cmd file?

Re: Unable to ACK (again)

Posted: Wed Mar 09, 2016 2:03 pm
by highness
ssax wrote: Do you have SNMP traps setup on this machine? If so, please attach your /usr/local/bin/snmptraphandling.py script.

Do you have any other cron jobs, checks, event handlers, or anything that write to the nagios.cmd file?
We don't have SNMP traps set up on this box, and don't have any other cron jobs that touch the nagios.cmd file.

Re: Unable to ACK (again)

Posted: Thu Mar 10, 2016 12:14 pm
by tmcdonald
Let's audit this file. Assuming you have the auditd service running (you likely do by default) please run the following to start an attribute change watch on the nagios.cmd file:

auditctl -w /usr/local/nagios/var/rw/nagios.cmd -p a -k nagcmdfile

Do a few Apply Configs, try to force the issue, then once you have verified that the file is still showing as a regular file (and not a pipe) run the following:

ausearch -f nagios.cmd

This should show us exactly what is going on with that file.

For reference, after adding the audit rule I did a chmod 777 on the nagios.cmd file and got this in the log:

Code: Select all

time->Thu Mar 10 12:12:04 2016
type=PATH msg=audit(1457629924.171:1779325): item=0 name="nagios.cmd" inode=397930 dev=fd:00 mode=010660 ouid=500 ogid=501 rdev=00:00 nametype=NORMAL
type=CWD msg=audit(1457629924.171:1779325):  cwd="/usr/local/nagios/var/rw"
type=SYSCALL msg=audit(1457629924.171:1779325): arch=c000003e syscall=268 success=yes exit=0 a0=ffffffffffffff9c a1=10f70f0 a2=1ff a3=7ffd13fb23f0 items=1 ppid=12390 pid=6655 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts2 ses=61283 comm="chmod" exe="/bin/chmod" key="nagcmdfile"
It might not be too easy to read at first, but it should give us a hint as to what is going on.

Re: Unable to ACK (again)

Posted: Thu Mar 10, 2016 2:02 pm
by highness
tmcdonald wrote:Let's audit this file. Assuming you have the auditd service running (you likely do by default) please run the following to start an attribute change watch on the nagios.cmd file:

auditctl -w /usr/local/nagios/var/rw/nagios.cmd -p a -k nagcmdfile

Do a few Apply Configs, try to force the issue, then once you have verified that the file is still showing as a regular file (and not a pipe) run the following:

ausearch -f nagios.cmd

This should show us exactly what is going on with that file.

For reference, after adding the audit rule I did a chmod 777 on the nagios.cmd file and got this in the log:

Code: Select all

time->Thu Mar 10 12:12:04 2016
type=PATH msg=audit(1457629924.171:1779325): item=0 name="nagios.cmd" inode=397930 dev=fd:00 mode=010660 ouid=500 ogid=501 rdev=00:00 nametype=NORMAL
type=CWD msg=audit(1457629924.171:1779325):  cwd="/usr/local/nagios/var/rw"
type=SYSCALL msg=audit(1457629924.171:1779325): arch=c000003e syscall=268 success=yes exit=0 a0=ffffffffffffff9c a1=10f70f0 a2=1ff a3=7ffd13fb23f0 items=1 ppid=12390 pid=6655 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts2 ses=61283 comm="chmod" exe="/bin/chmod" key="nagcmdfile"
It might not be too easy to read at first, but it should give us a hint as to what is going on.
We do have auditd running and I added the nagios.cmd file. This problem sometimes doesn't surface for a few days, so let's leave this thread open and I'll update this once it happens again.

Re: Unable to ACK (again)

Posted: Thu Mar 10, 2016 3:00 pm
by rkennedy
Sounds good - let us know when you have an update.

Re: Unable to ACK (again)

Posted: Fri Mar 11, 2016 1:18 pm
by highness
ok - happened again and there is nothing in the audit log about that file being modded... See below:

Code: Select all

[email protected] (Linux) $ ausearch -f nagios.cmd
<no matches>
[email protected] (Linux) $ ls -la /usr/local/nagios/var/rw/
total 12
drwxrwsr-x 2 nagios nagcmd 4096 Mar 11 09:46 .
drwxrwxr-x 6 nagios nagios 4096 Mar 11 10:09 ..
-rw-rw-r-- 1 root   nagcmd  520 Mar 11 10:09 nagios.cmd
srw-rw---- 1 nagios nagcmd    0 Mar 11 09:46 nagios.qh
[email protected] (Linux) $ ausearch -f nagios.cmd
<no matches>
[email protected] (Linux) $ service nagios restart
Stopping nagios:. done.
Starting nagios: done.
[email protected] (Linux) $ ls -la /usr/local/nagios/var/rw/
total 8
drwxrwsr-x 2 nagios nagcmd 4096 Mar 11 10:10 .
drwxrwxr-x 6 nagios nagios 4096 Mar 11 10:10 ..
prw-rw---- 1 nagios nagcmd    0 Mar 11 10:10 nagios.cmd
srw-rw---- 1 nagios nagcmd    0 Mar 11 10:10 nagios.qh
[email protected] (Linux) $ ausearch -f nagios.cmd
<no matches>

Re: Unable to ACK (again)

Posted: Fri Mar 11, 2016 3:36 pm
by tmcdonald
Odd, it looks like the file may not get modified, but rather created improperly as a normal file sometimes when restarting nagios. I'd like to let this run over the weekend to get some more data if that's alright - I am still not convinced that my observation is correct, I think it is more likely that it *is* being modified and we just haven't seen it happen.

Either way, I am having our Core dev take a look at this to see if there are any conditions he can think of that would force that file to be written out as a non-pipe.

Update: Just to get my thoughts/fears out there, I could see a situation where a nagios restart from an apply config takes just long enough that the new process is spinning up as the old one is spinning down and causes issues. Old one says "I'm done with this file, get rid of it" so the kernel marks it for deletion. The name nagios.cmd is now available for a new file to take its place, but the in-memory reference to the pipe is still there. So now when the new nagios process tries to create the file, it is definitely able to but since the named pipe is still in memory, it has to be created as a normal file.

Total conjecture on my part, probably isn't that complex, but it's just geeky enough that I'd be somewhat impressed to see this happen.

Re: Unable to ACK (again)

Posted: Mon Mar 14, 2016 9:46 am
by tmcdonald
Do you have passive checks enabled? Anything on a remote machine sending in to NRDP/NSCA? Those would both potentially cause the nagios.cmd to be overwritten as a normal file.

Update: I am almost certain this is the cause - notice that the file size is 520 bytes when it should be 0 for a normal pipe file. Can you cat the contents of the nagios.cmd file?