Unable to ACK (again)

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
highness
Posts: 192
Joined: Thu May 01, 2014 4:25 pm

Re: Unable to ACK (again)

Post by highness »

Captured a little more information when this happened again today:

Code: Select all

[email protected] (Linux) $ ps aux | grep nagios.cfg
root     31318  0.0  0.0 103252   880 pts/0    S+   11:53   0:00 grep nagios.cfg
nagios   54440  5.4  0.0  95064 73892 ?        Rs   11:18   1:53 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
total 12
And here are the file permissions:

Code: Select all

drwxrwsr-x 2 nagios nagcmd 4096 Mar  1 11:18 .
drwxrwxr-x 6 nagios nagios 4096 Mar  1 11:53 ..
-rw-rw-r-- 1 root   nagcmd  493 Mar  1 11:53 nagios.cmd
srw-rw---- 1 nagios nagcmd    0 Mar  1 11:18 nagios.qh
drwxrwsr-x 2 nagios nagcmd 4096 Mar  1 11:18 /usr/local/nagios/var/rw/
looks like nagios.cmd is owned by root - and I'm not sure why...

doing this fixed it:

Code: Select all

service nagios stop
rm -rf /usr/local/nagios/var/rw/*
chown nagios.nagcmd /usr/local/nagios/var/rw
chmod g+s /usr/local/nagios/var/rw
service nagios start
For now...
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Unable to ACK (again)

Post by tmcdonald »

If this happens again, I'd like you to try a slightly different ps command: ps -ef | grep bin/nagios. The output should look similar to this:

Code: Select all

[root@localhost ~]# ps -ef | grep bin/nagios
root      6381 12390  0 11:58 pts/2    00:00:00 grep bin/nagios
nagios   32503     1  0 10:44 ?        00:00:01 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios   32507 32503  0 10:44 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   32508 32503  0 10:44 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   32509 32503  0 10:44 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   32510 32503  0 10:44 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   32519 32503  0 10:44 ?        00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
You'll see two nagios processes with the -d parameter, one will have a PPID of 1 (the second column) and the other will have a PPID that is equal to the PID (first column) of that process. In between you should have a number of workers whose PPID is also the PID of the first process. This is a healthy system - if you have more than 2 "-d" processes, something is wrong. At that point it is best to service nagios restart and inform us here.
Former Nagios employee
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Unable to ACK (again)

Post by ssax »

Ahh, I see why it's occurring but not how:

Code: Select all

-rw-rw-r-- 1 root   nagcmd  493 Mar  1 11:53 nagios.cmd
Your nagios.cmd file is being created as a regular file instead of a named pipe, you should see a "p" on the beginning like this:

Code: Select all

prw-rw---- 1 nagios nagcmd 0 Mar  2 10:51 nagios.cmd
Do you have SNMP traps setup on this machine? If so, please attach your /usr/local/bin/snmptraphandling.py script.

Do you have any other cron jobs, checks, event handlers, or anything that write to the nagios.cmd file?
highness
Posts: 192
Joined: Thu May 01, 2014 4:25 pm

Re: Unable to ACK (again)

Post by highness »

ssax wrote: Do you have SNMP traps setup on this machine? If so, please attach your /usr/local/bin/snmptraphandling.py script.

Do you have any other cron jobs, checks, event handlers, or anything that write to the nagios.cmd file?
We don't have SNMP traps set up on this box, and don't have any other cron jobs that touch the nagios.cmd file.
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Unable to ACK (again)

Post by tmcdonald »

Let's audit this file. Assuming you have the auditd service running (you likely do by default) please run the following to start an attribute change watch on the nagios.cmd file:

auditctl -w /usr/local/nagios/var/rw/nagios.cmd -p a -k nagcmdfile

Do a few Apply Configs, try to force the issue, then once you have verified that the file is still showing as a regular file (and not a pipe) run the following:

ausearch -f nagios.cmd

This should show us exactly what is going on with that file.

For reference, after adding the audit rule I did a chmod 777 on the nagios.cmd file and got this in the log:

Code: Select all

time->Thu Mar 10 12:12:04 2016
type=PATH msg=audit(1457629924.171:1779325): item=0 name="nagios.cmd" inode=397930 dev=fd:00 mode=010660 ouid=500 ogid=501 rdev=00:00 nametype=NORMAL
type=CWD msg=audit(1457629924.171:1779325):  cwd="/usr/local/nagios/var/rw"
type=SYSCALL msg=audit(1457629924.171:1779325): arch=c000003e syscall=268 success=yes exit=0 a0=ffffffffffffff9c a1=10f70f0 a2=1ff a3=7ffd13fb23f0 items=1 ppid=12390 pid=6655 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts2 ses=61283 comm="chmod" exe="/bin/chmod" key="nagcmdfile"
It might not be too easy to read at first, but it should give us a hint as to what is going on.
Former Nagios employee
highness
Posts: 192
Joined: Thu May 01, 2014 4:25 pm

Re: Unable to ACK (again)

Post by highness »

tmcdonald wrote:Let's audit this file. Assuming you have the auditd service running (you likely do by default) please run the following to start an attribute change watch on the nagios.cmd file:

auditctl -w /usr/local/nagios/var/rw/nagios.cmd -p a -k nagcmdfile

Do a few Apply Configs, try to force the issue, then once you have verified that the file is still showing as a regular file (and not a pipe) run the following:

ausearch -f nagios.cmd

This should show us exactly what is going on with that file.

For reference, after adding the audit rule I did a chmod 777 on the nagios.cmd file and got this in the log:

Code: Select all

time->Thu Mar 10 12:12:04 2016
type=PATH msg=audit(1457629924.171:1779325): item=0 name="nagios.cmd" inode=397930 dev=fd:00 mode=010660 ouid=500 ogid=501 rdev=00:00 nametype=NORMAL
type=CWD msg=audit(1457629924.171:1779325):  cwd="/usr/local/nagios/var/rw"
type=SYSCALL msg=audit(1457629924.171:1779325): arch=c000003e syscall=268 success=yes exit=0 a0=ffffffffffffff9c a1=10f70f0 a2=1ff a3=7ffd13fb23f0 items=1 ppid=12390 pid=6655 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts2 ses=61283 comm="chmod" exe="/bin/chmod" key="nagcmdfile"
It might not be too easy to read at first, but it should give us a hint as to what is going on.
We do have auditd running and I added the nagios.cmd file. This problem sometimes doesn't surface for a few days, so let's leave this thread open and I'll update this once it happens again.
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: Unable to ACK (again)

Post by rkennedy »

Sounds good - let us know when you have an update.
Former Nagios Employee
highness
Posts: 192
Joined: Thu May 01, 2014 4:25 pm

Re: Unable to ACK (again)

Post by highness »

ok - happened again and there is nothing in the audit log about that file being modded... See below:

Code: Select all

[email protected] (Linux) $ ausearch -f nagios.cmd
<no matches>
[email protected] (Linux) $ ls -la /usr/local/nagios/var/rw/
total 12
drwxrwsr-x 2 nagios nagcmd 4096 Mar 11 09:46 .
drwxrwxr-x 6 nagios nagios 4096 Mar 11 10:09 ..
-rw-rw-r-- 1 root   nagcmd  520 Mar 11 10:09 nagios.cmd
srw-rw---- 1 nagios nagcmd    0 Mar 11 09:46 nagios.qh
[email protected] (Linux) $ ausearch -f nagios.cmd
<no matches>
[email protected] (Linux) $ service nagios restart
Stopping nagios:. done.
Starting nagios: done.
[email protected] (Linux) $ ls -la /usr/local/nagios/var/rw/
total 8
drwxrwsr-x 2 nagios nagcmd 4096 Mar 11 10:10 .
drwxrwxr-x 6 nagios nagios 4096 Mar 11 10:10 ..
prw-rw---- 1 nagios nagcmd    0 Mar 11 10:10 nagios.cmd
srw-rw---- 1 nagios nagcmd    0 Mar 11 10:10 nagios.qh
[email protected] (Linux) $ ausearch -f nagios.cmd
<no matches>
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Unable to ACK (again)

Post by tmcdonald »

Odd, it looks like the file may not get modified, but rather created improperly as a normal file sometimes when restarting nagios. I'd like to let this run over the weekend to get some more data if that's alright - I am still not convinced that my observation is correct, I think it is more likely that it *is* being modified and we just haven't seen it happen.

Either way, I am having our Core dev take a look at this to see if there are any conditions he can think of that would force that file to be written out as a non-pipe.

Update: Just to get my thoughts/fears out there, I could see a situation where a nagios restart from an apply config takes just long enough that the new process is spinning up as the old one is spinning down and causes issues. Old one says "I'm done with this file, get rid of it" so the kernel marks it for deletion. The name nagios.cmd is now available for a new file to take its place, but the in-memory reference to the pipe is still there. So now when the new nagios process tries to create the file, it is definitely able to but since the named pipe is still in memory, it has to be created as a normal file.

Total conjecture on my part, probably isn't that complex, but it's just geeky enough that I'd be somewhat impressed to see this happen.
Former Nagios employee
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Unable to ACK (again)

Post by tmcdonald »

Do you have passive checks enabled? Anything on a remote machine sending in to NRDP/NSCA? Those would both potentially cause the nagios.cmd to be overwritten as a normal file.

Update: I am almost certain this is the cause - notice that the file size is 520 bytes when it should be 0 for a normal pipe file. Can you cat the contents of the nagios.cmd file?
Former Nagios employee
Locked