Page 1 of 1

Odd crashing behavior

Posted: Fri Jul 20, 2012 12:03 pm
by gwakem
OK, I have another strange one.

Quick specs, we have 1403 servers, and 7078 services monitored currently (and climbing). The machine we have as a master is a 16core 2.4GHz with 128GB of memory, 10k SAS drives.

We apply, and it takes two or three to get things running again. I see the following messages in /var/log/messages (used so I dont have to covert from Epoch:)

Code: Select all

Jul 20 10:00:56 sidhqmonm0 nagios: Caught SIGTERM, shutting down...
Jul 20 10:00:56 sidhqmonm0 nagios: Successfully shutdown... (PID=11492)
Jul 20 10:00:56 sidhqmonm0 nagios: Event broker module '/usr/lib64/mod_gearman/mod_gearman.o' deinitialized successfully.
Jul 20 10:00:56 sidhqmonm0 nagios: ndomod: Shutdown complete.
Jul 20 10:00:56 sidhqmonm0 nagios: Event broker module '/usr/local/nagios/bin/ndomod.o' deinitialized successfully.
Jul 20 10:00:58 sidhqmonm0 nagios: Nagios 3.4.1 starting... (PID=27118)
Jul 20 10:00:58 sidhqmonm0 nagios: Local time is Fri Jul 20 10:00:58 MDT 2012
Jul 20 10:00:58 sidhqmonm0 nagios: LOG VERSION: 2.0
Jul 20 10:00:58 sidhqmonm0 nagios: mod_gearman: initialized version 1.3.0 (libgearman 0.25)
Jul 20 10:00:58 sidhqmonm0 nagios: Event broker module '/usr/lib64/mod_gearman/mod_gearman.o' initialized successfully.
Jul 20 10:00:58 sidhqmonm0 nagios: ndomod: NDOMOD 1.5.1 (05-15-2012) Copyright (c) 2009 Nagios Core Development Team and Community Contributors
Jul 20 10:00:58 sidhqmonm0 nagios: ndomod: Successfully connected to data sink.  0 queued items to flush.
Jul 20 10:00:58 sidhqmonm0 nagios: Event broker module '/usr/local/nagios/bin/ndomod.o' initialized successfully.

[CLIP]

Jul 20 10:01:07 sidhqmonm0 nagios: Finished daemonizing... (New PID=27244)
Jul 20 10:01:07 sidhqmonm0 nagios: Error: Could not create external command file '/usr/local/nagios/var/rw/nagios.cmd' as named pipe: (17) -> File exists.  If this file already exists and you are sure that another copy of Nagios is not running, you should delete this file.
Jul 20 10:01:07 sidhqmonm0 nagios: Bailing out due to errors encountered while trying to initialize the external command file... (PID=27244)
Jul 20 10:01:07 sidhqmonm0 nagios: Event broker module '/usr/lib64/mod_gearman/mod_gearman.o' deinitialized successfully.
Jul 20 10:01:07 sidhqmonm0 nagios: ndomod: Shutdown complete.
Jul 20 10:01:07 sidhqmonm0 nagios: Event broker module '/usr/local/nagios/bin/ndomod.o' deinitialized successfully.
Around the third time of applying this seems to come back. To test, I applied, while watching the /usr/local/nagios/var/rw/nagios.cmd, and it did exist until *right* after the message about it already existing. Then it went away for about 5 seconds, and came back with these permissions:

Code: Select all

rw-rw-rw-+ 1 nagios nagcmd 0 Jul 20 10:46 /usr/local/nagios/var/rw/nagios.cmd
I started nagios manually with a service nagios start, and the above nagios.cmd dissapeared again for about 3 seocnds, and came back like this:

Code: Select all

prw-rw----+ 1 nagios nagcmd 0 Jul 20 10:46 /usr/local/nagios/var/rw/nagios.cmd
Which is normal while its running. The nagios service started successfully and checks started flowing again. We timed our last apply and it takes a touch over 4 minutes to complete. I dont know if that is a contributing factor or not, but figured I'd mention it. Any ideas on where I could start looking? Nothing significant has changed except we disabled logging yesterday, however, this only started occurring today.

Re: Odd crashing behavior

Posted: Fri Jul 20, 2012 1:47 pm
by mguthrie
Most of these steps are contained in the /etc/init.d/nagios init script.

Does you guys have SELinux enabled?

Here's my nagios.cmd file. I don't suppose you know what the '+' sign denotes...?

Code: Select all

prw-rw---- 1 nagios nagios       0 Jul 20 13:01 nagios.cmd