Odd crashing behavior

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
User avatar
gwakem
Posts: 238
Joined: Mon Jan 23, 2012 2:02 pm
Location: Asheville, NC

Odd crashing behavior

Post by gwakem »

OK, I have another strange one.

Quick specs, we have 1403 servers, and 7078 services monitored currently (and climbing). The machine we have as a master is a 16core 2.4GHz with 128GB of memory, 10k SAS drives.

We apply, and it takes two or three to get things running again. I see the following messages in /var/log/messages (used so I dont have to covert from Epoch:)

Code: Select all

Jul 20 10:00:56 sidhqmonm0 nagios: Caught SIGTERM, shutting down...
Jul 20 10:00:56 sidhqmonm0 nagios: Successfully shutdown... (PID=11492)
Jul 20 10:00:56 sidhqmonm0 nagios: Event broker module '/usr/lib64/mod_gearman/mod_gearman.o' deinitialized successfully.
Jul 20 10:00:56 sidhqmonm0 nagios: ndomod: Shutdown complete.
Jul 20 10:00:56 sidhqmonm0 nagios: Event broker module '/usr/local/nagios/bin/ndomod.o' deinitialized successfully.
Jul 20 10:00:58 sidhqmonm0 nagios: Nagios 3.4.1 starting... (PID=27118)
Jul 20 10:00:58 sidhqmonm0 nagios: Local time is Fri Jul 20 10:00:58 MDT 2012
Jul 20 10:00:58 sidhqmonm0 nagios: LOG VERSION: 2.0
Jul 20 10:00:58 sidhqmonm0 nagios: mod_gearman: initialized version 1.3.0 (libgearman 0.25)
Jul 20 10:00:58 sidhqmonm0 nagios: Event broker module '/usr/lib64/mod_gearman/mod_gearman.o' initialized successfully.
Jul 20 10:00:58 sidhqmonm0 nagios: ndomod: NDOMOD 1.5.1 (05-15-2012) Copyright (c) 2009 Nagios Core Development Team and Community Contributors
Jul 20 10:00:58 sidhqmonm0 nagios: ndomod: Successfully connected to data sink.  0 queued items to flush.
Jul 20 10:00:58 sidhqmonm0 nagios: Event broker module '/usr/local/nagios/bin/ndomod.o' initialized successfully.

[CLIP]

Jul 20 10:01:07 sidhqmonm0 nagios: Finished daemonizing... (New PID=27244)
Jul 20 10:01:07 sidhqmonm0 nagios: Error: Could not create external command file '/usr/local/nagios/var/rw/nagios.cmd' as named pipe: (17) -> File exists.  If this file already exists and you are sure that another copy of Nagios is not running, you should delete this file.
Jul 20 10:01:07 sidhqmonm0 nagios: Bailing out due to errors encountered while trying to initialize the external command file... (PID=27244)
Jul 20 10:01:07 sidhqmonm0 nagios: Event broker module '/usr/lib64/mod_gearman/mod_gearman.o' deinitialized successfully.
Jul 20 10:01:07 sidhqmonm0 nagios: ndomod: Shutdown complete.
Jul 20 10:01:07 sidhqmonm0 nagios: Event broker module '/usr/local/nagios/bin/ndomod.o' deinitialized successfully.
Around the third time of applying this seems to come back. To test, I applied, while watching the /usr/local/nagios/var/rw/nagios.cmd, and it did exist until *right* after the message about it already existing. Then it went away for about 5 seconds, and came back with these permissions:

Code: Select all

rw-rw-rw-+ 1 nagios nagcmd 0 Jul 20 10:46 /usr/local/nagios/var/rw/nagios.cmd
I started nagios manually with a service nagios start, and the above nagios.cmd dissapeared again for about 3 seocnds, and came back like this:

Code: Select all

prw-rw----+ 1 nagios nagcmd 0 Jul 20 10:46 /usr/local/nagios/var/rw/nagios.cmd
Which is normal while its running. The nagios service started successfully and checks started flowing again. We timed our last apply and it takes a touch over 4 minutes to complete. I dont know if that is a contributing factor or not, but figured I'd mention it. Any ideas on where I could start looking? Nothing significant has changed except we disabled logging yesterday, however, this only started occurring today.
--
Griffin Wakem
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Odd crashing behavior

Post by mguthrie »

Most of these steps are contained in the /etc/init.d/nagios init script.

Does you guys have SELinux enabled?

Here's my nagios.cmd file. I don't suppose you know what the '+' sign denotes...?

Code: Select all

prw-rw---- 1 nagios nagios       0 Jul 20 13:01 nagios.cmd
Locked