Odd crashing behavior
Posted: Fri Jul 20, 2012 12:03 pm
OK, I have another strange one.
Quick specs, we have 1403 servers, and 7078 services monitored currently (and climbing). The machine we have as a master is a 16core 2.4GHz with 128GB of memory, 10k SAS drives.
We apply, and it takes two or three to get things running again. I see the following messages in /var/log/messages (used so I dont have to covert from Epoch:)
Around the third time of applying this seems to come back. To test, I applied, while watching the /usr/local/nagios/var/rw/nagios.cmd, and it did exist until *right* after the message about it already existing. Then it went away for about 5 seconds, and came back with these permissions:
I started nagios manually with a service nagios start, and the above nagios.cmd dissapeared again for about 3 seocnds, and came back like this:
Which is normal while its running. The nagios service started successfully and checks started flowing again. We timed our last apply and it takes a touch over 4 minutes to complete. I dont know if that is a contributing factor or not, but figured I'd mention it. Any ideas on where I could start looking? Nothing significant has changed except we disabled logging yesterday, however, this only started occurring today.
Quick specs, we have 1403 servers, and 7078 services monitored currently (and climbing). The machine we have as a master is a 16core 2.4GHz with 128GB of memory, 10k SAS drives.
We apply, and it takes two or three to get things running again. I see the following messages in /var/log/messages (used so I dont have to covert from Epoch:)
Code: Select all
Jul 20 10:00:56 sidhqmonm0 nagios: Caught SIGTERM, shutting down...
Jul 20 10:00:56 sidhqmonm0 nagios: Successfully shutdown... (PID=11492)
Jul 20 10:00:56 sidhqmonm0 nagios: Event broker module '/usr/lib64/mod_gearman/mod_gearman.o' deinitialized successfully.
Jul 20 10:00:56 sidhqmonm0 nagios: ndomod: Shutdown complete.
Jul 20 10:00:56 sidhqmonm0 nagios: Event broker module '/usr/local/nagios/bin/ndomod.o' deinitialized successfully.
Jul 20 10:00:58 sidhqmonm0 nagios: Nagios 3.4.1 starting... (PID=27118)
Jul 20 10:00:58 sidhqmonm0 nagios: Local time is Fri Jul 20 10:00:58 MDT 2012
Jul 20 10:00:58 sidhqmonm0 nagios: LOG VERSION: 2.0
Jul 20 10:00:58 sidhqmonm0 nagios: mod_gearman: initialized version 1.3.0 (libgearman 0.25)
Jul 20 10:00:58 sidhqmonm0 nagios: Event broker module '/usr/lib64/mod_gearman/mod_gearman.o' initialized successfully.
Jul 20 10:00:58 sidhqmonm0 nagios: ndomod: NDOMOD 1.5.1 (05-15-2012) Copyright (c) 2009 Nagios Core Development Team and Community Contributors
Jul 20 10:00:58 sidhqmonm0 nagios: ndomod: Successfully connected to data sink. 0 queued items to flush.
Jul 20 10:00:58 sidhqmonm0 nagios: Event broker module '/usr/local/nagios/bin/ndomod.o' initialized successfully.
[CLIP]
Jul 20 10:01:07 sidhqmonm0 nagios: Finished daemonizing... (New PID=27244)
Jul 20 10:01:07 sidhqmonm0 nagios: Error: Could not create external command file '/usr/local/nagios/var/rw/nagios.cmd' as named pipe: (17) -> File exists. If this file already exists and you are sure that another copy of Nagios is not running, you should delete this file.
Jul 20 10:01:07 sidhqmonm0 nagios: Bailing out due to errors encountered while trying to initialize the external command file... (PID=27244)
Jul 20 10:01:07 sidhqmonm0 nagios: Event broker module '/usr/lib64/mod_gearman/mod_gearman.o' deinitialized successfully.
Jul 20 10:01:07 sidhqmonm0 nagios: ndomod: Shutdown complete.
Jul 20 10:01:07 sidhqmonm0 nagios: Event broker module '/usr/local/nagios/bin/ndomod.o' deinitialized successfully.
Code: Select all
rw-rw-rw-+ 1 nagios nagcmd 0 Jul 20 10:46 /usr/local/nagios/var/rw/nagios.cmd
Code: Select all
prw-rw----+ 1 nagios nagcmd 0 Jul 20 10:46 /usr/local/nagios/var/rw/nagios.cmd