Monitoring engine stops some time after apply changes
Posted: Mon Sep 20, 2021 9:15 am
We have a customer that seems to have problems sometime after they apply configs after making CCM changes in Nagios XI 5.8.3. Some short time (tens of minutes at most) after the make changes and apply config, Nagios monitoring engine craps out and they have to start it back up again. We put a bandaid on it that has solved the problem, but hasn't fixed the issue. A simple "service nagios restart" solves the problem and it will run for ever until the next changes in CCM are applied. Note that the changes are applied, and the system runs for a while after, but then craps out. Restarting will bring it back up and running properly, so it doesn't seem to be related to the changes themselves, per se.
I did some digging and found that there's a SIGSEGV during each of these events. I'm including a log that shows the output of an "egrep -B 5 SIGSEGV" for the log files from August (which is when the problems seemed to have started). I've replaced all Host/Services with XXX as there doesn't seem to be a pattern to that (and I need to obfuscate some information).
I did some digging and found that there's a SIGSEGV during each of these events. I'm including a log that shows the output of an "egrep -B 5 SIGSEGV" for the log files from August (which is when the problems seemed to have started). I've replaced all Host/Services with XXX as there doesn't seem to be a pattern to that (and I need to obfuscate some information).
Code: Select all
nagios-08-19-2021-00.log-[1629316295] NDO-3: Ended service_check thread
nagios-08-19-2021-00.log-[1629316295] NDO-3: Ended comment thread
nagios-08-19-2021-00.log-[1629316295] HOST ALERT: XXX;DOWN;SOFT;1;CRITICAL - 10.123.32.81: rta nan, lost 100%
nagios-08-19-2021-00.log-[1629316295] SERVICE ALERT: XXX;CRITICAL;HARD;1;CRITICAL - Plugin timed out while executing system call
nagios-08-19-2021-00.log-[1629316295] NDO-3: Ended timed_event thread
nagios-08-19-2021-00.log:[1629316295] Caught SIGSEGV, shutting down...
--
nagios-08-19-2021-00.log-[1629316350] SERVICE ALERT: XXX;OK;SOFT;2;SNMP OK - 1
nagios-08-19-2021-00.log-[1629316350] SERVICE ALERT: XXX;OK;SOFT;2;SNMP OK - 1
nagios-08-19-2021-00.log-[1629316350] SERVICE ALERT: XXX;OK;SOFT;2;SNMP OK - 1
nagios-08-19-2021-00.log-[1629316350] SERVICE ALERT: XXX;OK;SOFT;2;SNMP OK - 1
nagios-08-19-2021-00.log-[1629316350] NDO-3: Ended timed_event thread
nagios-08-19-2021-00.log:[1629316350] Caught SIGSEGV, shutting down...
--
nagios-08-22-2021-00.log-[1629481479] SERVICE ALERT: XXX;CRITICAL;HARD;1;CRITICAL - Plugin timed out while executing system call
nagios-08-22-2021-00.log-[1629481479] SERVICE ALERT: XXX;CRITICAL;HARD;1;CRITICAL - Plugin timed out while executing system call
nagios-08-22-2021-00.log-[1629481479] NDO-3: Ended downtime thread
nagios-08-22-2021-00.log-[1629481479] HOST ALERT: XXX;DOWN;SOFT;2;CRITICAL - 10.208.129.124: rta nan, lost 100%
nagios-08-22-2021-00.log-[1629481479] NDO-3: Ended timed_event thread
nagios-08-22-2021-00.log:[1629481479] Caught SIGSEGV, shutting down...
--
nagios-08-27-2021-00.log-[1629974636] SERVICE ALERT: XXX;OK;HARD;3;SNMP OK - "None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None"
nagios-08-27-2021-00.log-[1629974636] SERVICE ALERT: XXX;OK;HARD;3;SNMP OK - "False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False"
nagios-08-27-2021-00.log-[1629974636] SERVICE ALERT: XXX;OK;HARD;3;SNMP OK - "False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False"
nagios-08-27-2021-00.log-[1629974636] NDO-3: Ended service_check thread
nagios-08-27-2021-00.log-[1629974636] NDO-3: Ended downtime thread
nagios-08-27-2021-00.log:[1629974636] Caught SIGSEGV, shutting down...
--
nagios-08-27-2021-00.log-[1630008866] NDO-3: Ended notification thread
nagios-08-27-2021-00.log-[1630008866] NDO-3: Ended service_check thread
nagios-08-27-2021-00.log-[1630008866] NDO-3: Ended downtime thread
nagios-08-27-2021-00.log-[1630008866] HOST ALERT: XXX;DOWN;SOFT;1;CRITICAL - 216.99.17.101: rta nan, lost 100%
nagios-08-27-2021-00.log-[1630008866] NDO-3: Ended timed_event thread
nagios-08-27-2021-00.log:[1630008866] Caught SIGSEGV, shutting down...
--
nagios-08-30-2021-00.log-[1630251627] SERVICE DOWNTIME ALERT: XXX;STARTED; Service has entered a period of scheduled downtime
nagios-08-30-2021-00.log-[1630251627] SERVICE DOWNTIME ALERT: XXX;STARTED; Service has entered a period of scheduled downtime
nagios-08-30-2021-00.log-[1630251627] SERVICE DOWNTIME ALERT: XXX;STARTED; Service has entered a period of scheduled downtime
nagios-08-30-2021-00.log-[1630251628] NDO-3: Ended host_check thread
nagios-08-30-2021-00.log-[1630251629] NDO-3: Ended host_status thread
nagios-08-30-2021-00.log:[1630251647] Caught SIGSEGV, shutting down...
--
nagios-08-30-2021-00.log-[1630286027] SERVICE ALERT: XXX;CRITICAL;SOFT;4;SNMP CRITICAL - *"Night"*
nagios-08-30-2021-00.log-[1630286028] SERVICE ALERT: XXX;OK;HARD;3;SNMP OK - "Night"
nagios-08-30-2021-00.log-[1630286028] SERVICE ALERT: XXX;CRITICAL;HARD;1;CRITICAL - Plugin timed out while executing system call
nagios-08-30-2021-00.log-[1630286028] SERVICE ALERT: XXX;CRITICAL;SOFT;1;SNMP CRITICAL - *"Night"*
nagios-08-30-2021-00.log-[1630286028] SERVICE ALERT: XXX;CRITICAL;SOFT;3;SNMP CRITICAL - *"Night"*
nagios-08-30-2021-00.log:[1630286028] Caught SIGSEGV, shutting down...