Monitoring engine stops some time after apply changes

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
User avatar
eloyd
Cool Title Here
Posts: 2129
Joined: Thu Sep 27, 2012 9:14 am
Location: Rochester, NY
Contact:

Monitoring engine stops some time after apply changes

Post by eloyd »

We have a customer that seems to have problems sometime after they apply configs after making CCM changes in Nagios XI 5.8.3. Some short time (tens of minutes at most) after the make changes and apply config, Nagios monitoring engine craps out and they have to start it back up again. We put a bandaid on it that has solved the problem, but hasn't fixed the issue. A simple "service nagios restart" solves the problem and it will run for ever until the next changes in CCM are applied. Note that the changes are applied, and the system runs for a while after, but then craps out. Restarting will bring it back up and running properly, so it doesn't seem to be related to the changes themselves, per se.

I did some digging and found that there's a SIGSEGV during each of these events. I'm including a log that shows the output of an "egrep -B 5 SIGSEGV" for the log files from August (which is when the problems seemed to have started). I've replaced all Host/Services with XXX as there doesn't seem to be a pattern to that (and I need to obfuscate some information).

Code: Select all

nagios-08-19-2021-00.log-[1629316295] NDO-3: Ended service_check thread
nagios-08-19-2021-00.log-[1629316295] NDO-3: Ended comment thread
nagios-08-19-2021-00.log-[1629316295] HOST ALERT: XXX;DOWN;SOFT;1;CRITICAL - 10.123.32.81: rta nan, lost 100%
nagios-08-19-2021-00.log-[1629316295] SERVICE ALERT: XXX;CRITICAL;HARD;1;CRITICAL - Plugin timed out while executing system call
nagios-08-19-2021-00.log-[1629316295] NDO-3: Ended timed_event thread
nagios-08-19-2021-00.log:[1629316295] Caught SIGSEGV, shutting down...
--
nagios-08-19-2021-00.log-[1629316350] SERVICE ALERT: XXX;OK;SOFT;2;SNMP OK - 1
nagios-08-19-2021-00.log-[1629316350] SERVICE ALERT: XXX;OK;SOFT;2;SNMP OK - 1
nagios-08-19-2021-00.log-[1629316350] SERVICE ALERT: XXX;OK;SOFT;2;SNMP OK - 1
nagios-08-19-2021-00.log-[1629316350] SERVICE ALERT: XXX;OK;SOFT;2;SNMP OK - 1
nagios-08-19-2021-00.log-[1629316350] NDO-3: Ended timed_event thread
nagios-08-19-2021-00.log:[1629316350] Caught SIGSEGV, shutting down...
--
nagios-08-22-2021-00.log-[1629481479] SERVICE ALERT: XXX;CRITICAL;HARD;1;CRITICAL - Plugin timed out while executing system call
nagios-08-22-2021-00.log-[1629481479] SERVICE ALERT: XXX;CRITICAL;HARD;1;CRITICAL - Plugin timed out while executing system call
nagios-08-22-2021-00.log-[1629481479] NDO-3: Ended downtime thread
nagios-08-22-2021-00.log-[1629481479] HOST ALERT: XXX;DOWN;SOFT;2;CRITICAL - 10.208.129.124: rta nan, lost 100%
nagios-08-22-2021-00.log-[1629481479] NDO-3: Ended timed_event thread
nagios-08-22-2021-00.log:[1629481479] Caught SIGSEGV, shutting down...
--
nagios-08-27-2021-00.log-[1629974636] SERVICE ALERT: XXX;OK;HARD;3;SNMP OK - "None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,None"
nagios-08-27-2021-00.log-[1629974636] SERVICE ALERT: XXX;OK;HARD;3;SNMP OK - "False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False"
nagios-08-27-2021-00.log-[1629974636] SERVICE ALERT: XXX;OK;HARD;3;SNMP OK - "False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False"
nagios-08-27-2021-00.log-[1629974636] NDO-3: Ended service_check thread
nagios-08-27-2021-00.log-[1629974636] NDO-3: Ended downtime thread
nagios-08-27-2021-00.log:[1629974636] Caught SIGSEGV, shutting down...
--
nagios-08-27-2021-00.log-[1630008866] NDO-3: Ended notification thread
nagios-08-27-2021-00.log-[1630008866] NDO-3: Ended service_check thread
nagios-08-27-2021-00.log-[1630008866] NDO-3: Ended downtime thread
nagios-08-27-2021-00.log-[1630008866] HOST ALERT: XXX;DOWN;SOFT;1;CRITICAL - 216.99.17.101: rta nan, lost 100%
nagios-08-27-2021-00.log-[1630008866] NDO-3: Ended timed_event thread
nagios-08-27-2021-00.log:[1630008866] Caught SIGSEGV, shutting down...
--
nagios-08-30-2021-00.log-[1630251627] SERVICE DOWNTIME ALERT: XXX;STARTED; Service has entered a period of scheduled downtime
nagios-08-30-2021-00.log-[1630251627] SERVICE DOWNTIME ALERT: XXX;STARTED; Service has entered a period of scheduled downtime
nagios-08-30-2021-00.log-[1630251627] SERVICE DOWNTIME ALERT: XXX;STARTED; Service has entered a period of scheduled downtime
nagios-08-30-2021-00.log-[1630251628] NDO-3: Ended host_check thread
nagios-08-30-2021-00.log-[1630251629] NDO-3: Ended host_status thread
nagios-08-30-2021-00.log:[1630251647] Caught SIGSEGV, shutting down...
--
nagios-08-30-2021-00.log-[1630286027] SERVICE ALERT: XXX;CRITICAL;SOFT;4;SNMP CRITICAL - *"Night"*
nagios-08-30-2021-00.log-[1630286028] SERVICE ALERT: XXX;OK;HARD;3;SNMP OK - "Night"
nagios-08-30-2021-00.log-[1630286028] SERVICE ALERT: XXX;CRITICAL;HARD;1;CRITICAL - Plugin timed out while executing system call
nagios-08-30-2021-00.log-[1630286028] SERVICE ALERT: XXX;CRITICAL;SOFT;1;SNMP CRITICAL - *"Night"*
nagios-08-30-2021-00.log-[1630286028] SERVICE ALERT: XXX;CRITICAL;SOFT;3;SNMP CRITICAL - *"Night"*
nagios-08-30-2021-00.log:[1630286028] Caught SIGSEGV, shutting down...
Image
Eric Loyd • http://everwatch.global • 844.240.EVER • @EricLoydI'm a Nagios Fanatic!
User avatar
pbroste
Posts: 1288
Joined: Tue Jun 01, 2021 1:27 pm

Re: Monitoring engine stops some time after apply changes

Post by pbroste »

Hello @eloyd

Thanks for reaching out looks like "Caught SIGSEGV" happens most of the time after NDO3 event. Do you see issues in the database logs? (/var/log/mysqld.log)

The database repair would be a good start. Do you see anything of concern in the Core Configuration Manager Pre-Flight:
  • Code: Select all

    /usr/local/nagios/bin/nagios -vvv /usr/local/nagios/etc/nagios.cfg
[/list]

Please let us know how things look,
Perry
User avatar
eloyd
Cool Title Here
Posts: 2129
Joined: Thu Sep 27, 2012 9:14 am
Location: Rochester, NY
Contact:

Re: Monitoring engine stops some time after apply changes

Post by eloyd »

Pre-flight check is clean. Database logs are clean, but it hasn't happened for a couple of weeks at this point. I'll see what we can find on that track. Thanks.
Image
Eric Loyd • http://everwatch.global • 844.240.EVER • @EricLoydI'm a Nagios Fanatic!
User avatar
pbroste
Posts: 1288
Joined: Tue Jun 01, 2021 1:27 pm

Re: Monitoring engine stops some time after apply changes

Post by pbroste »

Hello @eloyd

Thanks for following up, the apply configuration can take some time as it needs to wait for the current checks to finish and stop the Nagios process. It also writes out the current configuration in the database to files, runs a verification on those files, and then re-starts the nagios service. It's a pretty I/O intensive operation.

Want to provide several support documents to help optimize your Nagios XI environment.
Please let us know if you have further questions,
Perry
User avatar
eloyd
Cool Title Here
Posts: 2129
Joined: Thu Sep 27, 2012 9:14 am
Location: Rochester, NY
Contact:

Re: Monitoring engine stops some time after apply changes

Post by eloyd »

Thanks. I appreciate the note, but I'm very aware of what the Apply Config does. :D The problem is not directly related to the Apply Config. It only occurs if changes are made to existing entities, and then apply config, and then it's maybe 20-30 minutes later. If you add a new entity and apply config, it does not do it.

For what it's worth, it's not the message queue issue that you pointed to in your URL. Also, we did PHP optimizing when we first set this customer up. I've just never seen a SIGSEGV 20-30 minute AFTER the event that most likely triggered it (the Apply Config). So I opened this topic up for general tracking and conversation about the topic.

We did the database repair as part of our normal troubleshooting, and the issue has not occurred again, but we're keeping an eye on it.
Image
Eric Loyd • http://everwatch.global • 844.240.EVER • @EricLoydI'm a Nagios Fanatic!
User avatar
pbroste
Posts: 1288
Joined: Tue Jun 01, 2021 1:27 pm

Re: Monitoring engine stops some time after apply changes

Post by pbroste »

Hello @eloyd

Great sounds like things are rolling along, and you have an excellent understanding of what is going on and will reach out if problems persist. I will go ahead and lock this post for the time being. Please let us know if you need anything going forward.

Thanks,
Perry
User avatar
pbroste
Posts: 1288
Joined: Tue Jun 01, 2021 1:27 pm

Re: Monitoring engine stops some time after apply changes

Post by pbroste »

Hello @eloyd

Thanks for pinging me on this.

Perry
Locked