Page 1 of 2

Caught Sig Term, Shutting Down - Unknown Cause

Posted: Mon Aug 23, 2021 4:21 pm
by JFox
Caught Sig Term, Shutting Down - Unknown Cause

Linux Distribution and version? CentOS 7, 64 bit, manual install
Are there special configurations on your system, ie; is Gnome installed? Are you using a proxy? Are you using SSL?

I found the Monitoring Engine Event Queue dashlet within the Monitoring Process\Process info page was empty and managed to catch the following in the event log history.

2021-08-23 20:05:16 - Event broker module '/usr/lib64/mod_german/mod_german_nagios4.o' initialized successfully
2021-08-23 20:05:16 - mod_gearman: initialized version 3.0.7 (libgearman 0.33)
2021-08-23 20:05:16 - Event broker module '/usr/local/nagios/bin/ndo.so' initialized successfylly.
2021-08-23 20:05:10 - Successfully shutdown... (PID...)
2021-08-23 20:05:10 - Caught SIGTERM, shutting down

This was followed by several minutes of initial state declarations and then checks following that. We have had several instances of this, and most of the time we are able to use the Monitoring Engine Process dashlet from an end user (non CCM user or admin) to restart the process. There are times that it requires a push from CCM to restart the Nagios process. I will have my team look for more information from the logs but any suggestions are helpful. We have other issues that may or may not be related so I will start them in a different thread. This is the most common issue and is a hard stop on our monitoring service until it is noticed, manually, and restarted.


----

Nagios XI - System Info
Nagios XI version: 5.8.3
Release info: 3.10.0-1062.9.1.el7.x86_64 x86_64
CentOS Linux release 7.0.1406 (Core)
Gnome is not installed
Apache Information
PHP Version: 5.4.16
Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36
Server Port: 443
Date/Time
PHP Timezone: UTC
PHP Time: Mon, 23 Aug 2021 21:23:08 +0000
System Time: Mon, 23 Aug 2021 21:23:08 +0000
Nagios XI Data
Install Type: manual/unknown

??6592 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
??9832 /usr/local/nagios/bin/npcd -d -f /usr/local/nagios/etc/pnp/npcd.cfg
CPU Load 15: 0.66
Total Hosts: 892
Total Services: 5881


Nagios XI Components, abbreviated

ccm 3.1.1
deploydashboard 1.3.2
deploynotification 1.3.4
escalationwizard 1.5.1
freevariabletab 1.1.0
globaleventhandler 1.3.0
graphexplorer 2.3.0
helpsystem 2.0.1
ldap_ad_integration 1.2.2
massacknowledge 2.2.2
massimmediatecheck 1.0.2
metrics 1.3.5
nagiosbpi 3.0.3
nagiosim 2.2.6
nagiosna 1.4.4

Re: Caught Sig Term, Shutting Down - Unknown Cause

Posted: Tue Aug 24, 2021 10:39 am
by pbroste
Hello @jfox

Thanks for reaching out and providing the info, but it looks like we will need to get the System Profile from you as well so we can see what is going on.

Let's start off with a check on the nagios.service:

Code: Select all

journalctl -u nagios.service -f > results.txt &
Make note of the PID number, we will end this process by killing it.

Code: Select all

systemctl restart nagios
Let's find out what brokers?

Code: Select all

grep broker /usr/local/nagios/etc/nagios.cfg
To send us your system profile.
  • Login to the Nagios XI GUI using a web browser.
  • Click the "Admin" > "System Profile" Menu
  • Click the "Download Profile" button
  • Save the profile.zip file and share in a private message or upload it to the post/ticket
Now let's kill that process that we started earlier by killing the PID.

Code: Select all

kill ##pid_number_from_the_journalctl_command###
Please send the results via Private Message.

Thanks,
Perry

Re: Caught Sig Term, Shutting Down - Unknown Cause

Posted: Tue Aug 24, 2021 11:19 am
by JFox
Perry, thank you for the update. I will reach out to my engineers to get you the information requested. I sent you a PM with a trimmed system profile.

Re: Caught Sig Term, Shutting Down - Unknown Cause

Posted: Wed Aug 25, 2021 10:05 am
by pbroste
Hello @jfox

Thanks for following up and providing the System Profile. After reviewing the events and logs we don't see anything that sticks out.

Typically when we see a Caught SIGTERM, shutting down; this is an error(signal) caused by an invalid memory reference or a segmentation fault.

The most common cause of this would be if the server ran out of memory. From the system information provided we see that the environment is running Nagios-based applications with the Gearman module. Want to have you take a look at the /var/log/gearmand/gearmand.log to verify that we don't see anything related. Otherwise from the snapshot that we received the RAM usage looks good unless you see anything different ('watch -n 5 free') on your end.

Want to also provide the following support articles that reference performance optimizations: Thanks,
Perry

Re: Caught Sig Term, Shutting Down - Unknown Cause

Posted: Mon Sep 13, 2021 2:12 pm
by mvikhman
Hi Perry,
I want to follow up on this issue. Basically one of the the things we see periodically is when we make a chane in nagios xi and apply the config. At that point our Monitoring Engine Event Queue show zero in the queue. So we need to restart the engine or nagios to make nagios start checking again.
My question is, the next time we see the Monitoring Engine Event Queue stop showing checks, what can you recommend we check (logs or memory etc) to see why the Monitoring engine stopped running. At this point we are have sporadic issue, but don't know what is causing it.

Thank you for your help.

Mike

Re: Caught Sig Term, Shutting Down - Unknown Cause

Posted: Tue Sep 14, 2021 9:08 am
by pbroste
Hello @mvikhman

Thanks for following up; several things to look at. One; '/var/log/messages' or '/var/log/syslog' depending on distro. Secondly, the '/usr/local/nagios/var/nagios.log', and the ipcs (interested in queues and overall pic).

All of these will provide help provide understanding.

Thanks,
Perry

Re: Caught Sig Term, Shutting Down - Unknown Cause

Posted: Wed Sep 22, 2021 1:36 pm
by mvikhman
Hi Perry,
We had this issue again where we try to apply the config, and then nagios seems to stop running checks until we apply a config again with out making any changes.
I have a attached /var/log/messages , /usr/local/nagios/var/nagios.log and a screen shot of the config snap shots.
Basically, a config change was made and applied 9/21/2021 at 15:35:25 UTC.
However, we didn't notice that nagios was not updating check untill 9/21/2021 16:50:26 UTC at which time we applied the config again with out actually making any config changes which seems to restart nagios and the checks start to process.
We notice that starting at 15:35:25 UTC (when the config was applied first) , there was no data in "Scheduled Events Over Time" dashlet. Only when we apply the config second time at 16:50 , we see data in "Scheduled Events Over Time".
Please review the logs and let me know if you can find anything.

Mike

Re: Caught Sig Term, Shutting Down - Unknown Cause

Posted: Thu Sep 23, 2021 11:18 am
by pbroste
Hello @mvikhman

Could you resend the System Profile from the previous post in the thread? Please go ahead and send via Private Message [PM].

Thanks,
Perry

Re: Caught Sig Term, Shutting Down - Unknown Cause

Posted: Wed Sep 29, 2021 9:50 am
by mvikhman
Hi Perry,
I uploaded profile.zip in a private message, please confirm you have reprieved it.

Mike

Re: Caught Sig Term, Shutting Down - Unknown Cause

Posted: Wed Sep 29, 2021 2:42 pm
by pbroste
Hello @mvikhman

Thanks, I have received it and will follow up after review.
Perry