Hi Perry,
Have not heard back from you, wanted to see if you had any luck with the logs and finding anything.
We had another similar issue where Nagios backups ran, and I guess they restart nagios durring the backup process.
On satuday morning, it kicked off the backups, but Nagios didn't start. We tried to start in Web UI, kept getting error.
I tried to start manually on command line "systemctl start nagios", this also kept failing.
We where able to get nagios going by doing dummy config push. Basically we edit a config but then make no changes and close. At this point we get message that config needs to be applied. We apply the config and nagios started working again. Here is the log file of when is shut down till startup failed.
Caught Sig Term, Shutting Down - Unknown Cause
Re: Caught Sig Term, Shutting Down - Unknown Cause
You do not have the required permissions to view the files attached to this post.
Re: Caught Sig Term, Shutting Down - Unknown Cause
Hello @mvikhman
Thanks for checking in on this issue, and since it has been a while let's review.
[*]Memory seems okay looking at the System Profile snapshot.[/*]
[*]Review the latest appears that there is 'servicedependencies.cfg' error, but that is probably a housekeeping issue since the pre-flight is not telling us that this is a show stopper.[/*]
[*]The messages "Unit nagios.service entered failed state and systemd: nagios.service failed" are not very telling on what the cause is. And when we see "Caught Sig Term, Shutting Down - Unknown Cause" this typically means that there is a resource issue.[/*]
In your latest update; you stated that a "dummy config push" helped resolve the issue and with that want to have you reindex the Core Configs.
Here are the steps to reindex the Core Configuration Manager (CCM) configs by:
[*]6: Core Configuration Manager (CCM) ==> Under Quick Tools ==> "Apply Configuration"[/*]
[*]7: Restart nagios.service by terminal command: -> systemctl restart nagios[/*]
Verify that the host and services look good and verify that there are no errors in core by:
Let us know how things are looking,
Perry
Thanks for checking in on this issue, and since it has been a while let's review.
- Looking through we see that the "pre-flight check" is not failing with errors. [list]
Code: Select all
/usr/local/nagios/bin/nagios -vvv /usr/local/nagios/etc/nagios.cfg
- ModGearman
- Puppet Agent Freshness" on host tossing out duplicate definitions found but don't see issues other than that
[*]Memory seems okay looking at the System Profile snapshot.[/*]
[*]Review the latest appears that there is 'servicedependencies.cfg' error, but that is probably a housekeeping issue since the pre-flight is not telling us that this is a show stopper.[/*]
[*]The messages "Unit nagios.service entered failed state and systemd: nagios.service failed" are not very telling on what the cause is. And when we see "Caught Sig Term, Shutting Down - Unknown Cause" this typically means that there is a resource issue.[/*]
- To verify further we want to see what the 'nagios.service is doing on restart or startup. [list]
Code: Select all
journalctl -xefu nagios.service
In your latest update; you stated that a "dummy config push" helped resolve the issue and with that want to have you reindex the Core Configs.
Here are the steps to reindex the Core Configuration Manager (CCM) configs by:
- 1: command list all running /bin/nagios -> ps -aux | grep -E '/bin/nagios' [list]
Code: Select all
ps -aux | grep -E '/bin/nagios'
Code: Select all
pkill -f /bin/nagios
Code: Select all
rm -rf /usr/local/nagios/etc/import/*
Code: Select all
systemctl restart nagios
[*]6: Core Configuration Manager (CCM) ==> Under Quick Tools ==> "Apply Configuration"[/*]
[*]7: Restart nagios.service by terminal command: -> systemctl restart nagios[/*]
Code: Select all
systemctl restart nagios
Verify that the host and services look good and verify that there are no errors in core by:
Code: Select all
/usr/local/nagios/bin/nagios -vvv /usr/local/nagios/etc/nagios.cfg
Let us know how things are looking,
Perry
Re: Caught Sig Term, Shutting Down - Unknown Cause
Hi Perry,
Thank you for the feedback.
Before I execute the procedure, Since this is our production, I am a little nervous running the command :
"Nagios XI web console ==> Core Configuration Manager (CCM) ==> Config File Management ==> [Delete Files] ==> [Write Files] ==> [Verify Files]"
Can you provide some information what this is doing on the back end. And how long , if any, there is a Nagios outage.
Also, there are no files in /usr/local/nagios/etc/import/, so nothing to delete.
I just need to provide my management what potentially can break and how long things can be off line. Also if there is a roll back procedure if this fails.
Thank you.
Michael.
Thank you for the feedback.
Before I execute the procedure, Since this is our production, I am a little nervous running the command :
"Nagios XI web console ==> Core Configuration Manager (CCM) ==> Config File Management ==> [Delete Files] ==> [Write Files] ==> [Verify Files]"
Can you provide some information what this is doing on the back end. And how long , if any, there is a Nagios outage.
Also, there are no files in /usr/local/nagios/etc/import/, so nothing to delete.
I just need to provide my management what potentially can break and how long things can be off line. Also if there is a roll back procedure if this fails.
Thank you.
Michael.
Re: Caught Sig Term, Shutting Down - Unknown Cause
Hello @mvikhman
The [Delete Files] will delete the "Core" configs, and then [Write Files] will re-write the configs from the Nagios Database.
Verify will check for errors.
It is a good idea to take a snapshot or by running a backup (/usr/local/nagiosxi/scripts/backup_xi.sh)
Downtime is minimal, when the nagios service is restarted the checks will stop for a few seconds.
Thanks,
Perry
The [Delete Files] will delete the "Core" configs, and then [Write Files] will re-write the configs from the Nagios Database.
Verify will check for errors.
It is a good idea to take a snapshot or by running a backup (/usr/local/nagiosxi/scripts/backup_xi.sh)
Downtime is minimal, when the nagios service is restarted the checks will stop for a few seconds.
Thanks,
Perry