Nagios Support Forum

Posted: **Sat Jul 13, 2013 1:29 pm**

Our system has been using Nagios 3.2.0 for 3 years without issues. I've since set up 2 other Nagios host servers and am mostly familiar with all the configurations and files associated with it. We use 1 server as the host and run the nrpe daemon on each of our client boxes to handle service calls. Again, to note, this was installed and set up several years ago and has been working successfully for years.

However, we noticed that something happened a day ago where the nagios daemon died on the host server without logging any errors. The odd thing about it is that the web interface was still working, no one one even noticed until a check that was in a "warning" state that should have cleared up never recovered. Checking the server, we saw that the nagios daemon was not running. We restarted it, a recovery message came out, but the check remained on the web interface. Checking again showed the daemon had died shortly after starting, living long enough to do the initial check. Restarting it again sent out another recovery message, but then crashed and remained in the same state as before.

I've rebooted the system, I've reloaded and restarted the nagois daemons. I've restarted the httpd service. I've triple checked all relevant config files. Nothing has changed in the config files in the past few days with exception to my troubleshooting attempts. It seemed it was related to running manual checks through the cgi (based on the "premature end of script headers" errors in the httpd error_log file), but I removed the objects.cache and retention.dat files to see if it'd help, and when I restart the daemon, it crashes during the first service check. As a test, I shut off all active checks on services and restarted, and it stays alive and checks the status off all of the hosts. If I try to run a service check at that point, the nagios daemon will die and the service check will hang.

I've done extensive searches for this problem and have tried everything suggested, from checking config files, to restarting services, to removing files and checking permissions. In the case of permissions/config files everything has been set up correctly for years, so those did not pan out. The only real clue I have right now is that the nagios daemon will only stay alive until a service check is attempted, either through the cgi or an automatic one set up in my services.cfg file.

I thought maybe it was some sort of corrupt service check that was attempting to be run, however it happens no matter what the check and no matter if it's an automatic or force check. Has anyone experienced this happen to their nagios daemon? Is there any way to force it to log if it dies? Is there some reason why the web interface still acts like it's working even though the nagios daemon is not running?

Posted: **Mon Jul 15, 2013 10:28 am**

The webui should only continue working if there is a nagios process running. Otherwise looking at any page that it would pull objects from should show a pretty obvious error. Are you certain that there was not a second nagios process running? As for the checks causing nagios to exit, do you happen to have selinux enabled or something else that might restrict a process from forking? An strace of nagios when attempting to execute a check might help as well.

Code: Select all

sestatus
strace -o /tmp/nagios.strace /usr/local/nagios/bin/nagios -d -f /usr/local/nagios/etc/nagios.cfg

Posted: **Mon Jul 15, 2013 11:36 am**

We got some fresh eyes on it this morning. While there was nothing in the logs to indicate what was happening, and the web interface continued 'working' when the daemon died (i.e. you could get everywhere but could not apply changes to notifications, checks, comments), we did figure out the cause of the problem.

Apparently because we had "process_performance_data" turned on in the nagios.cfg file, it was logging to a /tmp file there. It ended up being too large to read/write to for nagios (2GB) that we assume each time it tried to open it, it died.

It was still very odd, and odder still the fact that it did not log why it was dying. But it is fixed now.

Nagios Support Forum

Nagios Daemon dies if any service checks are run

Nagios Daemon dies if any service checks are run

Re: Nagios Daemon dies if any service checks are run

Re: Nagios Daemon dies if any service checks are run