Nagios Daemon dies if any service checks are run
Posted: Sat Jul 13, 2013 1:29 pm
Our system has been using Nagios 3.2.0 for 3 years without issues. I've since set up 2 other Nagios host servers and am mostly familiar with all the configurations and files associated with it. We use 1 server as the host and run the nrpe daemon on each of our client boxes to handle service calls. Again, to note, this was installed and set up several years ago and has been working successfully for years.
However, we noticed that something happened a day ago where the nagios daemon died on the host server without logging any errors. The odd thing about it is that the web interface was still working, no one one even noticed until a check that was in a "warning" state that should have cleared up never recovered. Checking the server, we saw that the nagios daemon was not running. We restarted it, a recovery message came out, but the check remained on the web interface. Checking again showed the daemon had died shortly after starting, living long enough to do the initial check. Restarting it again sent out another recovery message, but then crashed and remained in the same state as before.
I've rebooted the system, I've reloaded and restarted the nagois daemons. I've restarted the httpd service. I've triple checked all relevant config files. Nothing has changed in the config files in the past few days with exception to my troubleshooting attempts. It seemed it was related to running manual checks through the cgi (based on the "premature end of script headers" errors in the httpd error_log file), but I removed the objects.cache and retention.dat files to see if it'd help, and when I restart the daemon, it crashes during the first service check. As a test, I shut off all active checks on services and restarted, and it stays alive and checks the status off all of the hosts. If I try to run a service check at that point, the nagios daemon will die and the service check will hang.
I've done extensive searches for this problem and have tried everything suggested, from checking config files, to restarting services, to removing files and checking permissions. In the case of permissions/config files everything has been set up correctly for years, so those did not pan out. The only real clue I have right now is that the nagios daemon will only stay alive until a service check is attempted, either through the cgi or an automatic one set up in my services.cfg file.
I thought maybe it was some sort of corrupt service check that was attempting to be run, however it happens no matter what the check and no matter if it's an automatic or force check. Has anyone experienced this happen to their nagios daemon? Is there any way to force it to log if it dies? Is there some reason why the web interface still acts like it's working even though the nagios daemon is not running?
However, we noticed that something happened a day ago where the nagios daemon died on the host server without logging any errors. The odd thing about it is that the web interface was still working, no one one even noticed until a check that was in a "warning" state that should have cleared up never recovered. Checking the server, we saw that the nagios daemon was not running. We restarted it, a recovery message came out, but the check remained on the web interface. Checking again showed the daemon had died shortly after starting, living long enough to do the initial check. Restarting it again sent out another recovery message, but then crashed and remained in the same state as before.
I've rebooted the system, I've reloaded and restarted the nagois daemons. I've restarted the httpd service. I've triple checked all relevant config files. Nothing has changed in the config files in the past few days with exception to my troubleshooting attempts. It seemed it was related to running manual checks through the cgi (based on the "premature end of script headers" errors in the httpd error_log file), but I removed the objects.cache and retention.dat files to see if it'd help, and when I restart the daemon, it crashes during the first service check. As a test, I shut off all active checks on services and restarted, and it stays alive and checks the status off all of the hosts. If I try to run a service check at that point, the nagios daemon will die and the service check will hang.
I've done extensive searches for this problem and have tried everything suggested, from checking config files, to restarting services, to removing files and checking permissions. In the case of permissions/config files everything has been set up correctly for years, so those did not pan out. The only real clue I have right now is that the nagios daemon will only stay alive until a service check is attempted, either through the cgi or an automatic one set up in my services.cfg file.
I thought maybe it was some sort of corrupt service check that was attempting to be run, however it happens no matter what the check and no matter if it's an automatic or force check. Has anyone experienced this happen to their nagios daemon? Is there any way to force it to log if it dies? Is there some reason why the web interface still acts like it's working even though the nagios daemon is not running?