We have a Nagios/pnp4nagios installation several years old. It has been periodically updated to recent versions. Suddenly the nagios service stopped for no apparent resaon with no obvious errors in any of the logs I've gone through. When restarted it starts checking hosts and fails at the same point after one specific host check and before the next. The only outward sign is noticing the last check times are old. When attempting to manually schedule a host check from the interface the it does nothing when the commit button is pressed, where normally it would confirm the check was scheduled and have a "done" link. This is presumably because the service is stopped and the interface gets no response. I have gone through the configuration files to no avail, specifically checking the host configurations of the last to be done before the service stops and the one next scheduled.
The installation is on a vmware virtual server and when restored to earlier snapshots nagios runs fine for a couple of weeks and the problem happens again. Hardware resources to seem taxed at all and there is plenty of space on the file systems.
I have found similar descriptions of this problem where NDO was the culprit, but we are using pnp4nagios/rrdtool so that does not apply. I upgraded to Nagios 3.4.4 and pnp4nagios 0.6.19 and the service fails at exactly the same point.
Any help would be greatly appreciated - especially assistence with deeper/more comprehensive troubleshooting tips. This is my first post so please gently point me in the right direction if this is the wrong forum.
Thanks!!!
Nagios Service Stops Unexpectedly
-
slansing
- Posts: 7698
- Joined: Mon Apr 23, 2012 4:28 pm
- Location: Travelling through time and space...
Re: Nagios Service Stops Unexpectedly
This is extremely odd, lets start with basics, have you verified the configuration files?
What else is running on the system besides Nagios? Are you using it for web hosting or any such thing?
Code: Select all
/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfgRe: Nagios Service Stops Unexpectedly
Actually just solved my own problem. For future reference and possible help to others, here is what was going on...
Root cause: The service-perfdata file grew too big and caused the Nagios service to crash.
There's an error somewhere in the original configuration which caused service-perfdata and host-perfdata to continuously grow. Not being familiar enough with Nagios I had not noticed it in the past. However, while looking last night I saw that the service-perfdata file size was 2,147,483,647 bytes. I realized that a file size > 2GB on a 32-bit system was trouble and sure enough, 2^31 = 2,147,483,648. Being 1 byte less than the maximum file size, Nagios was crashing on the next attempt to write to the file. I stopped Nagios gracefully, renamed service-perfdata, did a "touch service-perfdata" to start a new file, restarted Nagios and viola!
My next step is to hunt down the configuration settings causing the continuous growth of service-perfdata. If anyone happens to have a quick pointer on that one I'd be happy to hear it. Otherwise, I'll dig it up and post back to this thread.
Thanks,
Jim
Root cause: The service-perfdata file grew too big and caused the Nagios service to crash.
There's an error somewhere in the original configuration which caused service-perfdata and host-perfdata to continuously grow. Not being familiar enough with Nagios I had not noticed it in the past. However, while looking last night I saw that the service-perfdata file size was 2,147,483,647 bytes. I realized that a file size > 2GB on a 32-bit system was trouble and sure enough, 2^31 = 2,147,483,648. Being 1 byte less than the maximum file size, Nagios was crashing on the next attempt to write to the file. I stopped Nagios gracefully, renamed service-perfdata, did a "touch service-perfdata" to start a new file, restarted Nagios and viola!
My next step is to hunt down the configuration settings causing the continuous growth of service-perfdata. If anyone happens to have a quick pointer on that one I'd be happy to hear it. Otherwise, I'll dig it up and post back to this thread.
Thanks,
Jim
Last edited by jalr on Tue Feb 26, 2013 1:59 pm, edited 1 time in total.
-
slansing
- Posts: 7698
- Joined: Mon Apr 23, 2012 4:28 pm
- Location: Travelling through time and space...
Re: Nagios Service Stops Unexpectedly
That is odd but it certainly would cause that problem. On the point of configuration checking, do you monitor a lot of switches/routers/SNMP polled systems that return performance data? Or do you have a group of hosts/services running multiple checks within a minute reporting performance data?
Re: Nagios Service Stops Unexpectedly
If I understand your question correctly - we have several service and host groups defined and run checks using the groups. However, we only monitor a few services every minute and most are at least 5 or more, and we aren't polling any devices. What is your suspicion or line of thinking here?
-
slansing
- Posts: 7698
- Joined: Mon Apr 23, 2012 4:28 pm
- Location: Travelling through time and space...
Re: Nagios Service Stops Unexpectedly
My thought process was that your system was pulling in huge amounts of performance data, though that does not seem like the case now. This is a interesting one..