Page 1 of 2
Problem after Nagios restart all host checks completing
Posted: Sat Oct 12, 2013 9:06 pm
by ericfeldhusen
I have several Nagios 3.2.3 servers that monitor dynamically changing networks, i.e. new hosts, old hosts disappearing, etc.., where a script dynamically generates a file with hosts and then restarts the nagios service. What we're finding is if we restart the nagios service too often, the host checks aren't completing for more and more hosts as the number of hosts are growing.
I suspect we're missing options in the nagios config file that if X hosts were most recently check, and there's Y hosts with very old checks, the restart causes nagios to start at the beginning of the X list of hosts and continue on, rather checking and then continuing with the hosts with the oldest checks and working up the hosts with the most recent checks.
I think I'm getting stuck on the terminology of what I think I'm looking for in configuration options instead of understanding the options I'm looking for.
Any suggestions?
Eric Feldhusen
Re: Problem after Nagios restart all host checks completing
Posted: Mon Oct 14, 2013 10:42 am
by abrist
Host checks should not hang with the restart of nagios, though if the restart happen too often, you may have issues with multiple nagios parent processes running concurrently. My sugestion would be to make a check to see if nagios is running before you restart it in your scripts. If it is not running, wait to try to restart it again until it is.
Re: Problem after Nagios restart all host checks completing
Posted: Mon Oct 14, 2013 11:31 am
by ericfeldhusen
It's not that the nagios process or host checks are hanging up, it's that they start over from the top of the host files when the nagios service is reloaded/restarted and so newer hosts added at the bottom don't get checked at all or very intermittently.
Re: Problem after Nagios restart all host checks completing
Posted: Mon Oct 14, 2013 11:42 am
by abrist
How often are you restarting nagios?
Re: Problem after Nagios restart all host checks completing
Posted: Mon Oct 14, 2013 11:49 am
by ericfeldhusen
Anywhere from every 10 to 60 minutes. It depends on how many new devices are being added to the network and appended to the host file as a new host to check.
Even if we slow down to every 60 minutes, with checks every 5 minutes, we have hosts that aren't being rechecked in that 60 minutes.
Eric Feldhusen
Re: Problem after Nagios restart all host checks completing
Posted: Mon Oct 14, 2013 12:20 pm
by abrist
1. How large is the installation?
2. Is the date/time/timezone on the server correct?
3. When you restart nagios, are the checks scheduled correctly, or are some of them scheduled out by an hour or more?
Re: Problem after Nagios restart all host checks completing
Posted: Mon Oct 14, 2013 1:57 pm
by ericfeldhusen
1. Depending on the server, approximately 5000-6000 hosts with one service check each.
2. All date/time/timezone are correct on the servers
3. When I restart, I see all "next_check" scheduled from between the moment of restart to no later than 20 minutes out. Which, if I restart every 10-20 minutes could be the problem, but when I've adjusted the schedule to be 30 minutes, I'm still not getting checks.
Re: Problem after Nagios restart all host checks completing
Posted: Mon Oct 14, 2013 2:01 pm
by abrist
With that size of installation, you should be able to set the check interval to 5 minutes, instead of 20. If your checks are configured for 5 minute intervals but taking 20 minutes, then indeed something is amiss and must be hunted down.
Re: Problem after Nagios restart all host checks completing
Posted: Fri Oct 18, 2013 2:44 am
by bananagios
Hello to all,
I'm sorry to reopen this quite old topic, but my problem is very similar.
I have a large installation with around 4000 services (totally ditributed topology) to check on a 3.4.4 nagios version.
Every tuning saggested by documentation was applied but the problem remain the same: after some hours (typically 3-4) many check (seems randomly) are not performed any more.
I put critical files and folders on a ramdisk as well.
At the moment I used the following workaround: restart nagios with crontab every 2 hours and it works well.
The virtual machine which nagios is installed on, doesn't show problems about ram or cpu (4 cpu and 16 Gb ram). The VmWare server is very powerfull and plays 6 poor other vms.
Looking at the nagios' code I found that a list is used to elaborate the check results and, also, that the free memory method and calls are subordinate to an if statement.
I suppose that the list, under some conditions that depend on host name and service description, is not correctly cleaned; in this way the list grows and the performance go down drammatically.
This has match with the restart workaround (maybe

).
Do you know if there is a known problem about this and if a solutions exists?
Do you have any suggestions?
I'm going to create another server in order to reduce the amount of check that each server performs.
Thank you in advanced.
Rob
Re: Problem after Nagios restart all host checks completing
Posted: Fri Oct 18, 2013 10:39 am
by slansing
Do checks just not return? Is there a valid next scheduled check time on the host/service? Are they all suddenly going critical?
Can you correlate any messages in the system log during this time? Or the nagios log?