Page 1 of 3

Weird scheduling issues

Posted: Tue Apr 09, 2013 7:52 pm
by jsmurphy
Hi guys,

I'm still only beginning my investigation but thought I would open a ticket here in case you've seen this before. For the last three days we've been having issues where Nagios will never execute checks and just continually schedule hosts 10 minutes into the future (which is our usual check interval).

Today is a little different, instead of doing this for every host like the previous couple of days, today it is only doing it for some hosts with little to nothing in-common (different templates, host gorups, etc). Restarting Nagios seems to resolve the problem for somewhere between 12/24 hours before the problem starts again. We are currently running XI r1.6, we've attempted a restart of the Nagios server, I've confirmed that this is occurring at the Nagios Core level and not just some database oddity.

I'm hoping some one has seen something similar to this before and can save me a little time.

Thanks!

End of day edit:

Well I learned nothing of value, databases are a-ok, I've update to XI 1.7, upgraded the vmware tools that were out of date and discovered 8000 files in /tmp/ called checkXXXXXX which I've removed (what's the deal with those?) and couldn't find anything else out of the ordinary. I've also done the pre-requisite amount of finger crossing so let me know if there's something else I should check.

Re: Weird scheduling issues

Posted: Wed Apr 10, 2013 1:11 pm
by mguthrie
Are you using DNX or Mod Gearman event brokers on this system?

For the leftover check files in /tmp, if you extend the restart time of Nagios in the /etc/init.d/nagios init script it should give extra time for check results to close out before the parent process kills them off.

/etc/init.d/nagios around line 137.

Code: Select all

                #echo -n 'Waiting for nagios to exit .'
                for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30; do
                    if status_nagios > /dev/null; then

Re: Weird scheduling issues

Posted: Wed Apr 10, 2013 8:09 pm
by jsmurphy
Nope, no DNX or Mod Gearman, the performance improvements in the last few versions has given us plenty of room for growth without additional servers. Additionally no scheduling issues this morning, so finger crossing appears to be working rather well so far.

I did however come across an entirely separate issue but I've opened that in another thread: http://support.nagios.com/forum/viewtop ... 16&t=10062

Re: Weird scheduling issues

Posted: Thu Apr 11, 2013 9:20 am
by mguthrie
OK, let us know if you see the scheduling oddities again. We'll keep our eyes out for it.

Re: Weird scheduling issues

Posted: Sun Apr 14, 2013 6:57 pm
by jsmurphy
Ok I think I might have got this one licked.

We moved the Nagios server to a different data center and forgot to change the DNS configuration to point to the local DNS servers. Overnight some fairly big jobs run over the links between our data centres which would cause hours of rather slow name resolution... it's like Nagios eventually got far enough behind in its processing queue that it just gives up, takes its ball and goes home.

I'm not 100% certain (I'm not even 90% certain really), but this is looking like the root cause.

Re: Weird scheduling issues

Posted: Mon Apr 15, 2013 9:45 am
by scottwilkerson
You may want to consider also making the changes mguthrie suggested here
http://support.nagios.com/forum/viewtop ... 048#p50571

It is somewhat possible that if too many check tmp files get left behind that there could be timeout caused by Nagios just trying to get a list of them.

Again, let us know if this pops up again.

Re: Weird scheduling issues

Posted: Mon Apr 15, 2013 4:46 pm
by jsmurphy
Oh I definitely already did that too, 8000 files in any one directory is too many for me ;)

Re: Weird scheduling issues

Posted: Mon Apr 15, 2013 4:50 pm
by scottwilkerson
Not really on-topic, but we'd love to have you (jsmurphy) speak at this years conference again. If you are able, a call for papers was just recently posted
http://www.nagios.com/events/nagiosworl ... lforpapers

Re: Weird scheduling issues

Posted: Wed Apr 24, 2013 12:36 pm
by vAJ
scottwilkerson wrote:You may want to consider also making the changes mguthrie suggested here
http://support.nagios.com/forum/viewtop ... 048#p50571

It is somewhat possible that if too many check tmp files get left behind that there could be timeout caused by Nagios just trying to get a list of them.

Again, let us know if this pops up again.
I'm seeing the same issues with check files building up in /tmp.

Per Mike's post above, is that what the wait time should originally be, and we need to extend longer (say 45? 60?) or is 30 enough? Mine is currently set as shown, but I've inherited this system... Never know what is out of the box and what has been mucked with.

I've had support cases with you guys before, you know I'm running a large # of checks (~800 hosts, ~14k services).

Re: Weird scheduling issues

Posted: Wed Apr 24, 2013 2:33 pm
by scottwilkerson
The default is just 10

On larger systems you may need to build it up higher, especially if you have some long running checks. Ideally, it should be long enough for all of your checks to return.