Weird scheduling issues

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
User avatar
jsmurphy
Posts: 989
Joined: Wed Aug 18, 2010 9:46 pm

Weird scheduling issues

Post by jsmurphy »

Hi guys,

I'm still only beginning my investigation but thought I would open a ticket here in case you've seen this before. For the last three days we've been having issues where Nagios will never execute checks and just continually schedule hosts 10 minutes into the future (which is our usual check interval).

Today is a little different, instead of doing this for every host like the previous couple of days, today it is only doing it for some hosts with little to nothing in-common (different templates, host gorups, etc). Restarting Nagios seems to resolve the problem for somewhere between 12/24 hours before the problem starts again. We are currently running XI r1.6, we've attempted a restart of the Nagios server, I've confirmed that this is occurring at the Nagios Core level and not just some database oddity.

I'm hoping some one has seen something similar to this before and can save me a little time.

Thanks!

End of day edit:

Well I learned nothing of value, databases are a-ok, I've update to XI 1.7, upgraded the vmware tools that were out of date and discovered 8000 files in /tmp/ called checkXXXXXX which I've removed (what's the deal with those?) and couldn't find anything else out of the ordinary. I've also done the pre-requisite amount of finger crossing so let me know if there's something else I should check.
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Weird scheduling issues

Post by mguthrie »

Are you using DNX or Mod Gearman event brokers on this system?

For the leftover check files in /tmp, if you extend the restart time of Nagios in the /etc/init.d/nagios init script it should give extra time for check results to close out before the parent process kills them off.

/etc/init.d/nagios around line 137.

Code: Select all

                #echo -n 'Waiting for nagios to exit .'
                for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30; do
                    if status_nagios > /dev/null; then
User avatar
jsmurphy
Posts: 989
Joined: Wed Aug 18, 2010 9:46 pm

Re: Weird scheduling issues

Post by jsmurphy »

Nope, no DNX or Mod Gearman, the performance improvements in the last few versions has given us plenty of room for growth without additional servers. Additionally no scheduling issues this morning, so finger crossing appears to be working rather well so far.

I did however come across an entirely separate issue but I've opened that in another thread: http://support.nagios.com/forum/viewtop ... 16&t=10062
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Weird scheduling issues

Post by mguthrie »

OK, let us know if you see the scheduling oddities again. We'll keep our eyes out for it.
User avatar
jsmurphy
Posts: 989
Joined: Wed Aug 18, 2010 9:46 pm

Re: Weird scheduling issues

Post by jsmurphy »

Ok I think I might have got this one licked.

We moved the Nagios server to a different data center and forgot to change the DNS configuration to point to the local DNS servers. Overnight some fairly big jobs run over the links between our data centres which would cause hours of rather slow name resolution... it's like Nagios eventually got far enough behind in its processing queue that it just gives up, takes its ball and goes home.

I'm not 100% certain (I'm not even 90% certain really), but this is looking like the root cause.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Weird scheduling issues

Post by scottwilkerson »

You may want to consider also making the changes mguthrie suggested here
http://support.nagios.com/forum/viewtop ... 048#p50571

It is somewhat possible that if too many check tmp files get left behind that there could be timeout caused by Nagios just trying to get a list of them.

Again, let us know if this pops up again.
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
User avatar
jsmurphy
Posts: 989
Joined: Wed Aug 18, 2010 9:46 pm

Re: Weird scheduling issues

Post by jsmurphy »

Oh I definitely already did that too, 8000 files in any one directory is too many for me ;)
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Weird scheduling issues

Post by scottwilkerson »

Not really on-topic, but we'd love to have you (jsmurphy) speak at this years conference again. If you are able, a call for papers was just recently posted
http://www.nagios.com/events/nagiosworl ... lforpapers
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
vAJ
Posts: 456
Joined: Thu Nov 08, 2012 5:09 pm
Location: Austin, TX

Re: Weird scheduling issues

Post by vAJ »

scottwilkerson wrote:You may want to consider also making the changes mguthrie suggested here
http://support.nagios.com/forum/viewtop ... 048#p50571

It is somewhat possible that if too many check tmp files get left behind that there could be timeout caused by Nagios just trying to get a list of them.

Again, let us know if this pops up again.
I'm seeing the same issues with check files building up in /tmp.

Per Mike's post above, is that what the wait time should originally be, and we need to extend longer (say 45? 60?) or is 30 enough? Mine is currently set as shown, but I've inherited this system... Never know what is out of the box and what has been mucked with.

I've had support cases with you guys before, you know I'm running a large # of checks (~800 hosts, ~14k services).
Andrew J. - Do you even grok?
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Weird scheduling issues

Post by scottwilkerson »

The default is just 10

On larger systems you may need to build it up higher, especially if you have some long running checks. Ideally, it should be long enough for all of your checks to return.
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
Locked