Nagios core slowness and orphan issues

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
btmikkelsen
Posts: 38
Joined: Wed Feb 23, 2011 10:29 am

Nagios core slowness and orphan issues

Post by btmikkelsen »

Hello - Nagios is throwing hundreds of Warning: The check of service 'x' on host 'y' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service...
and
Warning: The check of host 'x' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the host...

Verified the in the status.dat that the services are valid, and I can run the check command manually for both the service and the host.
Verified there are no "extra" nagios processes per the troubleshooting guide on the nagios support site.
Removed all /tmp/checkXXXXX files; I even removed the retension.dat file to force a full requery of all services.
I've tried removing NDO from the configuration.
I've upped debugging in both NDO and nagios core - don't see what could be causing this.

I think whatever is causing this is also causing my checks to not work in the defined check interval, but that is also a problem.

Nagiox XI 2011R1.4, manually installed
2.6.18-238.12.1.el5 #1 SMP Tue May 31 13:22:04 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
no ssl; no proxy; no specialized configurations

Any help would be appreciated.
Last edited by btmikkelsen on Wed Jun 22, 2011 9:57 am, edited 1 time in total.
ormsbeec
Posts: 35
Joined: Fri May 27, 2011 1:18 pm

Re: Nagios core slowness and orphan issues

Post by ormsbeec »

As interesting as it sounds, im having the exact same issue (also on 2011R1.4), started at around 1-2 am last night for me. Everything has been working perfectly and I was going to move this server to production tonight :\
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Nagios core slowness and orphan issues

Post by mguthrie »

Can I have you check the following?

Is the hard drive full? ( I hate to ask, but it can cause this).

What do the permissions look like for /usr/local/nagios/var/spool/checkresults?
Should be:

Code: Select all

drwxrwxr-x 2 nagios nagios  36864 Jun 22 14:57 checkresults
Can you run:

Code: Select all

killall -9 nagios
service ndo2db stop
service nagios start
service ndo2db start
Someone else is reporting this as well, so we'll keep a close eye out for it.
btmikkelsen
Posts: 38
Joined: Wed Feb 23, 2011 10:29 am

Re: Nagios core slowness and orphan issues

Post by btmikkelsen »

Thank you:

/dev/mapper/VolGroup00-LogVol00 550497960 85045128 437038064 17% /

and

ls -ld /usr/local/nagios/var/spool/checkresults
drwxrwxr-x 2 nagios nagios 107823104 Jun 22 13:55 /usr/local/nagios/var/spool/checkresults

ran the killall -9 and restarted all.


Just FYI: when nagios first starts, it seems to process ok - it takes about 5-15 minutes for the errors to begin showing up. Also - I can submit a command via the web gui or via the .cmd file, and nagios takes it - but the check is not performed. The check does get rescheduled for a later time, though. I don't know if that is part of this problem, or a new one.
btmikkelsen
Posts: 38
Joined: Wed Feb 23, 2011 10:29 am

Re: Nagios core slowness and orphan issues

Post by btmikkelsen »

something odd happened. It's working now.

When i looked at the checkresults directory it was HUGE...couldn't even do an ls. I moved the directory and recreated it with nagios down. When I restarted, all my jobs stuck in pending immediately fired off, and it seems to be scheduling properly again.
In the OLD checkresults directory, there are 4093491 files. This is kind of odd because I have only 11k services and 800 hosts.
Nagios has been up for about 20 minutes now, and I only have 963 files in the NEW checkresults directory.


The only other thing I did today was apply that patch tonyyarusso gave us for the check_xi_service_nsclient problem from earlier today.
http://support.nagios.com/forum/viewtop ... 3&start=10.
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Nagios core slowness and orphan issues

Post by mguthrie »

Those files aren't supposed to get left behind, they should be getting deleted as soon as they're processed. What do the permissions look like for the /usr/local/nagios/var directory?
btmikkelsen
Posts: 38
Joined: Wed Feb 23, 2011 10:29 am

Re: Nagios core slowness and orphan issues

Post by btmikkelsen »

ls -ld /usr/local/nagios/var
drwxrwxr-x 6 nagios nagios 4096 Jun 22 14:34 /usr/local/nagios/var

and my current file count in the checkresults is 9, so it seems to be functional now. I wonder if that check_xi_service_nsclient problem was related. That's the only other thing I did (other than debug) after the upgrade to 2011R1.4.
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Nagios core slowness and orphan issues

Post by mguthrie »

Locked