service check schedulling issue

Post by **sebastiaopburnay** » Wed Mar 06, 2013 3:00 pm

Hi,

I nanage a Nagios Core distributed infrastructure consisting of a central node which only gets passive checks, sends notifications and deals with the NDOUtils data persistence tool.

All my servers so far (central and distributed) are using Ubuntu 10.04 LTS and nagios core 3.3.1)

On the latest distributed instance to be deployed, I took a big leap:

I began using a virtualized ubuntu server 12.04 (hosted on a Windows Server R2 with VMware Player) and Nagios Core 3.4.1. and I've also deployed this distributed instance with MySQL server and NDOUtils persistence engine.

The fact that so much has changed compared to my usual environment is making this diagnosis hard.

The issue is that despite I'm specifying the normal_check_interval individually for each service check (ranging between 5min and 60min), some services keep unchecked for long periods, displaying old timevalues in the 'Last Check Time' entrance of the web interface.

The VM does not indicate signs of CPU starvation nor memory shortages and about 1/4 of the aproximate 700 service checks for this instance keep with good and fresh 'Last Check Time' values.

I've read in some other forums/topics this issue could be related to
- NDOUtils :: Tried disabling it with no success
- HW clock on server - I don't believe a @hourly cronjob to NTP synch would cause it

I also tried to restart the nagios service with the use_retained_scheduling_info set to zero - no success.

I'm losing hair and head over this. So it may help, below there are schedulling directives from my nagios.cfg on this missbehaved remote monitor:
auto_reschedule_checks=0
auto_rescheduling_interval=30
auto_rescheduling_window=180
use_retained_scheduling_info=1

If I don't get a solution I will go back to my Ubuntu 10.04 with nagios 3.3.1 hosted on a Windows 7 with VMware Player (seems stupid, but it works like a charm)

Thank you for your time, patience and hopefully... knowledge

abrist · Post by **abrist** » Thu Mar 07, 2013 11:42 am

What method are you using for the passive checks? If it is nsca, what version is running on the core server and on the client systems?

Post by **sebastiaopburnay** » Thu Mar 07, 2013 3:57 pm

I'm using NSCA daemon/server Version: 2.7.2 on the central server.

The remote instance uses a submit_service/host shell script designed to call send_nsca whenever the check is HARD (send_nsca client version is also 2.7.2.)

But the problem here aren't the passive checks, those are being smoothly sent from the remote towards the central server.

My problem are the active checks, not being executed on time by the remote server. They simply do not respect the normal_check_interval values and sometimes get more than 24h/48h late.

abrist · Post by **abrist** » Fri Mar 08, 2013 12:47 pm

Curious. So the remote shell script is not firing off anywhere near the right time? Are the delays consistent?

Post by **sebastiaopburnay** » Fri Mar 08, 2013 4:08 pm

abrist wrote: So the remote shell script is not firing off anywhere near the right time?

That's correct... unless it is being executed and somehow not processed/taken to account by nagios.

Those scrips (mostly check_nrpe and check_nt plugins) return positive well formed results in the CLI.

abrist wrote: Are the delays consistent?

I'm not sure What you mean by that, but I'm attatching to this reply a view of the scheduling queue, showing how Nagios' process is scheduling 'Next Check Times' to points in time prior to actual time (which is incredibly nonesense).

Another «fun»-fact is that host checks are being correctly scheduled and executed.

The last time I restarted Nagios I forced it not to retain status (forcing the process to check all hosts/services). As result, only 148 of all 704 service checks have been executed, all other 556 appear as «PENDING»

scottwilkerson · Post by **scottwilkerson** » Fri Mar 08, 2013 4:35 pm

Out of curiosity, what are the following set to?

Code: Select all

check_for_orphaned_services
check_for_orphaned_hosts
use_timezone

does the machine have the correct timezone?

Post by **sebastiaopburnay** » Mon Mar 11, 2013 9:10 am

scottwilkerson wrote:Out of curiosity, what are the following set to?
Code: Select all
check_for_orphaned_services
check_for_orphaned_hosts
use_timezone

I have the following configs

Code: Select all

check_for_orphaned_services=1
check_for_orphaned_hosts=1
#use_timezone=US/Mountain             ; comented
#use_timezone=Australia/Brisbane    ;  comented

scottwilkerson wrote: does the machine have the correct timezone?

The remote network has an NTP server and I use an @hourly cron job for the VM's clock to synch every hour.
I'm using Lisbon/London/Grenwitch hour

abrist · Post by **abrist** » Mon Mar 11, 2013 10:40 am

What are your timezone settings in /etc/php.ini? Have you configured /etc/localtime for the desired timezone correctly?
EDIT: Though this may not matter as host checks are running on schedule. This is very strange.

Post by **sebastiaopburnay** » Mon Mar 11, 2013 10:56 am

abrist wrote:What are your timezone settings in /etc/php.ini?.

Well this is embarassing for a sysadmin, I have none of those configs:

Code: Select all

root@SERVER:/# cat /etc/php.ini
cat: /etc/php.ini: No such file or directory

abrist wrote: Have you configured /etc/localtime for the desired timezone correctly?

I never tampered with /etc/localtime, maybe it has inherited values during OS install, when I run cat over that file all I got was binary rubish...

abrist wrote:This is very strange.

Indeed it is,

Out of dispair I'm temporarly abandoning this 'Ubuntu 12.04 x64' with 'nagios 3.4.1' architecture for now and apply the good old 10.04 x86 with nagios 3.3.1

But if you are willing I'm up to a brainstorm/solution to this intriguing 'bug'/missconfiguration

abrist · Post by **abrist** » Mon Mar 11, 2013 11:23 am

I could be in one of these locations:

Code: Select all

/etc/php/php.ini
/etc/php5/php.ini
/usr/bin/php5/bin/php.ini

Or you could just "find" it:

Code: Select all

find / -name php.ini

You should be able to set your localtime in ubuntu with the utility "tzconfig": http://manpages.ubuntu.com/manpages/dap ... fig.8.html
Lets make sure these few things are setup correctly, then we will move into the really strange stuff.
So host checks run on time. Service checks are delayed by an inconsistent amount. Weird.

Nagios Support Forum

service check schedulling issue

service check schedulling issue

Re: service check schedulling issue

Re: service check schedulling issue

Re: service check schedulling issue

Re: service check schedulling issue

Re: service check schedulling issue

Re: service check schedulling issue

Re: service check schedulling issue

Re: service check schedulling issue

Re: service check schedulling issue