Page 1 of 2

Nagios Checks were not happening - system time change

Posted: Thu Aug 28, 2014 5:48 am
by ashok
Dear All,

Today I found a strange issue in nagios.

Checks were not happening in nagios for 30 minutes

when i checked the nagios.log

following logs i found ... Please help me in finding out the root cause..

I found that checks were not happening and after some time ( 10 minutes) i found the below line in nagios.log

[1409219057] Warning: A system time change of 0d 0h 16m 56s (forwards in time) has been detected. Compensating...

I didnt do any sort of changes in the server..i didnt even login


after this i found countinous logs like

[Thu Aug 28 15:20:36 2014] Max concurrent service checks (3000) has been reached. Nudging Ranchi by 10 seconds...
[Thu Aug 28 15:20:36 2014] Max concurrent service checks (3000) has been reached. Nudging Raipur by 10 seconds...
[Thu Aug 28 15:20:37 2014] Max concurrent service checks (3000) has been reached. Nudging Vapi by 5 seconds...
[Thu Aug 28 15:20:37 2014] Max concurrent service checks (3000) has been reached. Nudging Indore by 13 seconds...
[Thu Aug 28 15:20:37 2014] Max concurrent service checks (3000) has been reached. Nudging Plant by 6 seconds...
[Thu Aug 28 15:20:37 2014] Max concurrent service checks (3000) has been reached. Nudging DNS Server by 5 seconds...
[Thu Aug 28 15:20:38 2014] Max concurrent service checks (3000) has been reached. Nudging Ghaziabad:Environment by 7 seconds...
[Thu Aug 28 15:20:38 2014] Max concurrent service checks (3000) has been reached. Nudging Noida by 6 seconds...
[Thu Aug 28 15:20:38 2014] Max concurrent service checks (3000) has been reached. Nudging Lucknow by 13 seconds...
[Thu Aug 28 15:20:38 2014] Max concurrent service checks (3000) has been reached. Nudging FARIDABAD:Uptime by 11 seconds...
[Thu Aug 28 15:20:39 2014] Max concurrent service checks (3000) has been reached. Nudging Raipur by 9 seconds...
[Thu Aug 28 15:20:39 2014] Max concurrent service checks (3000) has been reached. Nudging Mahindra-REVA Link by 9 seconds...
[Thu Aug 28 15:20:39 2014] Max concurrent service checks (3000) has been reached. Nudging Goregaon:Environment by 9 seconds...
[Thu Aug 28 15:20:39 2014] Max concurrent service checks (3000) has been reached. Nudging :Environment by 10 seconds...
[Thu Aug 28 15:20:40 2014] Max concurrent service checks (3000) has been reached. Nudging Ghaziabad:Memory Used by 8 seconds...

All these (Raipur,Vapi,Indore,Plant server, etcc..,,)were network devices and servers

For all the checks after that were logged like this...

after half an hour...

the issue got resolved automatically without doing anything and i found this log after ( 5 -7 minutes )


[1409221391] Auto-save of retention data completed successfully.



Please help me in getting this issue resolved...

If this is repeating...then i will be in trouble.


Thanks in advance...

Regards,
Ashok.

Re: Nagios Checks were not happening - system time change

Posted: Thu Aug 28, 2014 1:15 pm
by slansing
Looks like you've got some time drift, follow these steps to resolve that:

http://assets.nagios.com/downloads/nagi ... m_Time.pdf

Easiest way would be to set up NTP. As far as concurrent checks go...sheesh... that is a lot, what are your check intervals? How many services do you have on each interval? Here are some options for the concurrent checks setting:

http://nagios.sourceforge.net/docs/nagi ... uning.html

Re: Nagios Checks were not happening - system time change

Posted: Mon Sep 01, 2014 12:43 am
by ashok
Hi Slangsing,

We have already configured ntp and we are in sync with ntp server only.

I checked with another server also there was no time change..

As our environment is very big we had to configure like that and we are able manage it..

check intervals are globally - 15 minutes and it varies based on the criticallity of the servers. For a few it is like 10 minutes, 5 minutes , 3 minutes etc,..

I think because this has come

Warning: A system time change of 0d 0h 16m 56s (forwards in time) has been detected. Compensating...

it stopped the checks because of time difference.. as checks were scheduled and it didnt find the exact time stamp..


I want a clarity on why time has changed and where it has changed .. in ntp server? or in nagios server?

When i checked with ntp server team, they told that they didnt do any sort of changes

even i didnt do the changes in nagios server too..

but why this has come is where i got stuck,,

please help me out..

Re: Nagios Checks were not happening - system time change

Posted: Tue Sep 02, 2014 6:17 am
by ashok
Can anybody give a clue on this please,,

i need to provide the Root Cause by tomorrow EOD..

And also i need the confirmation whether this issue is nagios application's or server OS's or NTP server's?

Please help...

Re: Nagios Checks were not happening - system time change

Posted: Wed Sep 03, 2014 1:26 pm
by abrist
Well, nagios just detects the system time change, but has no control over actually changing it. I would assume you server experienced some drift and the ntp daemon fixed the offset. Nagios detected the change and attempted to compensate. As far as I know, nagios, by default, does not effect system time changes.

Re: Nagios Checks were not happening - system time change

Posted: Wed Sep 03, 2014 1:31 pm
by eloyd
Warning: A system time change of 0d 0h 16m 56s (forwards in time) has been detected. Compensating...
This is NTP changing your clock. Make sure your hardware clock is in sync with your software clock and that your NTP server is reachable and has the right time. If you have a properly synced NTP client, almost 17 minute drifts should never happen.

Re: Nagios Checks were not happening - system time change

Posted: Fri Sep 05, 2014 7:26 am
by ashok
Thank you very much for your valuable suggestions..
:)

Re: Nagios Checks were not happening - system time change

Posted: Fri Sep 05, 2014 10:13 am
by tmcdonald
Did that resolve the problem?

Re: Nagios Checks were not happening - system time change

Posted: Mon Sep 08, 2014 1:02 am
by ashok
Actually we didnt do anything from nagios application end.. it has auto rectified on it own after 17 minutes.. without any service restarts...or any troubleshooting..
I have informed the concerned team to troubleshoot regarding why the drift is happening..

Re: Nagios Checks were not happening - system time change

Posted: Mon Sep 08, 2014 7:58 am
by eloyd
Depending on the age and manufacturer of the machine, you may have a bad or dying battery on the motherboard. Worth looking at, but glad it's working.