Page 1 of 1

Alert won't reset

Posted: Wed Sep 17, 2014 4:14 am
by oz123
Hi,

We have upgraded lately to the Nagios XI 2014R1.4
It looks likes since then, some of our service alerts won’t reset even though their state change has returned into a normal threshold values.
If I’m clicking the “Schedule a forced immediate check” then the alert will reset. Also if I run the command from the Linux platform I can see the state is OK.
The “Next Check” date and time is set correctly means it should have been reset already.
Note that we haven't done any changes in the services configurations, the problem started I think after the upgrade.
Attached is the related information.

Thank you

Re: Alert won't reset

Posted: Wed Sep 17, 2014 10:03 am
by jwelch
What is the 'check period' of this service?
If the 'check period' is not 24x7 then that might be why the
alert is not clearing. From your attachment it looked like the
next scheduled check was not for almost another 2 hours.

Re: Alert won't reset

Posted: Wed Sep 17, 2014 12:31 pm
by Box293
In your screenshot it does appear that for this service the next check is in two hours. If you don't do anything, when the next check time is reached, does this jump forward another two hours?

Your service settings do appear to be correct.

Can you please execute this at the CLI and post the results:

Code: Select all

cat /usr/local/nagios/etc/nagios.cfg | grep interval_length
Is it just this host / service or is it random?

Re: Alert won't reset

Posted: Thu Sep 18, 2014 9:44 am
by oz123
Hi,

1) I hardly have hosts alerts so I don't know if it's just services issue, and as I said, the problem is only since last update.
2) on my screenshot, the last check time = the time the alert fired. and it's not changing even when the next check time has arrives.
3) screenshot was taken at 07:51 but the alert not reset also after 07:53 (the next check time).
4) the next check moves forward every 5 min (not 2 hours) as shown on screenshot under check settings configuration.
5) cat /usr/local/nagios/etc/nagios.cfg | grep interval_length = 60
6) the check period verified as 24/7.
so, you can see on linux that when I ran it manually on 07:51 it should have been reset.
The below is from the log, ialert reset only on 12:51 from some reason, means it's doing nothing when the next check time reaches:

Sep 15 00:00:27 nagios nagios: CURRENT SERVICE STATE: NTP;Memory Usage;OK;HARD;1;Memory usage: total:16381.31 MB - used: 5601.27 MB (34%) - free: 10780.04 MB (66%)
Sep 16 00:00:22 nagios nagios: CURRENT SERVICE STATE: NTP;Memory Usage;OK;HARD;1;Memory usage: total:16381.31 MB - used: 9530.82 MB (58%) - free: 6850.48 MB (42%)
Sep 17 00:00:25 nagios nagios: CURRENT SERVICE STATE: NTP;Memory Usage;OK;HARD;1;Memory usage: total:16381.31 MB - used: 12437.31 MB (76%) - free: 3944.00 MB (24%)
Sep 17 05:57:23 nagios nagios: SERVICE ALERT: NTP;Memory Usage;WARNING;SOFT;1;Memory usage: total:16381.31 MB - used: 13942.92 MB (85%) - free: 2438.38 MB (15%)
Sep 17 05:59:22 nagios nagios: SERVICE ALERT: NTP;Memory Usage;WARNING;HARD;2;Memory usage: total:16381.31 MB - used: 13941.13 MB (85%) - free: 2440.18 MB (15%)
Sep 17 12:51:00 nagios nagios: SERVICE ALERT: NTP;Memory Usage;OK;HARD;2;Memory usage: total:16381.31 MB - used: 5057.06 MB (31%) - free: 11324.25 MB (69%)
Sep 18 00:00:24 nagios nagios: CURRENT SERVICE STATE: NTP;Memory Usage;OK;HARD;1;Memory usage: total:16381.31 MB - used: 5122.05 MB (31%) - free: 11259.26 MB (69%)

Re: Alert won't reset

Posted: Thu Sep 18, 2014 10:19 am
by Box293
I'm going to suggest that we delete the retention.dat file.

It's purpose is as follows:
This is the file that Nagios will use for storing status, downtime, and comment information before it shuts down. When Nagios is restarted it will use the information stored in this file for setting the initial states of services and hosts before it starts monitoring anything. In order to make Nagios retain state information between program restarts, you must enable the retain_state_information option.
So after deleting it and restarting Nagios, everything will appear in a pending state and all checks will be scheduled.

Do the following at the CLI:

Code: Select all

service nagios stop
rm /usr/local/nagios/var/retention.dat
service nagios start
Lets see if that fixes your problem.

Re: Alert won't reset

Posted: Mon Sep 22, 2014 3:00 am
by oz123
Hi,

I did what you've suggested a couple of days ago but it didn't help.
Attached is another example where you can see CRITICAL alert that won't reset. and if I "Schedule a forced immediate check", it will.
The black part is the linux running the command at 07:44

Thanks

Re: Alert won't reset

Posted: Mon Sep 22, 2014 1:12 pm
by lmiltchev
Do you have many services set with check_interval = 1 min.? You may try the following:

1. Make sure that the "auto_rescheduling_window" directive in the "nagios.cfg" is set LOWER than the smallest check interval. For example, if your check interval is 1 min, you can set "auto_rescheduling_window" in the nagios.cfg to 45 sec.

Code: Select all

auto_rescheduling_window=45
2. Make sure that "auto_rescheduling_interval" is lower than auto_rescheduling_window. For example:

Code: Select all

auto_reschedule_checks=1
auto_rescheduling_interval=30
auto_rescheduling_window=45
Restart nagios.

Code: Select all

service nagios restart
Let me know if this helped.