reset alert after acknowledge

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
matthew_77
Posts: 1
Joined: Fri Jan 20, 2012 4:14 am

reset alert after acknowledge

Post by matthew_77 »

I'd like to raise an alert when a string appears in a log. Then i'd like to reset the aler after i acknowleged it.
I'm using check_logfiles plugin (http://labs.consol.de/lang/en/nagios/check_logfiles/).
My service is defined as follow, using "is_volatile" parameter in order to reset the alert after an acknowledge...but it doesn't work.
Any suggestion?
Thanks

define service{
use local-service
host_name localhost
service_description chexk authlog
check_command check_logfiles
flap_detection_enabled 0
max_check_attempts 1
normal_check_interval 1
retry_check_interval 1
is_volatile 1
max_check_attempts 1
}
cseitan
Posts: 5
Joined: Wed Sep 18, 2013 11:27 am

Re: reset alert after acknowledge

Post by cseitan »

Hi Matthew,

I need the same feature and I am new to this forum. I did not see any replies to your question. Did you find a solution?


Best Regards,
Costel
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: reset alert after acknowledge

Post by abrist »

is_volatile will treat every check as a state change, and it will definitely not work the way you want. By acknowledging, you should suppress future alerts as long as the state does not change. After acknowledging, you could submit a passive check, resetting the status to 'OK'. Your check will need to be smart enough to know when to alert - or in other words, the script needs to be able to identify when the strings appeared in the logs and not alert on lines it has already alerted on.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
cseitan
Posts: 5
Joined: Wed Sep 18, 2013 11:27 am

Re: reset alert after acknowledge

Post by cseitan »

Thank you for this quick answer .

I indeed use the is_volatile to treat every check as a state change. But let's say that we
have defined a SERVICE_LOG_MONITORING monitoring /var/adm/messages
and we are looking for several different patterns inside
"WARNING: pattern 1"
"WARNING: pattern 2"
"CRITICAL: pattern 1"

If the check find the first warning into the log file, the state will become "Warning"(yellow). Once I fixed the issue, I don't want the service state to remain on "Warning". This way, if the second "Warning" pattern appears in the log on next check, I can see there is a different problem.
As I understand, acknowledging the service alert only means that I am working on it but it does not change the status.

THe feature I am trying to implement looks to me as something mandatory for log monitoring but I did not find a solution yet. In fact, this could be simply addressed if we would have the possibility to launch a manual cmd from the NAGIOS GUI for instance. But I do not see how this can be done. The event_handler doesn't help.

What are the best practices to monitor log files with Nagios so that we do not miss some critical event?

Thanks,
C
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: reset alert after acknowledge

Post by abrist »

Have you tried to submit a passive check with status 'OK'?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
cseitan
Posts: 5
Joined: Wed Sep 18, 2013 11:27 am

Re: reset alert after acknowledge

Post by cseitan »

Hi,

Yes, the passive check will change the status into NAGIOS to OK. But next check it will come back again because the plug-in is not aware of the fact that NAGIOS status changed as per our decision.
For the plug-in the issue is still not fixed.

This is what we want to a certain extent because when coming in the morning we want to see messages received over night. But after analysis we want to be able to confirm
from the GUI, to the plug-in, that the issue has been fixed.

As I do not see how to pass new ARGs to the plug-in (from the GUI) in order to say "the problem is fixed" or to execute an external cmd (from the GUI) to say : "the problem is fixed" ..
the plug-in will continue to state the problem is still there.

I hope I made it clearer. :)

Any ideas?

Regards,
C.
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: reset alert after acknowledge

Post by slansing »

This is what we want to a certain extent because when coming in the morning we want to see messages received over night. But after analysis we want to be able to confirm
from the GUI, to the plug-in, that the issue has been fixed.
The plugin is not responsible for acknowledging or altering what causes alerts, notifications, etc. That is all Nagios system level stuff. If you acknowledge that a service is in a critical state then alerts will be suppressed until it changes to a different state. If there is still a problem with what that passive check reports on from the remote system then it will continue to send "for example, exit code 2 for critical" based on the thresholds you have set until that problem is resolved on the remote system. This is why acknowledging alerts is a powerful tool.

However, as you said acknowledging will not change the state, it will only change how nagios is perceiving and reacting to that object's current state.
This way, if the second "Warning" pattern appears in the log on next check, I can see there is a different problem.
THe feature I am trying to implement looks to me as something mandatory for log monitoring but I did not find a solution yet. In fact, this could be simply addressed if we would have the possibility to launch a manual cmd from the NAGIOS GUI for instance. But I do not see how this can be done. The event_handler doesn't help.
This would not really be possible with passive checks, as they are only sent when cron comes around to their local service check on the remote host. If that check is ran and "pattern 1" is still present in the log file you are monitoring then it will update the Nagios service status with the same information moving on down it's max_check_attempts.
What are the best practices to monitor log files with Nagios so that we do not miss some critical event?
I'm not sure we really have a best practices for monitoring logs, this is very environment specific and all depends on that.

Once again, acknowledging a state change from a service's perspective will never reset the state to an Ok state. There may be a way to script an event handler to do this, however you will continue to receive that state change if the issue is not really fixed on the remote host, this is the purpose of Nagios as it is not a log server which monitors logs in that fashion.
Locked