Page 1 of 2

how to reset the state and attempt count

Posted: Mon Jul 10, 2017 8:09 am
by jenithangel
removed

Re: how to reset the state and attempt count

Posted: Mon Jul 10, 2017 9:46 am
by mcapra
jenithangel wrote:How to change the state to OK again after it reached CRITICAL HARD.
Typically the Nagios plugin associated with your check is supposed to take care of this. Plugins return 2 for a CRITICAL and 0 for an OK, so if the state has reverted to OK, your plugin should return 0 and subsequently your Nagios check's state should revert to OK.

Can you share the service definition used for this check, the associated command definition used in the check_command parameter of the service, and any scripts/plugins associated with the command.

Re: how to reset the state and attempt count

Posted: Mon Jul 10, 2017 11:33 am
by jenithangel
command.cfg
-----------------


# 'check_http' command definition
define command{
command_name check_http
command_line $USER1$/check_http -I $HOSTADDRESS$ $ARG1$
}


# 'check-host-alive' command definition
define command{
command_name check-host-alive
command_line $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5
}

define command {
command_name check_http_url
command_line $USER1$/check_http -I $ARG1$ -p $ARG2$ -u $ARG3$
}

define command {
command_name check_http_port
command_line $USER1$/check_http -I $ARG1$ -p $ARG2$
}

define command {
command_name recover-service
command_line /u/gls/Monitoring/nagios/scripts/serviceRecovery.sh $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ $ARG1$ $ARG2$ $ARG3$ $ARG4$ $ARG5$ $ARG6$ $ARG7$
}

Re: how to reset the state and attempt count

Posted: Mon Jul 10, 2017 11:55 am
by jenithangel
main problem is :

service check reaches critical hard state . current attempt reaches max_retry value. then state becomes 2. even after the java service becomes up after some network failure,subsequent validation doesnt happen since Service state is hard even after nagios restart. worker try to run the same notification and comes out

Re: how to reset the state and attempt count

Posted: Mon Jul 10, 2017 4:43 pm
by tgriep
Are you restarting the nagios process in your script if the status of the service check doesn't return back to the OK state?

Re: how to reset the state and attempt count

Posted: Tue Jul 11, 2017 12:17 am
by jenithangel
if the status of service is critical SOFT
-> if service attempt is maximum retry ,we try to stop and start the issue process and NOT nagios

Re: how to reset the state and attempt count

Posted: Tue Jul 11, 2017 6:24 am
by jenithangel
Let me explain the issue clearly :

i have java service say xxx.

THis service is monitored for every 2 minutes (normal_check_interval) ..

Network is down for sometime due to some outage.

after 4 tries (max_check_attempts = 4) ,the service reached the CRITICAL HARD state.

so therafter this will be sending notification in the form of email every 2 hours (notification_interval) until the service is UP BY ITSELF. Is there a way to reset the state to SOFT and attempt count back to 1 so that nagios will try to start this issue application xxx even after it reached CRITICAL HARD.

Re: how to reset the state and attempt count

Posted: Tue Jul 11, 2017 10:37 am
by mcapra
jenithangel wrote:so that nagios will try to start this issue application xxx even after it reached CRITICAL HARD.
So is Nagios Core trying to restart this application using an event handler? That's a pretty critical piece of information which makes a lot more sense in regards to the use case.

You might try marking the service as "volatile" using the is_volatile option.
What's So Special About Volatile Services?

Volatile services differ from "normal" services in three important ways. Each time they are checked when they are in a hard non-OK state, and the check returns a non-OK state (i.e. no state change has occurred)
  • the non-OK service state is logged
  • contacts are notified about the problem (if that's what should be done). Note: Notification intervals are ignored for volatile services.
  • the event handler for the service is run (if one has been defined)
So, if the service is marked as is_volatile, the eventhandler should execute every time the check returns a non-ok state. Even if the check is currently in a HARD state.

If you still really wanted to reset the state of the service check programatically, so that the event handler would be fired again, one solution would be to use the external commands file. You'll need to be mindful of how this command is called though to prevent infinite looping of your event handler (I've done this!).

You may find the PROCESS_HOST_CHECK_RESULT and PROCESS_SERVICE_CHECK_RESULT commands particularly useful in terms of forcibly "resetting" a check's state to OK.

Re: how to reset the state and attempt count

Posted: Tue Jul 11, 2017 11:08 am
by tgriep
Thanks mcapra for the help.
Adding the Volatile setting to that check would be the easiest thing to do without having to create a script to reset the state after it is has reached the Critical Hard state.

Re: how to reset the state and attempt count

Posted: Tue Jul 11, 2017 11:41 am
by jenithangel
Thanks tgriep and mcapra for your response.. I will try that volatile option and post the reply tomorrow.

one more issue i noted is ,nagios result in too many defunct process..

Process started by nagios
------
us32860s4000d0a:/u/gls/middleware/tools/nagios/var> psg nagios
gls 14718 4497 0 11:20 ? 00:00:00 [nagios] <defunct>
gls 14719 4497 0 11:20 ? 00:00:00 [nagios] <defunct>
gls 14724 4497 0 11:20 ? 00:00:00 [nagios] <defunct>
gls 14725 4497 0 11:20 ? 00:00:00 [nagios] <defunct>
gls 14980 4497 0 11:20 ? 00:00:00 /u/gls/middleware/tools/nagios/bin/nagios --worker /u/gls/middleware/tools/nagios/var/rw/nagios.qh
gls 14981 4497 0 11:20 ? 00:00:00 /u/gls/middleware/tools/nagios/bin/nagios --worker /u/gls/middleware/tools/nagios/var/rw/nagios.qh
gls 14983 4497 0 11:20 ? 00:00:00 /u/gls/middleware/tools/nagios/bin/nagios --worker /u/gls/middleware/tools/nagios/var/rw/nagios.qh
gls 14984 4497 0 11:20 ? 00:00:00 /u/gls/middleware/tools/nagios/bin/nagios --worker /u/gls/middleware/tools/nagios/var/rw/nagios.qh
gls 4497 1 0 03:39 ? 00:00:05 /u/gls/middleware/tools/nagios/bin/nagios -d /u/gls/middleware/tools/nagios/etc/nagios.cfg
gls 4499 4497 0 03:39 ? 00:00:00 [nagios] <defunct>
gls 4500 4497 0 03:39 ? 00:00:00 [nagios] <defunct>
gls 4501 4497 0 03:39 ? 00:00:00 [nagios] <defunct>
gls 4502 4497 0 03:39 ? 00:00:00 [nagios] <defunct>
gls 4520 4497 0 03:39 ? 00:00:02 /u/gls/middleware/tools/nagios/bin/nagios -d /u/gls/middleware/tools/nagios/etc/nagios.cfg
us32860s4000d0a:/u/gls/middleware/tools/nagios/var> date

nagios.log infor
-----------------

[1499762354] nerd: Fully initialized and ready to rock!
[1499762354] wproc: Successfully registered manager as @wproc with query handler
[1499762354] wproc: Registry request: name=Core Worker 4501;pid=4501
[1499762354] wproc: Registry request: name=Core Worker 4500;pid=4500
[1499762354] wproc: Registry request: name=Core Worker 4499;pid=4499
[1499762354] wproc: Registry request: name=Core Worker 4502;pid=4502
[1499762354] Successfully launched command file worker with pid 4520


[1499779879] wproc: NOTIFY job 806 from worker Core Worker 4499 is a non-check helper but exited with return code 2

[1499787553] Auto-save of retention data completed successfully.
[1499787896] SERVICE NOTIFICATION: mm-websocket-support;localhost;dev-mm-websocket-service;CRITICAL;notify-service-by-email;connect to address localhost and port 23114: Connection refused
[1499790025] wproc: Socket to worker Core Worker 4499 broken, removing
[1499790025] wproc: Socket to worker Core Worker 4500 broken, removing
[1499790025] wproc: Socket to worker Core Worker 4501 broken, removing
[1499790025] wproc: Socket to worker Core Worker 4502 broken, removing
[1499790025] Caught SIGHUP, restarting...
[1499790025] Event broker module 'NERD' deinitialized successfully.
[1499790025] Nagios 4.1.1 starting... (PID=4497)
[1499790025] Local time is Tue Jul 11 11:20:25 CDT 2017

.... I suspect that when it tries to restart and when it fails theis get created...pls let me know..