how to reset the state and attempt count

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
jenithangel
Posts: 9
Joined: Mon Jul 10, 2017 7:31 am

how to reset the state and attempt count

Post by jenithangel »

removed
Last edited by jenithangel on Tue Jul 11, 2017 4:34 am, edited 2 times in total.
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: how to reset the state and attempt count

Post by mcapra »

jenithangel wrote:How to change the state to OK again after it reached CRITICAL HARD.
Typically the Nagios plugin associated with your check is supposed to take care of this. Plugins return 2 for a CRITICAL and 0 for an OK, so if the state has reverted to OK, your plugin should return 0 and subsequently your Nagios check's state should revert to OK.

Can you share the service definition used for this check, the associated command definition used in the check_command parameter of the service, and any scripts/plugins associated with the command.
Former Nagios employee
https://www.mcapra.com/
jenithangel
Posts: 9
Joined: Mon Jul 10, 2017 7:31 am

Re: how to reset the state and attempt count

Post by jenithangel »

command.cfg
-----------------


# 'check_http' command definition
define command{
command_name check_http
command_line $USER1$/check_http -I $HOSTADDRESS$ $ARG1$
}


# 'check-host-alive' command definition
define command{
command_name check-host-alive
command_line $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5
}

define command {
command_name check_http_url
command_line $USER1$/check_http -I $ARG1$ -p $ARG2$ -u $ARG3$
}

define command {
command_name check_http_port
command_line $USER1$/check_http -I $ARG1$ -p $ARG2$
}

define command {
command_name recover-service
command_line /u/gls/Monitoring/nagios/scripts/serviceRecovery.sh $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ $ARG1$ $ARG2$ $ARG3$ $ARG4$ $ARG5$ $ARG6$ $ARG7$
}
Last edited by jenithangel on Tue Jul 11, 2017 3:43 am, edited 2 times in total.
jenithangel
Posts: 9
Joined: Mon Jul 10, 2017 7:31 am

Re: how to reset the state and attempt count

Post by jenithangel »

main problem is :

service check reaches critical hard state . current attempt reaches max_retry value. then state becomes 2. even after the java service becomes up after some network failure,subsequent validation doesnt happen since Service state is hard even after nagios restart. worker try to run the same notification and comes out
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: how to reset the state and attempt count

Post by tgriep »

Are you restarting the nagios process in your script if the status of the service check doesn't return back to the OK state?
Be sure to check out our Knowledgebase for helpful articles and solutions!
jenithangel
Posts: 9
Joined: Mon Jul 10, 2017 7:31 am

Re: how to reset the state and attempt count

Post by jenithangel »

if the status of service is critical SOFT
-> if service attempt is maximum retry ,we try to stop and start the issue process and NOT nagios
jenithangel
Posts: 9
Joined: Mon Jul 10, 2017 7:31 am

Re: how to reset the state and attempt count

Post by jenithangel »

Let me explain the issue clearly :

i have java service say xxx.

THis service is monitored for every 2 minutes (normal_check_interval) ..

Network is down for sometime due to some outage.

after 4 tries (max_check_attempts = 4) ,the service reached the CRITICAL HARD state.

so therafter this will be sending notification in the form of email every 2 hours (notification_interval) until the service is UP BY ITSELF. Is there a way to reset the state to SOFT and attempt count back to 1 so that nagios will try to start this issue application xxx even after it reached CRITICAL HARD.
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: how to reset the state and attempt count

Post by mcapra »

jenithangel wrote:so that nagios will try to start this issue application xxx even after it reached CRITICAL HARD.
So is Nagios Core trying to restart this application using an event handler? That's a pretty critical piece of information which makes a lot more sense in regards to the use case.

You might try marking the service as "volatile" using the is_volatile option.
What's So Special About Volatile Services?

Volatile services differ from "normal" services in three important ways. Each time they are checked when they are in a hard non-OK state, and the check returns a non-OK state (i.e. no state change has occurred)
  • the non-OK service state is logged
  • contacts are notified about the problem (if that's what should be done). Note: Notification intervals are ignored for volatile services.
  • the event handler for the service is run (if one has been defined)
So, if the service is marked as is_volatile, the eventhandler should execute every time the check returns a non-ok state. Even if the check is currently in a HARD state.

If you still really wanted to reset the state of the service check programatically, so that the event handler would be fired again, one solution would be to use the external commands file. You'll need to be mindful of how this command is called though to prevent infinite looping of your event handler (I've done this!).

You may find the PROCESS_HOST_CHECK_RESULT and PROCESS_SERVICE_CHECK_RESULT commands particularly useful in terms of forcibly "resetting" a check's state to OK.
Former Nagios employee
https://www.mcapra.com/
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: how to reset the state and attempt count

Post by tgriep »

Thanks mcapra for the help.
Adding the Volatile setting to that check would be the easiest thing to do without having to create a script to reset the state after it is has reached the Critical Hard state.
Be sure to check out our Knowledgebase for helpful articles and solutions!
jenithangel
Posts: 9
Joined: Mon Jul 10, 2017 7:31 am

Re: how to reset the state and attempt count

Post by jenithangel »

Thanks tgriep and mcapra for your response.. I will try that volatile option and post the reply tomorrow.

one more issue i noted is ,nagios result in too many defunct process..

Process started by nagios
------
us32860s4000d0a:/u/gls/middleware/tools/nagios/var> psg nagios
gls 14718 4497 0 11:20 ? 00:00:00 [nagios] <defunct>
gls 14719 4497 0 11:20 ? 00:00:00 [nagios] <defunct>
gls 14724 4497 0 11:20 ? 00:00:00 [nagios] <defunct>
gls 14725 4497 0 11:20 ? 00:00:00 [nagios] <defunct>
gls 14980 4497 0 11:20 ? 00:00:00 /u/gls/middleware/tools/nagios/bin/nagios --worker /u/gls/middleware/tools/nagios/var/rw/nagios.qh
gls 14981 4497 0 11:20 ? 00:00:00 /u/gls/middleware/tools/nagios/bin/nagios --worker /u/gls/middleware/tools/nagios/var/rw/nagios.qh
gls 14983 4497 0 11:20 ? 00:00:00 /u/gls/middleware/tools/nagios/bin/nagios --worker /u/gls/middleware/tools/nagios/var/rw/nagios.qh
gls 14984 4497 0 11:20 ? 00:00:00 /u/gls/middleware/tools/nagios/bin/nagios --worker /u/gls/middleware/tools/nagios/var/rw/nagios.qh
gls 4497 1 0 03:39 ? 00:00:05 /u/gls/middleware/tools/nagios/bin/nagios -d /u/gls/middleware/tools/nagios/etc/nagios.cfg
gls 4499 4497 0 03:39 ? 00:00:00 [nagios] <defunct>
gls 4500 4497 0 03:39 ? 00:00:00 [nagios] <defunct>
gls 4501 4497 0 03:39 ? 00:00:00 [nagios] <defunct>
gls 4502 4497 0 03:39 ? 00:00:00 [nagios] <defunct>
gls 4520 4497 0 03:39 ? 00:00:02 /u/gls/middleware/tools/nagios/bin/nagios -d /u/gls/middleware/tools/nagios/etc/nagios.cfg
us32860s4000d0a:/u/gls/middleware/tools/nagios/var> date

nagios.log infor
-----------------

[1499762354] nerd: Fully initialized and ready to rock!
[1499762354] wproc: Successfully registered manager as @wproc with query handler
[1499762354] wproc: Registry request: name=Core Worker 4501;pid=4501
[1499762354] wproc: Registry request: name=Core Worker 4500;pid=4500
[1499762354] wproc: Registry request: name=Core Worker 4499;pid=4499
[1499762354] wproc: Registry request: name=Core Worker 4502;pid=4502
[1499762354] Successfully launched command file worker with pid 4520


[1499779879] wproc: NOTIFY job 806 from worker Core Worker 4499 is a non-check helper but exited with return code 2

[1499787553] Auto-save of retention data completed successfully.
[1499787896] SERVICE NOTIFICATION: mm-websocket-support;localhost;dev-mm-websocket-service;CRITICAL;notify-service-by-email;connect to address localhost and port 23114: Connection refused
[1499790025] wproc: Socket to worker Core Worker 4499 broken, removing
[1499790025] wproc: Socket to worker Core Worker 4500 broken, removing
[1499790025] wproc: Socket to worker Core Worker 4501 broken, removing
[1499790025] wproc: Socket to worker Core Worker 4502 broken, removing
[1499790025] Caught SIGHUP, restarting...
[1499790025] Event broker module 'NERD' deinitialized successfully.
[1499790025] Nagios 4.1.1 starting... (PID=4497)
[1499790025] Local time is Tue Jul 11 11:20:25 CDT 2017

.... I suspect that when it tries to restart and when it fails theis get created...pls let me know..
Locked