how to reset the state and attempt count
-
- Posts: 9
- Joined: Mon Jul 10, 2017 7:31 am
how to reset the state and attempt count
removed
Last edited by jenithangel on Tue Jul 11, 2017 4:34 am, edited 2 times in total.
Re: how to reset the state and attempt count
Typically the Nagios plugin associated with your check is supposed to take care of this. Plugins return 2 for a CRITICAL and 0 for an OK, so if the state has reverted to OK, your plugin should return 0 and subsequently your Nagios check's state should revert to OK.jenithangel wrote:How to change the state to OK again after it reached CRITICAL HARD.
Can you share the service definition used for this check, the associated command definition used in the check_command parameter of the service, and any scripts/plugins associated with the command.
Former Nagios employee
https://www.mcapra.com/
https://www.mcapra.com/
-
- Posts: 9
- Joined: Mon Jul 10, 2017 7:31 am
Re: how to reset the state and attempt count
command.cfg
-----------------
# 'check_http' command definition
define command{
command_name check_http
command_line $USER1$/check_http -I $HOSTADDRESS$ $ARG1$
}
# 'check-host-alive' command definition
define command{
command_name check-host-alive
command_line $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5
}
define command {
command_name check_http_url
command_line $USER1$/check_http -I $ARG1$ -p $ARG2$ -u $ARG3$
}
define command {
command_name check_http_port
command_line $USER1$/check_http -I $ARG1$ -p $ARG2$
}
define command {
command_name recover-service
command_line /u/gls/Monitoring/nagios/scripts/serviceRecovery.sh $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ $ARG1$ $ARG2$ $ARG3$ $ARG4$ $ARG5$ $ARG6$ $ARG7$
}
-----------------
# 'check_http' command definition
define command{
command_name check_http
command_line $USER1$/check_http -I $HOSTADDRESS$ $ARG1$
}
# 'check-host-alive' command definition
define command{
command_name check-host-alive
command_line $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5
}
define command {
command_name check_http_url
command_line $USER1$/check_http -I $ARG1$ -p $ARG2$ -u $ARG3$
}
define command {
command_name check_http_port
command_line $USER1$/check_http -I $ARG1$ -p $ARG2$
}
define command {
command_name recover-service
command_line /u/gls/Monitoring/nagios/scripts/serviceRecovery.sh $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ $ARG1$ $ARG2$ $ARG3$ $ARG4$ $ARG5$ $ARG6$ $ARG7$
}
Last edited by jenithangel on Tue Jul 11, 2017 3:43 am, edited 2 times in total.
-
- Posts: 9
- Joined: Mon Jul 10, 2017 7:31 am
Re: how to reset the state and attempt count
main problem is :
service check reaches critical hard state . current attempt reaches max_retry value. then state becomes 2. even after the java service becomes up after some network failure,subsequent validation doesnt happen since Service state is hard even after nagios restart. worker try to run the same notification and comes out
service check reaches critical hard state . current attempt reaches max_retry value. then state becomes 2. even after the java service becomes up after some network failure,subsequent validation doesnt happen since Service state is hard even after nagios restart. worker try to run the same notification and comes out
Re: how to reset the state and attempt count
Are you restarting the nagios process in your script if the status of the service check doesn't return back to the OK state?
Be sure to check out our Knowledgebase for helpful articles and solutions!
-
- Posts: 9
- Joined: Mon Jul 10, 2017 7:31 am
Re: how to reset the state and attempt count
if the status of service is critical SOFT
-> if service attempt is maximum retry ,we try to stop and start the issue process and NOT nagios
-> if service attempt is maximum retry ,we try to stop and start the issue process and NOT nagios
-
- Posts: 9
- Joined: Mon Jul 10, 2017 7:31 am
Re: how to reset the state and attempt count
Let me explain the issue clearly :
i have java service say xxx.
THis service is monitored for every 2 minutes (normal_check_interval) ..
Network is down for sometime due to some outage.
after 4 tries (max_check_attempts = 4) ,the service reached the CRITICAL HARD state.
so therafter this will be sending notification in the form of email every 2 hours (notification_interval) until the service is UP BY ITSELF. Is there a way to reset the state to SOFT and attempt count back to 1 so that nagios will try to start this issue application xxx even after it reached CRITICAL HARD.
i have java service say xxx.
THis service is monitored for every 2 minutes (normal_check_interval) ..
Network is down for sometime due to some outage.
after 4 tries (max_check_attempts = 4) ,the service reached the CRITICAL HARD state.
so therafter this will be sending notification in the form of email every 2 hours (notification_interval) until the service is UP BY ITSELF. Is there a way to reset the state to SOFT and attempt count back to 1 so that nagios will try to start this issue application xxx even after it reached CRITICAL HARD.
Re: how to reset the state and attempt count
So is Nagios Core trying to restart this application using an event handler? That's a pretty critical piece of information which makes a lot more sense in regards to the use case.jenithangel wrote:so that nagios will try to start this issue application xxx even after it reached CRITICAL HARD.
You might try marking the service as "volatile" using the is_volatile option.
So, if the service is marked as is_volatile, the eventhandler should execute every time the check returns a non-ok state. Even if the check is currently in a HARD state.What's So Special About Volatile Services?
Volatile services differ from "normal" services in three important ways. Each time they are checked when they are in a hard non-OK state, and the check returns a non-OK state (i.e. no state change has occurred)
- the non-OK service state is logged
- contacts are notified about the problem (if that's what should be done). Note: Notification intervals are ignored for volatile services.
- the event handler for the service is run (if one has been defined)
If you still really wanted to reset the state of the service check programatically, so that the event handler would be fired again, one solution would be to use the external commands file. You'll need to be mindful of how this command is called though to prevent infinite looping of your event handler (I've done this!).
You may find the PROCESS_HOST_CHECK_RESULT and PROCESS_SERVICE_CHECK_RESULT commands particularly useful in terms of forcibly "resetting" a check's state to OK.
Former Nagios employee
https://www.mcapra.com/
https://www.mcapra.com/
Re: how to reset the state and attempt count
Thanks mcapra for the help.
Adding the Volatile setting to that check would be the easiest thing to do without having to create a script to reset the state after it is has reached the Critical Hard state.
Adding the Volatile setting to that check would be the easiest thing to do without having to create a script to reset the state after it is has reached the Critical Hard state.
Be sure to check out our Knowledgebase for helpful articles and solutions!
-
- Posts: 9
- Joined: Mon Jul 10, 2017 7:31 am
Re: how to reset the state and attempt count
Thanks tgriep and mcapra for your response.. I will try that volatile option and post the reply tomorrow.
one more issue i noted is ,nagios result in too many defunct process..
Process started by nagios
------
us32860s4000d0a:/u/gls/middleware/tools/nagios/var> psg nagios
gls 14718 4497 0 11:20 ? 00:00:00 [nagios] <defunct>
gls 14719 4497 0 11:20 ? 00:00:00 [nagios] <defunct>
gls 14724 4497 0 11:20 ? 00:00:00 [nagios] <defunct>
gls 14725 4497 0 11:20 ? 00:00:00 [nagios] <defunct>
gls 14980 4497 0 11:20 ? 00:00:00 /u/gls/middleware/tools/nagios/bin/nagios --worker /u/gls/middleware/tools/nagios/var/rw/nagios.qh
gls 14981 4497 0 11:20 ? 00:00:00 /u/gls/middleware/tools/nagios/bin/nagios --worker /u/gls/middleware/tools/nagios/var/rw/nagios.qh
gls 14983 4497 0 11:20 ? 00:00:00 /u/gls/middleware/tools/nagios/bin/nagios --worker /u/gls/middleware/tools/nagios/var/rw/nagios.qh
gls 14984 4497 0 11:20 ? 00:00:00 /u/gls/middleware/tools/nagios/bin/nagios --worker /u/gls/middleware/tools/nagios/var/rw/nagios.qh
gls 4497 1 0 03:39 ? 00:00:05 /u/gls/middleware/tools/nagios/bin/nagios -d /u/gls/middleware/tools/nagios/etc/nagios.cfg
gls 4499 4497 0 03:39 ? 00:00:00 [nagios] <defunct>
gls 4500 4497 0 03:39 ? 00:00:00 [nagios] <defunct>
gls 4501 4497 0 03:39 ? 00:00:00 [nagios] <defunct>
gls 4502 4497 0 03:39 ? 00:00:00 [nagios] <defunct>
gls 4520 4497 0 03:39 ? 00:00:02 /u/gls/middleware/tools/nagios/bin/nagios -d /u/gls/middleware/tools/nagios/etc/nagios.cfg
us32860s4000d0a:/u/gls/middleware/tools/nagios/var> date
nagios.log infor
-----------------
[1499762354] nerd: Fully initialized and ready to rock!
[1499762354] wproc: Successfully registered manager as @wproc with query handler
[1499762354] wproc: Registry request: name=Core Worker 4501;pid=4501
[1499762354] wproc: Registry request: name=Core Worker 4500;pid=4500
[1499762354] wproc: Registry request: name=Core Worker 4499;pid=4499
[1499762354] wproc: Registry request: name=Core Worker 4502;pid=4502
[1499762354] Successfully launched command file worker with pid 4520
[1499779879] wproc: NOTIFY job 806 from worker Core Worker 4499 is a non-check helper but exited with return code 2
[1499787553] Auto-save of retention data completed successfully.
[1499787896] SERVICE NOTIFICATION: mm-websocket-support;localhost;dev-mm-websocket-service;CRITICAL;notify-service-by-email;connect to address localhost and port 23114: Connection refused
[1499790025] wproc: Socket to worker Core Worker 4499 broken, removing
[1499790025] wproc: Socket to worker Core Worker 4500 broken, removing
[1499790025] wproc: Socket to worker Core Worker 4501 broken, removing
[1499790025] wproc: Socket to worker Core Worker 4502 broken, removing
[1499790025] Caught SIGHUP, restarting...
[1499790025] Event broker module 'NERD' deinitialized successfully.
[1499790025] Nagios 4.1.1 starting... (PID=4497)
[1499790025] Local time is Tue Jul 11 11:20:25 CDT 2017
.... I suspect that when it tries to restart and when it fails theis get created...pls let me know..
one more issue i noted is ,nagios result in too many defunct process..
Process started by nagios
------
us32860s4000d0a:/u/gls/middleware/tools/nagios/var> psg nagios
gls 14718 4497 0 11:20 ? 00:00:00 [nagios] <defunct>
gls 14719 4497 0 11:20 ? 00:00:00 [nagios] <defunct>
gls 14724 4497 0 11:20 ? 00:00:00 [nagios] <defunct>
gls 14725 4497 0 11:20 ? 00:00:00 [nagios] <defunct>
gls 14980 4497 0 11:20 ? 00:00:00 /u/gls/middleware/tools/nagios/bin/nagios --worker /u/gls/middleware/tools/nagios/var/rw/nagios.qh
gls 14981 4497 0 11:20 ? 00:00:00 /u/gls/middleware/tools/nagios/bin/nagios --worker /u/gls/middleware/tools/nagios/var/rw/nagios.qh
gls 14983 4497 0 11:20 ? 00:00:00 /u/gls/middleware/tools/nagios/bin/nagios --worker /u/gls/middleware/tools/nagios/var/rw/nagios.qh
gls 14984 4497 0 11:20 ? 00:00:00 /u/gls/middleware/tools/nagios/bin/nagios --worker /u/gls/middleware/tools/nagios/var/rw/nagios.qh
gls 4497 1 0 03:39 ? 00:00:05 /u/gls/middleware/tools/nagios/bin/nagios -d /u/gls/middleware/tools/nagios/etc/nagios.cfg
gls 4499 4497 0 03:39 ? 00:00:00 [nagios] <defunct>
gls 4500 4497 0 03:39 ? 00:00:00 [nagios] <defunct>
gls 4501 4497 0 03:39 ? 00:00:00 [nagios] <defunct>
gls 4502 4497 0 03:39 ? 00:00:00 [nagios] <defunct>
gls 4520 4497 0 03:39 ? 00:00:02 /u/gls/middleware/tools/nagios/bin/nagios -d /u/gls/middleware/tools/nagios/etc/nagios.cfg
us32860s4000d0a:/u/gls/middleware/tools/nagios/var> date
nagios.log infor
-----------------
[1499762354] nerd: Fully initialized and ready to rock!
[1499762354] wproc: Successfully registered manager as @wproc with query handler
[1499762354] wproc: Registry request: name=Core Worker 4501;pid=4501
[1499762354] wproc: Registry request: name=Core Worker 4500;pid=4500
[1499762354] wproc: Registry request: name=Core Worker 4499;pid=4499
[1499762354] wproc: Registry request: name=Core Worker 4502;pid=4502
[1499762354] Successfully launched command file worker with pid 4520
[1499779879] wproc: NOTIFY job 806 from worker Core Worker 4499 is a non-check helper but exited with return code 2
[1499787553] Auto-save of retention data completed successfully.
[1499787896] SERVICE NOTIFICATION: mm-websocket-support;localhost;dev-mm-websocket-service;CRITICAL;notify-service-by-email;connect to address localhost and port 23114: Connection refused
[1499790025] wproc: Socket to worker Core Worker 4499 broken, removing
[1499790025] wproc: Socket to worker Core Worker 4500 broken, removing
[1499790025] wproc: Socket to worker Core Worker 4501 broken, removing
[1499790025] wproc: Socket to worker Core Worker 4502 broken, removing
[1499790025] Caught SIGHUP, restarting...
[1499790025] Event broker module 'NERD' deinitialized successfully.
[1499790025] Nagios 4.1.1 starting... (PID=4497)
[1499790025] Local time is Tue Jul 11 11:20:25 CDT 2017
.... I suspect that when it tries to restart and when it fails theis get created...pls let me know..