Stuck in critical
Posted: Mon Mar 25, 2013 10:17 pm
I have a service monitor that gets stuck in critical. Its a simple check_tcp. Basically the service is rather unstable and at times ends up going in and out of flapping. Eventually it may end up in critical but never come out.
Running the check from the console completes just fine:
But nagios reports
It looks like its checking, and I believe that it really is, but the end result is wrong. Restarting the service does not clear it up, I have to restart the server.
How can it get stuck in such a state? Is there a way to monitor the checks and verify the results its getting? Is there an easier way to get it to start seeing correct results from the active checks short of restarting the server?
Running the check from the console completes just fine:
Code: Select all
$ ./check_tcp -H 70.57.237.99 -e "# javAPRSSrvr" -p 14580
TCP OK - 0.248 second response time on port 14580 [# javAPRSSrvr 3.15b08]|time=0.247537s;;;0.000000;10.000000Code: Select all
Current Status: CRITICAL (for 16d 1h 59m 26s)
Status Information: No data received from host
Performance Data:
Current Attempt: 3/3 (HARD state)
Last Check Time: 03-25-2013 23:00:04
Check Type: ACTIVE
Check Latency / Duration: 0.098/0.581 seconds
Next Scheduled Check:03-25-2013 23:05:04
Last State Change: 03-09-2013 20:01:04
Last Notification: 03-09-2013 20:05:13 (notification 1)
Is This Service Flapping? NO (0.00% state change)
In Scheduled Downtime? NO
Last Update: 03-25-2013 23:00:27 ( 0d 0h 0m 3s ago)How can it get stuck in such a state? Is there a way to monitor the checks and verify the results its getting? Is there an easier way to get it to start seeing correct results from the active checks short of restarting the server?