Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
I have a service monitor that gets stuck in critical. Its a simple check_tcp. Basically the service is rather unstable and at times ends up going in and out of flapping. Eventually it may end up in critical but never come out.
Running the check from the console completes just fine:
$ ./check_tcp -H 70.57.237.99 -e "# javAPRSSrvr" -p 14580
TCP OK - 0.248 second response time on port 14580 [# javAPRSSrvr 3.15b08]|time=0.247537s;;;0.000000;10.000000
Current Status: CRITICAL (for 16d 1h 59m 26s)
Status Information: No data received from host
Performance Data:
Current Attempt: 3/3 (HARD state)
Last Check Time: 03-25-2013 23:00:04
Check Type: ACTIVE
Check Latency / Duration: 0.098/0.581 seconds
Next Scheduled Check:03-25-2013 23:05:04
Last State Change: 03-09-2013 20:01:04
Last Notification: 03-09-2013 20:05:13 (notification 1)
Is This Service Flapping? NO (0.00% state change)
In Scheduled Downtime? NO
Last Update: 03-25-2013 23:00:27 ( 0d 0h 0m 3s ago)
It looks like its checking, and I believe that it really is, but the end result is wrong. Restarting the service does not clear it up, I have to restart the server.
How can it get stuck in such a state? Is there a way to monitor the checks and verify the results its getting? Is there an easier way to get it to start seeing correct results from the active checks short of restarting the server?
define service{
name service-aprs-14580
check_command check_aprs!-p 14580
register 0
}
This is used as a service definition on 3 instances, on 5 servers, and has not been modified since about 2008. It only ever happens on the one service that bounces a lot, seems like maybe a timing issue if you aggravate it enough. A few months ago I used a network trace to verify that it was running the check and the it was succeeding. For whatever reason it seems like nagios never sees the result.
$ ./check_tcp -H 70.57.237.99 -e "# javAPRSSrvr" -p 14580
TCP OK - 0.220 second response time on port 14580 [# javAPRSSrvr 3.15b08]|time=0.219601s;;;0.000000;10.000000
Current Status: CRITICAL (for 0d 1h 11m 49s)
Status Information: Connection refused
Performance Data:
Current Attempt: 3/3 (HARD state)
Last Check Time: 03-27-2013 11:05:36
Check Type: ACTIVE
Check Latency / Duration: 0.061 / 0.286 seconds
Next Scheduled Check: 03-27-2013 11:10:36
Last State Change: 03-27-2013 09:56:36
Last Notification: 03-27-2013 10:00:41 (notification 1)
Is This Service Flapping? NO (4.54% state change)
In Scheduled Downtime? NO
Last Update: 03-27-2013 11:08:21 ( 0d 0h 0m 4s ago)
-t, --timeout=INTEGER
Seconds before connection times out (default: 10)
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Alright, this is bizarre. I've run a network trace and the Nagios probes area really failing. The probes from cli are not. The SYN packets are identical, source port, checksum, and time stamp aside.
I'm running a deeper trace to see if there is some signature that might be tripping over an IDS.
Are you running some type of IDS on the nagios server? If so, what type? When you run the command from the cli, are you doing it as user root, apache, or nagios?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.