Stuck in critical

tlum · Post by **tlum** » Mon Mar 25, 2013 10:17 pm

I have a service monitor that gets stuck in critical. Its a simple check_tcp. Basically the service is rather unstable and at times ends up going in and out of flapping. Eventually it may end up in critical but never come out.

Running the check from the console completes just fine:

Code: Select all

$ ./check_tcp -H 70.57.237.99 -e "# javAPRSSrvr" -p 14580
TCP OK - 0.248 second response time on port 14580 [# javAPRSSrvr 3.15b08]|time=0.247537s;;;0.000000;10.000000

But nagios reports

Code: Select all

Current Status: CRITICAL (for 16d  1h 59m 26s)
Status Information: No data received from host
Performance Data:
Current Attempt: 3/3 (HARD state)
Last Check Time: 03-25-2013 23:00:04
Check Type: ACTIVE
Check Latency / Duration: 0.098/0.581 seconds
Next Scheduled Check:03-25-2013 23:05:04
Last State Change: 03-09-2013 20:01:04
Last Notification: 03-09-2013 20:05:13 (notification 1)
Is This Service Flapping? NO (0.00% state change)
In Scheduled Downtime? NO
Last Update: 03-25-2013 23:00:27 ( 0d  0h  0m  3s ago)

It looks like its checking, and I believe that it really is, but the end result is wrong. Restarting the service does not clear it up, I have to restart the server.

How can it get stuck in such a state? Is there a way to monitor the checks and verify the results its getting? Is there an easier way to get it to start seeing correct results from the active checks short of restarting the server?

scottwilkerson · Post by **scottwilkerson** » Tue Mar 26, 2013 7:37 am

This is not getting the same return you are from the command line, can you post your command and the service command definition

tlum · Post by **tlum** » Tue Mar 26, 2013 6:34 pm

I built the cli example by doing a cut & paste from the command definition so I'm pretty sure its the same.

Code: Select all

define command{
  command_name check_aprs
  command_line $USER1$/check_tcp -H $HOSTADDRESS$ -e "# javAPRSSrvr" $ARG1$
  }

Code: Select all

define service{
  name          service-aprs-14580
  check_command check_aprs!-p 14580
  register      0
  }

This is used as a service definition on 3 instances, on 5 servers, and has not been modified since about 2008. It only ever happens on the one service that bounces a lot, seems like maybe a timing issue if you aggravate it enough. A few months ago I used a network trace to verify that it was running the check and the it was succeeding. For whatever reason it seems like nagios never sees the result.

scottwilkerson · Post by **scottwilkerson** » Wed Mar 27, 2013 7:31 am

your service definition doesn't have a host_name or hostgroup directive. How does it know what host it is checking?

tlum · Post by **tlum** » Wed Mar 27, 2013 8:50 am

It's not supposed to know what host, that's a template, notice the register 0.

Its applied in the host files like this

Code: Select all

define service{
  host_name             cwop.fuller.net
  use                   service-aprs-14580
}

tlum · Post by **tlum** » Wed Mar 27, 2013 12:37 pm

Its stuck again

Code: Select all

$ ./check_tcp -H 70.57.237.99 -e "# javAPRSSrvr" -p 14580
TCP OK - 0.220 second response time on port 14580 [# javAPRSSrvr 3.15b08]|time=0.219601s;;;0.000000;10.000000

Code: Select all

Current Status:	CRITICAL (for 0d 1h 11m 49s)
Status Information: Connection refused
Performance Data:	
Current Attempt: 3/3  (HARD state)
Last Check Time: 03-27-2013 11:05:36
Check Type: ACTIVE
Check Latency / Duration: 0.061 / 0.286 seconds
Next Scheduled Check: 03-27-2013 11:10:36
Last State Change: 03-27-2013 09:56:36
Last Notification: 03-27-2013 10:00:41 (notification 1)
Is This Service Flapping? NO  (4.54% state change)
In Scheduled Downtime? NO  
Last Update: 03-27-2013 11:08:21  ( 0d 0h 0m 4s ago)

abrist · Post by **abrist** » Wed Mar 27, 2013 12:46 pm

Have you tried incresing the timeout?

Code: Select all

 -t, --timeout=INTEGER
    Seconds before connection times out (default: 10)

tlum · Post by **tlum** » Wed Mar 27, 2013 12:53 pm

Alright, this is bizarre. I've run a network trace and the Nagios probes area really failing. The probes from cli are not. The SYN packets are identical, source port, checksum, and time stamp aside.

I'm running a deeper trace to see if there is some signature that might be tripping over an IDS.

abrist · Post by **abrist** » Wed Mar 27, 2013 2:14 pm

Are you running some type of IDS on the nagios server? If so, what type? When you run the command from the cli, are you doing it as user root, apache, or nagios?

tlum · Post by **tlum** » Wed Mar 27, 2013 2:54 pm

The rejection is coming from the remote endpoint, I have a call in to see what's going on over there. I'll update once this gets sorted out.

Nagios Support Forum

Stuck in critical

Stuck in critical

Re: Stuck in critical

Re: Stuck in critical

Re: Stuck in critical

Re: Stuck in critical

Re: Stuck in critical

Re: Stuck in critical

Re: Stuck in critical

Re: Stuck in critical

Re: Stuck in critical