Stuck in critical

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
tlum
Posts: 16
Joined: Thu Jun 28, 2012 9:07 pm

Stuck in critical

Post by tlum »

I have a service monitor that gets stuck in critical. Its a simple check_tcp. Basically the service is rather unstable and at times ends up going in and out of flapping. Eventually it may end up in critical but never come out.

Running the check from the console completes just fine:

Code: Select all

$ ./check_tcp -H 70.57.237.99 -e "# javAPRSSrvr" -p 14580
TCP OK - 0.248 second response time on port 14580 [# javAPRSSrvr 3.15b08]|time=0.247537s;;;0.000000;10.000000
But nagios reports

Code: Select all

Current Status: CRITICAL (for 16d  1h 59m 26s)
Status Information: No data received from host
Performance Data:
Current Attempt: 3/3 (HARD state)
Last Check Time: 03-25-2013 23:00:04
Check Type: ACTIVE
Check Latency / Duration: 0.098/0.581 seconds
Next Scheduled Check:03-25-2013 23:05:04
Last State Change: 03-09-2013 20:01:04
Last Notification: 03-09-2013 20:05:13 (notification 1)
Is This Service Flapping? NO (0.00% state change)
In Scheduled Downtime? NO
Last Update: 03-25-2013 23:00:27 ( 0d  0h  0m  3s ago)
It looks like its checking, and I believe that it really is, but the end result is wrong. Restarting the service does not clear it up, I have to restart the server.

How can it get stuck in such a state? Is there a way to monitor the checks and verify the results its getting? Is there an easier way to get it to start seeing correct results from the active checks short of restarting the server?
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Stuck in critical

Post by scottwilkerson »

This is not getting the same return you are from the command line, can you post your command and the service command definition
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
tlum
Posts: 16
Joined: Thu Jun 28, 2012 9:07 pm

Re: Stuck in critical

Post by tlum »

I built the cli example by doing a cut & paste from the command definition so I'm pretty sure its the same.

Code: Select all

define command{
  command_name check_aprs
  command_line $USER1$/check_tcp -H $HOSTADDRESS$ -e "# javAPRSSrvr" $ARG1$
  }

Code: Select all

define service{
  name          service-aprs-14580
  check_command check_aprs!-p 14580
  register      0
  }
This is used as a service definition on 3 instances, on 5 servers, and has not been modified since about 2008. It only ever happens on the one service that bounces a lot, seems like maybe a timing issue if you aggravate it enough. A few months ago I used a network trace to verify that it was running the check and the it was succeeding. For whatever reason it seems like nagios never sees the result.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Stuck in critical

Post by scottwilkerson »

your service definition doesn't have a host_name or hostgroup directive. How does it know what host it is checking?
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
tlum
Posts: 16
Joined: Thu Jun 28, 2012 9:07 pm

Re: Stuck in critical

Post by tlum »

It's not supposed to know what host, that's a template, notice the register 0.

Its applied in the host files like this

Code: Select all

define service{
  host_name             cwop.fuller.net
  use                   service-aprs-14580
}
tlum
Posts: 16
Joined: Thu Jun 28, 2012 9:07 pm

Re: Stuck in critical

Post by tlum »

Its stuck again

Code: Select all

$ ./check_tcp -H 70.57.237.99 -e "# javAPRSSrvr" -p 14580
TCP OK - 0.220 second response time on port 14580 [# javAPRSSrvr 3.15b08]|time=0.219601s;;;0.000000;10.000000

Code: Select all

Current Status:	CRITICAL (for 0d 1h 11m 49s)
Status Information: Connection refused
Performance Data:	
Current Attempt: 3/3  (HARD state)
Last Check Time: 03-27-2013 11:05:36
Check Type: ACTIVE
Check Latency / Duration: 0.061 / 0.286 seconds
Next Scheduled Check: 03-27-2013 11:10:36
Last State Change: 03-27-2013 09:56:36
Last Notification: 03-27-2013 10:00:41 (notification 1)
Is This Service Flapping? NO  (4.54% state change)
In Scheduled Downtime? NO  
Last Update: 03-27-2013 11:08:21  ( 0d 0h 0m 4s ago)
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Stuck in critical

Post by abrist »

Have you tried incresing the timeout?

Code: Select all

 -t, --timeout=INTEGER
    Seconds before connection times out (default: 10)
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
tlum
Posts: 16
Joined: Thu Jun 28, 2012 9:07 pm

Re: Stuck in critical

Post by tlum »

Alright, this is bizarre. I've run a network trace and the Nagios probes area really failing. The probes from cli are not. The SYN packets are identical, source port, checksum, and time stamp aside.

I'm running a deeper trace to see if there is some signature that might be tripping over an IDS.
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Stuck in critical

Post by abrist »

Are you running some type of IDS on the nagios server? If so, what type? When you run the command from the cli, are you doing it as user root, apache, or nagios?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
tlum
Posts: 16
Joined: Thu Jun 28, 2012 9:07 pm

Re: Stuck in critical

Post by tlum »

The rejection is coming from the remote endpoint, I have a call in to see what's going on over there. I'll update once this gets sorted out.
Last edited by tlum on Wed Mar 27, 2013 5:03 pm, edited 1 time in total.
Locked