check_nrpe plugin can't connect, Nagios reports service "OK"

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
pato
Posts: 18
Joined: Thu Oct 10, 2013 6:06 pm
Location: Melbourne, Australia

check_nrpe plugin can't connect, Nagios reports service "OK"

Post by pato »

Just upgraded from 3.5.1 to 4.0.4 and noticed the following when testing notifications by disabling a remote NRPE service:

Image

Nagios reports the service as being in "OK" state, when it should be reporting a critical error. The 4 services in OK state are checked by the check_nrpe plugin.

Running the check from the command line produces the following output:

Code: Select all

$ ../../libexec/check_nrpe -H aumelbou-kof -c check_load
  connect to address 10.210.8.228 port 5666: Connection refused
  connect to host aumelbou-kof port 5666: Connection refused
$ echo $?
  255
I don't know why check_nrpe exits with 255 on a connection error, but Nagios 3.5.1 regarded out-of-bounds exit codes as errors. How can I change this new behaviour that regards them as "OK"?
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: check_nrpe plugin can't connect, Nagios reports service

Post by slansing »

You are correct, they should be returned as criticals, can you share the service configuration of one of these checks? And also the command from commands.cfg that is being used by that service?

In addition, what version of the nagios plugins are you running? When you check a host that has port 5666 open, "so that you can actually run the check successfully" what is returned? Do you have multiple hosts that are out of bounds and are returning OK states instead of critical?

Code: Select all

cat /usr/local/nagios/var/status.dat | grep 'out of bounds'
It may also be that the error code is being read in incorrectly, I'm going to do some looking around on that one.
pato
Posts: 18
Joined: Thu Oct 10, 2013 6:06 pm
Location: Melbourne, Australia

Re: check_nrpe plugin can't connect, Nagios reports service

Post by pato »

Probably of most significance is that I've found that this error only occurs on version 2.15 of the check_nrpe plugin. I've rolled back to an earlier version, v2.12, which exits with code 2 on the same connection error, and Nagios interprets this correctly.

Service config for memory check:

Code: Select all

define service{
        use local-service,graphed-service
        host_name aumelbou-kof
        service_description Memory
        contact_groups admins
        check_command check_win_nrpe!alias_mem
}
Templates referenced:

Code: Select all

define service{
        name                            local-service     
        use                             generic-service   
        max_check_attempts              4             
        normal_check_interval           2              
        retry_check_interval            1                
        register                        0                       
}

Code: Select all

define service {
       name graphed-service
       action_url /nagios/cgi-bin/show.cgi?host=$HOSTNAME$&service=$SERVICEDESC$' onMouseOver='showGraphPopup(this)' onMouseOut='hideGraphPopup()' rel='/nagios/cgi-bin/showgraph.cgi?host=$HOSTNAME$&service=$SERVICEDESC$&period=week&rrdopts=-w+450+-j
        register                        0
}

Code: Select all

define service{
        name                            generic-service  
        active_checks_enabled           1              
        passive_checks_enabled          1             
        parallelize_check               1                   
        obsess_over_service             1                
        check_freshness                 0                  
        notifications_enabled           1                 
        event_handler_enabled           1               
        flap_detection_enabled          1                
        process_perf_data               1                  
        retain_status_information       1               
        retain_nonstatus_information    1                 
        is_volatile                     0
        check_period                    24x7            
        max_check_attempts              3           
        normal_check_interval           10           
        retry_check_interval            2               
        contact_groups                  admins        
        notification_options            w,u,c,r        
        notification_interval           0      
        notification_period             24x7           
        register                        0                    
}
And the command called by the service check:

Code: Select all

define command{
        command_name    check_win_nrpe
        command_line    $USER1$/check_nrpe -H $HOSTNAME$ -c $ARG1$
}
When you check a host that has port 5666 open, "so that you can actually run the check successfully" what is returned?

Code: Select all

$ ../../libexec/check_nrpe  -H aumelbou-kof -c alias_mem
  OK: physical memory: Total: 7.97G - Used: 5.22G (65%) - Free: 2.75G (35%)|'physical memory %'=65%;100;100 'physical memory'=5.21999G;7.96699;7.96699;0;7.96699
$ echo $?
  0
Do you have multiple hosts that are out of bounds and are returning OK states instead of critical?
No, the out of bounds error only occurs when the Nagios service cannot reach the remote nrpe service (due to intermittent network issues or when I purposely turn off nrpe).

Thanks very much for your help so far, I'm really grateful that such a helpful community exists.
User avatar
lmiltchev
Former Nagios Staff
Posts: 13587
Joined: Mon May 23, 2011 12:15 pm

Re: check_nrpe plugin can't connect, Nagios reports service

Post by lmiltchev »

Let's clarify, when you use NRPE v2.12, and you have "Connection refused" error, running:

Code: Select all

echo $?
produces the correct output:

Code: Select all

2
but if you use NRPE v2.15, you get:

Code: Select all

255
Right?
Be sure to check out our Knowledgebase for helpful articles and solutions!
pato
Posts: 18
Joined: Thu Oct 10, 2013 6:06 pm
Location: Melbourne, Australia

Re: check_nrpe plugin can't connect, Nagios reports service

Post by pato »

lmiltchev wrote:Let's clarify, when you use NRPE v2.12, and you have "Connection refused" error, running:

Code: Select all

echo $?
produces the correct output:

Code: Select all

2
but if you use NRPE v2.15, you get:

Code: Select all

255
Right?
That's correct (and sorry for the late reply on this one)
pato
Posts: 18
Joined: Thu Oct 10, 2013 6:06 pm
Location: Melbourne, Australia

Re: check_nrpe plugin can't connect, Nagios reports service

Post by pato »

Dragging this thread back up to update briefly. I've since found another scenario that results in an out-of-bounds error being interpreted with "OK" status when it really (really) shouldn't be - when a command file is not executable by nagios user:

Code: Select all

$ whoami
nagios
$ /usr/local/nagios/libexec/check_psqsq20_waiting_tasks.pl
bash: /usr/local/nagios/libexec/check_psqsq20_waiting_tasks.pl: Permission denied
$ echo $?
126
Results in:
Image

This is bad! As you can see this went unnoticed for almost a month due to the incorrect status applied to the return code. I've since put in a check for "out of bounds" in status.dat, but ideally this would be handled correctly by Nagios.

Now on Nagios Core 4.0.5, by the way.
User avatar
lmiltchev
Former Nagios Staff
Posts: 13587
Joined: Mon May 23, 2011 12:15 pm

Re: check_nrpe plugin can't connect, Nagios reports service

Post by lmiltchev »

We will be testing this and filing a bug report if we are able to recreate the issue. If you wish to post a bug report on our bug tracker on your own, you are welcome to do so. Thanks for the feedback!
Be sure to check out our Knowledgebase for helpful articles and solutions!
pato
Posts: 18
Joined: Thu Oct 10, 2013 6:06 pm
Location: Melbourne, Australia

Re: check_nrpe plugin can't connect, Nagios reports service

Post by pato »

lmiltchev wrote:We will be testing this and filing a bug report if we are able to recreate the issue. If you wish to post a bug report on our bug tracker on your own, you are welcome to do so. Thanks for the feedback!
Thank you. I've submitted a report: http://tracker.nagios.org/view.php?id=602
User avatar
lmiltchev
Former Nagios Staff
Posts: 13587
Joined: Mon May 23, 2011 12:15 pm

Re: check_nrpe plugin can't connect, Nagios reports service

Post by lmiltchev »

Thanks, pato!
Be sure to check out our Knowledgebase for helpful articles and solutions!
pato
Posts: 18
Joined: Thu Oct 10, 2013 6:06 pm
Location: Melbourne, Australia

Re: check_nrpe plugin can't connect, Nagios reports service

Post by pato »

Fixes in Nagios Core 4.0.6 resolve this issue :)
Locked