NSCA and Distributed Nagios

petronagios · Post by **petronagios** » Tue Sep 11, 2012 6:04 am

Hi, I’m having problems setting up a distributed monitoring environment. Can you help?

Service checks are running on the distributed server and are forwarded to the master. But the service check isn’t updated on the master. If I look in /var/log/messages on the master I can see the following

Sep 11 09:40:48 ablxpn02 xinetd[1552]: START: nsca pid=12214 from=10.4.24.227
Sep 11 09:40:48 ablxpn02 nsca[12214]: Handling the connection...
Sep 11 09:40:49 ablxpn02 nsca[12214]: End of connection...
Sep 11 09:40:49 ablxpn02 xinetd[1552]: EXIT: nsca status=0 pid=12214 duration=1(sec)

Distributed server config
enable_notifications=0
obsess_over_services=1
ocsp_command=submit_check_result
nsca is running under xinetd

define service{
name generic-service
active_checks_enabled 1
passive_checks_enabled 1
parallelize_check 1
obsess_over_service 1
check_freshness 0
notifications_enabled 1
event_handler_enabled 1
flap_detection_enabled 1
failure_prediction_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
is_volatile 0
check_period 24x7
max_check_attempts 3
normal_check_interval 2
retry_check_interval 2
contact_groups admins
notification_options w,u,c,r
notification_interval 60
notification_period 24x7
register 0
}

# Local service definition template - This is NOT a real service, just a template!

define service{
name local-service
use generic-service
max_check_attempts 4
normal_check_interval 5
retry_check_interval 1
register 0
}

define service{
use local-service
host_name tvpl0682
service_description Root Partition
check_command check_local_disk!20%!10%!/
}

Master server Config
execute_service_checks=1
check_external_commands=1
accept_passive_service_checks=1
nsca is running under xinetd

# Define a passive check template
define service{
#use generic-service
name passive_service
active_checks_enabled 0
passive_checks_enabled 1
parallelize_check 1
flap_detection_enabled 0
register 0
is_volatile 0
check_period 24x7
max_check_attempts 1
normal_check_interval 1
retry_check_interval 1
check_freshness 0
contact_groups admins
check_command check_dummy!0
notification_interval 45
notification_period 24x7
notification_options w,u,c,r
stalking_options w,c,u
process_perf_data 1
}

define service{
use passive_service
host_name tvpl0682
service_description Root Partition
active_checks_enabled 0
check_command check_dummy!0
}

Many thanks
Steve.

mguthrie · Post by **mguthrie** » Tue Sep 11, 2012 9:24 am

Try turning on:

Code: Select all

log_external_commands=1

in the main nagios.cfg. That way you can just tail the nagios.log file and you should see the reason why it's failing.

petronagios · Post by **petronagios** » Thu Sep 13, 2012 7:30 am

Hi Thanks for your reply, I can now see whats happening but I’m not sure how to fix it!

The following is being sent from the distributed host via send_nsca

tvpl0682 Root Partition DISK CRITICAL - free space: / 953 MB (6% inode=92%):

And the following two lines appear in the Nagios log on the Master

[1347531830] EXTERNAL COMMAND: PROCESS_HOST_CHECK_RESULT;tvpl0682;0;DISK CRITICAL - free space: / 953 MB (6% inode=92%):
[1347531839] PASSIVE HOST CHECK: tvpl0682;0;DISK CRITICAL - free space: / 953 MB (6% inode=92%):

The problem is the check is being submitted as HOST_CHECK_RESULT check, so it doesn’t update the passive service check on the master. What do I change to make it submit a PROCESS_SERVICE_CHECK_RESULT ?

My event handler is just a cut n paste from the Nagios website

cat submit_check_result
#!/bin/sh

# Arguments:
# $1 = host_name (Short name of host that the service is
# associated with)
# $2 = svc_description (Description of the service)
# $3 = state_string (A string representing the status of
# the given service - "OK", "WARNING", "CRITICAL"
# or "UNKNOWN")
# $4 = plugin_output (A text string that should be used
# as the plugin output for the service checks)
#

# Convert the state string to the corresponding return code
Return_Code=-1

case "$3" in
OK)
Return_Code=0
;;
WARNING)
Return_Code=1
;;
CRITICAL)
Return_Code=2
;;
UNKNOWN)
Return_Code=-1
;;
esac

# pipe the service check info into the send_nsca program, which
# in turn transmits the data to the nsca daemon on the central
# monitoring server

/usr/bin/printf "%s\t%s\t%s\t%s\n" "$1" "$2" "$Return_Code" "$4" | /usr/local/nagios/bin/send_nsca sblxppns01 -c /usr/local/nagios/etc/send_nsca.cfg >> /tmp/output

eventhandlers]#

Many thanks
Steve

mguthrie · Post by **mguthrie** » Thu Sep 13, 2012 10:35 am

Here's the format for a passive service result:
[<timestamp>] PROCESS_SERVICE_CHECK_RESULT;<host_name>;<svc_description>;<return_code>;<plugin_output>

Passive host result:
[<timestamp>] PROCESS_HOST_CHECK_RESULT;<host_name>;<host_status>;<plugin_output>

(Pulled from the following Core doc)
http://nagios.sourceforge.net/docs/3_0/ ... hecks.html

You need both the host name, and service description for the service result. I'm guessing that's the issue.

petronagios · Post by **petronagios** » Fri Sep 14, 2012 8:38 am

Great thanks for your reply, I re-created the event handler on the distributed server and it now submits the remote command as PROCESS_SERVICE_CHECK_RESULT – Great! But as soon as the service check turns red/critical in the Nagios GUI, it goes back to green/OK even though the file system is still full. See the logfile entries below from the master server.

[1347628821] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;tvpl0682;Root Partition;2;DISK CRITICAL - free space: / 949 MB (6% inode=92%):
[1347628827] PASSIVE SERVICE CHECK: tvpl0682;Root Partition;2;DISK CRITICAL - free space: / 949 MB (6% inode=92%):
[1347628827] SERVICE ALERT: tvpl0682;Root Partition;CRITICAL;HARD;1;DISK CRITICAL - free space: / 949 MB (6% inode=92%):
[1347628827] SERVICE ALERT: tvpl0682;Root Partition;OK;HARD;1;OK

I don’t know where the HARD OK is coming from as the file system is still full!

My passive service check template/service definition is

define service{
name passive-check
use generic-service,srv-pnp
max_check_attempts 1
is_volatile 1
normal_check_interval 2
active_checks_enabled 0
passive_checks_enabled 1
retry_check_interval 1
flap_detection_enabled 0
check_period 24x7
notification_interval 0
notification_period workhours
notification_options w,u,c,r
register 0
}

define service{
use passive-check
host_name tvpl0682
service_description Root Partition
active_checks_enabled 0
check_command check_dummy!0
}

How do I get the service to stay critical until a HARD OK is sent from the distributed server?

Thanks
Steve.

mguthrie · Post by **mguthrie** » Fri Sep 14, 2012 9:34 am

That seems a little bit odd. I'm noticing on the log output that you posted that there's no plugin output with that either. Is it possible that the sending script has a bug and could be returning false for the return code? "false" will evaluate to 0, and show up as OK for the service status.

petronagios · Post by **petronagios** » Tue Sep 18, 2012 9:35 am

Thanks for your help its all working now. The check didn't stay critical as I had the defined the passive check_command as follows

define service{
use passive-service ; Name of service template to use
host_name tvpl0682
service_description Root Partition
active_checks_enabled 0
check_command check_dummy!0
}

instead of check_dummy!$ARG1$

Many thanks for your help
Steve

mguthrie · Post by **mguthrie** » Tue Sep 18, 2012 9:43 am

Good deal, glad it's working for you!

Nagios Support Forum

NSCA and Distributed Nagios

NSCA and Distributed Nagios

Re: NSCA and Distributed Nagios

Re: NSCA and Distributed Nagios

Re: NSCA and Distributed Nagios

Re: NSCA and Distributed Nagios

Re: NSCA and Distributed Nagios

Re: NSCA and Distributed Nagios

Re: NSCA and Distributed Nagios