Distributed monitoring issue - SOLVED

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
User avatar
sebastiaopburnay
Posts: 105
Joined: Sun Oct 31, 2010 1:40 pm
Location: Lisbon, Portugal

Distributed monitoring issue - SOLVED

Post by sebastiaopburnay »

Hey everyone, I've been investigating and developing a nagios solution for monitoring several networks and a bunch of services they support.

I'm doing this not on a full time basis and only for a couple of months, so I'm still not able to give a lot of support, eventhough I like to spread the good word on nagios' solution.

Right now I'm trying to put the two remote nagios' instances to communicate, and I have correctly configured firewalls to allow NRPE and NSCA traffic to pass from/to this instances of nagios on theyr ports (5666 and 5667).

When I tried to run the command to send a service_check result to the central nagios from the remote one, I've got an error complaining about the send_nsca.cfg configuration file:

Code: Select all

[root@sebas distributed-monitoring]# ./submit_check_result_via_nsca remote-monitor nsca_check_load 1 testando

Could not open config file '/usr/local/nagios/etc/send_nsca.cfg' for reading.
Error: Config file '/usr/local/nagios/etc/send_nsca.cfg' contained errors..
That error is troubling me, once that the file only contains two directives:

Code: Select all

password=xxxyyyzzz
encryption_method=1
I'm guessing that one who successfully got two instances communicating has had the same problem, and having that in mind I come here to ask for directions on how to fix this issue.


Thank you very much.
Last edited by sebastiaopburnay on Wed Mar 06, 2013 2:29 pm, edited 1 time in total.
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Distributed monitoring issue

Post by mguthrie »

It's saying that the file could not be opened, you may want to double check the directory location in the script and of the actual file, as well as the read permissions.
User avatar
sebastiaopburnay
Posts: 105
Joined: Sun Oct 31, 2010 1:40 pm
Location: Lisbon, Portugal

Re: Distributed monitoring issue

Post by sebastiaopburnay »

You're absolutely right.

I've ended up figuring it out for myself.

The thing is that now I'm getting a timeout, so I'm wandering about all the possibilities that could be causing this unsuccessful output.
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Distributed monitoring issue

Post by mguthrie »

Can you show the actual error output? If it timed out then it's either a firewall issue, or the receiving server does not have your client machine added to the list of "allowed hosts." Check that your iptables are allowing traffic on ports 5666 and 5667, and that your client machine is an allowed host in either the nagios/etc/nsca.cfg, or if you're running it under xinetd, you'll have the create a definition file for nsca and add your client machine to the list of allowed hosts. That's probably as clear as mud right now, so if you need more detail on any of this let us know.
User avatar
sebastiaopburnay
Posts: 105
Joined: Sun Oct 31, 2010 1:40 pm
Location: Lisbon, Portugal

Re: Distributed monitoring issue

Post by sebastiaopburnay »

Well, I solved that issue, which is great.

Problem is that part of the solution was to get the two instances in the same subnet. I'm positive that's not what caused the fixing, as that was achieved by correcting the configurations on both machines.

What's bothering me now is this question. Being the two instances on remote locations and different subnets (private addresses of one subnet make no sense on inside the other subnet and vice-versa) should I omit the address attribute of remote monitored hosts on the central server?

Thank you for replying and contributing.
Attachments
connecting diagram
connecting diagram
route.png (115.54 KiB) Viewed 4674 times
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Distributed monitoring issue

Post by mguthrie »

I'm assuming you're referring to the "address" attribute in the host config on the passive server?

Code: Select all

define host {
        host_name                       <host_name>
        use                             xiwizard_genericnetdevice_host
       [b] address [/b]                        <address>
        max_check_attempts              5
        check_interval                  5
        retry_interval                  1
This shouldn't really matter either way if it's receiving passive results, I'm not sure about NRPE. I don't see any harm in leaving it. Is there a potential problem that you're seeing that I'm missing?
User avatar
sebastiaopburnay
Posts: 105
Joined: Sun Oct 31, 2010 1:40 pm
Location: Lisbon, Portugal

Re: Distributed monitoring issue

Post by sebastiaopburnay »

Thanks, all those problems have been solved, and fortunately, new ones have shown up.

Now, my central nagios server does not recognize some warning and critical passive service check results.

Additionally, the central nagios server is not getting the hostchecks from the majority of the remote monitored hosts (it's only getting those relative to the remote monitoring server plus one hostgroup) and I can't seem to find a cause analyzing the several .cfg files.

With my best regards,
sebastiaopburnay.
Attachments
This service check returns warning, but the web interface doesn't show it as warning state.
This service check returns warning, but the web interface doesn't show it as warning state.
trouble.png (32.35 KiB) Viewed 4648 times
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Distributed monitoring issue

Post by mguthrie »

My guess is that your checks are all sending a "0" return code for everything, which Nagios reads as UP, or OK. See the following:

Host passive string:
MyPassiveMachine;0;'Host is up'"

<hostname> ; <state code(0-2 for hosts, 0-3 for services)>; <message or output>

Service passive string
MyPassiveMachine;PROC;0;`$LIB/check_procs`" -> At the moment, this is setup to always send an OK result
<hostname> ; <servicename> ; <state code> ; <message or plugin output>

NRPE is a little bit nicer in that the check results are interpreted for you.
ashish
Posts: 4
Joined: Tue Mar 01, 2011 1:03 am

Re: Distributed monitoring issue

Post by ashish »

Hi,


Error while configuring Central & Distributed Nagios Server




We are trying to setup one distributed nagios monitoring with one central nagios server and two distributed nagios server. We have installed nsca deamon on the central server and added the nrpe service under xinetd. We have also setup the distributed server with send_nsca. Below are our central and distributed server setup and configurations.

If we are executing a send_nsca command manually, its working fine like below...
root# printf "temphost\ttempservice\t0\t Ashish Singh NSCA\n" | /usr/local/nagios/bin/send_nsca -H 192.168.16.2 -c /usr/local/nagios/etc/send_nsca.cfg
1 data packet(s) sent to host successfully.

and it logs on central server in nagios.log as below
[1299216743] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;temphost;tempservice;0; Ashish Singh NSCA
-------------------------------------------------------------------------------------------------------------------------------------------\

But we are not able to understand why the nagios log on central and distributed is giving the following error. It looks like distributed server is not sending the nagios data to the central server and trying to do a force check from central. Please help us on this...


6315] Warning: The results of service 'Current Load' on host '192.168.16.10' are stale by 0d 0h 0m 30s (threshold=0d 0h 6m 0s). I'm forcing an immediate check of the service.
[1299216315] Warning: The results of service 'Current Users' on host '192.168.16.10' are stale by 0d 0h 0m 30s (threshold=0d 0h 6m 0s). I'm forcing an immediate check of the service.
[1299216315] Warning: The results of service 'MySql' on host '192.168.16.10' are stale by 0d 0h 0m 30s (threshold=0d 0h 6m 0s). I'm forcing an immediate check of the service.
[1299216315] Warning: The results of service 'Total Processes' on host '192.168.16.10' are stale by 0d 0h 0m 30s (threshold=0d 0h 6m 0s). I'm forcing an immediate check of the service.
[1299216405] Warning: The results of service 'Current Load' on host '192.168.16.20' are stale by 0d 0h 0m 30s (threshold=0d 0h 6m 0s). I'm forcing an immediate check of the service.
[1299216405] Warning: The results of service 'Current Users' on host '192.168.16.20' are stale by 0d 0h 0m 30s (threshold=0d 0h 6m 0s). I'm forcing an immediate check of the service.
[1299216405] Warning: The results of service 'PING' on host '192.168.16.20' are stale by 0d 0h 0m 30s (threshold=0d 0h 6m 0s). I'm forcing an immediate check of the service.
[1299216525] Warning: The results of service 'PING' on host '192.168.16.10' are stale by 0d 0h 0m 30s (threshold=0d 0h 6m 0s). I'm forcing an immediate check of the service.
[1299216525] Warning: The results of service 'Root Partition' on host '192.168.16.10' are stale by 0d 0h 0m 30s (threshold=0d 0h 6m 0s). I'm forcing an immediate check of the service.
[1299216585] Warning: The results of service 'SSH' on host '192.168.16.10' are stale by 0d 0h 0m 30s (threshold=0d 0h 6m 0s). I'm forcing an immediate check of the service.









####################################CENTRAL SERVER########################################################
#cd nsca-2.7.2
./configure
$ make all

# cp src/nsca /usr/local/nagios/bin
# cp sample_config/nsca.cfg /usr/local/nagios/etc/nsca.cfg
# vi /usr/local/nagios/etc/nsca.cfg
#password=********** (here we are not using any password)
#decryption_method=3 (here we are not using any password)

# cp sample-config/nsca.xinetd /etc/xinetd.d/nsca

#vi /etc/xinetd.d/nsca
# default: on
# description: NSCA (Nagios Service Check Acceptor)
service nsca
{
flags = REUSE
socket_type = stream
wait = no
user = nagios
group = nagios
server = /usr/local/nagios/bin/nsca
server_args = -c /usr/local/nagios/etc/nsca.cfg --inetd
log_on_failure += USERID
disable = no
only_from = 127.0.0.1 192.168.16.2
}



#cp sample-config/nsca.cfg /usr/local/nagios/etc/nsca.cfg
#cp src/nsca /usr/local/nagios/bin/nsca


# vi /usr/local/nagios/etc/nagios.cfg
cfg_dir=/usr/local/nagios/etc/servers
enable_notifications=1
execute_service_checks=1
check_external_commands=1
accept_passive_service_checks=1
accept_passive_host_checks=1
translate_passive_host_checks=1
obsess_over_services=0
#ocsp_command=
check_service_freshness=1
service_freshness_check_interval=60
check_host_freshness=1
host_freshness_check_interval=60
enable_event_handlers=1


#vi /usr/local/nagios/etc/objects/template.cfg

# Generic host definition template - This is NOT a real host, just a template!

define host{
name linux-server ; The name of this host template
use linux-server ; This template inherits other values from the generic-host template
check_period 24x7 ; By default, Linux hosts are checked round the clock
check_interval 5 ; Actively check the host every 5 minutes
retry_interval 1 ; Schedule host check retries at 1 minute intervals
max_check_attempts 10 ; Check each Linux host 10 times (max)
notifications_enabled 1
notification_period workhours ; Linux admins hate to be woken up, so we only notify during the day
notification_interval 30 ; Resend notifications every 2 hours
notification_options d,u,r ; Only send notifications for specific host states
check_freshness 1 ; Default is to NOT check service 'freshness'
event_handler_enabled 1 ; Host event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
notification_period 24x7 ; Send host notifications at any time
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
process_perf_data 1 ; Process performance data
contact_groups admins ; Notifications get sent to the admins by default
active_checks_enabled 0 ; Active service checks are enabled
passive_checks_enabled 1 ; Passive service checks are enabled/accepted
# check_command check-host-alive ; Default command to check Linux hosts

}

# Linux host definition template - This is NOT a real host, just a template!

define service{

name generic-service ; The name of this host template
active_checks_enabled 0 ; Active service checks are enabled
passive_checks_enabled 1 ; Passive service checks are enabled/accepted
parallelize_check 0 ; Active service checks should be parallelized (disabling this can lead to majorperformance problems)
obsess_over_service 0
check_period 24x7 ; By default, Linux hosts are checked round the clock
check_interval 5 ; Actively check the host every 5 minutes
retry_interval 1 ; Schedule host check retries at 1 minute intervals
retry_check_interval 10 ; Check the service every 10 minutes under normal conditions
max_check_attempts 10 ; Check each Linux host 10 times (max)
notifications_enabled 1
notification_period workhours ; Linux admins hate to be woken up, so we only notify during the day
notification_interval 60 ; Resend notifications every 2 hours
#notification_options d,u,r ; Only send notifications for specific host states
check_freshness 1 ; Default is to NOT check service 'freshness'
normal_check_interval 3 ; Check the service every 10 minutes under normal conditions
freshness_threshold 30
contact_groups admins,nagios ; Notifications get sent to the admins by default
event_handler_enabled 1 ; Host event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
is_volatile 0
retain_nonstatus_information 1 ; Retain non-status information across program restarts
notification_period 24x7 ; Send host notifications at any time
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!

}


# mkdir /usr/local/nagios/etc/servers

# /etc/init.d/nagios start

#/usr/local/nagios/bin/nsca -c /usr/local/nagios/etc/nsca.cfg –inetd




===========================================================================
DISTRIBUTED SERVER SETUP

$ cd nsca-2.7.2
$ ./configure
$ make all

# cp src/send_nsca /usr/local/nagios/bin
# cp sample_config/send_nsca.cfg /usr/local/nagios/etc/send_nsca.cfg
# vi /usr/local/nagios/etc/send_nsca.cfg
password=**********
decryption_method=3

# chown nagios /usr/local/nagios/etc/send_nsca.cfg
# chmod 400 /usr/local/nagios/etc/send_nsca.cfg

# vi nagios.cfg
cfg_dir=/usr/local/nagios/etc/servers
enable_notifications=0
obsess_over_services=1
ocsp_command=submit_check_result
obsess_over_host=1
accept_passive_service_checks=0
accept_passive_host_checks=0


#ls -l /usr/local/nagios/libexec/eventhandlers/submit_check_result
-rwxrwxrwx 1 nagios root 1119 Mar 8 20:54 /usr/local/nagios/libexec/eventhandlers/submit_check_result

###################################################################3

#vi submit_check_result

#!/bin/bash
# Arguments:

# $1 = host_name (Short name of host that the service is

# associated with)

# $2 = svc_description (Description of the service)

# $3 = state_string (A string representing the status of

# the given service - "OK", "WARNING", "CRITICAL"

# or "UNKNOWN")

# $4 = plugin_output (A text string that should be used

# as the plugin output for the service checks)

#
# Convert the state string to the corresponding return code

return_code=-1

case "$3" in

OK)

return_code=0

;;

WARNING)

return_code=1

;;

CRITICAL)

return_code=2

;;

UNKNOWN)

return_code=-1

;;
esac
# pipe the service check info into the send_nsca program, which

# in turn transmits the data to the nsca daemon on the central

# monitoring server
/usr/bin/printf "%s\t%s\t%s\t%s\n" "$1" "$2" "$return_code" "$4" | /usr/local/nagios/bin/send_nsca -H 192.168.16.2 -c /usr/local/nagios/etc/send_nsca.cfg



#sh submit_service_check
0 data packet(s) sent to host successfully.




# vi etc/objects/command.cfg
define command{
command_name submit_check_result
command_line /usr/local/nagios/libexec/eventhandlers/submit_check_result $HOSTNAME$ 'SERVICEDESC$' $SERVICESTATEID$ '$SERVICEOUTPUT$'
}


#chmod 755 libexec/eventhandlers/submit_check_result

#vi /etc/services
Nsca 5667/tcp # NSCA
Nrpe 5666/tcp #NRPE


#vi /usr/local/nagios/etc/objects/templates.cfg



define host{
name generic-host ; The name of this host template
use generic-host ; This template inherits other values from the generic-host template
check_period 24x7 ; By default, Linux hosts are checked round the clock
check_interval 5 ; Actively check the host every 5 minutes
retry_interval 1 ; Schedule host check retries at 1 minute intervals
max_check_attempts 10 ; Check each Linux host 10 times (max)
notifications_enabled 1
notification_period workhours ; Linux admins hate to be woken up, so we only notify during the day
notification_interval 60 ; Resend notifications every 2 hours
notification_options d,u,r ; Only send notifications for specific host states
check_freshness 1 ; Default is to NOT check service 'freshness'
contact_groups admins ; Notifications get sent to the admins by default
event_handler_enabled 1 ; Host event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
notification_period 24x7 ; Send host notifications at any time
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
process_perf_data 1 ; Process performance data


}
define service{

name generic-service ; The name of this host template
active_checks_enabled 1 ; Active service checks are enabled
passive_checks_enabled 0 ; Passive service checks are enabled/accepted
parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to majorperformance problems)
obsess_over_service 0
check_period 24x7 ; By default, Linux hosts are checked round the clock
check_interval 5 ; Actively check the host every 5 minutes
retry_interval 1 ; Schedule host check retries at 1 minute intervals
retry_check_interval 10 ; Check the service every 10 minutes under normal conditions
max_check_attempts 10 ; Check each Linux host 10 times (max)
notifications_enabled 1
notification_period workhours ; Linux admins hate to be woken up, so we only notify during the day
notification_interval 60 ; Resend notifications every 2 hours
check_freshness 1 ; Default is to NOT check service 'freshness'
normal_check_interval 3 ; Check the service every 10 minutes under normal conditions
freshness_threshold 30
contact_groups admins,nagios ; Notifications get sent to the admins by default
event_handler_enabled 1 ; Host event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
is_volatile 0
retain_nonstatus_information 1 ; Retain non-status information across program restarts
notification_period 24x7 ; Send host notifications at any time
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
}


=======================================================================
Please help me out to resolve this issue....you can also mail me

a9ever@gmail.com
Ashish
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Distributed monitoring issue

Post by mguthrie »

You either need to disable your checks for host/service freshness ( check_service_freshness=0 ), or turn up the freshness_threshold to a higher time interval. For example, if your passive checks are scheduled to come in every 10 minutes, your freshness_threshold should be longer than 10 minutes. "freshness_threshold" needs to be defined in your templates.
Locked