NSCA with NSClient++ results not holding
Posted: Sat Jul 18, 2015 3:43 pm
Ok, I've spent 3 weeks trying to get this functioning and I am at the end of my understanding.
I am a 4 week nagios /linux noob
Setup - 2 Windows Web servers on the internet - No ability to open ports inbound as behind a natted network and impractical going forward as wish to apply to other internet facing servers.
Have installed NSClient ++ and using NSCA to send results into nagios every 4 minutes
I believe the messages are arriving ok, as all Alerts are appearing with correct info at one stage or another on the nagios webportal.
The problem is , Whilst I can see them coming in ok on the messages log, they dont stay green, they flick to critical with the error "CHECK_NRPE: Socket timeout after 10 seconds. " , even though there is nothing wrong with the monitored machine.
I am sure my configuration files will current read far from perfect at the moment, as I have tweaked a few of the services on the nagios server, to see if make any difference.
This is a home project, so I only get on Evenings, and I am in the UK, so please dont think I am ignorant with my replys, I am learning nagios and linux for this project , so I apologise in advance if I take my time providing the info required to help me.
You will find alerts duplicated, again done to attempt different configs to see if helps.
Here is my NSClient Ini file
[/modules]
CheckSystem=enabled
CheckDisk=enabled
CheckExternalScripts=enabled
CheckHelpers=enabled
Scheduler=enabled
NSCAClient=enabled
CheckWMI=enabled
CheckSystem=1
CheckDisk=1
FileLogger.dll
NSClientListener.dll
;[/settings/scheduler/schedules/foo]
;command=bar
;[/settings/scheduler/schedules/alias]
;command=command
[/settings/scheduler/schedules/default]
interval=4m
[/settings/scheduler/schedules]
cpu=alias_cpu
mem=alias_mem
disk=alias_disk
service=alias_service
uptime=check_uptime
Check Up Time=CheckUpTime
CPU Usage=checkCPU warn=80 crit=90 time=30m time=20s time=10s
Memory Usage=checkMem MaxWarn=90% MaxCrit=98% ShowAll
Pagefile Usage=checkMem MaxWarn=90% MaxCrit=98% ShowAll type=page
All Drive Space Usage=CheckDriveSize -a FilterType=FIXED matching=.*[CD].* ShowAll=long MinWarn=10%
MinCrit=5% CheckAll
FilterType=FIXED
C:\ Drive Space Usage=CheckDriveSize MinWarn=10% MinCrit=5% Drive=c:\
FilterType=FIXED
Apache Service Status=checkServiceState Apache2.2
NSClient Service Status=checkServiceState nscp
Operating System Version = CheckWMI "Query=Select Version,Caption from win32_OperatingSystem"
;# The following is the host check (always ok/up)
host_check=CheckOK Machine is okay
[/settings/NSCA/client]
hostname=Houseswiftweb1
[/settings/NSCA/client/targets/default]
address=**.**.**.** Obsured
encryption=xor
password=********** Obscured
allow arguments=1
**********************************************************************************************************************
here is a sample passive template and host
define host{
name windows-server-passive ; The name of this host template
use generic-host ; Inherit default values from the generic-host template
check_period 24x7 ; By default, Windows servers are monitored round the clock
check_interval 5 ; Actively check the server every 5 minutes
retry_interval 1 ; Schedule host check retries at 1 minute intervals
max_check_attempts 10 ; Check each server 10 times (max)
check_command check-host-alive ; Default command to check if servers are "alive"
notification_period 24x7 ; Send notification out at any time - day or night
notification_interval 30 ; Resend notifications every 30 minutes
notification_options d,r ; Only send notifications for specific host states
contact_groups admins ; Notifications get sent to the admins by default
hostgroups windows-servers-passive ; Host groups that Windows servers should be a member of
register 0 ; DONT REGISTER THIS - ITS JUST A TEMPLATE
}
define host{
use windows-server-passive ; Inherit default values from a template
host_name **********web1 ; The name we're giving to this host
alias ******* Web Server ; A longer name associat$
passive_checks_enabled 1
active_checks_enabled 0
address 77.*.*.* ; IP address of the host
}
*******************************************************************************************
Here are a couple of services
define service{
use generic-service-passive
hostgroup_name windows-servers-passive
passive_checks_enabled 1
active_checks_enabled 0
service_description CPU Usage
check_command check_nrpe!checkCPU 80 90 30 20 10
}
define service{
use generic-service-passive
hostgroup_name windows-servers-passive
passive_checks_enabled 1
active_checks_enabled 0
service_description Memory Usage
check_command check_nrpe!checkMem 90 98
}
********************************************************************************************************************
here is a extract of my log,
Jul 17 02:18:43 localhost xinetd[10564]: EXIT: nsca status=0 pid=7908 duration=5(sec)
Jul 17 02:18:47 localhost nsca[7909]: Handling the connection...
Jul 17 02:18:47 localhost nsca[7909]: SERVICE CHECK -> Host Name: '******* web1', Service Description: 'Pagefile Usage', Return Code: '0', Output: 'OK: committed: Total: 47.988GB - Used: 18.767GB (39%) - Free: 29.221GB (60%)|'committed'=18.76693GB;43.1895;47.02856;0;47.98833 'committed %'=39%;89;97;0;100'
Jul 17 02:18:47 localhost nsca[7909]: End of connection...
Jul 17 02:18:48 localhost nsca[7910]: Handling the connection...
Jul 17 02:18:48 localhost nsca[7910]: SERVICE CHECK -> Host Name: '********** - Uranus', Service Description: 'CPU Usage', Return Code: '0', Output: 'OK: CPU load is ok.|'total 20m'=0%;80;90 'total 10s'=0%;80;90 'total 4'=0%;80;90'
Jul 17 02:18:48 localhost nsca[7910]: End of connection...
Jul 17 02:18:48 localhost xinetd[10564]: EXIT: nsca status=0 pid=7910 duration=5(sec)
Jul 17 02:18:55 localhost xinetd[10564]: START: nsca pid=7921 from=::ffff:*.*.*.*
Jul 17 02:19:00 localhost nsca[7921]: Handling the connection...
Jul 17 02:19:00 localhost nsca[7921]: SERVICE CHECK -> Host Name: '******* web1', Service Description: 'CPU Usage', Return Code: '0', Output: 'OK: CPU load is ok.|'total 30m'=12%;80;90 'total 20s'=11%;80;90 'total 10s'=12%;80;90'
Jul 17 02:19:00 localhost nsca[7921]: End of connection...
Jul 17 02:19:00 localhost xinetd[10564]: EXIT: nsca status=0 pid=7921 duration=5(sec)
Jul 17 02:19:02 localhost xinetd[10564]: START: nsca pid=7927 from=::ffff:*.*.*.*
Jul 17 02:19:07 localhost nsca[7927]: Handling the connection...
Jul 17 02:19:07 localhost nsca[7927]: SERVICE CHECK -> Host Name: '******* web1', Service Description: 'Check Up Time', Return Code: '0', Output: 'OK: uptime: 49w 299d 7176:429503h, boot: 2014-Aug-08 09:05:14 (UTC)|'uptime'=29763186s;172800;86400'
Jul 17 02:19:07 localhost nsca[7927]: End of connection...
Jul 17 02:19:07 localhost xinetd[10564]: EXIT: nsca status=0 pid=7927 duration=5(sec)
Jul 17 02:19:25 localhost xinetd[10564]: START: nsca pid=7939 from=::ffff:*.*.*.*
Jul 17 02:19:28 localhost xinetd[10564]: START: nsca pid=7940 from=::ffff:*.*.*.*
Jul 17 02:19:30 localhost nsca[7939]: Handling the connection...
Jul 17 02:19:30 localhost nsca[7939]: SERVICE CHECK -> Host Name: '******* web1', Service Description: 'uptime', Return Code: '0', Output: 'OK: uptime: 49w 299d 7176:429503h, boot: 2014-Aug-08 09:05:14 (UTC)|'uptime'=29763207s;172800;86400'
Jul 17 02:19:30 localhost nsca[7939]: End of connection...
Jul 17 02:19:30 localhost xinetd[10564]: EXIT: nsca status=0 pid=7939 duration=5(sec)
Dispite these arriving, my alerts are still showing the NRPE error
Is anyone able to help me please?
I am a 4 week nagios /linux noob
Setup - 2 Windows Web servers on the internet - No ability to open ports inbound as behind a natted network and impractical going forward as wish to apply to other internet facing servers.
Have installed NSClient ++ and using NSCA to send results into nagios every 4 minutes
I believe the messages are arriving ok, as all Alerts are appearing with correct info at one stage or another on the nagios webportal.
The problem is , Whilst I can see them coming in ok on the messages log, they dont stay green, they flick to critical with the error "CHECK_NRPE: Socket timeout after 10 seconds. " , even though there is nothing wrong with the monitored machine.
I am sure my configuration files will current read far from perfect at the moment, as I have tweaked a few of the services on the nagios server, to see if make any difference.
This is a home project, so I only get on Evenings, and I am in the UK, so please dont think I am ignorant with my replys, I am learning nagios and linux for this project , so I apologise in advance if I take my time providing the info required to help me.
You will find alerts duplicated, again done to attempt different configs to see if helps.
Here is my NSClient Ini file
[/modules]
CheckSystem=enabled
CheckDisk=enabled
CheckExternalScripts=enabled
CheckHelpers=enabled
Scheduler=enabled
NSCAClient=enabled
CheckWMI=enabled
CheckSystem=1
CheckDisk=1
FileLogger.dll
NSClientListener.dll
;[/settings/scheduler/schedules/foo]
;command=bar
;[/settings/scheduler/schedules/alias]
;command=command
[/settings/scheduler/schedules/default]
interval=4m
[/settings/scheduler/schedules]
cpu=alias_cpu
mem=alias_mem
disk=alias_disk
service=alias_service
uptime=check_uptime
Check Up Time=CheckUpTime
CPU Usage=checkCPU warn=80 crit=90 time=30m time=20s time=10s
Memory Usage=checkMem MaxWarn=90% MaxCrit=98% ShowAll
Pagefile Usage=checkMem MaxWarn=90% MaxCrit=98% ShowAll type=page
All Drive Space Usage=CheckDriveSize -a FilterType=FIXED matching=.*[CD].* ShowAll=long MinWarn=10%
MinCrit=5% CheckAll
FilterType=FIXED
C:\ Drive Space Usage=CheckDriveSize MinWarn=10% MinCrit=5% Drive=c:\
FilterType=FIXED
Apache Service Status=checkServiceState Apache2.2
NSClient Service Status=checkServiceState nscp
Operating System Version = CheckWMI "Query=Select Version,Caption from win32_OperatingSystem"
;# The following is the host check (always ok/up)
host_check=CheckOK Machine is okay
[/settings/NSCA/client]
hostname=Houseswiftweb1
[/settings/NSCA/client/targets/default]
address=**.**.**.** Obsured
encryption=xor
password=********** Obscured
allow arguments=1
**********************************************************************************************************************
here is a sample passive template and host
define host{
name windows-server-passive ; The name of this host template
use generic-host ; Inherit default values from the generic-host template
check_period 24x7 ; By default, Windows servers are monitored round the clock
check_interval 5 ; Actively check the server every 5 minutes
retry_interval 1 ; Schedule host check retries at 1 minute intervals
max_check_attempts 10 ; Check each server 10 times (max)
check_command check-host-alive ; Default command to check if servers are "alive"
notification_period 24x7 ; Send notification out at any time - day or night
notification_interval 30 ; Resend notifications every 30 minutes
notification_options d,r ; Only send notifications for specific host states
contact_groups admins ; Notifications get sent to the admins by default
hostgroups windows-servers-passive ; Host groups that Windows servers should be a member of
register 0 ; DONT REGISTER THIS - ITS JUST A TEMPLATE
}
define host{
use windows-server-passive ; Inherit default values from a template
host_name **********web1 ; The name we're giving to this host
alias ******* Web Server ; A longer name associat$
passive_checks_enabled 1
active_checks_enabled 0
address 77.*.*.* ; IP address of the host
}
*******************************************************************************************
Here are a couple of services
define service{
use generic-service-passive
hostgroup_name windows-servers-passive
passive_checks_enabled 1
active_checks_enabled 0
service_description CPU Usage
check_command check_nrpe!checkCPU 80 90 30 20 10
}
define service{
use generic-service-passive
hostgroup_name windows-servers-passive
passive_checks_enabled 1
active_checks_enabled 0
service_description Memory Usage
check_command check_nrpe!checkMem 90 98
}
********************************************************************************************************************
here is a extract of my log,
Jul 17 02:18:43 localhost xinetd[10564]: EXIT: nsca status=0 pid=7908 duration=5(sec)
Jul 17 02:18:47 localhost nsca[7909]: Handling the connection...
Jul 17 02:18:47 localhost nsca[7909]: SERVICE CHECK -> Host Name: '******* web1', Service Description: 'Pagefile Usage', Return Code: '0', Output: 'OK: committed: Total: 47.988GB - Used: 18.767GB (39%) - Free: 29.221GB (60%)|'committed'=18.76693GB;43.1895;47.02856;0;47.98833 'committed %'=39%;89;97;0;100'
Jul 17 02:18:47 localhost nsca[7909]: End of connection...
Jul 17 02:18:48 localhost nsca[7910]: Handling the connection...
Jul 17 02:18:48 localhost nsca[7910]: SERVICE CHECK -> Host Name: '********** - Uranus', Service Description: 'CPU Usage', Return Code: '0', Output: 'OK: CPU load is ok.|'total 20m'=0%;80;90 'total 10s'=0%;80;90 'total 4'=0%;80;90'
Jul 17 02:18:48 localhost nsca[7910]: End of connection...
Jul 17 02:18:48 localhost xinetd[10564]: EXIT: nsca status=0 pid=7910 duration=5(sec)
Jul 17 02:18:55 localhost xinetd[10564]: START: nsca pid=7921 from=::ffff:*.*.*.*
Jul 17 02:19:00 localhost nsca[7921]: Handling the connection...
Jul 17 02:19:00 localhost nsca[7921]: SERVICE CHECK -> Host Name: '******* web1', Service Description: 'CPU Usage', Return Code: '0', Output: 'OK: CPU load is ok.|'total 30m'=12%;80;90 'total 20s'=11%;80;90 'total 10s'=12%;80;90'
Jul 17 02:19:00 localhost nsca[7921]: End of connection...
Jul 17 02:19:00 localhost xinetd[10564]: EXIT: nsca status=0 pid=7921 duration=5(sec)
Jul 17 02:19:02 localhost xinetd[10564]: START: nsca pid=7927 from=::ffff:*.*.*.*
Jul 17 02:19:07 localhost nsca[7927]: Handling the connection...
Jul 17 02:19:07 localhost nsca[7927]: SERVICE CHECK -> Host Name: '******* web1', Service Description: 'Check Up Time', Return Code: '0', Output: 'OK: uptime: 49w 299d 7176:429503h, boot: 2014-Aug-08 09:05:14 (UTC)|'uptime'=29763186s;172800;86400'
Jul 17 02:19:07 localhost nsca[7927]: End of connection...
Jul 17 02:19:07 localhost xinetd[10564]: EXIT: nsca status=0 pid=7927 duration=5(sec)
Jul 17 02:19:25 localhost xinetd[10564]: START: nsca pid=7939 from=::ffff:*.*.*.*
Jul 17 02:19:28 localhost xinetd[10564]: START: nsca pid=7940 from=::ffff:*.*.*.*
Jul 17 02:19:30 localhost nsca[7939]: Handling the connection...
Jul 17 02:19:30 localhost nsca[7939]: SERVICE CHECK -> Host Name: '******* web1', Service Description: 'uptime', Return Code: '0', Output: 'OK: uptime: 49w 299d 7176:429503h, boot: 2014-Aug-08 09:05:14 (UTC)|'uptime'=29763207s;172800;86400'
Jul 17 02:19:30 localhost nsca[7939]: End of connection...
Jul 17 02:19:30 localhost xinetd[10564]: EXIT: nsca status=0 pid=7939 duration=5(sec)
Dispite these arriving, my alerts are still showing the NRPE error
Is anyone able to help me please?